[in case you missed it] Data Science Weekly - Issue 391

May 23, 2021

Issue #391 May 20 2021

Editor Picks

Developing Major League Baseball's Automated Ball/Strike System
Early in 2019, Major League Baseball announced a partnership with the Atlantic League of Professional Baseball (ALPB) to create and test an automated ball and strike calling system (ABS). The goal of the tests was to validate whether ABS was able to make and consistently communicate the correct call to the umpire quickly enough for the umpire to make the call on the field without introducing a delay...In this article, we will explain how ABS operates, review the system’s initial design and subsequent iterations, and preview some of the additional ABS changes and improvements slated for 2021...

Flat Data
Flat explores how to make it easy to work with data in git and GitHub. It builds on the “git scraping” approach pioneered by Simon Willison to offer a simple pattern for bringing working datasets into your repositories and versioning them, because developing against local datasets is faster and easier than working with data over the wire...

Sharing learnings about our [Twitter] image cropping algorithm
Twitter started using a saliency algorithm in 2018 to crop images. We did this to improve consistency in the size of photos in your timeline and to allow you to see more Tweets at a glance. The saliency algorithm works by estimating what a person might want to see first within a picture so that our system could determine how to crop an image to an easily-viewable size...In October 2020, we heard feedback from people on Twitter that our image cropping algorithm didn’t serve all people equitably. As part of our commitment to address this issue, we also shared that we'd analyze our model again for bias...

A Message from this week's Sponsor:

Metis Now Offers Online Flex Data Science Programs!

Join an Online Flex Data Science & Analytics Bootcamp and work on your own schedule with on-demand lectures, while still getting dedicated 1:1 instructor support. You’ll also get focused career support until you’re hired.

Ready to start your journey? Learn more about the Metis Online Flex Data Science & Analytics Bootcamps.

Data Science Articles & Videos

Fetching Better Beer Recommendations with Collie
In the adult world, choosing the right beer to drink is complicated. Luckily, machine learning can help! In this three part blog series, we'll use the Collie library to iteratively build up and understand deep learning recommendation models to recommend beer to users based on their previous history or a particular beer they like. Along the way, we'll work to shorten our training times and improve our results with simple optimizations, and incorporate metadata directly into the model and loss function. And, to prove just how effective the model is, I'll even input my own drink preferences and try the beers the model recommends to me...

Decoupling Value and Policy for Generalization in Reinforcement Learning
Standard deep reinforcement learning algorithms use a shared representation for the policy and value function. However, we argue that more information is needed to accurately estimate the value function than to learn the optimal policy. Consequently, the use of a shared representation for the policy and value function can lead to overfitting. To alleviate this problem, we propose two approaches which are combined to create IDAAC: Invariant Decoupled Advantage Actor-Critic...

Serving Uncertainty
Most Machine Learning (ML) models return a point-estimate of the most likely data label, given an instance of feature data. There are many scenarios, however, where a point-estimate is not enough - where there is a need to understand the model's uncertainty in the prediction...Half-way between statistics and ML we have probabilistic programming, rooted in the methods of Bayesian inference. We demonstrates how to train such a predictive model using PyMC3 - a Probabilistic Programming Language (PPL) for Python. We will demonstrate how a single probabilistic program can be used to support requests for point-estimates, arbitrary uncertainty ranges, as well as entire distributions of predicted data labels, for a non-trivial regression task...

Applications of Reinforcement Learning (RL): Recent examples from large US companies
RL is still not a widely used or accessible technology. Our recent analysis of Fortune 1000 companies revealed that, compared to other techniques (like deep learning), engagement in RL is very much still in the early stages across all sectors...Let’s look at some recent examples of RL in the real world. The list below includes representative examples of how some Fortune 1000 companies are beginning to use RL and related tools. While the list mostly includes companies outside the “Technology” sector, there are a few included tech companies that are using RL for chip design...

Google Replaces BERT Self-Attention with Fourier Transform: 92% Accuracy, 7 Times Faster on GPUs
A research team from Google shows that replacing transformers’ self-attention sublayers with Fourier Transform achieves 92 percent of BERT accuracy on the GLUE benchmark with training times seven times faster on GPUs and twice as fast on TPUs...

The data model behind Notion's flexibility
Everything you see in Notion is a block. Text, images, lists, a row in a database, even pages themselves — these are all blocks, dynamic units of information that can be transformed into other block types or moved freely within Notion. They’re the LEGOs we use to build and model information. And when put together, blocks are like LEGO sets, creating something much greater than the sum of their parts...

Are Convolutional Neural Networks or Transformers more like human vision?
Modern machine learning models for computer vision exceed humans in accuracy on specific visual recognition tasks, notably on datasets like ImageNet. However, high accuracy can be achieved in many ways... In this work, we follow a recent trend of in-depth behavioral analyses of neural network models that go beyond accuracy as an evaluation metric by looking at patterns of errors. Our focus is on comparing a suite of standard Convolutional Neural Networks (CNNs) and a recently-proposed attention-based network, the Vision Transformer (ViT), which relaxes the translation-invariance constraint of CNNs and therefore represents a model with a weaker set of inductive biases...

The Future of Machine Learning Lies in Better Abstractions
This week’s guest is Travis Addair, he previously led the team at Uber that was responsible for building Uber’s deep learning infrastructure...We chat with Travis Addair on how higher levels of abstractions enable non-experts to build efficient machine learning models...Travis is deeply involved with two popular open source projects related to deep learning: a) He is maintainer of Horovod, a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet and b) And Travis is a co-maintainer of Ludwig, a toolbox that allows users to train and test deep learning models without the need to write code...

The Analytics Engineer
The analytics engineer sits at the intersection of the skill sets of data scientists, analysts, and data engineers. They bring a formal and rigorous software engineering practice to the efforts of analysts and data scientists, and they bring an analytical and business-outcomes mindset to the efforts of data engineering. It’s their job to build tools and infrastructure to support the efforts of the analytics and data team as a whole...Before we dive further into the role, we should cover some background on the “traditional” roles on the data team1...

Falx: Synthesis-powered Visualization Authoring
With modern visualization tools like Tableau, ggplot2 or Vega-Lite, we can often create a visualization easily by mapping data columns to visual properties. However, when there is a mismatch between data layout and the design, we can't do it easily in such way. We need to spend significant effort on data transformation and visualization scripting...Falx is a visualization-by-example tool to bypass these challenges. In Falx, we demonstrate how a few data points from the dataset would be mapped to the canvas, and Falx automatically generalizes the example to visualize the full dataset...

Tools*

Free T-Shirt After Your First Similarity Search

"Love thy nearest neighbor!"

Show the world you're into algorithms and you're a kind human. Get a free t-shirt when you try Pinecone and make your first similarity search query.

Whether you start with "hello world" or a more practical example, you'll see that deploying a similarity search feature is easy with Pinecone. Common use cases include:

Semantic search
Document search
Image/Audio/Video search
Anomaly detection
Deduplication
Question-answering
Personalized recommendations
Record matching
Automatic labeling
T-shirt acquisition :)

Try similarity search with Pinecone and get your free t-shirt after your first pinecone.query() call — while supplies last!

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

Jobs

Data Science Roles - Blue Cross and Blue Shield of IL, MT, NM, OK, TX - Chicago, IL / Richardson, TX

Health Care Services Corporation (HCSC), an Independent Licensee of the Blue Cross and Blue Shield Association, is the largest customer-owned and not-for-profit health insurer and fourth largest health insurer overall in the United States.

Our Data Science organization touches every aspect of our business, from claims processing to customer service to care management. Likewise our portfolio employs a broad range of methodologies - from canonical tasks like using medical history to predict disease progression to applying NLP to doctors’ notes or leveraging deep learning on medical imaging. We’re hiring to expand our capabilities across the spectrum of Data Science roles. Come join us for the opportunity to work with massive datasets to drive revenue growth, improve our operational and member-facing processes, and affect how healthcare is delivered to our members.

Click here to check out our relevant job postings!

Want to post a job here? Email us for details >> team@datascienceweekly.org

Training & Resources

Building with TensorFlow Lite for microcontrollers
Today, people use TensorFlow to develop large scale machine learning models. But did you know that TensorFlow can now run on microcontrollers? In this Google I/O Workshop, you'll learn about the potential of this combination. We'll show you debut demos, how to train a model, and explain where TensorFlow fits in the TinyML ecosystem...

Tutorial: Graph Neural Networks
In this tutorial, we will discuss the application of neural networks on graphs. Graph Neural Networks (GNNs) have recently gained increasing popularity in both applications and research, including domains such as social networks, knowledge graphs, recommender systems, and bioinformatics. While the theory and math behind GNNs might first seem complicated, the implementation of those models is quite simple and helps in understanding the methodology. Therefore, we will discuss the implementation of basic network layers of a GNN, namely graph convolutions, and attention layers. Finally, we will apply a GNN on a node-level, edge-level, and graph-level tasks...

Top 10 AI and ML developer updates from Google I/O 2021 [YouTube Video]
In this video, AI Lead Laurence Moroney gives us the top 10 AI and ML developer updates from this year’s Google I/O. We’re excited with how much the AI and ML ecosystems have grown, and we hope these updates will help you solve any future challenges you face...

Books

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Data Science Weekly Newsletter