Data Science Weekly

Jul 08, 2021

Issue #398 July 08 2021

Editor Picks

Apple Music Stream Data Analysis
In this project, we are going to explore my personal music streaming data from Apple Music...

Languages don’t all have the same number of terms for colors – scientists have a new theory why
The ways that languages categorize color vary widely. Nonindustrialized cultures typically have far fewer words for colors than industrialized cultures. So while English has 11 words that everyone knows [black, white, red, green, yellow, blue, brown, orange, pink, purple and gray], the Papua-New Guinean language Berinmo has only five, and the Bolivian Amazonian language Tsimane’ has only three words that everyone knows, corresponding to black, white and red...The goal of our project was to understand why cultures vary so much in their color word usage...

Causal Inference in the Wild: Elasticity Pricing
Causal inference is a hot topic in machine learning, and there are many excellent primers on the theory of causal inference available. But much fewer examples of real-world applications of machine-learning-powered causal inference exist. This article introduces one such example from an industry context, using a (public) real-world dataset. It is aimed at a technical audience with an understanding of the basics of causality...Specifically, I will look at the “ideal” scenario of price elasticity estimation. This scenario is highly relevant for retailers wishing to optimize prices, for example to better manage inventory or just to be more competitive....

A Message from this week's Sponsor:

The Vector Database

Pinecone is a fully managed vector database that makes it easy to add vector similarity search to production applications. It combines state-of-the-art vector search libraries, advanced features such as live index updates, and distributed infrastructure to provide high performance and reliability at any scale. No more hassles of benchmarking and tuning algorithms or building and maintaining infrastructure for vector search.

Advanced ML teams use vector search to drastically improve results for semantic text search, image/audio search, recommendation systems, feed ranking, abuse/fraud detection, deduplication, and other applications.

3 reasons to try Pinecone:

It's production-ready: Go to production with a few lines of code, without breaking a sweat or slowing down.
It's scalable and high-performing: Search through billions of vectors in tens of milliseconds.
It's fully managed: We obsess over operations and security so you don't have to.

Try Pinecone now for free →

PS — Get a free t-shirt after you run your first query!

Data Science Articles & Videos

Building a data team at a mid-stage startup: a short story
I guess I should really call this a parable...The backdrop is: you have been brought in to grow a tiny data team (~4 people) at a mid-stage startup (~$10M annual revenue)...It's a made up story based on n-th hand experiences (for n ≤ 3), and quite opinionated. It's a story about teams and organization, not the tech itself. As a minor note, I deliberately use the term “data scientist” to mean something very broad...

Learn you a Kedro: Write reproducible, maintainable and modular data science code
In this article, I introduce Kedro, an open-source Python framework for creating reproducible, maintainable and modular data science code. After a brief description of what it is and why it is likely to become a standard part of every data scientist’s toolchain, I describe some technical Kedro concepts and illustrate how to use them with a tutorial...

The Best Things in Life Are Model Free
I am not blindly against the model-free paradigm. In fact, the most popular methods in core control systems are model free! The most ubiquitous control scheme out there is PID [proportional integral derivative] control, and PID has only three parameters. I’d like to use this post to briefly describe PID control, explain how it is closely connected to many of the most popular methods in machine learning, and then turn to explain what PID brings to the table over the model-free methods that drive contemporary RL research...

Scottish People Are More Inclined to Skip the Gym and Watch the Footy Than English People
England and Scotland's much-anticipated clash in the Euro 2020 group stages culminated in a lacklustre 0-0 draw. But, as someone who has been gathering occupancy data from one of the UK's largest gym providers, the encounter presented a unique opportunity to see how the nations' football fans adjusted their exercise routines to accommodate the occasion.Although both nations' gym usage dropped during the match, the effect was greater for the Scots and more pronounced throughout the day than for English gymgoers...

Vaccine Update System for Pakistan
To be honest it's pretty hard for you to find data on vaccine progress and especially time-based data on a country like Pakistan. So I created this small but interactive notebook that will keep updating the database until everyone is vaccinated. In this project I have used Pandas for easy WebSracping to get the data from pharmaceutical-technology.com then I have created Sqlite3 database to store the data into three tables...

Evaluating Large Language Models Trained on Code
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot...

Better computer vision models by combining Transformers and convolutional neural networks
We’ve developed a new computer vision model called ConViT, which combines two widely used AI architectures — convolutional neural networks (CNNs) and Transformer-based models — in order to overcome some important limitations of each approach on its own. By leveraging both techniques, this vision Transformer-based model can outperform existing architectures, especially in the low data regime, while achieving similar performance in the large data setting...

Parallelizing neural networks on one GPU with JAX
In this article, I describe how to get your money’s worth by training dozens of networks at once. As you follow along, we’ll efficiently train dozens of small neural networks in parallel on a single GPU using the vmap function from JAX. Whether you are training ensembles, sweeping over hyperparameters, or averaging across random seeds, this technique can give you a 10x-100x improvement in computation time. If you haven’t tried JAX yet, this may give you a reason to...

Multimodal Shape Completion via IMLE
Shape completion is the problem of completing partial input shapes such as partial scans. This problem finds important applications in computer vision and robotics due to issues such as occlusion or sparsity in real-world data. However, most of the existing research related to shape completion has been focused on completing shapes by learning a one-to-one mapping which limits the diversity and creativity of the produced results. We propose a novel multimodal shape completion technique that is effectively able to learn a one-to-many mapping and generates diverse complete shapes. Our approach is based on the conditional Implicit MaximumLikelihood Estimation (IMLE) technique wherein we condition our inputs on partial 3D point clouds. ...

CNN Heat Maps: Class Activation Mapping (CAM)
This is the first post in an upcoming series about different techniques for visualizing which parts of an image a CNN is looking at in order to make a decision. Class Activation Mapping (CAM) is one technique for producing heat maps to highlight class-specific regions of images...

Training*

Sharpen your data skills by solving 3 questions per week – for free

Get data science interview questions frequently asked at top companies every Monday, Wednesday & Friday. Solve the problem before receiving the solution the next morning. Check your work and sharpen your skills! Join our free newsletter.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

Jobs

Senior Data Scientist - WarnerMedia - New York, NY

WarnerMedia is a leading media and entertainment company that creates and distributes premium and popular content from a diverse array of talented storytellers and journalists to global audiences through its consumer brands including: HBO, HBO Max, Warner Bros., TNT, TBS, truTV, CNN, DC Entertainment, New Line, Cartoon Network, Adult Swim, Turner Classic Movies and others.

Reporting to the Sr. Manager, Data Science this role will help to develop the predictive insights and prescriptive capabilities behind CNN’s emerging products, transforming first- and third- party data into quantitative findings, visualizations, and automation

Want to post a job here? Email us for details >> team@datascienceweekly.org

Training & Resources

Guide to Cross-Validation and Hyperparameter Search
we shall cover the following in this post: 1) How to use cross-validation to evaluate a Machine Learning model, 2) How to use cross-validation to search for best model hyperparameters, 3) How to use grid search for hyperparameter search, 4) How to search more efficiently for hyperparameters using randomized search...

Machine Learning for Beginners - A Curriculum
Azure Cloud Advocates at Microsoft are pleased to offer a 12-week, 24-lesson curriculum all about Machine Learning. In this curriculum, you will learn about what is sometimes called classic machine learning, using primarily Scikit-learn as a library and avoiding deep learning, which is covered in our forthcoming 'AI for Beginners' curriculum. Pair these lessons with our forthcoming 'Data Science for Beginners' curriculum, as well!...

Gradient Pseudo-swap
When we have layers in a neural network we want to train with gradient descent, but those layers don't have smooth gradients that can be used, we can employ a "gradient pseudo-swap"; where, for forward propagation of the network, we use the original layer (with the bad gradients), but then, for back propagation, we use a variation of the layer, that has the same parameters we want to train, but has a smooth gradient...Let's see this by example...

Books

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Data Science Weekly Newsletter

Data Science Weekly - Issue 398