Data Science Weekly

May 06, 2021

Issue #389 May 06 2021

Editor Picks

The San Pellegrino label Moiré effect
Have you noticed the nice Moiré effect in the San Pellegrino label?...Notice the wavy pattern which repeats itself. If you look closely it looks like it is obtained from the periodic repetition of a wavy curve and a line along two different directions. Hence the beating effect which corresponds to the fact that the two curves are out of phase (appears darker because more ink density) or in phase (appear lighter because the two curves are one upon the other)...In this notebook, we are going to reverse engineer how this wavy pattern is being obtained...

Hard choices: AI in health care
Artificial intelligence will change the health care industry, not least by raising serious moral issues...Two of the most pressing current ethical considerations involve the potential loss of physician autonomy and the unconscious amplification of underlying biases...

Introducing Observable Plot
We're excited to announce Observable Plot, a new open-source library for faster and easier data exploration on the web!...Plot's concise API and thoughtful defaults are designed for a more joyful visualization process...Plot is informed by ten years of maintaining D3 but does not replace it. We continue to support and develop D3, and recommend its low-level approach for bespoke explanatory visualizations and as a foundation for higher-level exploratory visualization tools. In fact, Plot is built on D3! Observable Plot is more akin to Vega-Lite, another great tool for exploration...

A Message from this week's Sponsor:

Online Data Science Programs from Drexel University

Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career.

Learn more.

Data Science Articles & Videos

Practical SQL for Data Analysis: What you can do without Pandas
Pandas is a very popular tool for data analysis. It comes built-in with many useful features, it's battle tested and widely accepted. However, pandas is not always the best tool for the job...SQL databases has been around since the 1970s. Some of the smartest people in the world worked on making it easy to slice, dice, fetch and manipulate data quickly and efficiently. SQL databases have come such a long way, that many developers and data scientists lost track of what they can do with the database they already have!...In this article I demonstrate how to use SQL to perform fast and efficient data analysis...

L'art pour l'art: creating generative art with L-systems in Python
"Would it be possible to create a Python project that can generate those pencil-like drawings with random forests, with different types of trees?", I thought (pun intended). Of course, someone already thought of that and it even has a name: algorithmic botany...Although a lot has been written on the subject and there are quite a few open source libraries, they weren't quite what I have in mind...Instead, the idea of creating generative art in Python emerged using the following components: a) A lightweight implementation of L-systems that also supports stochastic and parametric production rules and b) Integration with p5js via pyp5js as a web-native graphing engine...In this blog post, I will lay out the ideas behind this approach...

Is Natural Language Processing Ready to Take on Legal Hearings?
Every year, California holds thousands of parole hearings for eligible prisoners...In each of those hearings, a 150-page transcript of the entire conversation is produced for the government and public to review. And most likely, that transcript will never be read...Machine learning opens the opportunity to devise a new approach: What if we could “read” thousands of hearing transcripts within minutes, writing out the most important factors for each case?...This approach would center on human discretionary judgment and use technology to ensure transparency and consistency...We call this the “Recon Approach” and believe it has applications well beyond parole...

Synthetic Data Generation Using Gaussian Mixture Model
At a conceptual level, synthetic data is not real data, but data that has been generated from real data and that has the same statistical properties as the real data. This means that if an analyst works with a synthetic dataset, they should get analysis results similar to what they would get with real data...In this notebook, first we are going to look at the the differences between KMeans and GMM as Clustering Algorithms, we will be able to realize the power of GMM (Gaussian Mixture Model) to be used as a Density Estimator Model...

Evolution of random number generators
Prime powers have a kinda chaotic pattern in their bits. We bootstrap this idea into a good random number generator, recapitulating history...

The art of solving problems with Monte Carlo simulations
This article will explore some examples and applications of Monte Carlo simulations using the Go programming language. To keep this article fun and interactive, after each Go code provided, you will find a link to the Go Playground, where you can run it without installing Go on your machine...Put your adventure helmets on!...

Do Wide and Deep Networks Learn the Same Things?
In “Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth”, we perform a systematic study of the similarity between wide and deep networks from the same architectural family through the lens of their hidden representations and final outputs. In very wide or very deep models, we find a characteristic block structure in their internal representations, and establish a connection between this phenomenon and model overparameterization...

Case Study: How Your Course Can Incorporate the Reproducibility Challenge
The Machine Learning Reproducibility Challenge (MLRC) is an event hosted by Papers with Code designed to encourage the publishing and sharing of reproducible scientific results in machine learning (ML)...The University of Amsterdam incorporated the MLRC into a graduate level course for students in the Master AI study program...All in all, this was a great experience both for students and TAs, with 9 papers accepted at the MLRC...

DriveGAN: Towards a Controllable High-Quality Neural Simulation
Realistic simulators are critical for training and verifying robotics systems. While most of the contemporary simulators are hand-crafted, a scaleable way to build simulators is to use machine learning to learn how the environment behaves in response to an action, directly from data. In this work, we aim to learn to simulate a dynamic environment directly in pixel-space, by watching unannotated sequences of frames and their associated action pairs. We introduce a novel high-quality neural simulator referred to as DriveGAN that achieves controllability by disentangling different components without supervision...

Parsing Petabytes, SpaceML Taps Satellite Images to Help Model Wildfire Risks
Teams of experts and citizen scientists help build image classifiers from satellite imagery of Earth to spot signs of natural disasters...

Tools*

Similarity Search: An Introduction from Pinecone

Similarity search (or "vector search") is a new method of searching through big data. Unlike traditional search methods, it indexes and searches through vector representations of data. It uses a combination of deep learning models and state-of-the-art algorithms to find items by their conceptual meanings rather than keywords or properties.

The ability to search for similar items, and not just exact matches, makes many tasks as easy as an API call:

Show recommended products to customers
Show recommended content to users
Personalize search results
Deduplicate documents
Match records
Search by image, audio, or video
Detect anomalies
Question-answering
And much more...

Learn more about similarity search then deploy your own similarity search application with a few lines of code using Pinecone.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

Jobs

Experimental Behavioral Scientist - BetterUp, Inc. - US-based, remote

BetterUp is a mobile-based coaching platform that brings personalized professional coaching to employees at all levels. We help managers lead better, teams perform better, and employees thrive personally and inspire professionally.

We are seeking an experimental behavioral scientist to join our team. In this role, you will direct a portfolio of original research to answer an essential question: What makes people happy and flourishing at work?

You’ll draw on your experience as an experimental social scientist, statistician, and lover of all things Data, to uncover groundbreaking findings at an epicenter of human experience: life at work. Your work will inform BetterUp products, inspire our customers, inform the broader scientific community, and amplify BetterUp’s reputation as a global thought-leader.

Want to post a job here? Email us for details >> team@datascienceweekly.org

Training & Resources

Introduction to Reinforcement Learning
In this tutorial, we aim to provide readers with a high-level overview of the fundamentals of RL as well as example code in Python, introducing the OpenAI Gym library. We begin with building intuitions about what is considered an RL problem and we introduce formal definitions as well as key terminologies that are used to describe and model an RL application. In parallel, we will focus on solving a concrete example of an RL problem (CartPole) using a classic RL algorithm called Q-learning. The fundamentals presented in this tutorial with respect to Q-learning were key in teaching neural networks to play Atari games, again by DeepMind in 2013. At the end of the tutorial, we present references for recommended further reading...

Is the concept of an 'epoch' being phased out, or even harmful? [ Reddit Discussion ]
I've noticed a trend that more papers are reporting the number of training "steps" rather than epochs.It kinda makes sense since datasets now vary in size from the dozens to the billions. This got me thinking: is the concept of an epoch potentially harmful, as it enforces the idea of the data as something finite?...

Paper Explained - Why AI is Harder Than We Think (Full Video Analysis) [Video]
The AI community has gone through regular cycles of AI Springs, where rapid progress gave rise to massive overconfidence, high funding, and overpromise, followed by these promises being unfulfilled, subsequently diving into periods of disenfranchisement and underfunding, called AI Winters. This paper examines the reasons for the repeated periods of overconfidence and identifies four fallacies that people make when they see rapid progress in AI...

Books

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post