Data Science Weekly - Issue 381
Issue #381 Mar 11 2021
Editor Picks
The Data Visualizations Behind COVID-19 Skepticism
How do COVID-19 skeptics use public health data and social media to advocate for reopening the economy and against mask mandates?..We studied half a million tweets, over 41,000 visualizations, and spent six months lurking in anti-mask Facebook groups...Here’s what we found...
Announcing the 2021 AI Index Report
This latest edition significantly expands the amount of data available in the report, which was drawn from a broader set of academic, private, and non-profit organizations for calibration. The report also shows the effect of COVID-19 on AI development from multiple perspectives, including how AI helps with COVID-related drug discovery and the effect of the pandemic on hiring and private investment...
State-of-the-Art Image Generative Models
I have aggregated some of the SotA image generative models released recently, with short summaries, visualizations and comments. The overall development is summarized, and the future trends are speculated. Many of the statements and the results here are easily applicable to other non-textual modalities, such as audio and video...
A Message from this week's Sponsor:
DOLT IS GIT FOR DATA
Having trouble managing and collaborating on your data with other data scientists on your team?
Dolt is a Git for data tool that merges MySQL and Git. Dolt databases can be forked, cloned, and merged just like Git repositories. That means multiple people can work on the same dataset without stomping on each other's changes.
Dolts can then be uploaded to Dolthub.com where they can be shared seamlessly. Dolthub makes collaboration possible for anyone who wants to have access to your data.
If you're interested please join our Discord here.
Data Science Articles & Videos
Street Network Models and Indicators for Every Urban Area in the World
Cities worldwide exhibit a variety of street network patterns and configurations that shape human mobility, equity, health, and livelihoods. This study models and analyzes the street networks of every urban area in the world, using boundaries derived from the Global Human Settlement Layer...It makes four contributions. First, it reports the methodological advances of this open‐source workflow. Second, it produces an open data repository containing street network models for each urban area. Third, it analyzes these models to produce an open data repository containing street network form indicators for each urban area. No such global urban street network indicator data set has previously existed. Fourth, it presents a summary analysis of urban street network form, reporting the first such worldwide results in the literature...
Next Raspberry Pi CPU Will Have Machine Learning Built In
At the recent tinyML Summit 2021, Raspberry Pi co-founder Eben Upton teased the future of 'Pi Silicon' and it looks like machine learning could see a massive improvement thanks to Raspberry Pi's news in-house chip development team...
Out of Distribution Generalization in Machine Learning
Machine learning has achieved tremendous success in a variety of domains in recent years. However, a lot of these success stories have been in places where the training and the testing distributions are extremely similar to each other. In everyday situations when models are tested in slightly different data than they were trained on, ML algorithms can fail spectacularly. This research attempts to formally define this problem, what sets of assumptions are reasonable to make in our data and what kind of guarantees we hope to obtain from them. Then, we focus on a certain class of out of distribution problems, their assumptions, and introduce simple algorithms that follow from these assumptions that are able to provide more reliable generalization...
A New Lens on Understanding Generalization in Deep Learning
Understanding generalization is one of the fundamental unsolved problems in deep learning. Why does optimizing a model on a finite set of training data lead to good performance on a held-out test set? This problem has been studied extensively in machine learning, with a rich history going back more than 50 years...In “The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers”...we find that models that train quickly on infinite data are the same models that generalize well if they are instead trained on finite data. This connection brings new perspectives on design choices in practice, and lays a roadmap for understanding generalization from a theoretical perspective...
Meta Learning Backpropagation And Improving It
Many concepts have been proposed for meta learning with neural networks (NNs), e.g., NNs that learn to control fast weights, hyper networks, learned learning rules, and meta recurrent NNs. Our Variable Shared Meta Learning (VS-ML) unifies the above and demonstrates that simple weight-sharing and sparsity in an NN is sufficient to express powerful learning algorithms (LAs) in a reusable fashion...
Accelerating Neural Networks on Mobile and Web with Sparse Inference
On-device inference of neural networks enables a variety of real-time applications, like pose estimation and background blur, in a low-latency and privacy-conscious way. Using ML inference frameworks like TensorFlow Lite with XNNPACK ML acceleration library, engineers optimize their models to run on a variety of devices by finding a sweet spot between model size, inference speed and the quality of the predictions...One way to optimize a model is through use of sparse neural networks [1, 2, 3], which have a significant fraction of their weights set to zero. In general, this is a desirable quality as it not only reduces the model size via compression, but also makes it possible to skip a significant fraction of multiply-add operations, thereby speeding up inference...
Multimodal Neurons in Artificial Neural Networks
We’ve discovered neurons in CLIP that respond to the same concept whether presented literally, symbolically, or conceptually. This may explain CLIP’s accuracy in classifying surprising visual renditions of concepts, and is also an important step toward understanding the associations and biases that CLIP and similar models learn...
ploomber - A convention over configuration workflow orchestrator
Ploomber is the simplest way to build reliable data pipelines for Data Science and Machine Learning. Provide your source code in a standard form and Ploomber will automatically construct the pipeline for you. Tasks can be anything from Python functions, Jupyter notebooks, Python/R/shell scripts, and SQL scripts...Once your pipeline is constructed, you'll be equipped with lots of development features to experiment faster. When you're ready, deploy to Airflow or Kubernetes (using Argo) without code changes...Here's how a pipeline task looks like...
Narratives and Counternarratives on Data Sharing in Africa
Many argue that data sharing can support research and policy design to alleviate poverty, inequality, and derivative effects in Africa. Despite the fact that the datasets in question are often extracted from African communities, conversations around the challenges of accessing and sharing African data are too often driven by nonAfrican stakeholders...we discuss issues arising from power imbalances resulting from the legacies of colonialism, ethno-centrism, and slavery, disinvestment in building trust, lack of acknowledgement of historical and present-day extractive practices, and Western-centric policies that are ill-suited to the African context. After outlining these problems, we discuss avenues for addressing them when sharing data generated in the continent...
Why Production Machine Learning Fails — And How To Fix It
Discussing applications of ML in theory is much different than actually applying ML models at scale in production. In this article, we walk through common challenges and corresponding solutions to making ML a force multiplier for your data organization...
Training*
Quick Question For You: Do you want a Data Science job?
After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.
The course is broken down into three guides:
Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)
Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate
Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!
Click here to learn more ...
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!
Jobs
Data Scientist - HelloFresh - Chicago, IL or New York, NY
Embedded in the NYC Tech Hub, we are building a cross-functional team of data scientists, analysts and engineers with the mission to bring the modeling and analytical capabilities of our marketing organization to the next level.
As a Data Scientist, you will support the analytic needs of our Growth organization comprising Technology, Digital Product and Marketing. You will play a pivotal role in helping us continue to succeed as the leading global meal kit provider. This role will solve challenging problems using vast repositories of customer data to provide detailed and actionable insights; core responsibilities include the development and automation of Marketing BI tools, predictive modeling, professional-grade dashboarding and reporting for some of our most critical initiatives and enhancing and facilitating the information extraction process...
Want to post a job here? Email us for details >> team@datascienceweekly.org
Training & Resources
First Principles of Computer Vision
This [free, online] lecture series on computer vision is presented by Shree Nayar, T. C. Chang Professor of Computer Science at Columbia Engineering. It has been designed for students, practitioners and enthusiasts who have no prior knowledge of computer vision...
Data Science for Psychologists
This [free, online] book provides an introduction to data science that is tailored to the needs of psychologists, but is also suitable for students of the humanities and other biological or social sciences. This audience typically has a basic familiarity with statistics, but rarely an idea how data is prepared and shaped for being analyzed and tested...The materials in this book are based on a course at the University of Konstanz in 2020/2021. The course...provides an introduction to data science in R (R Core Team, 2020) from a tidyverse (Wickham, Averick, et al., 2019) perspective. Book and course are supported by the R package ds4psy (Neth, 2020), which provides datasets and functions used in the examples and exercises...
The Mathematical Engineering of Deep Learning
In this [free, online] course we focus on the mathematical engineering aspects of deep learning. For this we survey and investigate the collection of algorithms, models, and methods that allow the statistician, mathematician, or machine learning professional to use deep learning methods effectively. Many machine learning courses focus either on the practical aspects of programming deep learning, or alternatively on the full development of machine learning theory, only presenting deep learning as a special case. In contrast, in this course, we focus directly on deep learning methods, building an understanding of the engineering mathematics that drives this field...
Books
Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian