Data Science Weekly - Issue 557
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #557
July 25, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
AI models collapse when trained on recursively generated data
Here we consider what may happen to GPT-{n} once LLMs contribute much of the text found online. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs)…
Evolution of the Italian pasta ripiena: the first steps toward a scientific classification
In this study, phylogenetic and biogeographic methods are used to investigate the evolutionary relationships between various types of Italian pasta ripiena (filled pasta) and related representatives from across Eurasia, using information from their geography, shape, content and cooking methods. Our results showed that, with the exception of the Sardinian Culurgiones, all the other pasta ripiena from Italy likely had a single origin in the northern parts of the country. Based on the proposed evolutionary hypothesis, the Italian pasta are divided into two main clades: a ravioli clade mainly characterized by a more or less flat shape, and a tortellini clade mainly characterized by a three-dimensional shape. The implications of these findings are further discussed…Applied Machine Learning for Tabular Data
We want to create a practical guide to developing quality predictive models from tabular data…The book takes a holistic view of the predictive modeling process and focuses on a few areas that are usually left out of similar works. For example, the effectiveness of the model can be driven by how the predictors are represented. Because of this, we tightly couple feature engineering methods with machine learning models. Also, quite a lot of work happens after we have determined our best model and created the final fit. These post-modeling activities are an important part of the model development process and will be described in detail…
A Sponsor Message:
Magical tools for working with data
Building a Big Picture Data Team at StubHub
See how Meghana Reddy, Head of Data at StubHub, built a data team that delivers business insights accurately and quickly with the help of Snowflake and Hex.
The challenges she faced may sound familiar:
Unclear SMEs meant questions went to multiple people
Without SLAs, answer times were too long
Lack of data modeling & source-of-truth metrics generated varying results
Lack of discoverability & reproducibility cost time, efficiency and accuracy
Static reporting reserved interactivity for rare occasion
Register now to hear how Meghana and the StubHub data team tackled these challenges with Snowflake and Hex. And watch Meghana demo StubHub’s data apps that increase quality and speed to insights…
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
Building A Generative AI Platform
After studying how companies deploy generative AI applications, I noticed many similarities in their platforms. This post outlines the common components of a generative AI platform, what they do, and how they are implemented. I try my best to keep the architecture general, but certain applications might deviate. This is what the overall architecture looks like…Understanding how the KernelDensityEstimator works
Histograms are great for getting a first impression of the density of a dataset. But they do have some flaws…This video will highlight an alternative that might be better suited for algorithmic purposes…The Many Lives of Null Island
At risk of ruining the secret for you, Null Island is a long-running inside joke among cartographers. It is an imaginary island located at a real place: the coordinates of 0º latitude and 0º longitude, a location in the Atlantic Ocean off the coast of Africa where the Prime Meridian meets the Equator, hundreds of miles from any real dry land…Null Island is not just a silly place to think about when cartographers are bored, it is a phenomenon that repeatedly and annoyingly asserts itself in the middle of day-to-day cartographic work, often when you least expect it…If you could only use 3 different file formats for the rest of your career. Which would you choose? [Reddit Discussion]
I would have to go with .parquet, .json, and .xml. Although I do think there is an argument for .xls or else I would just have to look at screen shares of what business analysts are talking about…Julia for Economists Bootcamp
I taught a series of instructional Julia sessions at Stanford's GSB (Graduate Business School). Each month's session was two to four hours of lectures, practical examples, and guided projects tailored towards economics research computing using Julia. Each month covers a different topic and can be attended in isolation, though the first session covers the basics of Julia and may be useful for more advanced sessions if you are not currently familiar with Julia…The Elegance of the ASCII Table
If you’ve been a programmer or programming-adjacent nerd1 for a while, you’ll have doubtless come across an ASCII table…An ASCII table is useful. But did you know it’s also beautiful and elegant…Counterfactuals in Language AI with open source language models and LLMs
In this article I’ll reflect on how counterfactuals might help us think differently about the pitfalls and potentials of Generative AI. And I’ll demonstrate with some concrete examples using open source LMs (specifically Microsoft’s Phi). I’ll show how to set up Ollama locally (it can also be done in Databricks), without too much fuss (both with and without a Docker container), so you can try it out for yourself. I’ll also compare OpenAI’s LLM response to the same prompts…Splink 4.0.0 released
- Splink is designed for linking and deduplicating large datasets quickly and accurately. It solves the common problem of having multiple records pertaining to the same person, but with typos, missing data etc.- After 4 years of work, it's the fastest, and probably the most accurate, free solution this problem that works on very large datasets.
- Splink is in use by a variety of government and private sector orgs, and in academia.
- Splink is one of the most popular open source libraries published by UK government, with 1.2k github stars and >8 million downloads…
Welcome to “Feature Engineering A-Z”! This book is written to be used as a reference guide to nearly all feature engineering methods you will encounter…Each section tries to be as comprehensive as possible with the number of different methods and solutions that are presented. A section on dimensionality reduction should list all the practical methods that could be used, as well as a comparison between the methods to help the reader decide what would be most appropriate. This does not mean that all methods are recommended to use. A number of these methods have little and narrow use cases. Methods that are deemed too domain-specific have been excluded from this book…
Where’s my train?
Yesterday I presented a webinar for PyMC Labs where I solved one of the exercises from Think Bayes, called “The Red Line Problem”…the scenario: how can Bayesian inference help predict my wait time on a train platform and when should I decide to give up and take a taxi…I used this exercise to demonstrate a process for developing and testing Bayesian models in PyMC. The solution uses some common PyMC features, like the Normal, Gamma, and Poisson distributions, and some less common features, like the Interpolated and StudentT distributions…Is scientific machine learning actually used in practice? [Reddit]
As someone whose background straddles both scientific computing and machine learning, I hear a lot about scientific machine learning (SML). The promise is that one can use machine learning to either speed up, simplify or otherwise improve numerical models…My question is, is scientific machine learning actually used in practice (industry)? Can anyone point to any real-world examples? Any companies that actually use this technology? If not, I would love to hear suggestions of why it seemingly doesn't provide any value to the market (at least for now)...Deep Learning for Economists
Deep learning provides powerful methods to impute structured information from large-scale, unstructured text and image datasets. For example, economists might wish to detect the presence of economic activity in satellite images, or to measure the topics or entities mentioned in social media, the congressional record, or firm filings. This review introduces deep neural networks, covering methods such as classifiers, regression models, generative AI, and embedding models. Applications include classification, document digitization, record linkage, and methods for data exploration in massive scale text and image corpora…
A Sponsor Message:
Statsig shipped a new feature every day last week
Companies like Notion, OpenAI, Brex, and Anthropic use Statsig to power their experimentation and feature management. We’re always building new products. Check out the five features we shipped last week:
Meta-Analysis: Identify trends across experiments on a team or company level.
Stratified Sampling: Improve test reliability by intelligently assigning users to groups.
Differential Impact Detection: Identify when user groups are impacted differently by an experiment.
Interaction Detection: Run hundreds of concurrent experiments with confidence they don’t interact.
Collaboration: Overcome challenges in scaling a great experimentation platform.
Get up to 2M free-tier events and 10K session recordings—all for free.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Training & Resources
A User’s Guide to Statistical Inference and Regression
This book, like many before it, will try to teach you statistics. The field of statistics describes how we learn about the world using quantitative data…We will focus on two key goals:Understand the basic ways to assess estimators
Apply these ideas to the estimation of regression models…
Linear Algebra for Data Science
Here’s a new textbook on linear algebra, where we re-imagined how and in which order linear algebra could be taught…we realized that (one of the central concepts from linear algebra that is used frequently in practice, if not every day)…were projection, and consequently singular value decomposition (SVD) as well as even less frequently positive definiteness. Unfortunately, we noticed that existing courses on linear algebra often focus much more on the invertibility (or lack thereof), to the point that many concepts are introduced not in the order of their practicality nor usefulness but in the order of the conveniences in mathematical derivations/introductions…AI in the Sciences and Engineering (2024)
AI is having a profound impact on science by accelerating discoveries across physics, chemistry, biology, and engineering. This course presents a highly topical selection of AI applications across these fields. Emphasis is placed on using AI, particularly deep learning, to understand systems modelled by PDEs, and key scientific machine learning concepts and themes are discussed…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #556 here.
Cutting Room Floor
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~63,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian