Data Science Weekly - Issue 582

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Jan 17, 2025

Issue #582
January 16, 2025

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Editor's Picks

Small Data
Returning to Our Roots…We have been data gatherers since the very beginning…

10 takeaways from 10 years of data science for social good
Over the past ten years, we've gotten to see a lot of attempts to apply data science and AI for social impact. On our end, we’ve worked on over 150 projects with 80+ partners including The World Bank, Bill & Melinda Gates Foundation, Candid, Microsoft, IDEO.org, and NASA. We’ve run 75+ data science competitions awarding more than $4.7 million in prizes to impact-minded developers around the world. Through this work we've built up best practices and battle scars. We've seen a lot of practices that accelerate progress, and others that hold it back…
It's a baffling fact about deep learning that model distillation works [X Thread]
method 1
- train small model M1 on dataset D
method 2 (distillation)
- train large model L on D
- train small model M2 to mimic output of L
- M2 will outperform M1
no theory explains this; it's magic…

What’s on your mind

This Week’s Poll:

Your Biggest Data Science Headaches

Whether or not you’re using AI/LLMs, we want to hear what slows you down most.
[Take this quick 5-second poll →]
We’ll share the results next week!

Last Week’s Poll:

Last week, we asked if you’re using AI/LLMs in your workflows and tons of you responded!

56% are already using AI/LLMs.
22% want to, but haven’t started yet.
22% think it’s overhyped.

We’d love to help everyone learn from each other! We’re thinking about hosting casual office hours to discuss challenges in AI/LLMs and data science workflows. Would you join?
[Click here to share what time would work best for you (we might even hold several) →] Google Form

Data Science Articles & Videos

New Years resolutions for PyTorch in 2025
In my previous two posts "Ways to use torch.compile" and "Ways to use torch.export", I often said that PyTorch would be good for a use case, but there might be some downsides. Some of the downsides are foundational and difficult to remove. But some... just seem like a little something is missing from PyTorch. In this post, here are some things I hope we will end up shipping in 2025!..
What AI engineers can learn from qualitative research methods in HCI
At first glance, qualitative research and the highly technical, benchmark-oriented world of AI/ML seem to have nothing to do with one another. But in reality, software developers building on AI models could learn a lot from qualitative research methods…
Weak baselines
Gather close as I tell you another cautionary tale about data science…When starting a data science project, be careful about the baseline you’re using to determine success. Choosing a poor baseline (like the original analyst model) can result in thinking that fairly bad outcomes (like the one from the LSTM model) are good, just because they’re better than horrible outcomes. I’ve seen a number of cases in which a simple heuristic handily beat a complicated ML model…
How to Train Your Robot
How to Train Your Robot is a long term side project. I've been working on it in some form for 20 years. My lifetime goal is to make a robot as smart as my pup. It's a long road, but it has a lot of fascinating stops like machine learning, Python development, software engineering, and robotics. Come join me…
One Line Command to Launch a Notebook with Pytorch
uv is changing how accessible Python is for new users.
If you want to try out pytorch in a Jupyter Notebook, you can install uv and then run this single line command…
Don't use cosine similarity carelessly
Just as Midas discovered that turning everything to gold wasn't always helpful, we'll see that blindly applying cosine similarity to vectors can lead us astray. While embeddings do capture similarities, they often reflect the wrong kind - matching questions to questions rather than questions to answers, or getting distracted by superficial patterns like writing style and typos rather than meaning. This post shows you how to be more intentional about similarity and get better results…
15 Regression Models for Continuous Y and Case Study in Ordinal Regression
This chapter concerns univariate continuous Y. There are many multivariable models for predicting such response variables…
text-2-image-Rich-Human-Feedback
Building upon Google's research Rich Human Feedback for Text-to-Image Generation we have collected over 1.5 million responses from 152'684 individual humans using Rapidata via the Python API. Collection took roughly 5 days…
A Deep Dive into Memorization in Deep Learning
Want to learn more about how, when and why machine learning, particularly deep learning systems memorize data? By studying memorization, you'll learn more about how machine learning systems really function, along with how privacy works from a technical point-of-view. You'll also be better able to decide how, when and where to use AI systems based on your new learnings…
Language Models: A Guide for the Perplexed
We decided to write this tutorial to help narrow the gap between the discourse among those who study language models -- the core technology underlying ChatGPT and similar products -- and those who are intrigued and want to learn more about them. In short, we believe the perspective of researchers and educators can add some clarity to the public's understanding of the technologies beyond what's currently available, which tends to be either extremely technical or promotional material generated about products by their purveyor…
How Hex builds AI for Data Scientists, with Barry McCardel
AI and data teams in a sense, kind of, do the same thing: make decisions based on data. So how do you build AI that helps data teams do their best work?…Hex was one of the first companies in their space to embrace language models and build code generation features into their data workspace…In this episode of Barrchives, I went deep with Hex’s co-founder and CEO, Barry McCardel, about Hex’s journey towards becoming an AI company…
Fitting TabPFN models in R using reticulate
TabPFN is a foundational model for tabular data, which can be used instead of models like XGBoost and random forest for predictive modelling using tabular data…This tutorial uses the packages reticulate, dplyr, and ggplot2. In addition palmerpenguins and modeldata are used for datasets…
Ace your NLP Interview
These are the questions I practiced to prepare for my interview from time to time. > *Some of the questions (4-5 of them) are not yet filled in, and some have been answered by other people as I felt I couldn’t answer them myself due to lacking knowledge, and some have been structured better into tables by Perplexity* >…

Last Week's Newsletter's 3 Most Clicked Links

.
* Based on unique clicks.
** Find last week's issue #581 here.

Cutting Room Floor

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~65,442 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post