Data Science Weekly - Issue 614

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Aug 28, 2025

Issue #614
August 28, 2025

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Editor's Picks

What the Jane Street Capital interns have wrought, 2025 edition
Yet again, we’re at the end of our internship season, and so it’s time to summarize what the interns were up to!…This year, I was recommended a real bumper crop of exciting projects to include. It’s kind of crazy how many great intern projects are out there. To mention a few that I’m not going to have time to cover in detail…

Learning DSPy: The power of good abstractions
You’ve likely seen plenty of posts online by users of DSPy raving about it, but you put it in the back of your mind, saying “sounds interesting… but it seems complex. I don’t really get it”. That was me, in 2024. Fast forward to today, and things look radically different, at least to my eyes. Having spent the last several weeks engaging with the amazing DSPyOSS community on X, and building with DSPy for multiple use cases, it’s finally become clear to me how useful and well-designed it really is, so I’ve decided that it’s worth an entire blog post just explaining it in terms of its abstraction philosophy…
I was wrong about tidymodels and LLMs
Today’s frontier LLMs know much more about tidymodels than I thought…Around this time last year, I started engaging with LLMs more seriously….I felt that LLMs required a good bit of prompting in order to write tidymodels code fluently…In working on a tidymodels assistant in recent months, I had been under the assumption that this was still the case. Now that I’ve put together some evals to measure the assistant’s performance, though, I realize this is no longer the case; today’s frontier LLMs “just know” tidymodels….

What’s on your mind

This Week’s Poll:

Last Week’s Poll:

Data Science Articles & Videos

Lessons after one year of data science freelancing
About a year ago, I did my last internship and officially finished my master in applied maths. I had a bit of money on the side and Yan Holtz asked me if I wanted to create an online course on matplotlib with him. My 2 options were:
- apply for jobs (I did a few interviews)
- start freelancing, create the course with Yan and look for clients
Since I feel like I had nothing to lose, I decided to go freelance and look for a regular job if I don't have money anymore. I'll start by describing the upsides (money, workload, etc) of this decision, and then the downsides…
347 Applicants for One Data Engineer Position [Reddit]
I was recently the hiring manager for a relatively junior data engineering position. We were looking for someone with 2 YOE. Within minutes of positing the job, we were inundated with qualified candidates - I couldn't believe the number of people with masters degrees applying. We kept the job open for about 4 days, and received 347 candidates. I'd estimate that at least 50-100 of the candidates would've been just fine at the job, but we only needed one. All this to say - it's extremely tough to get your foot in the door right now. You're not alone if you're struggling to find a job. Keep at it!…
Airbnb Data
I work on the data team at AirROI. For a while, we offered free datasets for about 250 cities, but we always wanted to do more for the community. Recently, we just expanded our free public dataset from ~250 to nearly 1000 global Airbnb markets on properties and pricing data. As far as we know, this makes it the single largest free Airbnb dataset ever released on the internet. You can browse the collection and download here, no sign-up required…
The Most Important Machine Learning Equations: A Comprehensive Guide
This blog post is designed to be your go-to resource, covering the most critical and “mind-breaking” ML equations—enough to grasp most of the core math behind ML. Each section includes theoretical insights, the equations themselves, and practical implementations in Python, so you can see the math in action. This guide is for anyone with a basic background in math and programming who wants to deepen their understanding of ML…
Distributed/compressed regression rocks
This is a quick post about a relatively new pair of packages meant for fast, SQL-backed, regressions with fixed effects in R and Python…Before I was as familiar with R, from the outside I thought that fixest was unbeatably fast for linear regression with fixed effects. Generally, fixest is definitely still king for most applied work. However, the benchmarks I’ll show below suggest that it can be beaten in at least one case (extremely large data), which has gotten me excited about big regressions using SQL backends. I wanted to share the excitement with this brief post…
The Math Behind GANs
In this post, we’ll take a deep dive into the math behind GANs. My primary source of reference is Generative Adversarial Nets by Ian Goodfellow, et al. It is in this paper that Goodfellow first outlined the concept of a GAN, which is why it only makes sense that we commence from the analysis of this paper…
Uncovering Lesser Known Mobile Adtech Domains
AppGoblin has now run over 40k apps in an emulator, tracking millions of API calls thousands of advertising domains. Unfortunately, some of them are dark, meaning they have no landing page of any kind, and I’m unclear who controls these domains…Let’s see if we can figure them out!…
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
In this post, I’ll iteratively optimize an implementation of matrix multiplication written in CUDA. My goal is not to build a cuBLAS replacement, but to deeply understand the most important performance characteristics of the GPUs that are used for modern deep learning. This includes coalescing global memory accesses, shared memory caching and occupancy optimizations, among others…
Why Stacking Sliding Windows Can't See Very Far
Modern LLMs use sliding window attention for efficiency, but why can't stacking sliding windows see as far as theory suggests? A mathematical exploration of information dilution and the exponential barrier created by residual connections…
Lecture 1 – Course Introduction (MIT How to AI Almost Anything, Spring 2025)
This course will introduce the basic principles of AI (focusing on modern deep learning and foundation models) and how we can apply AI to novel real-world data modalities. In addition, we will introduce the principles of multimodal AI that can process many modalities at once, such as connecting language and multimedia, music and art, sensing and actuation, and more…
What exactly is "prompt engineering" in data science? [Reddit]
I keep seeing people talk about prompt engineering, but I'm not sure I understand what that actually means in practice…Is it just writing one-off prompts to get a model to do something specific? Or is it more like setting up a whole system/workflow (e.g. using LangChain, agents, RAG, etc.) where prompts are just one part of the stack in developing an application?…For those of you working as data scientists:
- Are you actively building internal end-to-end agents with RAG and tool integrations (either external like MCP or creating your own internal files to serve as tools)?
- Is prompt engineering part of your daily work, or is it more of an experimental/prototyping thing?…
Monte Carlo Gradient Estimation in Machine Learning
This paper is a broad and accessible survey of the methods we have at our disposal for Monte Carlo gradient estimation in machine learning and across the statistical sciences: the problem of computing the gradient of an expectation of a function with respect to parameters defining the distribution that is integrated; the problem of sensitivity analysis…We explore three strategies--the pathwise, score function, and measure-valued gradient estimators--exploring their historical development, derivation, and underlying assumptions…
The Gaussian Distribution is Inevitable (And This Beautiful Principle Proves It)
How the Maximum Entropy Principle reveals the bell curve as the only honest choice when all you know is mean and variance…

Last Week's Newsletter's 3 Most Clicked Links

.
* Based on unique clicks.
** Find last week's issue #613 here.

Cutting Room Floor

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~68,500 subscribers by sponsoring this newsletter. 30-40% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post

Ready for more?