Data Science Weekly - Issue 589
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #589
March 06, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Optimizing Query Performance with Materialized Views with Arun Parthiban
Delivering a great customer experience requires fast, responsive user interfaces, which depend on low-latency query responses from data systems. As a result, optimizing query performance is a key priority for engineering teams. Various techniques can help achieve this, including query optimization, precomputing results, and caching data in memory or storage. One powerful approach is Materialized Views. In this talk, we’ll explore Datadog’s Materialized Views infrastructure and how it enhances query performance while keeping costs low…
Working as a Data Engineer Sucks
At least if you look at the data I pulled from the subreddit r/dataengineering. The data tells a story about mundane tasks, stress, blurred lines between roles, and lack of recognition. But it’s not all despair and nightfall in data land, some engineers seem to thrive in their roles like a byte in a data lake…The data validation landscape in 2025
What’s going on in the world of data validation? For those of you who don’t know, data validation is the process of checking data quality in an automated or semi-automated way—for example, checking datatypes, checking the number of missing values, and detecting whether there are anomalous numbers…Ultimately, if you want the analysis you’re doing to be accurate, then data validation is a great way to efficiently remove some of the risks of making a mistake…
What’s on your mind
This Week’s Poll:
Two questions, because we got two answers last week.
I asked - what would you value MOST as a paid subscriber?
43% answered “deep dives into new research”
34% answered “curated datasets and code”
Take this quick 5-second poll →
We’ll share the results next week!
Last Week’s Poll:
Data Science Articles & Videos
Understanding Attention in LLMs
There are many excellent AI papers and tutorials that explain the attention pattern in Large Language Models. But this essentially simple pattern is often obscured by implementation details and optimizations. In this post I will try to cut to the essentials. In a nutshell, the attention machinery tries to get at a meaning of a word (more precisely, a token). This should be easy in principle: we could just look it up in the dictionary…Composing Contracts: An Adventure in Financial Engineering
Financial and insurance contracts do not sound like promising territory for functional programming and formal semantics, but in fact we have discovered that insights from programming languages bear directly on the complex subject of describing and valuing a large class of contracts…How do you organize your files? [Reddit]
In my current work I mostly do one-off scripts, data exploration, try 5 different ways to solve a problem, and do a lot of testing. My files are a hot mess. Someone asks me to do a project and I vaguely remember something similar I did a year ago that I could reuse but I cannot find it so I have to rewrite it. How do you manage your development work and “rough drafts” before you have a final cleaned up version?…LLMs in medicine: evaluations, advances, and the future
Large Language Models (LLMs) have shown significant potential for medical applications yet many challenges remain. Let’s talk about the state of LLMs in medicine, how these models are evaluated, how the latest models are improving, and the future of the field…UCSD CSE 291 - AI Agents
This course will cover the basics of (1) what LLM-based AI Agents actually are; (2) where they can be useful (and where they are not); and (3) how to safely train and deploy an agent for a given virtual domain…Why I believe that the brain does something like gradient descent
Brains consist of billions of neurons, each connected by synapses that regulate how signals flow between cells…Given this immense complexity, it might seem implausible that brains collectively implement something as conceptually simple as gradient descent. Yet, I believe they do, albeit in a biologically grounded and decentralized manner (as written as a paper with my friend Blake Richards)…OpenForest: a data catalog for machine learning in forest monitoring
Here, we provide a comprehensive and extensive overview of 86 open-access forest datasets across spatial scales, encompassing inventories, ground-based, aerial-based, satellite-based recordings, and country or world maps. These datasets are grouped in OpenForest, a dynamic catalog open to contributions that strives to reference all available open-access forest datasets. Moreover, in the context of these datasets, we aim to inspire research in machine learning applied to forest biology by establishing connections between contemporary topics, perspectives, and challenges inherent in both domains…A Conceptual Introduction to Hamiltonian Monte Carlo
Hamiltonian Monte Carlo has proven a remarkable empirical success, but only recently have we begun to develop a rigorous understanding of why it performs so well on difficult problems and how it is best applied in practice. Unfortunately, that understanding is confined within the mathematics of differential geometry which has limited its dissemination, especially to the applied communities for which it is particularly important. In this review I provide a comprehensive conceptual account of these theoretical foundations, focusing on developing a principled intuition behind the method and its optimal implementations rather of any exhaustive rigor…Building AI Apps for Real-World Use Cases: From Basics to Production
In this 3-hour hands-on workshop, Ravin Kumar (Deep Mind, Google, ex-Tesla) and Hugo Bowne-Anderson (Vanishing Gradients) will show you how to build and improve a simple AI application, simulating the complete lifecycle of AI app development…
🔬 Defog Introspect: Deep Research for your internal data
Introspect is a service that does data-focused deep research for structured data. It understands your structured data (databases or CSV/Excel files), unstructured data (PDFs), and can query the web to get additional context…
Every pod eviction in Kubernetes, explained
There are so many ways Kubernetes terminates workloads, each with a non-trivial (and not always predictable) machinery, and there’s no page that lists out all eviction modes in one place. This article will dig into Kubernetes internals to walk you through all the eviction paths that can terminate your Pods, and why “kubelet restarts don’t impact running workloads” isn’t always true, and finally I’ll leave you with a cheatsheet at the end…Open-ended Agent Learning in the Era of Foundation Models
I really enjoyed guest lecturing in the Stanford CS Course "Self-Improving AI Agents." The talk is online, titled "Open-ended Agent Learning in the Era of Foundation Models" (w/ more emphasis on the AI Scientist and ADAS than prior versions)…The Model is the Product
There were a lot of speculation over the past years about what the next cycle of AI development could be. Agents? Reasoners? Actual multimodality? I think it's time to call it: the model is the product. All current factors in research and market development push in this direction…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #588 here.
Cutting Room Floor
The Interface Between Reinforcement Learning Theory and Language Model Post-Training
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
smalldiffusion: Simple and readable code for training and sampling from diffusion models
Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~66,600 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian