Data Science Weekly - Issue 595
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #595
April 17, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Things Statisticians Claim to Hate But Secretly Love
There Are Liars, Damn Liars, and Statisticians. This Is the Truth…
Decomposing Transactional Systems
Every transactional system does four things:It executes transactions
It orders transactions
It validates transactions
It persists transactions
All four of these things must be done before the system may acknowledge a transaction’s result to a client. However, these steps can be done in any order. They can be done concurrently. Different systems achieve different tradeoffs by reordering these steps…
Keeping it boring (and relevant) with BM25F
First introduced in 1994, BM25 eventually made its way into popular search engines like Apache Lucene and has been powering search bars across the internet for decades. Because it works well with keyword search engines, BM25 is efficient to compute over massive datasets. It also performs well in diverse settings, providing good out-of-the-box ranking in a variety of domains like site search, legal search, and more…Given BM25’s success in text search, it’s natural to wonder if it could work well for keyword-style code searches. After implementing it in Sourcegraph’s recent 6.2 release, our answer is “yes”!…
Sponsor Message
Analyse and visualise data with your AI assistant
Unlock the full potential of your data with Conjointly's Insights Explorer. The Insights Explorer is a free browser-based rswam IDE that includes an AI assistant, allowing you to generate analysis without writing additional code or installing software.
Simply ask the AI assistant questions about your data in plain English, and it will generate executable code ready to help you transform, visualise, and explore your data. Insights Explorer helps you simplify your workflow and streamline the path from data to decision.
Whether you need quick analysis or deep data exploration, the Insights Explorer helps you focus on the insights that really matter, rather than getting bogged down in technicalities and common pain points of data programming. Spend less time searching for the right libraries, looking up syntax, restructuring data or debugging code and more time interpreting results and extracting meaningful insights for business decisions.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
What’s on your mind
This Week’s Poll:
Data Science Articles & Videos
Generative modelling in latent space
Most contemporary generative models of images, sound and video do not operate directly on pixels or waveforms. They consist of two stages: first, a compact, higher-level latent representation is extracted, and then an iterative generative process operates on this representation instead. How does this work, and why is this approach so popular?…Diffusion Generative Model, Non-Euclidean Data, and How the Algebraic/Geometric Structure of Lie Groups Helps
It is challenging but beneficial to create diffusion generative model on manifolds. By algorithmically introducing an auxiliary momentum variable and mathematically “trivializing” it, this task becomes, for a useful class of manifolds, as easy as diffusion model in Euclidean spaces, so that many great existing developments can be used…How to Build an Agent
It’s not that hard to build a fully functioning, code-editing agent…It seems like it would be. When you look at an agent editing files, running commands, wriggling itself out of errors, retrying different strategies - it seems like there has to be a secret behind it…There isn’t. It’s an LLM, a loop, and enough tokens…Python TARIFF
The GREATEST, most TREMENDOUS Python package that makes importing great again!…TARIFF is a fantastic tool that lets you impose import tariffs on Python packages. We're going to bring manufacturing BACK to your codebase by making foreign imports more EXPENSIVE!…Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models
Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods?…An Intro to DeepSeek's Distributed File System
3FS (Fire-Flyer File System) is a distributed filesystem released by DeepSeek during their open source release week. This blog post will dive into what distributed file systems are and how 3FS operates, starting with some background…An Introduction to Stochastic Calculus
This post is about stochastic calculus, an extension of regular calculus to stochastic processes. It's not immediately obvious but the rigour needed to properly understand some of the key ideas requires going back to the measure theoretic definition of probability theory, so that's where I start in the background. From there I quickly move on to stochastic processes, the Wiener process, a particular flavour of stochastic calculus called Itô calculus, and finally end with a couple of applications. As usual, I try to include a mix of intuition, rigour where it helps intuition, and some simple examples. It's a deep and wide topic so I hope you enjoy my digest of it…The value of a dedicated data science approach in HR
This document outlines why HR departments in large organizations benefit from a dedicated data science approach, highlighting impacts beyond recruitment. In short, my thesis is as follows: as organizations scale, so does the complexity of understanding their internal dynamics. Data tools become essential to analyzing large organizations, as they enable HR to identify patterns and insights that can drive strategic improvements across key areas…Monte Carlo Crash Course - Continuous Probability
This chapter is a condensed review of continuous probability. If you’re comfortable working with continuous random variables, you may want to skip it…
Simple low-dimensional computations explain variability in neuronal activity
Our understanding of neural computation -- both in the brain and artificial networks -- is founded on an assumption: That neurons fire in response to a linear sum of inputs…We systematically test this assumption..
A Field Guide to Rapidly Improving AI Products
In this post, I’ll show you exactly how these successful teams operate. While every situation is unique, you’ll see patterns that apply regardless of your domain or team size. Let’s start by examining the most common mistake I see teams make: one that derails AI projects before they even begin…Reinforcement Learning from Human Feedback Book
In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF – both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics – understudied research questions in synthetic data and evaluation – and open questions for the field…Marimo - Reactive Notebooks for Python
Have you ever spent an afternoon wrestling with a Jupyter notebook, hoping that you ran the cells in just the right order, only to realize your outputs were completely out of sync? Today's guest has a fresh take on solving that exact problem. Akshay Agrawal is here to introduce Marimo, a reactive Python notebook that ensures your code and outputs always stay in lockstep. And that's just the start! We'll also dig into Akshay's background at Google Brain and Stanford, what it's like to work on the cutting edge of AI, and how Marimo is uniting the best of data science exploration and real software engineering…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #594 here.
Cutting Room Floor
Migrating a large codebase to Polars by Jeroen Janssens and Thijs Nieuwdorp
fastkmeans - A fast and efficient k-means implementation for PyTorch, with support for GPU and CPU
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~68,000 subscribers by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian