Data Science Weekly - Issue 606
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #606
July 03, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Data and Reality, 2nd Edition
People who've listened to me for more than, like, five minutes probably know that my favorite software book, bar none, is Data and Reality. It's about how data is terrible and we can't accurately capture data in information systems because we don't know what we, as the people capturing the data, actually mean when we think about things…
An Algorithm for a Better Bookshelf - Managing the strategic positioning of empty spaces
Drop in at a library, and you’ll likely notice that most shelves aren’t full—librarians leave some empty space on each shelf. That way, when they get new books, they can slot them into place without having to move too many other books. It’s a simple-enough idea, but one that arises in a host of settings in computer science that involve sorted data, such as an alphabetically ordered census repository, or a list of connections between members of a social network. In such situations, where the entries can number in the hundreds of billions, the strategic positioning of empty spaces takes on great significance…Stalking the Statistically Improbable Restaurant… With Data!
Last summer, I wrote about the statistically improbable restaurant, the restaurant you wouldn’t expect to find in a small American city: the excellent Nepali food in Erie, PA and Akron, OH; a gem of a Gambian restaurant in Springfield, IL. Statistically improbable restaurants often tell you something about the communities they are based in…The existence of the statistically improbable restaurant implies a statistically probable restaurant distribution: the mix of restaurants we’d expect to find in an “average” American city…
What’s on your mind
This Week’s Poll:
Last Week’s Poll:
.
Data Science Articles & Videos
LLMZip: Lossless Text Compression using Large Language Models
We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in \cite{cover1978convergent}, \cite{lutati2023focus}. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme…Bringing Posit’s Air formatter for R to Homebrew
This post walks through creating a Homebrew formula for Air, Posit’s new R formatter and language server. We explore why Air represents a significant step forward for R code formatting, discuss the technical considerations of packaging Rust-based tools for Homebrew, and demonstrate the complete process from formula creation to pull request submission…How’s the job market for Bayesian statistics? [Reddit]
I’m a data scientist with 1 YOE. mostly worked on credit scoring models, sql, and Power BI. Lately, I’ve been thinking of going deeper into bayesian statistics and I’m currently going through the statistical rethinking book. But I’m wondering. is it worth focusing heavily on bayesian stats? Or should I pivot toward something that opens up more job opportunities? Would love to hear your thoughts or experiences!..Basic facts about GPUs
I’ve been trying to get a better sense of how GPUs work. I’ve read a lot online, but the following posts were particularly helpful:This post collects various facts I learned from these resources…
The rise of "context engineering"
Context engineering is building dynamic systems to provide the right information and tools in the right format such that the LLM can plausibly accomplish the task. Most of the time when an agent is not performing reliably the underlying cause is that the appropriate context, instructions and tools have not been communicated to the model. LLM applications are evolving from single prompts to more complex, dynamic agentic systems. As such, context engineering is becoming the most important skill an AI engineer can develop…Marketing for maintainers: Promote your project to users and contributors
Marketing your open source project can be intimidating, but three experts share their insider tips and tricks for how to get your hard work on the right people’s radars…It can be intimidating to go from writing code to writing marketing copy, so The ReadME Project’s senior editor Klint Finley gathered three experts to answer your questions about how to promote your project to both users and contributors…Discovering cognitive strategies with tiny recurrent neural networks
Understanding how animals and humans learn from experience to make adaptive decisions is a fundamental goal of neuroscience and psychology. Normative modeling frameworks such as Bayesian inference and reinforcement learning provide valuable insights into the principles governing adaptive behavior…Here we present a novel modeling approach that leverages recurrent neural networks to discover the cognitive algorithms governing biological decision-making. We show that neural networks with just one to four units often outperform classical cognitive models and match larger neural networks in predicting the choices of individual animals and humans, across six well-studied reward-learning tasks…Vitess for Postgres, with the co-founder of PlanetScale
Sugu Sougoumarane, co-creator of Vitess and co-founder of PlanetScale, joins me to talk about his time scaling YouTube's database infrastructure, building Vitess, and his latest project bringing sharding to Postgres with Multigres. This was a fun conversation with technical deep-dives, lessons from building distributed systems, and why he's joining Supabase to tackle this next big challenge…
A Guide to Failure in Machine Learning: Reliability and Robustness from Foundations to Practice
One of the main barriers to adoption of Machine Learning (ML) is that ML models can fail unexpectedly. In this work, we aim to provide practitioners a guide to better understand why ML models fail and equip them with techniques they can use to reason about failure. Specifically, we discuss failure as either being caused by lack of reliability or lack of robustness…
Designing the Tools that Shape Data Science – with Dr Hadley Wickham
We’re diving into the design and development of the tools that have transformed data science and statistical programming — with none other than Dr Hadley Wickham. Hadley is the Chief Scientist at Posit (formerly RStudio), where he leads the Tidyverse team…In this episode, Hadley shares key lessons he’s learnt from over two decades of building and maintaining open-source software. We also look ahead, as he discusses where he sees data science tooling headed — and the principles that will shape its future…
Uncommon Uses of Python in Commonly Used Libraries
To learn how to build more maintainable and usable Python libraries, I’ve been reading some of the most widely used Python packages. Along the way, I learned some things about Python that are off the beaten path. Here are a few things I didn’t know before…Data Science Has Become a Pseudo-Science [Reddit]
I’ve been working in data science for the last ten years, both in industry and academia, having pursued a master’s and PhD in Europe. My experience in the industry, overall, has been very positive. I’ve had the opportunity to work with brilliant people on exciting, high-impact projects…However, over the past two years or so, it feels like the field has taken a sharp turn. Just yesterday, I attended a technical presentation from the analytics team. The project aimed to identify anomalies in a dataset composed of multiple time series, each containing a clear inflection point…The team claimed to have solved the task using “generative AI”…Kokoro TTS
A CLI text-to-speech tool using the Kokoro model, supporting multiple languages, voices (with blending), and various input formats including EPUB books and PDF documents…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #605 here.
Cutting Room Floor
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~68,400 subscribers by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian