Data Science Weekly - Issue 611
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #611
August 07, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
The Internals of PostgreSQL
PostgreSQL is a well-designed, open-source multi-purpose relational database system which is widely used throughout the world. It is one huge system with the integrated subsystems, each of which has a particular complex feature and works cooperatively with each other. Although understanding of the internal mechanism is crucial for both administration and integration using PostgreSQL, its hugeness and complexity make it difficult. The main purposes of this document are to explain how each subsystem works, and to provide the whole picture of PostgreSQL…
Detecting a single event
The problem is this: If we have a series of cases where no event has taken place, what is the estimated event rate? Our best estimate of the proportion of cases which have an event is zero, but there will be uncertainty in this estimate. Just because we have not seen an event yet does not mean we will never see one. We need a confidence interval for this estimate…2025 Book Data
What genres of books have I read so far in 2025 and how many pages have I read? A short little data visualization exercise. Credit to the Libby app and Goodreads for helping me keep track of all the books I read, and to this example from the R graph gallery for inspiring my genre bump plot…
What’s on your mind
This Week’s Poll:
.
Last Week’s Poll:
.
Data Science Articles & Videos
iNaturalist accelerates biodiversity research
Participatory citizen science is expanding, with iNaturalist emerging as one of the most widely used platforms globally. However, its application in research is often anecdotal. To evaluate the impact of how iNaturalist is contributing to biodiversity and conservation research, we conducted a systematic review of iNaturalist data use and compared our findings with Global Biodiversity Information Facility literature citing iNaturalist…
If I get laid off tomorrow, what's the ONE skill I should have had to stay in demand? [Reddit]
I'm a Data Engineer with 3 YOE at a Big4. With all the layoffs happening, wondering what skill would make me most marketable…What's the ONE skill I should start learning that would make me recession-proof or boost my career?…Context Engineering — A Comprehensive Hands-On Tutorial with DSPy
Let's dissect the art and science of context engineering, one module at a time!…This article will cover the key ideas behind creating LLM applications using Context Engineering principles, visually explain these workflows, and share code snippets that apply these concepts practically…How Instacart Built a Modern Search Infrastructure on Postgres
In a previous blog post “Improving search at Instacart using hybrid recall”, we shared our progress on adaptively combining traditional full text search with embedding-based retrieval. Embarking on this journey required rethinking our search infrastructure to support hybrid recall while ensuring scalability and reliability…In this blog post, we will dive deeper into the architecture and engineering efforts that made this possible and lessons learned along the way….The worlds most convoluted spirograph
In this video, we're dipping our toes into the world of DSP, and learning about how to interpret the output of the worlds most important algorithm: The Fast Fourier Transform…By building an understanding of all the component parts, we can understand exactly how this mathematical tool can be used to create these wild drawings, based solely on circles of different sizes rotating around one another…Building notes: Linear regression viz in 3D
One of my favorite books on linear regression is Applied Regression Analysis & Generalized Models by John Fox. I feel like it has enough theory and practicality, and I like going back to it every now and then. One of my favorite diagrams to introduce readers to linear regression is below:I really love this plot and out of curiosity, I was wondering if I can replicate this plot in ggplot…
Common statistical tests are linear models (or: how to teach stats)
Most of the common statistical models (t-test, correlation, ANOVA; chi-square, etc.) are special cases of linear models or a very close approximation. This beautiful simplicity means that there is less to learn. In particular, it all comes down to 𝑦=𝑎⋅𝑥+𝑏 which most students know from highschool. Unfortunately, stats intro courses are usually taught as if each test is an independent tool, needlessly making life more complicated for students and teachers alike…The ultimate guide to starting a Quarto blog
This blog post is an in-depth guide on how to start blogging with Quarto…Today, I want to show you how to build a blog with Quarto. This in-depth guide is the result of hours of working with Quarto’s amazingly detailed documentation. Hopefully, it will save you a lot of time and helps you start your own blog…
This is the first post of what will be a series of posts reviewing Martin Bland’s “Detecting a single event” [see Editor’s picks above] and some of the theory behind it…As an applied statistician who spends most of my days fitting linear models and running simulations, I fall out of practice with some of the fundamentals of statistical theory. And it has been refreshing to return to the basics and be reminded of just how good they are!..In reading Bland’s write up, I found I needed to re-learn a lot of things, which inspired me to start a small series covering:
The problem to be solved & an overview of the binomial distribution
Estimating likely event rates when zero events have been detected (exact 95% confidence intervals)
Estimating the power to detect at least 1 event in a given study…
Learn Rust by Reasoning with Code Agents
It's often said that Rust has a steep learning curve. I disagree with this notion. I'm a strong believer in learning by doing. Rust is a programming language, and like any language, it should be learned by applying it to real projects rather than relying solely on books or videos. However, learning by doing can't solve every problem that newcomers might encounter. While it helps with grasping the basics, when it comes to mastering Rust's advanced features like ownership, traits, lifetimes, async, we need more than just hands-on practice. We need to understand. We need to reason. Thanks to Code Agents, I discovered something even better: learning Rust by reasoning (with Code Agents)…
Scaling the r-spatial ecosystem for the modern composable data pipeline
R has long been a top choice for spatial statistics, building on the pioneering sp and spdep packages and the wide ecosystem surrounding them. With the introduction of the sf package, R became home to a first-class spatial data frame API. A growing number of R users, however, need to scale beyond the capabilities of sf. This webinar will cover three broad categories of techniques to scale spatial workflows in R, including (1) ensuring that sf code is appropriately using features targeted at larger analyses, (2) using libraries that provide lower-level access to the primitives on which sf builds, including s2, wk, and geos, and (3) using database connectors and in-memory databases to write spatial SQL and perform computations in engines like PostGIS, DuckDB, and Apache Sedona. Finally, this webinar will provide an overview of the technologies that underlie these techniques, including GeoArrow, GeoParquet, and Apache Iceberg…How can I *give* a good data science/machine learning interview? [Reddit]
My company has decided they want to bring on some much needed help (thank god) and want me to do "the more technical side" of the interview (with others taking care of the behavioral etc)…Does anyone have any good tips on how to do the interviews, what to look for or what to include?…AWS deleted my 10-year account and all data without warning
On July 23, 2025, AWS deleted my 10-year-old account and every byte of data I had stored with them. No warning. No grace period. No recovery options. Just complete digital annihilation…This is the story of a catastrophic internal mistake at AWS MENA, a 20-day support nightmare where I couldn’t get a straight answer to “Does my data still exist?”, and what it reveals about trusting cloud providers with your data…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #610 here.
Cutting Room Floor
Achieving 10,000x training data reduction with high-fidelity labels
Kart: Distributed version-control for geospatial and tabular data
Making of SARE: How I Designed a File Format for Encrypted Data
How I Use LLMs for Data Harmonization: A Strategic, Limited Approach
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~68,500 subscribers by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian
I didn't Quarto, I just entered the rabbit hole, I have feeling that I'll be there for a while. Thanks 😁