Data Science Weekly - Issue 607
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #607
July 10, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February–June 2025 frontier affect the productivity of experienced open-source developers…After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%—AI tooling slowed developers down…
Women’s Pockets are Inferior
Like so many things on the internet, we could find complaints and anecdotes galore but little data illustrating just how inferior women’s pockets really are to men’s. So, we went there. We measured the pockets in both men’s and women’s pants in 20 of the US’ most popular blue jeans brands. Take a look at what we found…Vibe / Citizen Developers bringing our Datawarehouse to it's knees [Reddit]
Received an alert this morning stating that compute usage increased 2000% on a data warehouse. I went and looked at the top queries coming in and spotted evidence of Vibe coders right away. Stuff like SELECT * or SELECT TOP 7,000,000 * with a list of 50 different tables and thousands of fields at once (like 10,000), all joined on non-clustered indexes. And not just one query like this, but tons coming through…
What’s on your mind
This Week’s Poll:
Last Week’s Poll:
.
Data Science Articles & Videos
Out-of-Distribution Detection Methods Answer the Wrong Questions
To detect distribution shifts and improve model safety, many out-of-distribution (OOD) detection methods rely on the predictive uncertainty or features of supervised models trained on in-distribution data. In this paper, we critically re-examine this popular family of OOD detection procedures, and we argue that these methods are fundamentally answering the wrong questions for OOD detection…
Mathematical modelling of cancer cell evolution and plasticity
In this review, we argue that mathematical modelling is an essential tool for understanding cancer cell evolution and phenotypic plasticity. We show that mathematical models enable us to reconstruct time-dependent tumour evolutionary dynamics from temporally-restricted biological data…Does your company also have like a 1000 data silos? How did you deal?? [Reddit]
Our stack is starting to feel like a graveyard of data silos. Every team has their own little database or cloud storage or Kafka topic or spreadsheet or whatever, and no one knows what’s actually true anymore. We’ve got data everywhere, Excel docs in people’s inboxes… it’s a full-on Tower of Babel situation. We try to centralize stuff but it turns into endless meetings about “alignment” and nothing changes. Everyone nods, no one commits. Rinse, repeat. Has anyone actually succeeded in untangling this mess? Did you go the data mesh route? Lakehouse? Build some custom plaster yourself?…The slow death of scaling and what comes next
Is bigger always better? How our optimization space is fundamentally changing. The slow death of scaling and what comes next…The entire history of Computer Vision explained one core concept at a time
In this video, we are going to go through the history of CNNs specifically for Image Classification tasks – starting from those early research years, to the golden era of the mid 2010s when many of the most genius Deep Learning architectures ever were conceived, and finally discuss the latest trends in CNN research now as they compete with attention and vision-transformers…How to Profile Models in PyTorch
This tutorial seeks to teach users about using profiling tools such as nvsys, rocprof, and the torch profiler in a simple transformers training loop. We will cover how to use the PyTorch profiler to identify performance bottlenecks, understand GPU efficiency metrics, and perform initial optimizations…European Commission presents Roadmap for lawful access to data
On 24 June, the European Commission presented a Roadmap setting out the way forward to ensure law enforcement authorities in the EU have effective and lawful access to data. The roadmap is an important deliverable under ProtectEU – the EU’s Internal Security Strategy, which the Commission presented in April this year…Building resilient applications: design patterns for handling database outages
Database outages, whether planned or unexpected, pose significant challenges to applications. Planned outages for maintenance can be scheduled but still impact users. Unplanned outages are more disruptive and can happen at critical times. Even the most robust and resilient databases will inevitably experience outages, making application resiliency a critical consideration in modern system design…In this post, we explore design patterns for building resilient applications that gracefully handle database outages. These strategies protect against outages and ensure smooth degradation during database issues, helping maintain user experience and business continuity…
Today's AI industry is pouring billions into agent frameworks, orchestration platforms, and error recovery systems. Yet the most sophisticated AI applications still fail in production for a surprisingly mundane reason: we can't guarantee what comes out of our language models…
KSAT with Vegard Sandengen - Rust in Production Podcast
In this episode, we talk to Vegard Sandengen, a Rust engineer at KSAT, a company that provides ground station services for satellites. They use Rust to manage the data flow from hundreds of satellites, ensuring that data is received, processed, and stored efficiently. This data is then made available to customers around the world, enabling them to make informed decisions based on real-time satellite data…We dive deep into the technical challenges of building reliable, high-performance systems that operate 24/7 to capture and process satellite data. Vegard shares insights into why Rust was chosen for these mission-critical systems, how they handle the massive scale of data processing, and the unique reliability requirements when dealing with space-based infrastructure…
Prompting as Scientific Inquiry
Prompting is the primary method by which we study and control large language models. It is also one of the most powerful: nearly every major capability attributed to LLMs-few-shot learning, chain-of-thought, constitutional AI-was first unlocked through prompting. Yet prompting is rarely treated as science and is frequently frowned upon as alchemy. We argue that this is a category error. If we treat LLMs as a new kind of complex and opaque organism that is trained rather than programmed, then prompting is not a workaround: it is behavioral science…Step-by-Step Diffusion: An Elementary Tutorial
We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms…Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a novel multimodal document image parsing model following an analyze-then-parse paradigm. This repository contains the demo code and pre-trained models for Dolphin…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #606 here.
Cutting Room Floor
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~68,400 subscribers by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian