Data Science Weekly - Issue 586

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Feb 14, 2025

Issue #586
February 13, 2025

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Editor's Picks

Learn to Cloud
Learn to Cloud (L2C) is a junior cloud engineering courseware built on the belief that anyone can learn foundational cloud engineering skills with the right guide and discipline…

Do People Actually Hate Coldplay? A Statistical Analysis
Examining Coldplay's confusing cultural reputation…
Top Themes in Data in 2025
There are two opposing forces in the world of data: an overall consolidation within the modern data stack & a massive expansion driven by AI capabilities. AI is rewriting every rule about what’s possible with data in 2025. Here are Theory’s Top Themes in Data in 2025 with the full presentation at the bottom…

What’s on your mind

This Week’s Poll:

Do your friends understand what you do?

[Take this quick 5-second poll →]

We’ll share the results next week!

Last Week’s Poll:

Data Science Articles & Videos

Comparing Apache, CNCF, and Commonhaus
I've used open source projects for over 30 years and contributed for about 20 of those. My first interaction with an open source foundation was with Apache when I began working with Apache Hadoop in 2008. Since then, I have contributed to many Apache projects, created Apache Samza, and was mentor and project champion for Apache Airflow…I'm frequently ask for guidance on which open source foundation to donate a project to. I've decided to share my thoughts in this post…
BAML is like building blocks for AI engineers
In this post, I’ll explain more about how BAML, a domain-specific language for helping LLMs generate better structured outputs, provides AI engineers the necessary building blocks to create more composable, testable and robust LLM and agentic workflows. If you’ve never heard of BAML, check out my previous post that introduces its fundamentals…
How do companies with hundreds of databases document them effectively? [Reddit Discussion]
For those who’ve worked in companies with tens or hundreds of databases, what documentation methods have you seen that actually work and provide value to engineers, developers, admins, and other stakeholders?..I’m curious about approaches that go beyond just listing databases, rather something that helps with understanding schemas, ownership, usage, and dependencies…Have you seen tools, templates, or processes that actually work? I’m currently working on a template containing relevant details about the database that would be attached to the documentation of the parent application/project, but my feeling is that without proper maintenance it could become outdated real fast. What’s your experience on this matter?…
Open sourcing kubenetmon: how we monitor data transfer in ClickHouse Cloud
When it comes to data transfer, cloud providers typically charge you for:
- NAT Gateways;
- Load Balancers;
- Cross-Availability Zone traffic;
- Egress, where the cost basis depends on which region you egress from and where you egress into;
- Ingress, where the cost basis also depends on which region you ingress into and where the remote is.
We set out to untangle this complexity, and this blog post is going to tell you how…
Defining Unique Identifiers
Generally speaking, a Unique Identifier (UID) is an inscription that represents (no more than) one entity within a given system. UIDs are essential to the functioning of modern information systems, so it is important to understand and define what a UID is and how it should be used…In this post, I will define unique identifiers by deducing their essential properties…
Polars Cloud: the distributed Cloud Architecture to run Polars anywhere
Our goal is to enable Scalable data processing with all the flexibility and expressiveness of Polars’ API. We are working on two things; Polars Cloud and a completely novel Streaming Engine design. We will explain more about the streaming engine in later posts; Today we want to share what are building with Polars Cloud…It will be very seamless to spin up hardware and run Polars queries remotely, either in batch mode for production ETL jobs, or interactively doing data exploration. The rest of the post, we want to explore this through a few code examples…
What companies/industries are “slow-paced”/low stress? [Reddit Discussion]
I’ve only ever worked in data science for consulting companies, which are inherently fast-paced and quite stressful. The money is good but I don’t see myself in this field forever. “Fast-pace” in my experience can be a code word for “burn you out”…Out of curiosity, do any of you have lower stress jobs in data science? My guess would be large retailers/corporations that are no longer in growth stage and just want to fine tune/maintain their production models, while also dedicating some money to R&D with more reasonable timelines…
An Unexpected Reinforcement Learning Renaissance
The era we are living through in language modeling research is one pervasive with complete faith that reasoning and new reinforcement learning (RL) training methods will work. This is well founded. A day cannot go by without | a new reasoning model, RL training result, or dataset distilled from DeepSeek R1…The goal of this talk is to try and make sense of the story that is unfolding today…
How to disaggregate a log replication protocol
This post continues my series looking at log replication protocols, within the context of state-machine replication (SMR) or just when the log itself is the product (such as Kafka). So far I’ve been looking at Virtual Consensus, but now I’m going to widen the view to look at how log replication protocols can be disaggregated in general (there are many ways)…
Cutting through Complexity: How Data Science Can Help Policymakers Understand the World
This chapter looks at examples of where innovations from data science are cutting through the complexities faced by policymakers in measurement, allocating resources, monitoring the natural world, making predictions, and more. These examples show the promise and potential of data science to aid policymakers, and point to where actions may be taken that would support further progress in this space…
Exploring the bioRxiv API with R, httr2, rvest, tidytext, and Datawrapper
Collect metadata and publication details for >200k preprints over a 10 year period, investigate trends, and scrape full text for sentiment analysis…
Understanding Model Calibration: A Gentle Introduction & Visual Exploration
In this blog post we’ll take a look at the most commonly used definition for calibration and then dive into a frequently used evaluation measure for Model Calibration. We’ll then cover some of the drawbacks of this measure and how these surfaced the need for additional notions of calibration, which require their own new evaluation measures…
Learn how to make QGIS Plugins with AI coding tools (video)
I recently published a post on my experience using Cursor to create a new QGIS plugin. It seems to have inspired a few people, and so I decided to record a couple videos to try to show everyone exactly the process to do it. I’ve felt that being able to build things like QGIS Plugins has been life-changing, and so I just wanted to help demystify the process…

Last Week's Newsletter's 3 Most Clicked Links

.
* Based on unique clicks.
** Find last week's issue #585 here.

Cutting Room Floor

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~66,500 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post

Ready for more?