Data Science Weekly - Issue 593
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #593
April 03, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
The Art of Problem-Solving in Software Engineering:How to Make MySQL Better
This book uses MySQL challenges as case studies to explore problem analysis and resolution strategies. Readers will gain a deeper appreciation for logical reasoning, data structures, algorithms, and more through practical examples and insightful discussions…
Recent reasoning research: GRPO tweaks, base model RL, and data curation
Reasoning and reinforcement learning (RL) research has been making lots of noise, but finding the items to focus on amid the chaos is not easy. This post goes through the papers that I learned from and what they mean…SHAP Interpretations Depend on Background Data — Here’s Why
Or why height doesn't matter in the NBA…Last week, I held my first-ever workshop on interpretable machine learning. One thing we explored was how the choice of reference or background data affects SHAP interpretations. This aspect is often underappreciated—both as a potential pitfall in interpretation and as an opportunity. I previously wrote about the power of background data in other interpretation methods, but there is more to be said regarding SHAP…
What’s on your mind
This Week’s Poll:
Take this quick 5-second poll →
We’ll share the results next week!
Last Week’s Poll:
Data Science Articles & Videos
Explaining The Zoo Model
Over on his soccer substack, Devin Pleuler has brought up The Zoo model a couple times while admitting that he doesn’t fully understand how it works. The Zoo was the winning entry to BDB 2020 which was about predicting yards after handoff…In most models, you have to come up with some heuristic for which feature contains which player (eg nearest to farthest from the ball carrier or left to right on the field). Changing how you enter them will result in a different prediction. Not so with The Zoo. Anywho, there was a note in Devin’s post asking if anyone would be interested in explaining how it works. I’ve made a similar model in the past, so I figured I’d give a go at explaining it, or at least how I think about it…An Interview with Snowflake CEO Sridhar Ramaswamy About Data and AI
We cover…Ramaswamy’s background and experience at Google, and his current take on the company and the challenges it faces in search. Then we dive into Snowflake and his unexpected elevation to CEO, including topics like business models, go-to-market motions, and incentives. The rest of the interview is about AI and Snowflake’s position in the market: can Snowflake extend beyond its structured data warehouse roots before competitors like Databricks leverage AI to wrangle unstructured data into a more compelling offering?…dayplot - Calendar heatmaps with matplotlib
A simple-to-use Python library to build calendar heatmaps with ease. It's built on top of matplotlib and leverages it to access high customization possibilities…A love letter to the CSV format
Every month or so, a new blog article declaring the near demise of CSV in favor of some "obviously superior" format (parquet, newline-delimited JSON, MessagePack records etc.) find its ways to the reader's eyes. Sadly those articles often offer a very narrow and biased comparison and often fail to understand what makes CSV a seemingly unkillable staple of data serialization. It is therefore my intention, through this article, to write a love letter to this data format, often criticized for the wrong reasons, even more so when it is somehow deemed "cool" to hate on it. My point is not, far from it, to say that CSV is a silver bullet but rather to shine a light on some of the format's sometimes overlooked strengths…airflow-ai-sdk - An SDK for working with LLMs and AI Agents from Apache Airflow, based on Pydantic AI
This repository contains an SDK for working with LLMs from Apache Airflow, based on Pydantic AI. It allows users to call LLMs and orchestrate agent calls directly within their Airflow pipelines using decorator-based tasks. The SDK leverages the familiar Airflow @task syntax with extensions like @task.llm, @task.llm_branch, and @task.agent…Publish a Quarto project using GitHub Pages+GitHub Actions in 6 minutes (no need to render locally!)
In this video, we learn how to publish a Quarto project using GitHub Pages AND GitHub Actions, in a way that avoids the need to render any files locally. GitHub Actions handles the whole process from start to finish: Quarto + R will be installed, the correct versions of each R package you need will be installed, the Quarto project will be fully rendered, and then it'll be published on GitHub Pages. It's all free, and it only takes a few minutes to set up…A Visual Exploration of Gaussian Processes
How to turn a collection of small building blocks into a versatile tool for solving regression problems…4 Levels of GitHub Actions: A Guide to Data Workflow Automation
From a simple Python workflow to scheduled data processing and security management…GitHub Actions, an integrated Continuous Integration and Continuous Deployment (CI/CD) tool within GitHub, has established its position in the software development industry by providing a comprehensive platform for automating development and deployment workflows. However, its functionalities extend beyond this … We will delve into the use of GitHub Actions within the realm of data domain, demonstrating how it can streamline processes for developers and data professionals by automating data retrieval from external sources and data transformation operations…I've operated petabyte-scale ClickHouse® clusters for 5 years
I've been operating large ClickHouse clusters for years. Here's what I've learned about architecture, storage, upgrades, config, testing, costs, and ingestion…In that time, I've dealt with ClickHouse daily, helped start a company that uses it, sent critical changes to the database project, and managed many petabyte-scale clusters. Setting up a cluster is easy, the hard part is keeping it running. Let me go through the good and the bad parts, focusing on the problems you may find (so you can avoid them)…
Querying Hugging Face Datasets with the DuckDB UI
Hugging Face hosts a whopping 384k+ datasets that range from a few thousand rows to 100s of million. While the browser-based Data Studio (powered by DuckDB WASM) is powerful, exploring very large datasets or running complex queries can sometimes be limited by browser constraints. This is where the new DuckDB Local UI comes into play! Starting in DuckDB v1.2.1, Motherduck and DuckDB Labs collaborated to bring a local UI to the DuckDB CLI…
Bridging the human–AI knowledge gap through concept discovery and transfer in AlphaZero
As AI systems become more capable, they may internally represent concepts outside the sphere of human knowledge. This work gives an end-to-end example of unearthing machine-unique knowledge in the domain of chess. We obtain machine-unique knowledge from an AI system (AlphaZero) by a method that finds novel yet teachable concepts and show that it can be transferred to human experts (grandmasters). In particular, the new knowledge is learned from internal mathematical representations without a priori knowing what it is or where to start. The produced knowledge from AlphaZero (new chess concepts) is then taught to four grandmasters in a setting where we can quantify their learning, showing that machine-guided discovery and teaching is possible at the highest human level…Transformers without Normalization
Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT(x)=tanh(αx), as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning…Data-Driven Decision Making with Sentiment Analysis in R
In this article, we will explore how to use different packages (Quanteda, Sentimentr and Textstem) to perform sentiment analysis on customer feedback by processing, analyzing, and visualizing textual data…
.
Last Week's Newsletter's 3 Most Clicked Links
Hands-On APIs for AI and Data Science: Python Development with FastAPI
Common statistical tests are linear models (or: how to teach stats)
Which Countries Have the Most Unique Taste in Music? A Statistical Analysis
.
* Based on unique clicks.
** Find last week's issue #592 here.
Cutting Room Floor
The Hard Things About Sync - Lessons learned working on sync engine at Figma
A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli
Testing Distributed Systems: Curated list of resources on testing distributed systems
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~67,600 subscribers by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian