Data Science Weekly - Issue 601

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

May 29, 2025

Issue #601
May 22, 2025

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Editor's Picks

Iceberg Operation Journey: Takeaways for DB & Server Logs
Hi, I am SeungMin Lee from the Data Analytics Platform team at Kakao…This is the last article in the series…In the first article, we shared performing CDC (Change Data Capture) using Apache Flink to synchronize a MySQL table with another MySQL table, and in the second article, we shared our experiences gained from performing CDC from MySQL tables to Apache Iceberg and operating the system…In this article, we intend to share how best to perform partitioning and optimization for Iceberg tables based on the type of logs being collected, along with our current operational methods and the results of our tests…

How to Keep Your Data Team From Becoming a Money Pit
Today, we are going to discuss what I’ve learned as a consultant being called in to help turn around data teams and infrastructure. Ok the first story will be one of my early mistakes…
The 80/20 Guide to R You Wish You Read Years Ago
Most intermediate R users get stuck in what I call "local optima." Sure! They write code that gets the job done, but it's often held together by duct tape and hope. As someone in academia, believe me when I say, I’ve seen my fair share of scruffy R scripts that are barely readable, impossible to maintain, and one change away from collapsing. But they're missing out on improvements that require minimal effort…

What’s on your mind

This Week’s Poll:

Last Week’s Poll:

Data Science Articles & Videos

Isolation Forest
Isolation Forest is an unsupervised machine learning algorithm for anomaly detection. As the name implies, Isolation Forest is an ensemble method (similar to random forest). In other words, it use the average of the predictions by several decision trees when assigning the final anomaly score to a given data point. Unlike other anomaly detection algorithms, which first define what’s “normal” and then report anything else as anomalous, Isolation Forest attempts to isolate anomalous data points from the get go…
New data engineer getting paid more than me, a senior DE [Reddit]
I found out that a new data engineer coming onto my team is making a few thousand more than me (a senior thats been with the company several years) annually, despite this new DE having less direct/applicable experience than me…Having to be a bit vague for obvious reasons. I have been a top individual contributor on my team every year. Every review I've received from management is overwhelmingly positive. This new DE and I are in the same geographic area, so thats not the explanation. How should I broach this with my management…
AlphaEvolve: the hype and the wonder
A dive into the buzz surrounding AlphaEvolve's jaw-dropping announcement. Did AI just revolutionize matrix multiplication and "doom" mathematicians everywhere? Not quite. This presentation cuts through the noise to explore what AlphaEvolve really achieved: a remarkable feat of AI and optimization, but not the master of mathematical creativity some claim. Learn why the future of AI in math is exciting, but not apocalyptic!...
acquaint
acquaint implements a Model Context Protocol (MCP) server for your R sessions. When configured with acquaint, MCP-enabled tools like Claude Desktop and Claude Code can run R code in the sessions you have running to answer your questions. While the package supports configuring arbitrary R functions, acquaint provides a default set of tools from btw to:
- Peruse the documentation of packages you have installed,
- Check out the objects in your global environment, and
- Retrieve metadata about your session and platform…
Understanding Panel Data Models
Panel data analysis represents one of the most powerful and versatile approaches in modern social science. By combining cross-sectional and time-series dimensions, panel data provides researchers with a multifaceted view of phenomena that would be impossible to obtain with either dimension alone. This document explores the theoretical foundations, practical applications, and interpretative nuances of the three primary panel data models: fixed effects, between effects, and random effects, with applications to county-level health research…
MCP Jupyter
Jupyter MCP Server allows you to use tools like Goose or Cursor to pair with you in a JupyterLab notebook where the state of your variables, etc is preserved by the Jupterlab Kernel. The fact that state is preserved is the key to this because it allows to to pair with the Agent in a notebook, where for example if a package is not installed it will see the error and install it for you. You as the user can then do some data exploration and then hand off to the agent at any time to pick up where you left off…
An Alchemist’s Notes on Deep Learning
I have recently had the opportunity to spend lots of time learning with the excuse of pursing a PhD. These Alchemist’s Notes are a byproduct of that process. Each page contains notes and ideas related broadly to deep learning, generative modeling, and practical engineering. I’ve actually been writing these things since 2016 on my personal website, but this site should be a more put-together version…
Data Quality Is All You Need?
Microsoft's Phi-4 is a small (14B parameters) language model that is a massive testament to the importance of data quality in training Large Language Models (LLMs)…In fact, when I went through their 36-page long technical report, what astounded me was the fact that only one paragraph is devoted to details of the model architecture, and the rest of the report talks almost exclusively about the data or evaluation pipeline…Through this post I will walkthrough the training data collection and curation pipeline used in training…
SQLite-JS Extension
SQLite-JS is a powerful extension that brings JavaScript capabilities to SQLite. With this extension, you can create custom SQLite functions, aggregates, window functions, and collation sequences using JavaScript code, allowing for flexible and powerful data manipulation directly within your SQLite database…
Building software on top of Large Language Models
I presented a three hour workshop at PyCon US yesterday titled Building software on top of Large Language Models. The goal of the workshop was to give participants everything they needed to get started writing code that makes use of LLMs. Most of the workshop was interactive: I created a detailed handout with six different exercises, then worked through them with the participants. You can access the handout here—it should be comprehensive enough that you can follow along even without having been present in the room…
CSV to HTML Table
Display any CSV file as a searchable, filterable, pretty HTML table. Done in 100% JavaScript…
2025 stack check: which DS/ML tools am I missing? [Reddit]
I work in ad-tech, where my job is to improve the product with data-driven algorithms, mostly on tabular datasets (CTR models, bidding, attribution, the usual).
Current work stack (quite classic I guess)
- pandas, numpy, scikit-learn, xgboost, statsmodels
- PyTorch (light use)
- JupyterLab & notebooks
- matplotlib, seaborn, plotly for viz
- Infra: everything runs on AWS (code is hosted on Github)
The news cycle is overflowing with LLM tools, I do use ChatGPT / Claude / Aider as helpers, but my main concern right now is the core DS/ML tooling that powers production pipelines. So, What genuinely awesome 2024-25 libraries, frameworks, or services should I try?…
DumPy: NumPy except it's OK if you're dum
What I want from an array language is:
1. Don’t make me think.
2. Run fast on GPUs.
3. Really, do not make me think.
4. Do not.
I say NumPy misses on three of these. So I’d like to propose a “fix” that—I claim—eliminates 90% of unnecessary thinking, with no loss of power. It would also fix all the things based on NumPy, for example every machine learning library…

Last Week's Newsletter's 3 Most Clicked Links

.
* Based on unique clicks.
** Find last week's issue #600 here.

Cutting Room Floor

Whenever you're ready, 3 ways we can help:

Want to get better at Data Science / Machine Learning Math? I have a two weekly tutoring slots open. Hit reply to this email and let me know what you want to learn.
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~68,400 subscribers by sponsoring this newsletter. 30-40% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post