Data Science Weekly - Issue 599
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #599
May 15, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
My path into AI
How I got here. Building a career brick by brick over 8 years…
What Every Programmer Should Know About Enumerative Combinatorics
Enumerative combinatorics is a branch of mathematics focused on counting the elements of a set…For example, determining the number of unique user IDs in a database is a problem in enumerative combinatorics…This article showcases how programmers without formal math backgrounds can use observation and pattern recognition to approach problems in enumerative combinatorics…Insurance for AI: Easier Said than Done
In recent months, many friends have pitched or asked me about insuring AI risk. The idea is usually something like this: businesses want to adopt AI for efficiency, but they’re nervous about the AI hallucinating and making costly mistakes. Even if they buy all the best software to mitigate such mistakes, the scope of LLM outputs is so large that unpredictable, hugely expensive edge cases always remain. Insurance offers a clean way to transfer that risk…But the thesis is not that easy! While I won’t present a slam-dunk-view either way, I want to discuss some of the nuance and complexities that make this market tricky, and probably smaller than it appears at first glance…
What’s on your mind
This Week’s Poll:
Last Week’s Poll:
.
Data Science Articles & Videos
PDF to Text, a challenging problem
Extracting text information from PDFs is a significantly bigger challenge than it might seem. The crux of the problem is that the file format isn’t a text format at all, but a graphical format. It doesn’t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on “paper”. These glyphs may be rotated, overlap, and appear out of order, with very little semantic information attached to them. You should probably be in awe at the fact that you can open a PDF file in your favorite viewer (or browser), hit ctrl+f, and search for text…I am a staff data scientist at a big tech company -- AMA
I’m currently a staff data scientist at a big tech company in Silicon Valley. I’ve been in the field for about 10 years since earning my PhD in Statistics. I’ve worked at companies of various sizes — from seed-stage startups to pre-IPO unicorns to some of the largest tech companies. A few caveats:Anything I share reflects my personal experience and may carry some bias.
My experience is based in the US, particularly in Silicon Valley.
I have some people management experience but have mostly worked as an IC
Data science is a broad term. I’m most familiar with machine learning scientist, experimentation/causal inference, and data analyst roles.
Log Link vs Log Transformation in R — The Difference that Misleads Your Entire Data Analysis
How I almost misinterpreted my results when using the wrong type of log…Although normal distributions are the most commonly used, a lot of real-world data unfortunately is not normal. When faced with extremely skewed data, it’s tempting for us to utilize log transformations to normalize the distribution and stabilize the variance. I recently worked on a project analyzing the energy consumption of training AI models…Llama from scratch (or how to implement a paper without crying)
I want to provide some tips from my experience implementing a paper. I'm going to cover my tips so far from implementing a dramatically scaled-down version of Llama for training TinyShakespeare. This post is heavily inspired by Karpathy's Makemore series, which I highly recommend. I'm only going to loosely follow the layout of their paper; while the formatting and order of sections makes sense for publication, we're going to be implementing the paper…How To Scale
While there are already excellent posts on scaling, I wanted to share my own understanding and things i've learned from my past few months and hopefully spark some discussion. I hope this post can shed light for anyone navigating the challenges of scaling up neural networks…Pathfinding
I've recently been working on the pathfinding for NPCs in my game, which is something I've been looking forward to for a while now since it's a nice chunky problem to solve. I thought I'd write up this post about how I went about it all.I had a few extra requirements of my pathfinding, due to how my game plays:
Must deal with a dynamic physical environment (objects can move freely and are destructible)
Have paths that prefer to keep their distance from objects but still get close when needed
Allow for wrapping around the borders of the game area (Asteroids style)…
Tackling overfitting in tree-based models
If you're using LightGBM or Random Forest, you know they're powerful. But if you're just using the default settings, you're leaving performance on the table and likely overfitting. To get models that actually work well on new, unseen data, you have to tune their hyperparameters…This post distills practical experience into a focused discussion on the most impactful hyperparameters for LightGBM and Random Forest. More importantly, the article talks about what they do, and why they matter…What Does an AI Engineer Do, and How Can You Become One in 2025?
Discover what AI engineers do, why they’re in high demand, and how you too can become an AI engineer…Linear Regression - A Visual Introduction To (Almost) Everything You Should Know
This article will focus mostly on how the method is used in machine learning, so we won't cover common use cases like causal inference or experimental design. And although it may seem like linear regression is overlooked in modern machine learning's ever-increasing world of complex neural network architectures, the algorithm is still widely used across a large number of domains because it is effective, easy to interpret, and easy to extend. The key ideas in linear regression are recycled everywhere, so understanding the algorithm is a must-have for a strong foundation in machine learning…
Reservoir sampling is a technique for selecting a fair random sample when you don't know the size of the set you're sampling from. By the end of this essay you will know:
When you would need reservoir sampling.
The mathematics behind how it works, using only basic operations: subtraction, multiplication, and division. No math notation, I promise.
A simple way to implement reservoir sampling if you want to use it…
PageQL - Embed SQL in HTML Templates
PageQL is an experimental template language and micro Python web framework that allows embedding SQL inside HTML directly. It was inspired by ColdFusion language that allows embedding SQL and Handlebars / Moustache logic-less templates and also HTMX that simplifies web development…How to Fit Monotonic Smooths in JAX using Shape Constrained P-Splines
Let’s say you have a trend you are trying to model that you know to be monotonically increasing or decreasing; this could be something like default as a function of risk, power usage as a function of temperature, or CO2 emissions over time. Generalized Additive Models (GAMs) are a great general purpose modeling tool that you could use to model these relationships, but they are unconstrained and could have undesired shape behavior…Shape Constrained Additive Models :) SCAMs use a reparameterization of a traditional B-spline basis to enforce a constraint…This blog post is an attempt to recreate the logic in the SCAM paper using python and JAX. If you want to use these types of models for real there is an R-package for SCAMs and a P-spline implementation in python using the pygam library….Smarter Prompts for Better Responses: Exploring Prompt Optimization and Interpretability for LLMs
Crafting effective prompts is a critical step in aligning model behavior with user intent. However, manually optimizing prompts or interpreting model outputs is a trial-and-error process that is time-consuming and difficult to scale, especially as AI systems are deployed more widely across industries…In this blog post, we will explore two areas that are gaining traction to empower developers and teams to get the most out of their AI systems.Prompt Optimization: Tools and frameworks that help users craft better-performing prompts with less trial and error.
Prompt Interpretability: Methods that offer transparency into how prompts influence model outputs, helping users debug and refine their interactions more effectively…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #598 here.
Cutting Room Floor
Minimal Implementation of Scalable Rectified Flow Transformers
{ggstatsplot}: {ggplot2} Based Plots with Statistical Details
AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms
.
Whenever you're ready, 3 ways we can help:
Want to get better at Data Science / Machine Learning Math? I have a two weekly tutoring slots open. Hit reply to this email and let me know what you want to learn.
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~68,000 subscribers by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian