Data Science Weekly - Issue 590
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #590
March 13, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Using generative AI to scale DuoRadio 10x faster
DuoRadio is an audio experience that improves listening comprehension through short, podcast-like radio shows featuring Duolingo’s beloved World characters. When DuoRadio first launched in late 2023, it quickly became clear that the feature had enormous learning potential—if we could expand it to more courses, languages, and learners…We scaled DuoRadio DAUs from 100K to 5.5M while cutting costs by 99%. Here’s how…
Does your data's structure match the question you want to answer?
I have this thesis that people…often get frustrated with data – or they feel like data is useless – because the structure of the data they’re collecting doesn’t allow them to answer the question they want to answer. This usually happens because someone…is trying to take an existing report or data product and apply it to their question. The problem is that these reports and data products weren’t designed to answer that specific question. Often, they weren’t designed to answer any question…Defense Against Dishonest Charts
Charts are a window into the world. When done right, we gain an understanding of who we are, where we are, and how we can become better versions of ourselves. However, when done wrong, in the absence of truth, charts can be harmful. This is a guide to protect ourselves and to preserve what is good about turning data into visual things. We start with chart anatomy; then we look at how small changes can shift a point of view; this takes us to misleading chart varieties; and we finish with reading data and next steps…
What’s on your mind
This Week’s Poll:
Two questions, because we got two answers last week.
I asked - what would you value MOST as a paid subscriber?
43% answered “deep dives into new research”
34% answered “curated datasets and code”
Take this quick 5-second poll →
We’ll share the results next week!
Last Week’s Poll:
Data Science Articles & Videos
A minimum viable Shiny infrastructure for serving 95,000 monthly users
All too often, Shiny gets criticized for scaling poorly ‐ and yet the DynastyProcess Shiny app crushes those expectations by serving over 95,000 unique users each month! In this talk, we’ll deep dive into the motivations, architecture, and design decisions driving the development of a massively popular app and share takeaways for scaling up your own apps…What sort of things should I be doing in my personal time to make moving companies easier? [Reddit]
I'm looking to move from my current company, but am aware thats tough right now….What things can I do to show I have the hard and soft skills these roles are looking for and show I can succeed in a place that does measure impact?…I assume personal projects are less impressive than work projects, but is there anything I can do to make up for the fact that nothing I do at work really seems impressive either?…TinySQL - build the SQL layer of a distributed database
TinySQL is a course designed to teach you how to implement a distributed relational database in Go…Here’s how I use LLMs to help me write code
Using LLMs to write code is difficult and unintuitive. It takes significant effort to figure out the sharp and soft edges of using them in this way, and there’s precious little guidance to help people figure out how best to apply them. If someone tells you that coding with LLMs is easy they are (probably unintentionally) misleading you. They may well have stumbled on to patterns that work, but those patterns do not come naturally to everyone. I’ve been getting great results out of LLMs for code for over two years now. Here’s my attempt at transferring some of that experience and intution to you…Stop using the elbow criterion for k-means and how to choose the number of clusters instead
A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters. In this letter, we want to point out that it is very easy to draw poor conclusions from a common heuristic, the "elbow method". Better alternatives have been known in literature for a long time, and we want to draw attention to some of these easy to use options, that often perform better…When Kafka is not the right Move
When designing distributed systems, event streaming platforms such as Kafka are the preferred solution for asynchronous communication. In fact, on Kafka’s official website, the first use-case listed is messaging1. My believe is that this default choice can lead to problems and unnecessary complexity. The main reason for this is the conversion between state and events, which we’ll look at in this article through a game of chess…Python Developer Tooling Handbook
This is not a book about programming Python. Instead, the goal of this book is to help you understand the ecosystem of tools used to make Python development easier and more productive. For example, this book will help you make sense of the complex world of building Python packages: what exactly are uv, Poetry, Flit, Setuptools, and Hatch? What are the pros and cons of each? How do they compare to each other? It also covers tools for linting, formating, and managing dependencies…Stable Diffusion I — Mathematics Behind It
Stable Diffusion is an AI model that creates images from text by starting with pure noise (like static) and gradually removing the noise step by step until the image matches the description. It learns how to turn noise into meaningful images using millions of examples, combining deep neural networks with attention mechanisms to understand both visual patterns and text prompts…Machine Teaching: What I Learned From My Optimizer
We’ve developed sophisticated approaches to constructing and risk-managing portfolios in our systematic disciplines. This has helped us build optimizers that we utilize across our discretionary investment strategies as well, introducing additional rigor into human-driven processes. We’ve learned a great deal from working with those tools, and they’ve sharpened our intuition as investors. We reflect below on those lessons, starting with some basics…
Anthropic released Claude Code, their competitor to Anysphere’s Cursor and Codium’s Windsurf. Claude Code is a tool that uses LLM as an agent to take user commands to complete software engineering tasks. In this blog post, we will try to decompose and better understand how Claude Code works under the hood…
Understanding Transformers... (beyond the Math)
One thing I've come to understand is that transformers are essentially state simulators. Each individual prediction has its own separated state - it's not carried over from the previous one. This is important because language isn't like a linear causal left-to-right progression where state progresses at a linear rate…Scaling Beyond Postgres: How to Choose a Real-Time Analytical Database
This article will explore how real-time analytical databases address critical analytical requirements. We will explore the differences between cloud data warehouses like Snowflake and BigQuery, legacy OLAP databases like Vertica, and a new class of real-time analytical databases like ClickHouse and StarRocks that combine elements of both of these categories. We will also examine the categories of today's analytics solutions and how to choose the right one. Finally, we will discuss the history of analytical databases and how to choose a migration path to such a database…Medical Hallucinations in Foundation Models and Their Impact on Healthcare
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #589 here.
Cutting Room Floor
Build something interactive - And it doesn't have to be data viz!
Manify: A Python Library for Learning Non-Euclidean Representations
The AI Scientist Generates its First Peer-Reviewed Scientific Publication
Automating the Search for Artificial Life with Foundation Models
From diagnosis to treatment: Advancing AMIE for longitudinal disease management
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~67,200 subscribers by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian
I’ve definitely had that experience about data structure failing to meet key needs