Data Science Weekly - Issue 594
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #594
April 10, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
The evolution of graph learning
We describe how graphs and graph learning have evolved since the advent of PageRank in 1996, highlighting key studies and research…
There Are No New Ideas in AI…Only New Datasets
LLMs were invented in four major developments... all of which were datasets…If you squint just a little, these four things (DNNs → Transformer LMs → RLHF → Reasoning) summarize everything that’s happened in AI…We had DNNs (mostly image recognition systems), then we had text classifiers, then we had chatbots, now we have reasoning models (whatever those are)…Say we want to make a fifth such breakthrough; it could help to study the four cases we have here. What new research ideas led to these groundbreaking events?…Big Book of R
If you’re like me, you can’t help but bookmark every R-related programming book you find in the hopes that one day you, or someone you know, might find it useful. Hopefully this is the only bookmark you’ll need in future ;). When I initially released this collection in late August 2020, it contained about 100 books that I’d been collecting over the previous two years. Since then I’ve found a few more and there have been contributions from many people. The collection now stands at over 400 free, open-sourced books…
Sponsor Message
Stop waiting for engineering - build AI agents with no code 👨💻
Wordware.ai now integrates with 2,000+ apps & data-sources!
Connect your favorite apps, build AI agents, and let any trigger start automated workflows that do actual work for you.
It’s 2025, you’re smart, you know what good looks like - you can just.. build things!
Free monthly credits until we drain our $30M seed round 😂
Try it here!
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
What’s on your mind
Last Week’s Poll:
Data Science Articles & Videos
Dice and Queues
One of the key insights from queuing theory is that the average queue size for an unbounded system tends to increase significantly as utilization approaches 100%…Queuing models are often expressed using Kendall Notation (e.g. M/M/1 or M/D/1). These models specify different arrival and departure distributions, such as Poisson or Deterministic. They end up having closed-form expressions that define the average queue size as a function of different system parameters. While these equations are fairly easy to use, I wanted to get a more intuitive understanding of how these models behave and decided to build a simulation based on the idea of rolling a single die many times…Absolutely BOMBED Interview
First interview went well, then got a technical interview scheduled for today and ABSOLUTELY BOMBED it. It was BAD BADD. It made me realize how confused I was with some of the basics when it comes to the field and that I was just jumping to more advanced skills, similar to what a lot of people on this group do. It was literally so embarrassing and I know I won’t be moving to the next steps. Basically the advice I got from the senior data scientist was to focus on the basics and don’t rush ahead to making complex models and deployments. Know the basics of SQL, Statistics (linear regression, logistic, xgboost) and how you’re getting your coefficients and what they mean, and Python…swirl
swirl teaches you R programming and data science interactively, at your own pace, and right in the R console!…Recent AI model progress feels mostly like bullshit
Since 3.5-sonnet, we have been monitoring AI model announcements, and trying pretty much every major new release that claims some sort of improvement. Unexpectedly by me, aside from a minor bump with 3.6 and an even smaller bump with 3.7, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models….Insights into population dynamics: A foundation model for geospatial inference
We introduce a population dynamics foundation model and dataset able to easily be adapted to solve a wide array of geospatial problems across health, socioeconomic, and environmental tasks…SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
In this paper, we introduce SeedLM, a novel post-training compression method that uses seeds of pseudo-random generators to encode and compress model weights. Specifically, for each block of weights, we find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. This matrix is then linearly combined with compressed coefficients to reconstruct the weight block. SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses…plumber2
This is a complete rewrite of plumber. The purpose of the rewrite is to take everything we’ve learned from plumber, shed the bad decision that you inevitably make over the course of development, and start from scratch…You’ll find that plumber2 is very similar to plumber in a lot of ways, but diverts in key areas, resulting in API incompatibility between the two packages. Because of this you may need to update your plumber APIs if switching to plumber2…Generative Benchmarking
Widely-used public benchmarks often rely on artificially clean datasets and generic domains, with the added concern that they were likely seen by embedding models in training. Prior efforts such as RAGAS and AIR-Bench have used synthetic testset generation to address such limitations, primarily optimizing for dataset diversity. Building on this direction, we introduce a generative benchmarking method with a new objective: representativeness. Using production data as our ground truth, we demonstrate that our generated queries reflect real user queries and that they can capture performance differences that public benchmarks may miss…Shiny without Boundaries: One App, Multiple Destinations
This lecture tackles the challenge of deploying Shiny applications across multiple environments. We’ll explore deployment pathways including cloud services, Docker containers, on-premise servers, Electron desktop applications, WebAssembly browser solutions, and Quarto integration. You’ll learn practical implementation strategies and key trade-offs for each approach, enabling you to deploy a single Shiny application virtually anywhere—reaching users where they are while maintaining full functionality…
The Future of Open Source Table Formats: Apache Iceberg and Lance
As the scale of data continues to grow, open-source table formats have become essential for efficient data lake management. Apache Iceberg has emerged as a leader in this space, while new formats like Lance are introducing optimizations for specific workloads. In this post, we’ll explore how Iceberg and Lance address different challenges and complement each other in the evolving landscape of data lake table formats…
Plot sea surface temperature in R
The Copernicus Marine Service (CMS), also known as the Copernicus Marine Environment Monitoring Service, is the marine component of the European Union’s Copernicus Programme…It delivers free, regular, and systematic information on the state of the ocean, encompassing the Blue (physical), White (sea ice), and Green (biogeochemical) components, both globally and regionally…Create a map with sst information from Copernicus in R…A Survey of Meta-Reinforcement Learning
In this survey, we describe the meta-RL problem setting in detail as well as its major variations. We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task. Using these clusters, we then survey meta-RL algorithms and applications. We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner…Experimenting with Reinforcement Learning with Verifiable Rewards (RLVR)
Here's the latest talk I gave, last friday at the USC Information Sciences Institute. It's a slightly more technical version of the RL talks I've been giving, focusing on the different ways we (and the community is experimenting with RL for reasoning). It includes a bunch of discussion on GRPO expanding on my previous video…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #593 here.
Cutting Room Floor
How causal perspectives can inform problems in computational neuroscience
Building Regulariser.- Python library for regularizing building footprints in geospatial data
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~67,600 subscribers by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian