Data Science Weekly - Issue 596
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #596
April 24, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
AI Horseless Carriages
When I use AI to build software I feel like I can create almost anything I can imagine very quickly. AI feels like a power tool. It's a lot of fun…Many AI apps don't feel like that. Their AI features feel tacked-on and useless, even counter-productive…I am beginning to suspect that these apps are the "horseless carriages" of the AI era. They're bad because they mimic old ways of building software that unnecessarily constrain the AI models they're built with…To illustrate what I mean by that, I'll start with an example of a badly designed AI app….
YAGRI: You are gonna read it
YAGNI, or, You aren't gonna need it, is a standard piece of advice that warns against over engineering and building too many features too early. I think its great and saves you from wasting time, which can kill a project. However, there's an exception that I call YAGRI, or, "You are gonna read it". It means that you shouldn't just store the minimum data required to satisfy the current product spec. You should also store data that you'll highly likely use (read), such as timestamps and contextual metadata…Questions about the Future of AI
Considerations about economics, history, training, deployment, investment, and more…What started as an attempt to consolidate some thoughts from the last few interviews on my podcast has turned into this 6,000 word clusterf*ck of questions and considerations…
Sponsor Message
Online Data Science Programs from Drexel University
Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career. Learn more.
.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
What’s on your mind
This Week’s Poll:
Last Week’s Poll:
.
Data Science Articles & Videos
How (not) to do a data science project
In order for people to find value in your work in data science (and want to hire you!), you need to show them what you've worked on. In practice, this will mostly be school projects or side projects. But when someone starts working on their first data science projects, they inevitably fall into basic traps that make their project look boring/useless/ugly, even if the project is an amazing idea. Here, I want to share what I think would make a great data science project (not enterprise production-level, but mostly side projects)…Generative AI for Data Science 101: Coding Without Learning to Code
Should one teach coding in a required introductory statistics and data science class for non-major students?…With the release of large language models that write code, we saw an opportunity for a middle ground, which we tried in Fall 2023 in a required introductory data science course in our school’s full-time MBA program. We taught students how to write English prompts to the artificial intelligence tool GitHub Copilot that could be turned into R code and executed. In this short article, we report on our experience using this new approach…Deep Learning is Not So Mysterious or Different
Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization. We argue that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour can be intuitively understood, and rigorously characterized using long-standing generalization frameworks such as PAC-Bayes and countable hypothesis bounds…Data science is not about... [Reddit Discussion]
There's a lot of posts on LinkedIn which claim: Data science is not about Python It's not about SQL It's not about models It's not about stats ... But it's about storytelling and business value. There is a huge amount of people who are trying to convince everyone else in this BS, IMHO. It's just not clear why... Technical stuff is much more important. It reminds me of some rich people telling everyone else that money doesn't matter…What ‘Out-of-distribution’ Is and Is Not
Researchers want to generalize robustly to ‘out-of-distribution’ (OOD) data. Unfortunately, this term is used ambiguously causing confusion and creating risk— people might believe they have made progress on OOD data and not realize this progress only holds in limited cases. We critique a standard definition of OOD— difference-in-distribution—and then disambiguate four meaningful types of OOD data: transformed-distributions, related-distributions, complement-distributions, and synthetic-distributions. We describe how existing OOD datasets, evaluations, and techniques fit into this framework. We provide a template for researchers to carefully present the scope of distribution shift considered in their work…AI for Network Engineers: Understanding Flow, Flowlet, and Packet-Based Load Balancing
Deep learning models rely heavily on collective operations like all-reduce, all-gather, and broadcast. These generate dense traffic patterns between GPUs, often at terabit-per-second speeds. If these flows are not evenly distributed, a single congested path can slow down the entire training job. This chapter introduces two alternative load balancing methods to traditional Flow-Based with Layer 3 ECMP: 1) Flowlet-Based Load Balancing with Adaptive Routing, and 2) Packet-Based Load Balancing with Packet Spraying. Both aim to improve traffic distribution in RoCEv2-based AI backend networks, where conventional flow-based routing often leads to congestion and underutilized links…Air, an extremely fast R formatter
We’re thrilled to announce Air, an extremely fast R formatter. Formatters are used to automatically style code, but I find that it’s much easier to show what Air can do rather than tell, so we’ll start with a few examples. In the video below, we’re inside Positron and we’re looking at some unformatted code. Saving the file (yep, that’s it!) invokes Air, which automatically and instantaneously formats the code…Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support
AI programming tools enable powerful code generation, and recent prototypes attempt to reduce user effort with proactive AI agents, but their impact on programming workflows remains unexplored. We introduce and evaluate Codellaborator, a design probe LLM agent that initiates programming assistance based on editor activities and task context. We explored three interface variants to assess trade-offs between increasingly salient AI support: prompt-only, proactive agent, and proactive agent with presence and context (Codellaborator)…Apache Iceberg Internals Dive Deep On Performance
In this blog I will go over how Apache Iceberg contributes to performance of compute engine. Apache Iceberg is an ACID table format designed for large-scale analytics workloads. While its consistency and schema evolution features are covered in previous blog, its impact on query performance can be equally transformative. By the end of this document, you will have a deep understanding of how Iceberg enhances performance, the trade-offs involved, and best practices for maximizing efficiency in read-heavy workloads…
Visual AI for Geospatial: Earth Monitoring for Everyone with Earth Index
Earth Index is a end user focused application that preprocesses global imagery through AI foundation models to enable rapid in-browser search and monitoring. Earth Genome builds Earth Index for critical applications in the environment, and is being used today to report on illegal airstrips built in the Peruvian Amazon, track cattle factory farms across the planet for emissions modeling, and expose illegal gold mining in the Yanomami Indigenous Territory..
pyfonts: A simple and reproducible way of using fonts in matplotlib.
It’s now possible (and super easy) to use ALL fonts from Google Fonts with just a single line of code and apply them in matplotlib…A practical guide to fine-tuning embedding models
In this report, we try to answer questions like - If/when should you fine-tune embedding models, and what are the qualities of a good fine-tuning dataset We'll deal with embedding part of the retrieval pipeline, which means any changes or updates will require re-ingestion of the data, unlike reranking…A Meta-Learning Approach to Bayesian Causal Discovery
Discovering a unique causal structure is difficult due to both inherent identifiability issues, and the consequences of finite data. As such, uncertainty over causal structures, such as those obtained from a Bayesian posterior, are often necessary for downstream tasks. Finding an accurate approximation to this posterior is challenging, due to the large number of possible causal graphs, as well as the difficulty in the subproblem of finding posteriors over the functional relationships of the causal edges…To address these limitations, we propose a Bayesian meta learning model that allows for sampling causal structures from the posterior and encodes these key properties…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #595 here.
Cutting Room Floor
An LLM‑as‑Judge Won't Save The Product—Fixing Your Process Will
How I think about learning - The most important skill you can hone
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~68,000 subscribers by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian