Data Science Weekly - Issue 620
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #620
October 09, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
CMU’s Future Data Systems Seminar Series — Fall 2025
Databases are all about recording the events of the past. But we are interested in the question of what should a database system look like in the future to handle more of our past so we never forget…The Carnegie Mellon University Database Research Group is exploring this question with the Future Data Systems Seminar Series. The talks in this series will present ideas on modern system architectures and technologies for storing, managing, and accessing databases…
How does gradient descent work?
The simplest optimization algorithm is deterministic gradient descent…Perhaps surprisingly, traditional analyses of gradient descent cannot capture the typical dynamics of gradient descent in deep learning. We’ll first explain why, and then we’ll present a new analysis of gradient descent that does apply in deep learning…
.
What’s on your mind
This Week’s Poll:
.
Last Week’s Poll:
.
Data Science Articles & Videos
The State of AI 2025
Now in its eighth year, the State of AI Report 2025 is reviewed by leading AI practioners in industry and research. It considers the following key dimensions, including a new large-scale AI usage survey section:Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Survey: The largest open-access survey of 1,200 AI practitioners and their AI usage patterns.
Predictions: What we believe will happen in the next 12 months and a performance review of last year’s predictions to keep us honest.
Are traditional statistical methods better than machine learning for forecasting? [Reddit]
I have a degree in statistics but for 99% of prediction problems with data, I’ve defaulted to ML. Now, I’m specifically doing forecasting with time series, and I sometimes hear that traditional forecasting methods still outperform complex ML models (mainly deep learning), but what are some of your guys’ experience with this?…
A Guide to Landing and Excelling in HCI Research Internships
During my PhD, I completed five research internships at Adobe (2x), Google (2x), and Meta. In my first year as a research scientist at Adobe, I’ve interviewed, made offers to, and mentored four PhD interns. Having experienced internships from both sides, as a mentor and a mentee, I am writing to share some advice on how to land and excel in a research internship. My experience is mostly with HCI interns, but I believe many of these insights generalize to other subfields of CS…Generative AI for Data Visualisation
Can generative AI create good data visualisations? This blog post compares the performance of ChatGPT, Claude, Copilot, and Gemini when presented with a generic request to visualise a dataset…Statistics in the era of AI
How do we mentor, teach, and do stats when AI can do so much of the work?…Coding is the means to the end of running statistical tests and creating visualizations. If I hired a technician to do the fieldwork instead of doing it myself, would you have an issue with that? If I hired a consultant to write the code when I told them what I needed, would that be a problem? Then, what’s the problem in doing stats with an (AI) consultant?…The thing is, I think this isn’t a simple question. There’s a lot of nuance…Fast Matrix Multiply on an Apple GPU
I spent the weekends of a weird month writing a computer program to multiply matrices quickly. Matrix multiplication makes up the majority of the computational effort in getting ChatGPT to talk, so considerable human effort has gone into making it fast…Thanks to this, the matrix multiplication here does ~2.5 trillion 32-bit floating point operations a second while computing the product of two 4000 × 4000 matrices on my 2022 MacBook Air….There is a bona fide canon of good blog posts about fast matrix multiplication programs, this having the dubious distinction of being the first about an Apple GPU…WhyTorch
This is a learning tool designed to teach you how various PyTorch functions work…Reshelving generalization - You don’t need a theorem to argue more data is better than less data
We now turn to the mystical part of the machine learning course. When does a piece of computer code that makes good predictions on a sample of collected data make good predictions on data we haven’t seen? This question asks us to be metaforecasters. To predict prediction errors. Answering the question requires a theory of how collected data relates to the data in the wilds. In machine learning, this metaprediction problem is called generalization theory. In the applied sciences, this is the central question of external validity…Handbook of Regression Modeling in People Analytics
The aim of this book is to encourage inexperienced analytics practitioners to ‘dip their toes’ further into the wide and varied world of regression in order to deliver more targeted and precise insights to their organizations and stakeholders on the problems they are most interested in…
Data Lake File Formats
At the core of a Data Lake is the Storage Layer, which predominantly houses files in open-source formats. These data lake file formats are essentially the cloud’s version of CSVs but with enhanced capabilities…The utility of data lake file formats extends beyond mere storage. They play a crucial role in data sharing and exchange between different systems and processing frameworks. Key attributes of these formats include their ability to be split and their support for schema evolution, making them versatile across various teams and programming languages…AI-generated ‘participants’ can lead social science experiments astray, study finds
Data produced by “silicon samples” depends on researchers’ exact choice of models, prompts, and settings…
Is there anything actually new in data engineering? [Reddit]
I really want to believe that we haven’t reached the end of the line on creativity but it seems like there a nothing new under the sun. I see open source making a bunch of noise on ideas and techniques that have been around in the commercial sector for literally decades. I really hope I am just missing something here…btw - A complete toolkit for connecting R and LLMs
btw provides a flexible toolkit that works across different workflows:Copy-paste to external LLMs: Quickly gather context from your R session and copy it to your clipboard for pasting into ChatGPT, Claude, or any other chat interface.
Interactive chat in R: Launch a full-featured AI assistant directly in your IDE that can explore your environment, read documentation, and help you write code.
Build LLM-powered tools: Integrate btw’s capabilities into your own applications, whether you’re creating custom chat interfaces or connecting R to coding agents…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #619 here.
Cutting Room Floor
The “Gone Too Soon” Movie Star Hall of Fame: A Statistical Analysis
Fast approximate inference without convergence worries in PyMC
Generating the mth Lexicographical Element of a Combination Using the Combinadic
A Realistic Evaluation of Cross-Frequency Transfer Learning and Foundation Forecasting Models
Do well-written, clear instructions beat few-shotting for tiny-LLMs?
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~68,500 subscribers by sponsoring this newsletter. 30-35% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian
Thanks for sharing the People Analytics handbook 🙏🏻