Data Science Weekly - Issue 581
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #581
January 09, 2025
Happy New Year!
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
aRtist
Making art with R…
Agents
Intelligent agents are considered by many to be the ultimate goal of AI…The unprecedented capabilities of foundation models have opened the door to agentic applications that were previously unimaginable…This section will start with an overview of agents and then continue with two aspects that determine the capabilities of an agent: tools and planning. Agents, with their new modes of operations, have new modes of failure. This section will end with a discussion on how to evaluate agents to catch these failures…Browser-based parquet viewer
The browser based parquet viewer is pretty sweet -- let's you explore the file format, including schema and layout, and data with SQL. Natch it is based on ApacheDataFusion…
Your Input +
A New Newsletter for AI Enthusiasts!
Quick Poll:
AI and LLMs are transforming workflows, but how many of us are really using them?
[Take this quick 5-second poll →]
We’ll share the results (and our insights) next week!
Introducing:
AI Agents, AI Engineering, & LLM Systems Newsletter
We’re excited to launch a new newsletter exploring cutting-edge AI agents, engineering techniques, and LLM systems. If you’re fascinated by practical AI insights and tools, this is for you!
[Subscribe to the newsletter →]
This newsletter is an extension of what you love about Data Science Weekly—practical, insightful, and focused on AI systems that work.
Data Science Articles & Videos
Dive into Time-Series Anomaly Detection: A Decade Review
While traditional literature on anomaly detection is centered on statistical measures, the increasing number of machine learning algorithms in recent years call for a structured, general characterization of the research methods for time-series anomaly detection. This survey groups and summarizes anomaly detection existing solutions under a process-centric taxonomy in the time series context. In addition to giving an original categorization of anomaly detection methods, we also perform a meta-analysis of the literature and outline general trends in time-series anomaly detection research…What's your biggest time sink as a data scientist?
I'm curious what data-scientist specific task/problem is the biggest time suck for you at work. I feel like we're often building a new class of software in companies and systems that were designed for web 2.0 (or even 1.0)…Making maps with R
Maps allow us to easily convey spatial information. Here, we show how to create both static and interactive maps by using several mapping packages including ggplot2 (Wickham, Chang, et al. 2022), leaflet (Cheng, Karambelkar, and Xie 2022), mapview (Appelhans et al. 2022), and tmap (Tennekes 2022). We create maps of areal data using several functions and parameters of the mapping packages. We also briefly describe how to plot point and raster data. Then, we show how to create maps of flows between locations with the flowmapblue package (Boyandin 2023)…Transformer Architecture: The Positional Encoding
Let's use sinusoidal functions to inject the order of words in our model…In this article, I don’t plan to explain its architecture in depth as there are currently several great tutorials on this topic…, but alternatively, I want to discuss one specific part of the transformer’s architecture - the positional encoding…Learning CUDA by optimizing softmax: A worklog
Optimizing softmax, especially in the context of GPU programming with CUDA, presents many opportunities for learning. In this worklog, we will start by benchmarking PyTorch's softmax operation then finally we will iteratively optimize it in CUDA. The NVIDIA GPU used for this worklog is one GTX 1050Ti (that's all I have got right now)…AI and Startup Moats
This article is my attempt to enumerate all the possible moats you can count on that will still be relevant in the age of AI (and the ones that, I think, will not fare so well)…Top 10 Data Engineering & AI Trends for 2025
According to industry experts, 2024 was destined to be a banner year for generative AI. Operational use cases were rising to the surface, technology was reducing barriers to entry, and general artificial intelligence was obviously right around the corner. So… did any of that happen? Well, sort of…Here’s where leading futurist and investor Tomasz Tunguz thinks data and AI stands at the end of 2024—plus a few predictions of my own…2025 data engineering trends incoming…Model Merging and You
Model Merging is a weird and experimental technique which lets you take two models and combine them together to get a new model. This is primarily used in Large Language Models, where the likely convergent representations allow this technique to work somewhat better than you might expect, given the concept. Model merging techniques are interesting since they allow researchers to create new models which are somehow performant in new ways despite not doing any additional training. I think this is really weird, so I need to know about it…OpenRepoWiki - You don’t need to read the code to understand how to build!
OpenRepoWiki is a tool that automatically generates a comprehensive wiki page for any given GitHub repository. I hate reading code, but I want to learn how to build stuffs from websites to databases. That's why I built OpenRepoWiki, where we can understand the purpose of that files and folders of a particular repository…
Classical Statistical (In-Sample) Intuitions Don't Generalize Well: A Note on Bias-Variance Tradeoffs, Overfitting and Moving from Fixed to Random Designs
In this note, we show that there is…[a]…reason why we observe behaviors today that appear at odds with intuitions taught in classical statistics textbooks, which is much simpler to understand yet rarely discussed explicitly. In particular, many intuitions originate in fixed design settings, in which in-sample prediction error (under resampling of noisy outcomes) is of interest, while modern ML evaluates its predictions in terms of generalization error, i.e. out-of-sample prediction error in random designs….Allen Downey - A future of data science
In the hype cycle of data science, I suggest that the "peak of inflated expectations" was in 2012, the "trough of disillusionment" was in 2016, and since then, we have climbed the "slope of enlightenment". Now, as we approach the "plateau of productivity", it's a good time to figure out how we got here and what future we want. Can we use data to answer questions, resolve debates, and make better decisions? What tools and processes make data science work? What can we learn when it does, and what goes wrong when it doesn't? In this talk, I will present my answers, and then I would like to hear yours…Robots That Learn - UC Berkeley CS 294-277
Robot learning lecture series by Professor Jitendra Malik at University of California, Berkeley. Course TA: Toru Lin…Transfer Learning 101
In this blog, we’ll explore the foundations and implementation of Transfer Learning from very scratch (building intuitions) covering the following topics:WHY Transfer Learning?
Practical scenarios
The Architecture behind Transfer Learning
Mathematics behind Transfer Learning
Perspective 1: Representation Learning
Perspective 2: Bayesian Approach
Hands-on Implementation…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #580 here.
Cutting Room Floor
A First Introduction to Cooperative Multi-Agent Reinforcement Learning
Medical large language models are vulnerable to data-poisoning attacks
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~65,442 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian