Data Science Weekly - Issue 566
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #566
September 26, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
- ML and LLM system design: 450 case studies to learn from 
 A database of 450 case studies from 100+ companies…
- Langfun - OO for LLMs 
 Langfun is a PyGlove powered library that aims to make language models (LM) fun to work with. Its central principle is to enable seamless integration between natural language and programming by treating language as functions. Through the introduction of Object-Oriented Prompting, Langfun empowers users to prompt LLMs using objects and types, offering enhanced control and simplifying agent development…Langfun is compatible with popular LLMs such as Gemini, GPT, Claude, all without the need for additional fine-tuning…
- Introducing Netflix’s Key-Value Data Abstraction Layer 
 In this post, we dive deep into how Netflix’s KV abstraction works, the architectural principles guiding its design, the challenges we faced in scaling diverse use cases, and the technical innovations that have allowed us to achieve the performance and reliability required by Netflix’s global operations…
A Sponsor Message
Quadratic - analyze anything, host anywhere
With Quadratic, combine the spreadsheets your organization asks for with the code that matches your team’s code-driven workflows.
Powered by code, you can build anything in Quadratic spreadsheets with Python, JavaScript, or SQL, all approachable with the power of AI.
Use the data tool that actually aligns with how your team works with data, from ad-hoc to end-to-end analytics, all in a familiar spreadsheet.
Level up your team’s analytics with Quadratic today
.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
- On Impactful AI Research 
 Grad students often reach out to talk about structuring their research, e.g. how do I do research that makes a difference in the current, rather crowded AI space? Too many feel that long-term projects, proper code releases, and thoughtful benchmarks are not incentivized — or are perhaps things you do quickly and guiltily to then go back to doing 'real' research…This post distills thoughts on impact I've been sharing with folks who ask. Impact takes many forms, and I will focus only on making research impact in AI via open-source work through artifacts like models, systems, frameworks, or benchmarks….
- Bias/Variance is not the same as Approximation/Estimation 
 We study the relation between two classical results: the bias-variance decomposition, and the approximation-estimation decomposition. Both are important conceptual tools in Machine Learning, helping us describe the nature of model fitting. It is commonly stated that they are “closely related”, or “similar in spirit”. However, sometimes it is said they are equivalent. In fact they are different, but have subtle connections cutting across learning theory, classical statistics, and information geometry, that (very surprisingly) have not been previously observed…
- Running 7 Million Jobs in Parallel [Reddit Discussion] 
 Wondering what are people’s thoughts on the best tool for running 7 million tasks in parallel. Each tasks takes between 1.5-5minutes and consists of reading from parquet, do some processing in Python and write to Snowflake…Let’s assume each task uses 1GB of memory during runtime Right now I am thinking of using airflow with multiple EC2 machines. Even with 64 core machines, it would take at worst 350 days to finish running this assuming each job takes 300 seconds. Does anyone have any suggestion on what tool i can look at?…
- AI Advantage for Startups : Changing the Workflow through Services 
 Suppose you’re a startup in a competitive market with a large incumbent who owns the system of record - the software that runs the sales team or the support team or the marketing team. How do you win?…In the last decade, startups have chosen to identify a feature or workflow to improve & leverage that wedge into an advantage. Many have reached great levels of success, but few have overturned the incumbent. Early AI advantages have reinforced this advantage. How does this change?…
- PCA as an embedding technique 
 If you have text represented as a sparse vector then there are a few things that you cannot do. In particular; not every scikit-learn model inside of scikit-learn can deal with it. Most notably the histogram boosted ensemble models. So what if we use PCA to turn those sparse arrays into dense embeddings?…
- Fast TRAC 🏎 A Parameter-free Optimizer for Lifelong Reinforcement Learning 
 A key challenge in lifelong reinforcement learning (RL) is the loss of plasticity, where previous learning progress hinders an agent's adaptation to new tasks. While regularization and resetting can help, they require precise hyperparameter selection at the outset and environment-dependent adjustments. Building on the principled theory of online convex optimization, we present a parameter-free optimizer for lifelong RL, called TRAC, which requires no tuning or prior knowledge about the distribution shifts…
- Redis 8.0-M01 released – One Redis for every use case 
 Redis 8 introduces seven new data structures —JSON, time series, and five probabilistic types— along with the fastest and most scalable Redis query engine to date…This release also introduces advanced capabilities like vector search, secondary indexing for full-text search, exact matching, geospatial queries, numeric data handling, and data processing. Additional key features will be revealed in upcoming milestone releases, further enhancing Redis’ real-time capabilities…
- What I’ve Learned in the Past Year Spent Building an AI Video Editor 
 Last year I was let go after just 6 months in a new role. I had left a great company and boss to take a chance on a startup, and before I’d even begun, it was over…I decided to take the event as an opportunity, and explore what was now becoming possible in video with LLMs, Diffusion models, and the growing number of other open models…
- Fixing Bias: How Error Analysis Improved My Water Forecasting Model - The western United States has a huge water problem. That’s why much effort goes into managing the water properly, including forecasting a season’s water supply. The Bureau of Reclamation, which manages water and power in the West, sponsored a machine learning competition on DrivenData to improve the forecasts…Months into the challenge my models’ performance looked good. But then I did a deeper analysis of the model errors and found that the models had some strange error patterns… 
- CUTLASS Tutorial: Fast Matrix-Multiplication with WGMMA on NVIDIA® Hopper™ GPUs 
 No series of CUDA® tutorials is complete without a section on GEMM (GEneral Matrix Multiplication). Arguably the most important routine on modern GPUs, GEMM constitutes the majority of compute done in neural networks, large language models, and many graphics applications. Despite its ubiquity, GEMM is notoriously hard to implement efficiently. This 3-part tutorial series aims to equip readers with a thorough understanding of how to write efficient GEMM kernels on NVIDIA Hopper GPUs using the CUTLASS library…
- PLS 205 - Experimental Design and Analysis 
 The goal of this course is to introduce graduate students in the biological sciences to the fundamental concepts and introductory statistical methods necessary to plan, conduct, and interpret effective experiments…
- rix: Reproducible Data Science Environments with 'Nix' 
 Simplifies the creation of reproducible development environments using the 'Nix' package manager. The included ‘rix()' function generates a complete description of the development environment as a 'default.nix' file, which can then be built using ’Nix'. This results in project specific software environments with pinned versions of R, packages, linked system dependencies, and other tools. Additional helpers make it easy to run R code in 'Nix' software environments for testing and production…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #565 here.
Cutting Room Floor
- Generalized Additive Models (GAMs) for Meta-Regression using brms 
- Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling 
- Priompt - (priority + prompt) is a JSX-based prompting library 
- Augmenting Statistical Models with Natural Language Parameters 
- Local File Organizer: AI File Management Run Entirely on Your Device, Privacy Assured 
- Model2Vec: Distill a Small Fast Model from any Sentence Transformer 
- Can you draw scientific conclusions with interpretable machine learning? 
- An In-Depth Guide to Contrastive Learning: Techniques, Models, and Applications - . 
Whenever you're ready, 2 ways we can help:
- Looking to get a job? Check out our “Get A Data Science Job” Course 
 It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
- Promote yourself/organization to ~63,300 subscribers by sponsoring this newsletter. 35-45% weekly open rate. 
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian


