Data Science Weekly - Issue 535

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Feb 23, 2024

Issue #535
February 22, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

If you like what you read, consider becoming a paid member here: https://datascienceweekly.substack.com/subscribe :)

And now…let's dive into some interesting links from this week.

Editor's Picks

Scaling ChatGPT: Five Real-World Engineering Challenges
Just one year after its launch, ChatGPT had more than 100M weekly users. In order to meet this explosive demand, the team at OpenAI had to overcome several scaling challenges…Evan Morikawa, who led the Applied engineering team as ChatGPT launched and scaled, reveals the engineering challenges and how the team tackled them. We cover:
1. A refresher on OpenAI, and an introduction to Evan
2. Importance of GPUs
3. How does ChatGPT4 work?
4. Five scaling challenges:
  - KV Cache & GPU RAM
  - Optimizing batch size
  - Finding the right metrics to measure
  - Finding GPUs wherever they are
  - Inability to autoscale
5. Lessons learned…

The GenAI Guidebook
You may ask Isn’t there already a lot of material out there? You’re right, and that’s precisely why this is a guide book. So much is being shared across papers Youtube, blog posts, tweets, GitHub etc, keep tracking of it all without getting overwhelmed is its own challenge.
Here’s what’s inside
1. A guided tour of the fundamentals
2. The minimal resources needed to get a great understanding
3. Deep Dives into particular libraries, topics and papers…
Why pandas feels clunky when coming from R
Five years ago I started a new role and I suddenly found myself, a staunch R fan, having to code in Python on a daily basis. Working with data, most of my Python work involved using pandas, the Python data frame library, and initially I found it quite hard and clunky to use, being used to the silky smooth API of R’s tidyverse. And you know what? It still feels hard and clunky, even now, 5 years later!..Let’s first step through a short analysis of purchases using R and the tidyverse. After that we’ll see how the same solution using Python and pandas compares…

A Message from this week's Sponsor:

Join AI leaders for GenAI Productionize 2024

Announcing GenAI Productionize 2024 – a FREE, virtual industry-first summit focused on productionizing generative AI within the enterprise!

Join AI experts from Coinbase, LinkedIn, Roblox, Databricks, JPMorgan Chase, Comcast, Fidelity and more for insights and actionable strategies on generative AI governance, frameworks, and practical techniques for evaluation and observability.

- How organizations have successfully adopted enterprise GenAI

- Practical hands-on architecture and tool insights from leading GenAI builders

- The emerging enterprise GenAI stack, from orchestration to evaluation, inference, retrieval and more

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

“AI will cure cancer” misunderstands both AI and medicine
I recently watched a late-night talk show where skeptics and enthusiasts debated AI safety. Despite their conflicting views, there was one thing they could all agree upon. “AI will cure cancer,” one panelist declared, and everyone else confidently echoed their agreement. This collective optimism strikes me as overly idealistic and raises concerns about a failure to grapple with the realities of healthcare…In many instances, claims like “AI will cure cancer” are being invoked as little more than superficial marketing slogans. To understand why, it is first necessary to understand how AI is used, and how the medical system operates…
On Averaging and Extrapolation for Gradient Descent
This work considers the effect of averaging, and more generally extrapolation, of the iterates of gradient descent in smooth convex optimization. After running the method, rather than reporting the final iterate, one can report either a convex combination of the iterates (averaging) or a generic combination of the iterates (extrapolation). For several common stepsize sequences, including recently developed accelerated periodically long stepsize schemes, we show averaging cannot improve gradient descent's worst-case performance and is, in fact, strictly worse than simply returning the last iterate. In contrast, we prove a conceptually simple and computationally cheap extrapolation scheme strictly improves the worst-case convergence rate…
I'm Open Sourcing Our RAG Backend: Our CQH, GQL & CHS [Reddit]
I’m ‘open sourcing’ the backbone of our tech stack, including three components that have been pivotal in enhancing our capabilities in processing and managing complex queries, logging, and chat session optimizations…1. Complex Question Handling (CQH)…2. Generic Query Logger (GQL)…3. Chat History Summarization (CHS)…Full disclaimer: this isn’t some magic code; it’s pretty standard. Many beginners were asking the same question, which all the snippets cover…
Which Films Were Underappreciated in Their Time? A Statistical Analysis
Many films, like Fight Club, slip through the cracks only to be rediscovered after their theatrical run. In a world of IMDB, Letterboxd, and streaming, movies can live second lives that outshine their initial exhibition. So today, we'll explore the films whose secondary viewership greatly surpasses that of their initial release and the qualities unique to these movies…
Seamless Support with Langsmith
Its a common misconception that LangChain's LangSmith is only compatible with LangChain's models. In reality, LangSmith is a unified DevOps platform for developing, collaborating, testing, deploying, and monitoring LLM applications. In this blog we will explore how LangSmith can be used to enhance the OpenAI client alongside instructor…
Singular Value Decomposition Part 1: Perspectives on Linear Algebra
The singular value decomposition (SVD) of a matrix is a fundamental tool in computer science, data analysis, and statistics. It’s used for all kinds of applications from regression to prediction, to finding approximate solutions to optimization problems. In this series of two posts we’ll motivate, define, compute, and use the singular value decomposition to analyze some data…I want to spend the first post entirely on motivation and background…
Linear Regression is underrated [Reddit]
I wanted to share a quick story from the trenches of data science. I am not a data scientist but an engineer; however, I've been working on a dynamic pricing project where the client was all in on neural networks to predict product sales and figure out the best prices using an overly complicated setup. They tried linear regression once, didn't work magic instantly, so they jumped ship to the neural network, which took them days to train. I thought, "Hold on, let's not ditch linear regression just yet." Gave it another go, dove a bit deeper, and bam - it worked wonders…
tinytable - Easy, beautiful, and customizable tables in R
tinytable is a small but powerful R package to draw HTML, LaTeX, Word, PDF, Markdown, and Typst tables. The interface is minimalist, but it gives users direct and convenient access to powerful frameworks to create endlessly customizable tables…
Advanced Retrieval-Augmented Generation: From Theory to LlamaIndex Implementation
How to address limitations of naive RAG pipelines by implementing targeted advanced RAG techniques in Python…The advanced RAG paradigm comprises of a set of techniques targeted at addressing known limitations of naive RAG. This article first discusses these techniques, which can be categorized into pre-retrieval, retrieval, and post-retrieval optimizations…
The Shift from Models to Compound AI Systems
This post analyzes the trend toward compound AI systems and what it means for AI developers. Why are developers building compound systems? Is this paradigm here to stay as models improve? And what are the emerging tools for developing and optimizing such systems—an area that has received far less research than model training? We argue that compound AI systems will likely be the best way to maximize AI results in the future, and might be one of the most impactful trends in AI in 2024…
AI/ML Internships
[Reddit Discussion with thoughts from a FAANG hiring engineering manager]
Why is it so hard to land internships in AI/ML right now? I currently possess an experience of 1 year with some high-level AI projects. Yet somehow I'm unable to land any internship. As for jobs, I hardly find any that requires less than 5 year experience. Its depressing to be honest. Can anybody help?…
Let's build the GPT Tokenizer [Video]
The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely…

Educational Sponsor

R in 3 Months is here to help you finally learn R

R in 3 Months is the program to help you achieve your goal of **finally** learning R.

R in 3 Months is more than just a course: it's got high-quality course material, and it has personalized feedback and a supportive community.

Over 300 people from around the world have participated in R in 3 Months. Now it's your turn. The next cohort starts March 14 and you can get $50 off if you sign up by March 1 and use coupon code DSWEEKLY.

Training & Resources

Building Diffusion Model's theory from ground up
Diffusion Model, a new generative model family, has taken the world by storm after the seminal paper by Ho et al. [2020]. While diffusion models are often described as a probabilistic Markov Chain, their fundamental principle lies in the decade-old theory of Stochastic Differential Equation (SDE), as found out later by Song et al. [2021]. In this article, we will go back and revisit the 'fundamental ingredients' behind the SDE formulation, and show how the idea can be 'shaped' to get to the modern form of Score-based Diffusion Models. We'll start from the very definition of 'score', how it was used in the context of generative modeling, how we achieve the necessary theoretical guarantees, how the design choices were made and finally arrive at the more 'principled' framework of Score-based Diffusion. Throughout the article, we provide several intuitive illustrations for ease of understanding…
New Data Engineering advice from a Principal [Reddit]
I see a lot of folks here asking how to break into Data Engineering, and I wanted to offer some advice beyond the fundamentals of learning tool X. I've hired and trained dozens of people in this field, and at this point I've got a pretty solid sense of what makes someone successful in it. This is what I'd personally recommend…
Mamba: The Hard Way - Let's implement Mamba in Triton.
A gentle, (but mildly obsessive) tutorial notebook about GPU programming in Triton. We're getting close to mere mortals being able to do this 😂…This blog is about Mamba, a recent neural architecture that can be roughly thought of as a modern recurrent neural network (RNN). The model works really well and is a legitimate competitor with the ubiquitous Transformer architecture. It has gotten a lot of attention. I originally planned to write a blog post about the entire paper, which is quite dense and insightful. However I become fascinated just by the S6 algorithm as described here. This algorithm describes how one can compute an extremely large RNN efficiently on modern hardware, and extends ideas explored in S4 and S5 from recent years…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #534 here.

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~61,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

PS. If you like what you read, consider becoming a paid member here: https://datascienceweekly.substack.com/subscribe :)

Data Science Weekly Newsletter

Discussion about this post