Data Science Weekly - Issue 543
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #543
April 18, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Ten Statistical Ideas that Changed the World
Trevor Hastie and Rob Tibshirani interview authors of seminal papers in the field of Statistics. This is part of a project from Stanford's Stat 319 class in Winter 2024 to discuss important papers in the field. Please visit the website below to find the original papers, presentation slides, and summaries…
The 7 personas of Machine Learning - And what they need from you as a leader
Over the years we’ve worked with some incredibly high-performing data organizations; yet we’ve also worked with some teams that are train wrecks…There are some common traits here; while no two teams are exactly alike, each character in the narrative faces a familiar set of roadblocks…So in today’s post we dive into the seven personas on an ML team: what they do, where they stumble, and what keeps them up at night…Embeddings are a good starting point for the AI curious app developer
Vector embeddings have been an Overton window shifting experience for me, not because they’re sufficiently advanced technology indistinguishable from magic, but the opposite. Once I started using them, it felt obvious that this was what the search experience was always supposed to be: less “How did you do that?” and more mundanely, “Why isn’t this everywhere?”…
A Message from this week's Sponsor:
Is your A/B testing system reliable?
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
AI for Data Journalism: demonstrating what we can do with this stuff right now
I gave a talk last month at the Story Discovery at Scale data journalism conference hosted at Stanford by Big Local News. My brief was to go deep into the things we can use Large Language Models for right now, illustrated by a flurry of demos to help provide starting points for further conversations at the conference. I used the talk as an opportunity for some demo driven development—I pulled together a bunch of different project strands for the talk, then spent the following weeks turning them into releasable tools. There are 12 live demos in this talk!..Scaling Datasets in Pipelines
A machine learning algorithm may benefit from data that is "standardized". If one column in a dataframe has a completely different scale than the rest, it may cause an overfit. To mitigate this you may consider scaling your dataset. This video will give a full introduction on what this does to your pipeline…6 months of playing with lego bricks
No, I am not talking about the physical legos. Although as a kid I truly loved playing with them, I am talking about collaborating in the maintenance and development of scikit-lego…This is not an article about OSS (maybe I will write one in the future), all I want to say on the reasons why I (kind of) activately searched for such opportunity: I enjoy the process of building and maintaining developers tools, and until that point my target audience has never been larger than my company for internal work, and a handful of people in the open source community. scikit-lego was a chance to maintain a library that has a fairly large reach in the data science community, yet it is not too big that I would feel overwhelmed by the amount of tech debt before being able to contribute, nor that the responsibility and pressure to maintain it would be too high. It honestly felt like the perfect opportunity to start contributing to a project that I use and love…Is CUDA programming an in-demand skill in the industry? [Reddit]
Hi all, I am currently working as an AI engineer in a healthcare/ computer vision space. Currently, the type of work I am doing is repetitive and monotonous. It mostly involves data preparation and model training. Looking to branch out and learn some other industry relevant skills. I am considering learning CUDA programming instead of going down the beaten path of learning model deployment. Does CUDA programming open any doors in additional roles? What sort of value does it add?…Anthropic Cookbook
The Anthropic Cookbook provides code and guides designed to help developers build with Claude, providing copy-able code snippets that you can easily integrate into your own projects…Learn RAG From Scratch – Python AI Tutorial from a LangChain Engineer
Learn how to implement RAG (Retrieval Augmented Generation) from scratch, straight from a LangChain software engineer. This Python course teaches you how to use RAG to combine your own custom data with the power of Large Language Models (LLMs)…A Quarto Extension for Creating APA 7 Style Documents
This article template creates APA Style 7th Edition documents in .docx, .html. and .pdf. Because the .docx format is still widely used—and often required—my main priority was to ensure compatibility for…Data Cleaning for Data Sharing Using R
Before sharing research study data, it should be vetted to ensure that it is interpretable, analyzable, and reliable. This half-day, in-person workshop will provide a foundational understanding of how to organize data for the purpose of data sharing…Learning objectives:Understand how to assess a data set for 7 data quality indicators
Be able to review a data set and apply a list of standardized data cleaning steps as needed
Feel comfortable using R code to clean a data set
Understand types of documentation that should be shared alongside data
What does a confidence interval mean?
The usual explanation is complicated and confusing…what you find in most textbooks and what is taught in most classes. And, in my opinion, it is wrong. To explain why, I’ll start with a story…In this article, I argue that it can be simple -- and you don't even have to be a Bayesian…
Debugging AI With Adversarial Validation
Quickly detect a common bug in AI products using an automated technique…For years, I’ve relied on a straightforward method to identify sudden changes in model inputs or training data, known as “drift.” This method, Adversarial Validation, is both simple and effective. The best part? It requires no complex tools or infrastructure…No matter how careful you are, bugs can still slip through the cracks. A high-value activity is to routinely audit all your AI/ML projects for drift…How It Works..What is the most impressive thing you have achieved with only consumer grade hardware [Reddit]
I looking for some inspiration, and want to gauge the possibilities of consumer grade hardware. For those who are learning and training their models on a local consumer grade pc with a consumer grade gpu, what's the most impressive model you have achieved? Also share specs and how long did the training take…EDIT: I'm not looking for generally impressive achievements just stuff you are personally proud of, it can be small things, impractical things ( the more impractical the better tbh, I like whacky things )…TinyLlama + SDXS = real time kids story, uncut video, all running local on single RPI-CM4
tinyllama running on 1 thread with llama2.c given 10token/sec and another 3 thread running the sdxs ~10sec per image. we gonna start with kids bedtime stories and work towards more powerful LLM + SD models to enable D&D/RPG games…
Training & Resources
Video lectures, Stanford EE 274 Data Compression, Theory and Applications
Progress in storage and communication technologies has led to enhanced capabilities, with a perpetual cat and mouse chase between growing the ability to handle more data and the amounts of it required by new technologies. We are all painfully aware of this conundrum as we run out of space on our phones due to the selfies, boomerang videos and documents we collect…The goal of this course is to provide an understanding of how data compression enables representing all of this information in a succinct manner. Both theoretical and practical aspects of compression will be covered. A major component of the course is learning through doing - the students will work on a pedagogical data compression library and implement specific compression techniques…
CSC2547: AI Alignment University of Toronto Computer Science Winter 2024
What is “alignment”? As AI systems start to behave less like tools and more like agents that pursue goals in surprising ways, the important question becomes whether their motivational structures are aligned with human values…the first half of the course will focus on idealized models of powerful AI systems, including optimal planners and universal induction…the second half of the course will focus on practical safety and alignment techniques in the context of large language models…Ring Attention Explained
Context length in Large Language Models has expanded rapidly over the last few years. From GPT 3.5’s 16k tokens, to Claude 2’s 200k tokens, and recently Gemini 1.5 Pro’s 1 million tokens. The longer the context window, the more information the model can incorporate and reason about, unlocking many exciting use cases! However, increasing the context length has posed significant technical challenges, constrained by GPU memory capacity. What if we we could use multiple devices to scale to a near infinite context window? Ring Attention is a promising approach to do so, and we will dive into the tricks and details in this blog…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #542 here.
Cutting Room Floor
Alchemy is all you need - On the economics of frontier models
If you mainly want to do Machine Learning, don't become a Data Scientist
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~61,500 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian