Data Science Weekly - Issue 542
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #542
April 11, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Data Q&A - Answering the real questions with Python
Today I’m starting a new project with the working title Data Q&A: Answering the real questions with Python. In each installment, I’ll take a question from Reddit’s statistics forum and answer it, using Python code to demonstrate. The first installment is a question about the harmonic mean, which is a recurring topic of discussion on Reddit. It’s in a Jupyter notebook — see the link below to run it in Colab…
A Guide to Structured Generation Using Constrained Decoding
The how, why, power, and pitfalls of constraining generative language model outputs…sometimes, no matter how explicit you are with your instructions, generative language models will get too creative, deviate off task, or simply succumb to their urge to yap…Fortunately, there are techniques that ensure language models only return outputs that conform to your requirements. This article serves as a practitioner's guide for perhaps the most powerful of these techniques: constrained decoding. We'll cover what structured generation and constrained decoding are, how they work, best practices, useful patterns, and pitfalls to avoid…Math to Code: tutorial to teach engineers to read and implement math using NumPy
Math to Code is an interactive Python tutorial to teach engineers how to read and implement math using the NumPy library. Let’s get started!…
A Message from this week's Sponsor:
Is your A/B testing system reliable?
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
An introduction to Flow Matching
Flow matching (FM) is a recent generative modelling paradigm which has rapidly been gaining popularity in the deep probabilistic ML community. Flow matching combines aspects from Continuous Normalising Flows (CNFs) and Diffusion Models (DMs), alleviating key issues both methods have. In this blogpost we’ll cover the main ideas and unique properties of FM models starting from the basics…Enhancing Jupyter with Widgets with Trevor Manz - creator of anywidget
In this (first!) episode of Sample Space we talk to Trevor Mantz, the creator of anywidget. It's a (neat!) tool to help you build more interactive notebooks by giving you tools to apply just enough Javascript to get directional communication working in your favorite notebook environment. That means that Python can talk to widgets, but also that widgets can talk to Python. There's a lot to like about these widgets and we're doing a proper deep dive in this first episode…How I got into deep learning
I ran an education company, Dataquest, for 8 years. Last year, I got the itch to start building again. Deep learning was always interesting to me, but I knew very little about it. I set out to fix that problem. Since then, I’ve trained dozens of models (several state of the art for open source), built 2 libraries that have 5k+ Github stars, and recently accepted an offer from answer.ai, a research lab started by Jeremy Howard. I say this to establish the very rough outline of my learning journey. In this post, I’m going to cover more detail about how I learned deep learning. Hopefully it helps on your journey…Instructor is All You Need: My Review of the Instructor LLM Library
With numerous LLM libraries available, such as Langchain, CrewAI, Haystack, and Marvin AI, the question arises: what makes Instructor stand out as an llm library? Find out in this blog post…8 Billion Files [Reddit]
How would you process 8 billion files sitting in an S3 bucket? Each file is about 300 bytes. They're txt files that will need a few small transformations.I need to:
Perform some minor transformations on values
Load into a permanent queryable datastore
Each file holds a single check-in update for an IoT device and holds about 20 key value pairs that look like this…
Red-Teaming Language Models with DSPy
We spend a lot of time thinking about automated red-teaming. At its core, this is really an autoprompting problem: how does one search the combinatorially infinite space of language for an adversarial prompt?…One way to go about this problem is via DSPy, a new framework out of Stanford NLP used for structuring (i.e. programming) and optimizing LLM systems. DSPy introduces a systematic methodology that separates the flow of programs into modules from the parameters (LLM prompts and weights) of each step. This separation allows for more structured and efficient optimization. The framework also features optimizers, which are algorithms capable of tuning prompts and/or the weights of LLM calls, given a specific metric you aim to maximize…Securing Canada’s AI advantage
The Prime Minister, Justin Trudeau, today announced a $2.4 billion package of measures from the upcoming Budget 2024 to secure Canada’s AI advantage. These investments will accelerate job growth in Canada’s AI sector and beyond, boost productivity by helping researchers and businesses develop and adopt AI, and ensure this is done responsibly…These measures include…Blogposts Track ICLR 2024 : Announcing Accepted Blogposts
This year, 22 outstanding blog posts have been accepted for publication, each offering valuable perspectives on previously published papers. These posts, selected through a rigorous double-blind review process, cover a diverse range of topics and underscore the depth and breadth of contemporary machine learning research. They not only contribute to the ongoing dialogue within the community but also highlight the creative and pedagogical talent of their authors…What Computer Vision Can Tell Us About the Natural World
Michaela Alksne, a PhD student at the Scripps Institution of Oceanography, investigates how to detect and classify the social and foraging calls of the Northeastern Pacific blue whale. Her new approach is to use a large dataset of spectrograms, or visual representations of their sound waves. Last year, a three-week intensive workshop at Caltech, called Computer Vision Methods for Ecology (CV4Ecology), gave Alksne the computer-vision (CV) tools—methods by which machines “see” images to reproduce human analytical abilities through artificial intelligence—she needed to move that project forward…
Narwhals - Lightweight and extensible compatibility layer between Polars, pandas, cuDF, Modin, and more
Extremely lightweight and extensible compatibility layer between Polars, pandas, modin, and cuDF (and more!)…Seamlessly support all, without depending on any!✅ Just use a subset of the Polars API, no need to learn anything new
✅ No dependencies (not even Polars), keep your library lightweight
✅ Separate lazy and eager APIs
✅ Use Polars Expressions
✅ 100% branch coverage, tested against pandas and Polars nightly builds!
3Blue1Brown: Attention in transformers, visually explained
Demystifying attention, the key mechanism inside transformers and LLMs….Building an AI Coach to Help Tame My Monkey Mind
ChatGPT. Over the past year, I’ve been using it as my therapist and coach. It’s been surprisingly helpful and meets all the needs above, plus it’s available 24/7. And I don’t have to worry about being judged (hopefully our AI overlords will be benevolent!) So far, every “session” has been insightful and I almost always leave with a lighter heart. Then, I saw this tweet from Swyx. I was initially skeptical of voice as a modality—Siri and her ilk never quite worked for me. Nonetheless, I was inspired enough to give it a shot. Thus, I built my own AI coach that I could talk to during walks…
Training & Resources
Comprehensive materials for busy scientists and engineers embarking on Quantum Machine Learning - advanced algorithms, optimization techniques, and beyond!..
Hugging Face Diffusion Models Course
In this free course, you will:👩🎓 Study the theory behind diffusion models
🧨 Learn how to generate images and audio with the popular 🤗 Diffusers library
🏋️♂️ Train your own diffusion models from scratch
📻 Fine-tune existing diffusion models on new datasets
🗺 Explore conditional generation and guidance
🧑🔬 Create your own custom diffusion model pipelines
Randomized Numerical Linear Algebra
This is a follow up post on my video ‘Is the Future of Linear Algebra.. Randomized?’, which explains Randomized Numerical Linear Algebra (‘Rand-NLA’), the methods of using randomization to speed up linear algebra computations. It was spurred by a paper that convinced me its a promising path through the decades-old operations-complexity wall of NLA algorithms. This post is to answer outstanding questions and add updates after the video is released…
Last Week's Newsletter's 3 Most Clicked Links
Better way to query a large (15TB) dataset that does not cost $40,000
Stanford CS 25 Transformers Course (Open to Everybody | Starts Tomorrow)
* Based on unique clicks.
** Find last week's issue #541 here.
Cutting Room Floor
AI-Powered Search: Embedding-Based Retrieval and Retrieval-Augmented Generation (RAG)
Waterloo CS 886 Recent Advances on Foundation Models winter 2024
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~61,500 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian