Data Science Weekly - Issue 564
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #564
September 12, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
What is the goal of interpretability in AI? Resilience and Interpretability
My original talk was going to be, “What is this interpretability business for?” What are we studying this for? I think that there was a comment earlier, I forgot which speaker said it. They said, maybe we're doing too much interpretability work, not sure what it's for. So I thought that would be a good topic to talk about..How about this: what's needed for resilience in AI? I think what we really need is to think about what's needed for resilience. Turns out these two topics are related. So let's think about resilience…
How to Test Machine Learning Systems
Testing machine learning is hard because it’s probabilistic by nature, and must account for diverse data and dynamic real-world conditions…You should start with a basic CI pipeline. Focus on the most valuable tests for your use case: Syntax Testing, Data Creation Testing, Model Creation Testing, E2E Testing, and Artifact Testing. Most of the time the most valuable test is E2E Testing…To understand what value each kind of test brings we define the following table:..Visualize your machine learning model
Mycelium is a library for creating graph visualizations of machine learning models or any other directed acyclic graphs. It also powers the graph viewer of the Talaria model visualization and optimization system…
A Sponsor Message
Quadratic - analyze anything, host anywhere
With Quadratic, combine the spreadsheets your organization asks for with the code that matches your team’s code-driven workflows.
Powered by code, you can build anything in Quadratic spreadsheets with Python, JavaScript, or SQL, all approachable with the power of AI.
Use the data tool that actually aligns with how your team works with data, from ad-hoc to end-to-end analytics, all in a familiar spreadsheet.
Level up your team’s analytics with Quadratic today
.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
The Tensor Cookbook
What are Tensor Diagrams? Machine learning involves a lot of tensor manipulation, and it's easy to lose track of the bigger picture when manipulating high-dimensional data using notation designed for vectors and matrices. It turns out all the trouble with tensors disappears when you instead represent them using graphs…This book aims to standardize the notation for tensor diagrams by rewriting the classical "Matrix Cookbook" using this notation. Tensor diagrams are better than alternative notation like Index Notation (einsum) because…The “Who Does What” Guide To Enterprise Data Quality
Every organization is organized around data slightly differently. I’ve seen organizations with 15,000 employees centralize ownership of all critical data while organizations half their size decide to completely federate data ownership across business domains. For the purposes of this article, I’ll be referencing the most common enterprise architecture which is a hybrid of the two. This is the aspiration for most data teams, and it also features many cross-team responsibilities that make it particularly complex and worth discussing. Just keep in mind what follows is AN answer, not THE answer….SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL
Inspired by a pattern that works well in other modern data languages, we added piped data flow syntax to SQL. The results are transformative - SQL becomes a flexible language that’s easier to learn, use and extend, while still leveraging the existing SQL ecosystem and existing userbase. Improving SQL from within allows incrementally adopting new features, without migrations and without learning a new language, making this a more productive approach to improve on standard SQL…How to figures
Interaction design is a largely visual discipline. So too then are HCI (Human-computer interaction) papers. We need to craft and tell visual stories. This slide deck provides a collection of different types of visuals common to HCI papers. Let’s learn from the best. For the most part, I’ve focused on positive examples rather than negative ones (in part, out of sensitivity to authors); however, we can learn from both…A timeline of R's first 30 years
August 2023 marked the thirtieth anniversary of the first public release of the R programming language. To celebrate this, and to show how far the language has evolved across those three decades, the timeline below shows some landmark events, packages and papers…Denoising: A Powerful Building-Block for Imaging, Inverse Problems, and Machine Learning
Denoising, the process of reducing random fluctuations in a signal to emphasize essential patterns, has been a fundamental problem of interest since the dawn of modern scientific inquiry. Recent denoising techniques, particularly in imaging, have achieved remarkable success, nearing theoretical limits by some measures. Yet, despite tens of thousands of research papers, the wide-ranging applications of denoising beyond noise removal have not been fully recognized. This is partly due to the vast and diverse literature, making a clear overview challenging…This paper aims to address this gap. We present a comprehensive perspective on denoisers, their structure, and desired properties…You're always (always!) dealing with many (many!) tables - with Madelon Hulsebos
When you are working on a data pipeline for ML ... you are never dealing with a single table. It always demands different tables for different reasons that all have to be mashed together in order to have something that you can learn from. But if that is the case, why do we spend so much time talking about ML pipelines that only work on a single table?…Tutorial on Diffusion Models for Imaging and Vision
The goal of this tutorial is to discuss the essential ideas underlying the diffusion models. The target audience of this tutorial includes undergraduate and graduate students who are interested in doing research on diffusion models or applying these models to solve other problems…Open Source LLM Tools Leaderboard (Updated every 12 hours)
When a new repo is indexed, changes in stars in the last day/week are default to 0. Full analysis: What I learned from looking at 900 most popular open source AI tools…
CUDA MODE Lecture Notes
My notes from the CUDA MODE reading group lectures run by Andreas Kopf and Mark Saroufim…Most Data Quality Initiatives Fail Before They Start. Here’s Why.
Over the past few years, advances in the cloud and metadata management have made organizing silly amounts of data possible. Data engineering processes are starting to trend towards the level of maturity and rigor of more longstanding engineering disciplines. And of course, AI has the potential to streamline everything…While this problem isn’t – and probably will never be – completely solved, I have seen organizations adopt best practices that are the difference between initiative success…and having another kick-off meeting 12 months later.Here are 4 key lessons for building data quality scorecards:
Know what data matters
Measure the machine
Get your carrots and sticks right
Automate evaluation and discovery
Building an Advanced RAG System With Self-Querying Retrieval
In this tutorial, we will look into some scenarios where vector search alone is inadequate and see how to improve them using a technique called self-querying retrieval.Specifically, in this tutorial, we will cover the following:
Extracting metadata filters from natural language
Combining metadata filtering with vector search
Getting structured outputs from LLMs
Building a RAG system with self-querying retrieval…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #563 here.
Cutting Room Floor
Transforming Your Research with Generative AI Tutorial Series (U-Mich)
Harley - polars helper methods that will make you more productive
Monkey Business: a dataset of large LLM sample collections for math and code tasks
OpenCellID Data Exploration with Pandas (accelerated by cudf.pandas)
.
Whenever you're ready, 3 ways we can help:
Need to learn something for your work? Reply to this email to find out how we can work 1-on-1 with you to speed up your learning.
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~63,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian