Data Science Weekly - Issue 549
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #549
May 30, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
The Danger Zone in Data Science - Why mediocre ML is so dangerous to the business
When I was on Uber’s Marketplace team, we would (semi) joke that we were lurching from crisis to crisis. Our dozens of core machine learning products directly controlled billions of company dollars — targeted promotions, surge pricing, driver incentives, ETAs, pool matching, upfront rider fares, subscription upsells, the list goes on. We lived in paranoia that these were fundamentally broken in a way that would sink the business…Every week brought a new potential Data Science Disaster…
Predicted Probabilities - Understanding logistic regression using predicted probabilities
Logistic regression is used to model the relationship between a dependent categorical, binary variable (e.g., pass or fail) and one or more independent continuous or categorical variables…These models can be rather simple, but results for logistic regressions can be difficult to interpret, especially for non-technical audiences…As an alternative, predicted probabilities provide a more straightforward approach to presenting the results of a logistic regression. The purpose of this tutorial is to demonstrate how to obtain and present predicted probabilities from a logistic regression and plot them in a format easily interpretable for all audiences…The Afterlives of Shakespeare and Company in Online Social Readership
The growth of social reading platforms such as Goodreads and LibraryThing enables us to analyze reading activity at very large scale and in remarkable detail. But twenty-first century systems give us a perspective only on contemporary readers. Meanwhile, the digitization of the lending library records of Shakespeare and Company (SC) provides a window into the reading activity of an earlier, smaller community in interwar Paris. In this article, we explore the extent to which we can make comparisons between the SC and Goodreads communities…
A Message from this week's Sponsor:
Online Data Science Programs from Drexel University
Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career. Learn more.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
Don't worry about LLMs
This is a near-transcript of the talk I gave at PyCon Italia 2024 in May in Florence…After working with LLMs for the past year, what I’ve found is that the new engineering systems we’re building around these LLMs are a lot like the old ones. Once we cut away the hype, what we’re usually left with are plain engineering and machine learning problems…PsyTeachR
The psyTeachR team at the University of Glasgow School of Psychology and Neuroscience has successfully made the transition to teaching reproducible research using R across all undergraduate and postgraduate levels. Our curriculum now emphasizes essential ‘data science’ graduate skills that have been overlooked in traditional approaches to teaching, including programming skills, data visualisation, data wrangling and reproducible reports…This website contains our open materials for teaching reproducible research…Regression using LLMs
The llm-regression package demonstrates how LLMs can be used to solve classical regression problems, and exposes these capabilities for you to experiment with…I Built a Reusable Dashboard for Read the Docs Traffic Analytics Using Vizro-AI (In less than 50 lines of code)
In this article, I’ll explain how I built a dashboard to visualize the traffic data for some documentation I maintain as a technical writer. I have few design skills and limited Python experience, so needed a simple, low-code approach to show the impact and usage of the documentation I maintain. This turned out to be an open-source solution: Vizro as a template for a low-code dashboard, and Vizro-AI to build the individual charts with generative AI…What We Learned from a Year of Building with LLMs (Part I)
Our goal is to make this a practical guide to building successful products around LLMs, drawing from our own experiences and pointing to examples from around the industry. We’ve spent the past year getting our hands dirty and gaining valuable lessons, often the hard way. While we don’t claim to speak for the entire industry, here we share some advice and lessons for anyone building products with LLMs…This work is organized into three sections: tactical, operational, and strategic…We share best practices and common pitfalls around prompting, setting up retrieval-augmented generation, applying flow engineering, and evaluation and monitoring. Whether you’re a practitioner building with LLMs or a hacker working on weekend projects, this section was written for you. Look out for the operational and strategic sections in the coming weeks…How I run a software book club
I've been running software book clubs almost continuously since last summer, about 12 months ago. We read through Designing Data-Intensive Applications, Database Internals, Systems Performance, and we just started Understanding Software Dynamics…This post is for folks who are interested in running their own book club. None of these ideas are novel. I co-opted the best parts I saw from other people running similar things. And hopefully you'll improve on my experience too, should you try…The Past, Present, and Future of Data Quality Management: Understanding Testing, Monitoring, and Data Observability in 2024
Like everything in data engineering, data quality management is evolving at lightning speed….The meteoric rise of data and AI in the enterprise has made data quality a zero day risk for modern businesses—and THE problem to solve for data teams. With so much overlapping terminology, it’s not always clear how it all fits together—or if it fits together. But contrary to what some might argue, data quality monitoring, data testing, and data observability aren’t contradictory or even alternative approaches to data quality management—they’re complementary elements of a single solution…In this piece, I’ll dive into the specifics of these three methodologies, where they perform best, where they fall short, and how you can optimize your data quality practice to drive data trust in 2024…Moving towards KDearestNeighbors with Leland McInnes - creator of UMAP
Leland McInnes is known for a lot of packages. There's UMAP, but also PyNNDescent and HDBScan. Recently he's also been working on tools to help visualize clusters of data and he's also cooking up something new that's related to nearest neighbor algorithms. This interview touches all of these topics…Sources of Uncertainty in Machine Learning -- A Statisticians' View
Machine Learning and Deep Learning have achieved an impressive standard today, enabling us to answer questions that were inconceivable a few years ago. Besides these successes, it becomes clear, that beyond pure prediction, which is the primary strength of most supervised machine learning algorithms, the quantification of uncertainty is relevant and necessary as well. While first concepts and ideas in this direction have emerged in recent years, this paper adopts a conceptual perspective and examines possible sources of uncertainty. By adopting the viewpoint of a statistician, we discuss the concepts of aleatoric and epistemic uncertainty, which are more commonly associated with machine learning. The paper aims to formalize the two types of uncertainty and demonstrates that sources of uncertainty are miscellaneous and can not always be decomposed into aleatoric and epistemic…
KL is All You Need
Modern machine learning is a sea of initialisms: VAE, VIB, VDM, BBB, VB, etc. But, the more time I spend working in this field the more I come to appreciate that the core of essentially all modern machine learning methods is a single universal objective: Kullback-Leibler (KL) divergence minimization. Even better, there is a very simple universal recipe you can follow to re-derive most of the named objectives out there. Understand KL, understand the recipe, and you'll understand all of these methods and be well on your way to deriving your own…Cookiecutter Data Science Version 2
The original Cookiecutter Data Science (CCDS) was published over 8 years ago. The goal was, as the tagline states “a logical, reasonably standardized but flexible project structure for data science.” That version, now affectionately called V1, has been a workhorse for a long time, and got the job done for many projects while being mostly unchanged…That said, in the past 5 years, a lot has changed in data science tooling and MLOps. Cookiecutter V2 is designed to embrace these changes and look to the future. We’ll keep our Unix-like philosophy: pick a tool that does each single job well and then chain those together into a workflow. We want to be able to swap different tools in and out as they develop and mature…How to automate your reporting with Quarto Dashboards and Posit Connect
Isabella Velásquez dives into the practical side of lightweight dashboards made with Quarto, the next-generation R Markdown, and Posit Connect, our premier publishing platform. You’ll learn how to build and automate Quarto Dashboards with Posit Connect. We'll showcase a Python example, but the same principles apply to R, Julia, and Observable…
A Message from this week's other Sponsor:
Join Us to Unlock Generative AI for the Enterprise
Generative AI is poised to transform the world – IF data privacy solutions can keep up. Join us in San Francisco for Confidential Computing Summit, two eye-opening days on June 5 & 6 that will bring together top business and technology leaders to evaluate the latest solutions in secure and trustworthy AI, explore confidential data use cases, and get you up-to-speed on what’s now and what’s next. Get $200 off with promo code DSW —> https://bit.ly/4bvJvnk
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Training & Resources
How to build a terrible RAG system
If you've seen any of my work, you know that the main message I have for anyone building a RAG system is to think of it primarily as a recommendation system. Today, I want to introduce the concept of inverted thinking to address how we should approach the challenge of creating an exceptional system…
Awesome Robot Social Navigation
This repo keeps track of the historical and recent advances in robot social navigation/crowd navigation/navigation in dynamic or human environments…Awesome-Mamba-Collection
Welcome to Awesome Mamba Resources! This repository is a curated collection of papers, tutorials, videos, and other valuable resources related to Mamba. Whether you're a beginner or an experienced user, this collection aims to provide a comprehensive reference for all things Mamba. Explore the latest research papers, dive into helpful tutorials, and discover insightful videos to enhance your understanding and proficiency in Mamba. Join us in this open collaboration to foster knowledge sharing and empower the Mamba community. Let's embark on an exciting journey with Mamba!…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #548 here.
Cutting Room Floor
We aren’t running out of training data, we are running out of open training data
The landscapemetrics and motif packages for measuring landscape patterns and processes
Implementing MVCC and major SQL transaction isolation levels
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~62,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian
Great review!