Data Science Weekly - Issue 550
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #550
June 06, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
How to be a Curious Consumer of COVID-19 Modeling: 7 Data Science Lessons from ‘Effectiveness of COVID-19 shelter-in-place orders varied by state’
We have tens of thousands of empirical COVID-19 papers from the social sciences…Methodologically weak but seemingly sophisticated techniques are routinely pressed into service to provide answers where the data on hand cannot possibly, mathematically, provide them…I return again to the question of how we can promote more careful and curious analysis. Specifically, I draw 7 data science lessons for what to do and not do with COVID-19 modeling illustrated with a simple empirical paper on the causal effects of government lockdowns on individual mobility (Feyman et al. 2020). It is my hope the reader can take away practical tips for how they can begin to challenge and interrogate seemingly sophisticated modeling- especially modeling that is obfuscated, vague, or claims more than the underlying evidence can support…
Can an emerging field called ‘neural systems understanding’ explain the brain?
This mashup of neuroscience, artificial intelligence and even linguistics and philosophy of mind aims to crack the deep question of what “understanding” is, however un-brain-like its models may be…A neural algorithm for a fundamental computing problem
Similarity search—for example, identifying similar images in a database or similar documents on the web—is a fundamental computing problem faced by large-scale information retrieval systems. We discovered that the fruit fly olfactory circuit solves this problem with a variant of a computer science algorithm (called locality-sensitive hashing). The fly circuit assigns similar neural activity patterns to similar odors, so that behaviors learned from one odor can be applied when a similar odor is experienced. The fly algorithm, however, uses three computational strategies that depart from traditional approaches. These strategies can be translated to improve the performance of computational similarity searches. This perspective helps illuminate the logic supporting an important sensory function and provides a conceptually new algorithm for solving a fundamental computational problem…
A Message from this week's Sponsor:
Join industry’s leading AI conference - free passes available!
Don’t miss the AI conference of the year! Join 4500+ attendees, 350+ speakers and 150+ AI exhibitors at Ai4 2024, North America's largest AI industry event — taking place Las Vegas on August 12-14. Enjoy dedicated content & unbeatable networking for both business & technical leaders from every major industry and job function.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
DSPy + Nuxt Project - Extract Information
Today we're going to build a simple resume information extractor using DSPy, Nuxt 3, and FastAPI. This will give you a good idea of how to get started simply, with the ability to go further…Writing fast string ufuncs for NumPy 2.0
After a huge amount of work from many people, NumPy 2.0.0 will soon be released, the first NumPy major release since 2006! Among the many new features, several changes to both the Python API and the C API, and a great deal of documentation improvements, there was also a lot of work on improving the performance of string operations. In this blog post, we'll go through the timeline of the changes, trying to learn more about NumPy ufuncs in the process…Logarithms and Heteroskedasticity
A log transform is neither necessary nor sufficient to fix heteroskedasticity…Here’s another installment in Data Q&A: Answering the real questions with Python…Here’s a question from the Reddit statistics forum. Is it correct to use logarithmic transformation in order to mitigate heteroskedasticity?…Summarization and the Evolution of LLMs
Many of the core ideas surrounding LLMs—self-supervised pre-training, the transformer, learning from human feedback, and more—have roots in natural language processing research from years before. These concepts are not new, but rather an accumulation of ideas from over a decade of relevant research. As a result, fundamental research on core problems in natural language processing (e.g., machine translation, summarization, question answering and more) is incredibly important! In this overview, we’ll demonstrate this point by focusing upon the problem of (abstractive) text summarization, which has had a heavy influence on the evolution of LLM research over time…Monotonic, and better, boosting
There are these moments when you actually know something about the relationship between what you're trying to predict and the data that you have. One of these situations could involve monotonicity, which demands that when a variable 'x' goes up, 'y' must go down. Or vise versa. If you're dealing with such a situation there's some good news because the boosting implementations of scikit-learn support configuring this! The video gives an example to help motivate these constraints…The Problem with Lying is Keeping Track of All the Lies
Or why clear consistency guarantees are how to stay sane when programming distributed systems…“The real difficulty with lying is that you have to keep track of all the lies that you’ve told, and to whom” is a quote I once read that I can’t definitively source (it’s… inconsistently attributed to Mark Twain). It’s stuck with me because it captures the logic as to why it’s so hard to be productive as a programmer in a world of weak isolation models…Llama 3 took almost 8 million GPU hours [Reddit Discussion]
If you assume like 14 days of training you need around 25k H100s. Assuming full utilization. Wonder if at any point in the future hardware will get good enough so that we could do this on a a Single GPU..F*ck You, Show Me The Prompt.
Quickly understand inscrutable LLM frameworks by intercepting API calls…There are many libraries that aim to make the output of your LLMs better by re-writing or constructing the prompt for you. These libraries purport to make the output of your LLMs: safer (ex: guardrails) deterministic (ex: guidance) structured (ex: instructor) resilient (ex: langchain) … or even optimized for an arbitrary metric (ex: DSPy)…In this blog post, I’ll show you how you can intercept API calls w/prompts for any tool, without having to fumble through docs or read source code. I’ll show you how to setup and operate mitmproxy with examples from the LLM the tools I previously mentioned…Causal Inference Books/Resources for Industry
Hey, I am looking to the Causal Data Scientists, what are the books or resources that really helped you understand Causal Inference and how it can be used at your job. Open to courses, books, YouTube videos etc . Whatever has helped you the most. Really want to learn more about this space and how it can be used…
Building Uber a $6 million table component
When I joined Uber in 2019, the Base Web design system, only about a year old at the time, had a solitary table component that was one of the worst tables I’d ever seen in my life. It was a fairly simple grid with neither row nor column delineation, and had no functionality past the ability to display data in a columnar grid that lived in a window with scroll enabled…I was so irritated by this stupid, useless table component I wrote up a manifesto on why it sucked and sent it to Jeff Jura, Uber’s head of Design Systems at the time…The radius of statistical efficiency
Classical results in asymptotic statistics show that the Fisher information matrix controls the difficulty of estimating a statistical model from observed data. In this work, we introduce a companion measure of robustness of an estimation problem: the radius of statistical efficiency (RSE) is the size of the smallest perturbation to the problem data that renders the Fisher information matrix singular…Doing Stuff with AI: Opinionated Midyear Edition
Every six months or so, I write a guide to doing stuff with AI. A lot has changed since the last guide, while a few important things have stayed the same. It is time for an update. This is usually a serious endeavor, but, heeding the advice of Allie Miller, I wanted to start with a different entry point into AI: fun…
Training & Resources
smolar - tiny multidimensional array implementation in C similar to numpy, but only one file
A tiny multidimensional array implementation in C similar to numpy, but only one file…I wanted to dive into the trenches of implementing multidimensional arrays since quite some time. Finally, taking inspiration from numpy I decided to give it a go. I wanted to implement everything from ground up, and hence C was the perfect choice for it…Generative AI Handbook: A Roadmap for Learning Resources
This document aims to serve as a handbook for learning the key concepts underlying modern artificial intelligence systems. Given the speed of recent development in AI, there really isn’t a good textbook-style source for getting up-to-speed on the latest-and-greatest innovations in LLMs or other generative models, yet there is an abundance of great explainer resources (blog posts, videos, etc.) for these topics scattered across the internet. My goal is to organize the “best” of these resources into a textbook-style presentation, which can serve as a roadmap for filling in the prerequisites towards individual AI-related learning goals. My hope is that this will be a “living document”, to be updated as new innovations and paradigms inevitably emerge, and ideally also a document that can benefit from community input and contribution…Cleaning Medical Data in R
Materials from a workshop given at the R/Medicine 2023 Conference. My co-presenters included Shannon Pileggi and Peter Higgins…
Last Week's Newsletter's 3 Most Clicked Links
The Danger Zone in Data Science - Why mediocre ML is so dangerous to the business
Predicted Probabilities - Understanding logistic regression using predicted probabilities
* Based on unique clicks.
** Find last week's issue #549 here.
Cutting Room Floor
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~62,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian