Data Science Weekly - Issue 568
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #568
October 10, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
The State of AI Report 2024
Welcome to the State of AI Report 2024! Our seventh installment is our biggest and most comprehensive yet, covering everything you need to know about research, industry, safety and politics. This open-access report is our contribution to the AI ecosystem…
Diffusion Models From Scratch | Score-Based Generative Models Explained
In this video we are looking at Diffusion Models from a different angle, namely through Score-Based Generative Models, which arguably can be considered as the broader family of diffusion models. Personally, this approach has helped me so much in getting a better intuition for diffusion models and how to visualize the idea and especially connect different approaches like DDPM, DDIM or EDM to one another…Poker Theory and Analytics
This (MIT IAP) course takes a broad-based look at poker theory and applications of poker analytics to investment management and trading…
A Sponsor Message
Get easy-to-use business intelligence for your startup
Metabase’s intuitive BI tools empower your team to effortlessly report and derive insights from your data. Compatible with your existing data stack, Metabase offers both self-hosted and cloud-hosted (SOC 2 Type II compliant) options. In just minutes, most teams connect to their database or data warehouse and start building dashboards—no SQL required. With a free trial and super affordable plans, it's the go-to choice for venture-backed startups and over 50,000 organizations of all sizes. Empower your entire team with Metabase. Read more.
.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
Stein’s Paradox - Why the Sample Mean Isn’t Always the Best
Averaging is one of the most fundamental tools in statistics, second only to counting. While its simplicity might make it seem intuitive, averaging plays a central role in many mathematical concepts because of its robust properties. Major results in probability, such as the Law of Large Numbers and the Central Limit Theorem, emphasize that averaging isn’t just convenient — it’s often optimal for estimating parameters. Core statistical methods, like Maximum Likelihood Estimators and Minimum Variance Unbiased Estimators (MVUE), reinforce this notion. However, this long-held belief was upended in 1956 when Charles Stein made a breakthrough that challenged over 150 years of estimation theory…Interviewing Andrew Trask on how language models should store (and access) information
Andrew Trask is one of the bright spots in engaging with AI policy for me in the last year. He is a passionate idealist, trying to create a future for AI that enables privacy, academic research, and government involvement in a rapidly transforming ecosystem. Trask is a leader of the OpenMined organization facilitating researcher access to non-public data and AIs, a senior research scientist at Google DeepMind, a PhD student at the University of Oxford, an author and educator on Deep Learning…Mastering Realtime Data: How I Topped the Leaderboard at PyData
I joined the first conference day and learned about the PyData 2024 Challenge, a strategic challenge running during the two days of conference. To participate in the challenge you have to distribute 100 armies across 10 castles. Each castle has increasing points from 1 to 10, making strategic distribution essential to outscore other participants. This post is about my strategy beating this challenge and more importantly, the lessons I learned along the way…A guide to passing the A/B test interview question in tech companies [Reddit]
I'm a Sr. Analytics Data Scientist at a large tech firm (not FAANG) and I conduct about ~3 interviews per week. I wanted to share my advice on how to pass A/B test interview questions as this is an area I commonly see candidates get dinged. Hope it helps. Product analytics and data scientist interviews at tech companies often include an A/B testing component. Here is my framework on how to answer A/B testing interview questions. Please note that this is not necessarily a guide to design a good A/B test. Rather, it is a guide to help you convince an interviewer that you know how to design A/B tests…Understanding NVIDIA GPU Performance: Utilization vs. Saturation
GPU performance metrics reported by tools like nvidia-smi may be misleading. This blog delves into the underlying issue to provide a deeper understanding…Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise
We introduce Tutor CoPilot, a novel Human-AI approach that leverages a model of expert thinking to provide expert-like guidance to tutors as they tutor. This study is the first randomized controlled trial of a Human-AI system in live tutoring, involving 900 tutors and 1,800 K-12 students from historically under-served communities…Data Engineering Whitepapers
A curated list of influential whitepapers in the field of data engineering…How the GapEncoder works
The GapEncoder is an estimator from the skrub library that can do feature generation and topic modelling at the same time. Being able to do both is great for utility, but it also comes with some benefits for accuracy…Postprocessing is coming to tidymodels
We’re bristling with elation to share about a set of upcoming features for postprocessing with tidymodels. Postprocessors refine predictions outputted from machine learning models to improve predictive performance or better satisfy distributional limitations. The developmental versions of many tidymodels core packages include changes to support postprocessors, and we’re ready to share about our work and hear the community’s thoughts on our progress so far…
Sales Forecasting for Thousands of MSKUs [Reddit Discussion]
I have to create a solution for forecasting for thousands of different MSKUs at location level. Data : After a final cross join, For each MSKU I have a 36 monthly data points. (Not necessarily all will be populated, many monthly sales values could be 0)…Second edition of Geocomputation with R is complete
Geocomputation with R is a book about geographic data analysis, visualization, and modeling in a reproducible manner. It is aimed at people who want to learn how to use R to work with geographic data, so the book is designed to be accessible to people with no prior experience in spatial data processing in R…We are excited to announce that the second edition of Geocomputation with R is (almost) complete. It took us about three years to update and improve the book. This blog post summarizes the process and lists things we added and changed…Hybrid full-text search and vector search with SQLite
You can use SQLite's builtin full-text search (FTS5) extension and semantic search with sqlite-vec to create "hybrid search" in your applications. You can combine results using different methods like keyword-first, re-ranking by "semantics", and reciprocal rank fusion. Best of all, since it's all in SQLite, experiments and prototypes are cheap and easy, no 3rd party services required!..
.
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #567 here.
Cutting Room Floor
Multivariable Logistic Regression in R: The Ultimate Masterclass
Interactive Graph: Understanding Mean and Standard Deviation
The Burden of Demonstrating Statistical Validity of Clusters
The PyData 2024 Challenge - Winner did some clever hacking
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~63,400 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian