Data Science Weekly - Issue 570

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Oct 24, 2024

Issue #570
October 24, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Editor's Picks

How to run data science projects
In this article, I will outline my mental model for running a science project. Specifically, I’m referring to data or applied science projects, drawing from my experience of over 9 years at AWS and Amazon. You might argue that in agile environments like startups or smaller companies, the approach could differ, but aside from an additional layer of hierarchy, I don’t anticipate significant deviations…

When "A Helpful Assistant" Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models
Through extensive analysis of 4 popular families of LLMs and 2,410 factual questions, we demonstrate that adding personas in system prompts does not improve model performance across a range of questions compared to the control setting where no persona is added…
Gael Varoquaux - Skrub: Less data wrangling, more machine learning
(Filmed at dotAI on October 18, 2024 in Paris)…In data science, the glory is in the AI, the machine learning, but the hard work is often cleaning, wrangling, preparing the data. This is particularly true when working on data tables, as opposed to text or images that have more invariants across tasks. I will present how to reduce this burden, with the young library "skrub", as well as ongoing research…

A Sponsor Message

Get easy-to-use business intelligence for your startup

Metabase’s intuitive BI tools empower your team to effortlessly report and derive insights from your data. Compatible with your existing data stack, Metabase offers both self-hosted and cloud-hosted (SOC 2 Type II compliant) options. In just minutes, most teams connect to their database or data warehouse and start building dashboards—no SQL required. With a free trial and super affordable plans, it's the go-to choice for venture-backed startups and over 50,000 organizations of all sizes. Empower your entire team with Metabase. Read more.

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Understanding Gaussians
The Gaussian distribution, or normal distribution is a key subject in statistics, machine learning, physics, and pretty much any other field that deals with data and probability. It’s one of those subjects, like π or Bayes’ rule, that is so fundamental that people treat it like an icon…
A primer on ML in antibody engineering
I increasingly see that a fair bit of the progress in ML-based protein design methods is occurring in antibody engineering, which is the process of creating antibodies tailor-made to bind to specific things in the body…I've been recently taking the time to better understand this whole field a bit more and made a post about it to cement what I've learned. To note, this is not a machine-learning post in a traditional sense. I won't be going into the exact details of how antibody ML methods exactly work, but rather setting up the contextual knowledge needed to understand these methods at all + going through a few antibody ML papers…
No learning without randomness
Randomness is not a nuisance, but an elemental mechanism in machine learning…We call it noise, volatility, nuisance, variance, non-determinism, and uncertainty. We have to work hard to keep randomness under control: we try to keep our code reproducible with random seeds, we have to deal with missing values, and sometimes we struggle because the data is not such a good random sample.
But there’s another side to randomness…In this post, we explore the role of randomness in helping machines learn…
Your eCommerce product performance reports are probably misleading you
Why single metrics in isolation fall short and how Weighted Composite Scoring can transform your business insights…A Weighted Composite Score combines multiple metrics into a single, insightful metric that provides a comprehensive view of each product’s value across various dimensions. Think of it like your final grade in school — each subject may be assessed on a different scale, but ultimately they are combined into one overall score. This composite score can also be weighted to emphasize specific metrics, allowing you to align with particular business goals such as prioritizing profitability over growth or reducing return rates…
Lost in Transformation: The Horror and Wonder of Logit
The logit model for binary outcomes is poorly understood because it involves thinking in non-linear spaces. In this post, I illustrate what logit is using maps and why people who resist Logit are a bit like flat earthers…
Nested unit tests with testthat
The testthat package is the most widely used tool for unit testing in R. However, many users may not be aware of the possibility to nest test blocks within each other. In this post, I demonstrate how this underused feature provides a great way to structure and manage your unit tests…
Sampling with SQL
Sampling is one of the most powerful tools you can wield to extract meaning from large datasets. It lets you reduce a massive pile of data into a small yet representative dataset that’s fast and easy to use. If you know how to take samples using SQL, the ubiquitous query language, you’ll be able to take samples anywhere. No dataset will be beyond your reach!…In this post, we’ll look at some clever algorithms for taking samples. These algorithms are fast and easily translated into SQL…
Sequence models for Contextual Recommendations at Instacart
In this blog post, we describe how we built a centralized contextual retrieval system that powers diverse recommendation surfaces, even though their end goals and ranking layers are different…Having a common retrieval system across both ads and organic surfaces has lowered our maintenance costs and allowed us to deprecate many legacy ad hoc retrieval systems. Using in-session contextual signals, we built a BERT-like language model to power sequence recommendations for this system…
My NumPy Year: Creating a DType for the Next Generation of Scientific Computing
This project was a mix of challenges and learning as I navigated the CPython C API and worked closely with the NumPy community. I want to share a behind-the-scenes look at my work on introducing a new string DType in NumPy 2.0…In this post, I’ll walk you through the technical process, key design decisions, and the ups and downs I faced. Plus, you’ll find tips on tackling mental blocks and insights into becoming a maintainer…
Cleaning sample data in standardized way
In the previous two posts of this series we reviewed how to standardize the steps in our data cleaning process to produce consistent datasets across the field of education research, as well as practices we can implement to make our data cleaning workflow more reproducible and reliable. In this final post of the series, I attempt to answer the question, “What does this process look like when implemented in the real world?”…
Generate Simulated Dataset for Linear Model in R
Researchers usually generate a simulated dataset that follows the model’s assumptions. This simulated dataset can be used as a benchmark for the model or real-world dataset replacement in the modeling process, where the simulated dataset is cost-effective than the real-world dataset. This article will explain how to generate a simulated dataset for a linear model using R…
Autodiff Puzzles - a series of self-contained puzzles for learning about derivatives in tensor libraries
Your goal in these puzzles is to implement the derivatives for each core function…In each case the function takes in a tensor x and returns a tensor f(x), so your job is to compute df(x)_o/dx_i for all indices o and i. If you get discouraged, just remember, you did this in high school (it just had way less indexing)…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #569 here.

Cutting Room Floor

Whenever you're ready, 3 ways we can help:

Learning something for your job? Hit reply, let us know, and we’ll help you.
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~63,900 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post

Ready for more?