Data Science Weekly - Issue 534

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Feb 16, 2024

Issue #534
February 15, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

If you like what you read, consider becoming a paid member here: https://datascienceweekly.substack.com/subscribe :)

And now…let's dive into some interesting links from this week.

Editor's Picks

(Almost) Every infrastructure decision I endorse or regret after 4 years running infrastructure at a startup
I’ve led infrastructure at a startup for the past 4 years that has had to scale quickly. From the beginning I made some core decisions that the company has had to stick to, for better or worse, these past four years. This post will list some of the major decisions made and if I endorse them for your startup, or if I regret them and advise you to pick something else…

Why Probabilistic Linkage is More Accurate than Fuzzy Matching For Data Deduplication
This article describes the three types of information that are most important in making an accurate prediction, and how all three are leveraged by the Fellegi-Sunter model as used in Splink, a free software package for record linkage at scale. It also describes how some alternative record linkage approaches throw away some of this information, leaving accuracy on the table…
Data science with impact
I was recently asked to give a talk at No. 10 Downing Street on the topic of data science with impact and, in this post, I’m going to share some of what I said in that talk. The context for being asked is that the folks in 10DS, the Downing Street data team, are perhaps the most obsessed with having impact of any data science team I’ve met–so even though they’re the real experts on this topic, they’re very sensibly reaching out to others to see if there is anything extra they can learn…

A Message from this week's Sponsor:

Magical tools for working with data

Building a Big Picture Data Team at StubHub

See how Meghana Reddy, Head of Data at StubHub, built a data team that delivers business insights accurately and quickly with the help of Snowflake and Hex.

The challenges she faced may sound familiar:

Unclear SMEs meant questions went to multiple people
Without SLAs, answer times were too long
Lack of data modeling & source-of-truth metrics generated varying results
Lack of discoverability & reproducibility cost time, efficiency and accuracy
Static reporting reserved interactivity for rare occasion

Register now to hear how Meghana and the StubHub data team tackled these challenges with Snowflake and Hex. And watch Meghana demo StubHub’s data apps that increase quality and speed to insights…

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Thinking about High-Quality Human Data
High-quality data is the fuel for modern data deep learning model training. Most of the task-specific labeled data comes from human annotation, such as classification task or RLHF labeling (which can be constructed as classification format) for LLM alignment training. Lots of ML techniques in the post can help with data quality, but fundamentally human data collection involves attention to details and careful execution. The community knows the value of high quality data, but somehow we have this subtle impression that “Everyone wants to do the model work, not the data work”…
Visual vs text based programming, which is better?
Visual programming tools (also called ‘no-code’ or ‘low-code’) have been getting a lot of press recently. This, in turn, has generated a lot of discussion about whether visual or text based programming (coding) is ‘best’. As someone who uses text programming (C++) to create a visual programming data wrangling tool (Easy Data Transform) I have some skin in this game and have thought about it quite a bit…
A minimizer Far, Far Away
A few recent Arxiv papers and some recent conversations during my lectures made me realize that some optimization people might not be fully aware of important details on SGD when used on functions where the minimizer can be arbitrarily far from the initialization or even in the case when the minimizer does not exist. So, let’s talk about it…
word2vec Parameter Learning Explained
As an increasing number of researchers would like to experiment with word2vec or similar techniques, I notice that there lacks a material that comprehensively explains the parameter learning process of word embedding models in details, thus preventing researchers that are non-experts in neural networks from understanding the working mechanism of such models. This note provides detailed derivations and explanations of the parameter update equations of the word2vec models, including the original continuous bag-of-word (CBOW) and skip-gram (SG) models, as well as advanced optimization techniques, including hierarchical softmax and negative sampling…
Tidier.jl - Meta-package for data analysis in Julia, modeled after the R tidyverse.
Tidier.jl is a data analysis package inspired by R's tidyverse and crafted specifically for Julia. Tidier.jl is a meta-package in that its functionality comes from a series of smaller packages. Installing and using Tidier.jl brings the combined functionality of each of these packages to your fingertips…
Making my bookshelves clickable
I built a script that takes in an image of a bookshelf and makes each book clickable. When you click on a book, you are taken to the Google Books page associated with the book. You do not need to manually annotate any book or map each book to its title. This happens automatically. You can try a demo of a clickable bookshelf on GitHub…
Which AI/ML fields are growing under the radar? [Reddit]
LLMs and diffusion models are currently stealing the limelight. I was curious to know which other fields in AI/ML are people excited about, and especially fields which are seeing rapid industrial adoption. From my own perspective I noticed that computer vision/machine vision is in demand by many in the industry/manufacturing space, and to me this seems to be the most mature industrial use of machine learning. Close behind is data-driven signal processing which seems to be requested by aerospace type companies for their radar software. I know that graph neural networks are used by Facebook/Amazon and others, but don't know to what extent. I know that there is a lot of stuff happening with reinforcement learning, especially in robotics, but that is far removed from my area of expertise. There are moreover many people in many industries using both deep learning and more classical machine learning to find optimal layouts for SoC and other such problems in the silicon industry. I would be interested to hear from others who are doing AI/ML outside of LLMs/diffusion models. What are you excited about? Where do you see growth happening?…
SORA: Creating video from text
Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt…
Automated Unit Test Improvement using Large Language Models at Meta
This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms. In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers…
Making Sense of Voting Systems
Interactively explore how different voting systems can yield different outcomes…I've created this simulation tool. It's designed to clarify the complex world of voting systems, from the familiar Plurality Voting system to the Ranked Choice system that's gaining popularity in parts of the country, and even lesser-known methods like the Borda Count and Condorcet Method. There are many more systems than I've included in this tool, each with its own mechanics, strengths, and weaknesses…
Deep Scientific Search: Solved
Precise, thorough, fully automated literature search on ArXiv, covering CS, ML, math, and physics…Built an automated system to run a deep search of ArXiv and carefully find all the precise papers that exist on a complex topic. It's different from simple RAG because it searches, classifies, and adapts based on relevant papers it uncovers, and then continues until it finds every paper on a topic (trying to mimic the human research process). Benchmarked 10x higher accuracy and total retrieval compared to Google Scholar for a median search (whitepaper on website). Also knows when it is complete, and misses virtually nothing (< 3% or so, once it's converged)….
Thoughts on the 2024 AI Job Market
5 years ago, around the time I finished my PhD, if you wanted to work on cutting-edge natural language processing (NLP), your choice was relatively limited. Recently, I decided to go on the job market again, which has become much more diverse. In this post, I want to highlight some macro trends that I observed and the reasons that I joined my new company, Cohere, which may be helpful in guiding your own job search. Note: This post..is written from my perspective as a Europe-based researcher focused on NLP. If you are interested in AI companies but have a different skill, some of these thoughts should still be relevant to you…

Training & Resources

Large Language Models: A Survey
In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions…
Machine Learning Engineering Open Book
This is an open collection of methodologies, tools and step by step instructions to help with successful training of large language models and multi-modal models. This is a technical material suitable for LLM/VLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly address your needs. This repo is an ongoing brain dump of my experiences training Large Language Models (LLM) (and VLMs); a lot of the know-how I acquired while training the open-source BLOOM-176B model in 2022 and IDEFICS-80B multi-modal model in 2023…
Stanford CS25 - Transformers United
In this seminar, we examine the details of how transformers work, and dive deep into the different kinds of transformers and how they're applied in different fields. We do this through a combination of instructor lectures, guest lectures, and classroom discussions. We will invite people at the forefront of transformers research across different domains for guest lectures…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #533 here.

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~61,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

PS. If you like what you read, consider becoming a paid member here: https://datascienceweekly.substack.com/subscribe :)

Data Science Weekly Newsletter

Discussion about this post