Data Science Weekly - Issue 555

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Jul 11, 2024

Issue #555
July 11, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Editor's Picks

Extrinsic Hallucinations in LLMs
Hallucination in large language models usually refers to the model generating unfaithful, fabricated, inconsistent, or nonsensical content. As a term, hallucination has been somewhat generalized to cases when the model makes mistakes. Here, I would like to narrow down the problem of hallucination to cases where the model output is fabricated and not grounded by either the provided context or world knowledge…There are two types of hallucination:
1. In-context hallucination: The model output should be consistent with the source content in context.
2. Extrinsic hallucination: The model output should be grounded by the pre-training dataset. However, given the size of the pre-training dataset, it is too expensive to retrieve and identify conflicts per generation…
This post focuses on extrinsic hallucination. To avoid hallucination, LLMs need to be (1) factual and (2) acknowledge not knowing the answer when applicable…

Book: Alice’s Adventures in a differentiable wonderland
Stripped of anything else, neural networks are compositions of differentiable primitives, and studying them means learning how to program and how to interact with these models, a particular example of what is called differentiable programming…I overview the basics of optimizing a function via automatic differentiation…The focus is on a intuitive, self-contained introduction to the most important design techniques, including convolutional, attentional, and recurrent blocks, hoping to bridge the gap between theory and code (PyTorch and JAX) and leaving the reader capable of understanding some of the most advanced models out there, such as large language models (LLMs) and multimodal architectures…
The Science of Visual Data Communication: What Works
Effectively designed data visualizations allow viewers to use their powerful visual systems to understand patterns in data across science, education, health, and public policy. But ineffectively designed visualizations can cause confusion, misunderstanding, or even distrust—especially among viewers with low graphical literacy. We review research-backed guidelines for creating effective and intuitive visualizations oriented toward communicating data to students, coworkers, and the general public…

A Message from this week's Sponsor:

Magical tools for working with data

Building a Big Picture Data Team at StubHub

See how Meghana Reddy, Head of Data at StubHub, built a data team that delivers business insights accurately and quickly with the help of Snowflake and Hex.

The challenges she faced may sound familiar:

Unclear SMEs meant questions went to multiple people
Without SLAs, answer times were too long
Lack of data modeling & source-of-truth metrics generated varying results
Lack of discoverability & reproducibility cost time, efficiency and accuracy
Static reporting reserved interactivity for rare occasion

Register now to hear how Meghana and the StubHub data team tackled these challenges with Snowflake and Hex. And watch Meghana demo StubHub’s data apps that increase quality and speed to insights…

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Motivation Research Using Labeling Functions
We developed a new methodology capable of representing ill defined concepts as motivation. The methodology also supports discovery of reliable relations between motivation and performance. We used the methodology to investigate the motivation of 150k developers in their real work in GitHub…
Use-cases for inverted PCA
A lot of people know of PCA for it's ability to reduce the dimensionality of a dataset. It can turn a wide dataset into a thin one, while hopefully only a limited amount of information. But what about doing it the other way around as well? Can you turn the thin representation into a wide one again? And if so, what might be a use-case for that?…
Copyright is the wrong hill to die on
The generative AI copyright wars took a new turn recently, when the Recording Industry Association of America (RIAA) announced that it was suing music generation services Suno and Udio for massive infringement of copyright…We believe that history suggests that some kind of compromise is likely to be forced on this debate, if not by the courts, then by politics…Without taking a position on fair use arguments, we ultimately believe that copyright will prove to be a futile hill for the industry to die on…Based on how cases so far have progressed, total victory looks increasingly unlikely, so model builders should begin to invest their energies into working out the best possible shape for a compromise….
The Illustrated AlphaFold
A visual walkthrough of the AlphaFold3 architecture, with more details and diagrams than you were probably looking for…Do you want to know how AlphaFold3 works? It has one of the most intimidating transformer-based architectures, so to make it approachable, we made a visual walkthrough inspired by JayAlammar's Illustrated Transformer!..
How to Interview and Hire ML/AI Engineers
In this write-up, Jason and I will share a few things we’ve learned about interviewing candidates for machine learning (ML) and AI roles. First, we’ll discuss what technical and non-technical qualities to assess. Then, we’ll share how to calibrate phone screens, and run the interview loop and debrief. Finally, we’ll wrap up with some tips for interviewers and hiring managers, as well as our opinionated take on some traits of a good hire. (And like all our writing online, opinions our own.)…
How Has Music Changed Since the 1950s? A Statistical Analysis
Our analysis will track changes in song composition for popular music over the last seven decades. We'll define popular songs as works listed on The Billboard Hot 100 and utilize Spotify's database of song attributes to track changes in music design…Our analysis makes use of the following features from Kaggle’s Billboard and Spotify datasets:
- Billboard Top 100:
  - Song tenure on charts
  - Number of artists on the charts
- Spotify Song Attributes:
  - Song duration
  - Danceability
  - Instrumentalness
  - Speechiness
  - Valence (Song Positivity)…
Let's reproduce GPT-2 (1.6B): one 8XH100 node, 24 hours, $672, in llm.c
In this post we are reproducing GPT-2 in llm.c. This is "the GPT-2", the full, 1558M parameter version that was introduced in OpenAI's blog post “Better Language Models and their Implications” in February 14, 2019…In 2019, training GPT-2 was an involved project from an entire team and considered a big model run but, ~5 years later, due to improvements in compute (H100 GPUs), software (CUDA, cuBLAS, cuDNN, FlashAttention) and data (e.g. the FineWeb-Edu dataset), we can reproduce this model on a single 8XH100 node in 24 hours, and for $672, which is quite incredible…
If you had 3 hours before work every morning to learn data engineering, how would you spend your time? [Reddit Discussion]
Based on what you know now, if you had 3 hours before work every morning to learn data engineering - how would you spend your time?…
Data Science Tools Reimagined: From arXiv Search to Scikit-learn
In this clip from the "Vanishing Gradients" podcast, Vincent Warmerdam, a data scientist and educator, demonstrates several innovative data science projects and tools. The discussion covers:
- Data Science Fiction Book: A new project featuring short essays on data science topics
- Archive Front Page Project: A GitHub repository using actions for daily article scraping from arXiv, with classifiers to detect topics of interest.
- Neural Search for arXiv Papers: Using sentence transformers and Matrushka embeddings to search 600k CS abstracts, with prompting techniques and semi-supervised learning.
- Rethinking Scikit-learn Pipelines: Introduction to the "Playtime" library, offering a new syntax for declaring pipelines and simplifying feature engineering…
What is the biggest lesson you learned building AI applications? [Reddit Discussion]
I’m starting to work on an AI application around learning and developmental disorders. If you could build your app/service again what would you do or not do the next time?…
Lecture Notes on Computational Cosmology
I taught an introduction to Computational Cosmology, my research domain, in Fall 2023. Computational Cosmology is the science of extracting physics from observational data in cosmology. It involves a mixture of theory, statistics and programming. If you are a student interested in doing research with me, these are the basics you should know (or learn) about. The level of these lecture notes should be suitable for beginning graduate students or advanced undergraduates…
Self-Taught Data Engineers! What's been the biggest 💡moment for you? [Reddit Discussion]
All my self-taught data engineers who have held a data engineering position at a company - what has been the biggest insight you've gained so far in your career?…

Training & Resources

Physics-based Deep Learning Book
This document contains a practical and comprehensive introduction of everything related to deep learning in the context of physical simulations. As much as possible, all topics come with hands-on code examples in the form of Jupyter notebooks to quickly get started. Beyond standard supervised learning from data, we’ll look at physical loss constraints, more tightly coupled learning algorithms with differentiable simulations, training algorithms tailored to physics problems, as well as reinforcement learning and uncertainty modeling…
Berkeley’s CS 194/294-267 Understanding Large Language Models: Foundations and Safety
Generative AI and Large Language Models (LLMs) including ChatGPT have ushered the world into a new era with rich new capabilities for wide-ranging application domains. At the same time, there is little understanding of how these new capabilities emerge, their limitations and potential risks. In this class, we will introduce foundations of LLMs, study methods for better understanding LLMs, discuss scaling laws and emergence, and explore the risks and challenges with these technologies and how we can build towards safe and beneficial AI…
Gemma 2 - Improving Open Language Models at a Practical Size
In this post, we take a deep dive into the architectural components of Gemma 2 such as Grouped Query Attention, Sliding Window Attention, RoPE Embeddings, Logit soft-capping & Model-merging!…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #554 here.

Cutting Room Floor

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~63,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post

Ready for more?