Data Science Weekly - Issue 517

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Oct 19, 2023

Issue #517
October 10 2023

Hello!

Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

If you don’t find this email useful, please unsubscribe here.

Is this newsletter helpful to your job?

Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)

And now…let's dive into some interesting links from this week

Editor's Picks

Kalman Filter For Dummies
When I started doing my homework for Optimal Filtering for Signal Processing class, I said to myself :"How hard can it be?". Soon I realized that it was a fatal mistake…this article is the result of my couple of day's work and reflects the slow learning curves of a "mathematically challenged" person…If you're humble enough to admit that you don't understand this stuff completely, you'll find this material very enlightening.

Understanding Moments
Why are a distribution's moments called "moments"? How does the equation for a moment capture the shape of a distribution? Why do we typically only study four moments? I explore these and other questions in detail…
Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments
LoRA is one of the most widely used, parameter-efficient fine-tuning techniques for training custom LLMs. From saving memory with QLoRA to selecting the optimal LoRA settings, this article provides practical insights for those interested in applying it…

A Message from this week's Sponsor:

Magical tools for working with data

Hex is a collaborative workspace for data science and analytics. Now data teams can run their queries, notebooks, and interactive reports — all in one place.

Hex has Magical AI tools that can generate queries and code, create visualizations, and even kickstart a whole analysis, all from natural language prompts, allowing teams to accelerate work and focus on what matters.

Join hundreds of data teams like Notion, AllTrails, Loom, Brex, and Algolia using Hex every day to make their work more impactful. Sign up today at hex.tech/datascienceweekly to get a 30-day free trial of the Hex Team plan!

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

News from PyTorch Conference 2023
Hello from the PyTorch Conference in San Francisco! We’re thrilled that we’re able to bring together leading researchers, developers, and academic communities to further the education and advancement of end-to-end machine learning framework…The PyTorch team has been hard at work this year to bring innovative releases that further enhance the AI and ML community...Read on for all of the news and happenings coming out of PyTorch Conference 2023!..

Deep Learning Ultra
Open source Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in PyTorch, OpenCV (compiled for GPU), TensorFlow 2 for GPU, PyG and NVIDIA RAPIDS, running on CUDA 12.1…

Storytelling vs. Exploring, With Global Happiness Data: A meta analysis on data visualization
I recently came across a piece that examines the difference between "storytelling" and "exploration" in the context of data visualization. The piece, by Amanda Makulec, observes that the two serve fundamentally different goals, but are often conflated in the industry…I wanted to explore (or I guess maybe tell a story about...) this distinction a bit more, as well as a few other factors that I think about when designing a visualization. My goal is to codify some of the broad decisions that go into designing a visualization, and to outline how those decisions can be translated into specific design choices…
LLM domination on job descriptions [Reddit]
Can anyone explain why many companies asking for LLM experience for data scientist roles? It wasn't there like 6-8 months ago, now around 70% of the job descriptions asking for that and it goes like Python, SQL and LLM. Looks a bit weird to be honest. What are they doing, creating their own ChatGPT?…
Why We’re Building an Open-Source Universal Translator
We’re building a small, unconnected box with a built-in display that can automatically translate between dozens of different languages. You can see it in the video above, and we’ve got working demos to share if you’re interested in trying it out. The form factor means it can be left in-place on a hotel front desk, brought to a meeting, placed in front of a TV, or anywhere you need continuous translation. The people we’ve shown this to have already asked to take them home for visiting relatives, colleagues, or themselves when traveling…

Interactive Demonstration of Ridge Regression and Intro to Hyperparameter Tuning
In the Fall of 2019 my students requested a demonstration to show the value of ridge regression. I wrote this interactive demonstration to show cases in which the use of regularization coefficient, a hyperparameter, that reduces the model flexibilty / sensivity to training data (reduces model variance) improves the prediction accuracy…
Multimodality and Large Multimodal Models (LMMs)
This post covers multimodal systems in general, including LMMs. It consists of 3 parts.
- Part 1 covers the context for multimodality, including why multimodal, different data modalities, and types of multimodal tasks.
- Part 2 discusses the fundamentals of a multimodal system, using the examples of CLIP, which lays the foundation for many future multimodal systems, and Flamingo, whose impressive performance gave rise to LMMs.
- Part 3 discusses some active research areas for LMMs, including generating multimodal outputs and adapters for more efficient multimodal training, covering newer multimodal systems such as BLIP-2, LLaVA, LLaMA-Adapter V2, LAVIN, etc.

A year ago, I left a French tenured prof. position to join Samsung AI.
A year ago, I left a French tenured prof. position to join Samsung AI. I received a lot of requests asking for my view on this move. Here is a document describing a few things that, I hope, will be useful to the ones facing the academia/industry dilemma…

Machine Learning Who to Nudge: Causal vs Predictive Targeting in a Field Experiment on Student Financial Aid Renewal
In many settings, interventions may be more effective for some individuals than others, so that targeting interventions may be beneficial. We analyze the value of targeting in the context of a large-scale field experiment with over 53,000 college students, where the goal was to use "nudges" to encourage students to renew their financial-aid applications before a non-binding deadline…

GenSim: Generating Robotic Simulation Tasks via Large Language Models
We propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability…Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks…

You've just joined a new company who do everything in Excel, but.... [Reddit]
There's a senior manager who's keen to modernize their approach to data, but doesn't know what they want. What are you asking for / putting in place?…

Understanding cirrus clouds using explainable machine learning
Cirrus clouds are key modulators of Earth’s climate. Their dependencies on meteorological and aerosol conditions are among the largest uncertainties in global climate models. This work uses 3 years of satellite and reanalysis data to study the link between cirrus drivers and cloud properties. We use a gradient-boosted machine learning model and a long short-term memory network with an attention layer to predict the ice water content and ice crystal number concentration…

Tool Sponsor:

Is your A/B testing system reliable?

There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.

Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:

Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis

Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.

Download the white paper to see if you have all seven, and if you don't, what you could be missing.

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Jobs

Data Science Intern:
Performance Control & Digitalization

More than 90% of automotive innovations are based on electronics and software.

We, the BMW Group, offer you an interesting and varied internship in data science for Performance Control & Digitalization. To take our operations to the next level, the BMW Group – Performance Control & Digitalization department is looking for a Data science intern to contribute to the Supply Chain Innovations Think Tank of BMW Group and continue BMW’s leadership in supply chain management. The goal of the team will be to research emerging technologies including Data Science (ML, AI, BI etc.).

Location is Munich. Apply here

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

RLHF Papers
Some notes for browsing the list.
- All papers are about LLMs unless tagged with a specific category, such as control for robotics / simulated agents or multimodal, or more than one!
- Limited notes and summaries will be added as I go
- This is a resource for keeping up with stuff that is added today and recently, I will not add all historic work, even things from earlier in 2023. For prominent past papers, see here…
Deep Learning Course
You can find here slides, recordings, and a virtual machine for François Fleuret's deep-learning courses 14x050 of the University of Geneva, Switzerland…This course is a thorough introduction to deep-learning, with examples in the PyTorch framework:
- machine learning objectives and main challenges,
- tensor operations,
- automatic differentiation, gradient descent,
- deep-learning specific techniques,
- generative, recurrent, attention models…
SAT Solvers I: Introduction and applications
This tutorial concerns the Boolean satisfiability or SAT problem. We are given a formula containing binary variables that are connected by logical relations such as OR and AND. We aim to establish whether there is any way to set these variables so that the formula evaluates to TRUE…Algorithms that are applied to this problem are known as SAT solvers. The tutorial is divided into three parts. In part I, we introduce Boolean logic and the SAT problem. We discuss how to transform SAT problems into a standard form that is amenable to algorithmic manipulation. We categorize types of SAT solvers and present two naïve algorithms. We introduce several SAT constructions, which can be thought of as common sub-routines for SAT problems. Finally, we present some applications; the Boolean satisfiability problem may seem abstract, but as we shall see it has many practical uses…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #516 here.

Cutting Room Floor

Whenever you're ready, 3 ways we can help you:

Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.

Thank you for joining us this week :)

All our best,
Hannah & Sebastian

P.S.

Is this newsletter helpful to your job?

Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)

Data Science Weekly Newsletter

Discussion about this post