Data Science Weekly - Issue 541

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Apr 05, 2024

Issue #541
April 04, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Editor's Picks

Governing AI Agents
This Article makes three contributions. First, it uses agency law and theory to identify and characterize problems arising from AI agents, including issues of information asymmetry, discretionary authority, and loyalty. Second, it illustrates the limitations of conventional solutions to agency problems: incentive design, monitoring, and enforcement might not be effective for governing AI agents that make uninterpretable decisions and operate at unprecedented speed and scale. Third, the Article explores the implications of agency law and theory for designing and regulating AI agents, arguing that new technical and legal infrastructure is needed to support governance principles of inclusivity, visibility, and liability…

JS-Torch - PyTorch in JavaScript
A JavaScript library like PyTorch, built from scratch…JS-Torch is a Deep Learning JavaScript library built from scratch, to closely follow PyTorch's syntax. It contains a fully functional Tensor object, which can track gradients, Deep Learning Layers and functions, and an Automatic Differentiation engine. Feel free to try out the Web Demo!..
Data acquisition strategies for AI-first start-ups
A detailed guide to data acquisition for AI-first start-ups, in collaboration with our friend [at]muellerfreitag , Director of Product Management at Qualcomm...

A Message from this week's Sponsor:

New York R Conference | In-Person & Virtual

Come celebrate the 10th anniversary of the New York R Conference happening May 16th & 17th with interactive workshops on May 15th! Take a trip down memory lane as we look back on the past nine years. You’ll hear from all-time greats, such as Hadley Wickham (Posit) & Andrew Gelman (Columbia University) and discover fresh, exciting new voices we're adding to the mix like Zhangjun Zhou (Macy's) & Walker Harrison (New York Yankees).

Use promo code DSW20 for 20% off your ticket!

Don't miss out on this amazing opportunity to acquire knowledge, expand your expertise and network with fellow data science professionals!

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Stanford CS 25 Transformers Course (Open to Everybody | Starts Tomorrow)
We are opening the course through Zoom to the public. Lectures start tomorrow (Thursdays), 4:30-5:50pm PDT, at Zoom link. Course website: https://web.stanford.edu/class/cs25/ Interested in Transformers, the deep learning model that has taken the world by storm? Want to have intimate discussions with researchers? If so, this course is for you! It's not every day that you get to personally hear from and chat with the authors of the papers you read!…
Splats + Maps Test - Trying out Splats + 3D Tiles.
Combine @googlemaps 3d tiles and @LumaLabsAI gaussian splat scans via @aframevr / @threejs for high fidelity at tiny scale AND zoomable out to city scale all in the browser. 🙏 @glitch hosting, @playcanvas splat editing, @NYTimesRD 3dtiles component…
Using LESS Data to Tune - Models Data Selection in the Era of LLMs
- We introduce the motivation for data selection and describe how the criteria for “good” data depends heavily on the setting. One can either try to identify representative datapoints for the in-domain setting or relevant ones for the transfer setting.
- Our algorithm, LESS, effectively selects relevant data to induce capabilities in the instruction tuning setting. LESS identifies 5% of the dataset that induces stronger performance than training on the full dataset.
- We conduct an in-depth analysis of prior works on data selection for various settings and provide insights into their technical details, strengths, and limitations.
- We conclude by identifying trends in data selection in the era of LLMs.
Pipelines for convenience, *and* safety
There are many reasons to appreciate pipelines in scikit-learn. They are relatively easy to declare, they are expressive ... but they are also typically safer! This is an under appreciated aspect of pipelines which is why this video is all about exploring what might go wrong if you don't use pipelines…
How to interpret and report nonlinear effects from Generalized Additive Models
Generalized Additive Models (GAMs) are flexible tools that replace one or more predictors in a Generalized Linear Model (GLM) with smooth functions of predictors. These are helpful for learning arbitrarily complex, nonlinear relationships between predictors and conditional responses without needing a priori expectations about the shapes of these relationships. Rather, they are learned using penalized smoothing splines…
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Separation of compute and storage has become the de-facto standard in the data industry for batch processing. The addition of tiered storage to open source Apache Kafka is the first step in bringing true separation of compute and storage to the streaming world. In this talk, we'll discuss in technical detail how to take the concept of tiered storage to its logical extreme by building an Apache Kafka protocol compatible system that has zero local disks…
Better way to query a large (15TB) dataset that does not cost $40,000 [Reddit]
Our data analyst has come up with a list of 10M customer id that they are interested in, and want to pull all the transactions of the these customers. This list of 7.5M customer id is stored as a CSV file of 200MB on S3 as well. Currently, they are running an AWS Glue job where they are essentially loading the large dataset from the AWS Glue catalog and the small customer id list cut into smaller batches, and doing an inner join to get the outputs….However, doing this will run a bill close to $40,000 based off our calculation. What would be a better way to do this?….
How to guess a gradient
How much can you say about the gradient of a neural network without computing a loss or knowing the label? This may sound like a strange question: surely the answer is "very little." However, in this paper, we show that gradients are more structured than previously thought. Gradients lie in a predictable low-dimensional subspace which depends on the network architecture and incoming features. Exploiting this structure can significantly improve gradient-free optimization schemes based on directional derivatives, which have struggled to scale beyond small networks trained on toy datasets…
Running OCR against PDFs and images directly in your browser
I attended the Story Discovery At Scale data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?…So I built a new tool! tools.simonwillison.net/ocr provides a single page web app that can run Tesseract OCR against images or PDFs that are opened in (or dragged and dropped onto) the app. Crucially, everything runs in the browser. There is no server component here, and nothing is uploaded. Your images and documents never leave your computer or phone…
Dwell Time Analysis with Computer Vision - Real-Time Stream Processing
Learn how to use computer vision to analyze wait times and optimize processes. This tutorial covers object detection, tracking, and calculating time spent in designated zones. Use these techniques to improve customer experience in retail, traffic management, or other scenarios…
A Course in Exploratory Data Analysis
This book contains the lecture notes for a course on Exploratory Data Analysis that I taught for many years at Bowling Green State University. I started teaching this course using John Tukey’s EDA book, but there were several issues. First, it seemed students had difficulties with Tukey’s particular writing style. Second, the book does not use any technology. So the plan was to write up lecture notes covering many of the ideas from Tukey’s text and to supplement the course with software. Originally, I illustrated the methods using EDA commands from Minitab, but later I focused on using functions from the statistical system R…
Fine-tuning Mistral on your own data 🤙
In this notebook and tutorial, we will fine-tune the Mistral 7B model - which outperforms Llama 2 13B on all tested benchmarks - on your own data!…This tutorial will use QLoRA, a fine-tuning method that combines quantization and LoRA. For more information about what those are and how they work, see this post. In this notebook, we will load the large model in 4bit using bitsandbytes and use LoRA to train using the PEFT library from Hugging Face 🤗….

Training & Resources

Mamba Explained
Here we’ll discuss:
- The advantages (and disadvantages) of Mamba (🐍) vs Transformers (🤖),
- Analogies and intuitions for thinking about Mamba, and
- What Mamba means for Interpretability, AI Safety and Applications…
Ask HN: Most efficient way to fine-tune an LLM in 2024?
In Apr 2024 what is the most efficient way to fine-tune an LLM?…In particular we are trying to understand performance vs. cost trade-offs. We don't have a budget to train from scratch…We are working with a proprietary data set on the order of 100M tokens and are looking to fine-tune a general purpose language model and also create task-specific models based on the same corpus…
What is a complete list of the usual assumptions for linear regression?
What are the usual assumptions for linear regression?
Do they include:
1. a linear relationship between the independent and dependent variable
2. independent errors
3. normal distribution of errors
4. homoscedasticity
Are there any others?…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #540 here.

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~61,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post