Data Science Weekly - Issue 529

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Jan 12, 2024

Issue #529
January 11, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)

And now…let's dive into some interesting links from this week.

Editor's Picks

Budgeting with ChatGPT
Coming up with a budget is hard. Tracking how well you're sticking to that budget is even harder. In this blog post I'll show you how I use the ChatGPT API to track, categorize, and monitor my spending. The best part: it only took me about 2 hours to figure out and set up and now I'll share all the code I use so you can set it up in even less time for yourself…

How well-structured should your data code be?
In this post we talk about the inherent tradeoff between moving quickly and breaking things. While it is geared towards those in the data space, it can be read and enjoyed by anyone who writes code or builds systems, and feels that they are constantly under pressure to move quickly. We use the framing of a Data Scientist as someone that works in a setting where they prototype ML models, data transformations, etc…
A Philosophical Introduction to Language Models
Large language models like GPT-4 have achieved remarkable proficiency in a broad spectrum of language-based tasks, some of which are traditionally associated with hallmarks of human intelligence. This has prompted ongoing disagreements about the extent to which we can meaningfully ascribe any kind of linguistic or cognitive competence to language models. Such questions have deep philosophical roots, echoing longstanding debates about the status of artificial neural networks as cognitive models. This article -- the first part of two companion papers -- serves both as a primer on language models for philosophers, and as an opinionated survey of their significance in relation to classic debates in the philosophy cognitive science, artificial intelligence, and linguistics…

A Message from this week's Sponsor:

Is your A/B testing system reliable?

There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.

Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:

Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis

Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.

Download the white paper to see if you have all seven, and if you don't, what you could be missing.

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Let's talk about joins
Working with data would be so much simpler if we always only had one dataset to work with. However, in the world of research, we often have multiple datasets, collected from different instruments, participants, or time periods, and our research questions typically require these data to be linked in some way. This blog post reviews the various ways we may consider combining our data…
DataMapPlot
Creating beautiful plots of data maps. DataMapPlot is a small library designed to help you make beautiful data map plots for inclusion in presentations, posters and papers. The focus is on producing static plots that are great looking with as little work for you as possible. All you need to do is label clusters of points in the data map and DataMapPlot will take care of the rest. While this involves automating most of the aesthetic choices, the library provides a wide variety of ways to customize the resulting plot to your needs…
Unraveling spectral properties of kernel matrices
Since my early PhD years, I have plotted and studied eigenvalues of kernel matrices. In the simplest setting, take independent and identically distributed (i.i.d.) data, such as in the cube below in 2 dimensions, take your favorite kernels, such as the Gaussian or Abel kernels, plot eigenvalues in decreasing order, and see what happens. The goal of this sequence of blog posts is to explain what we observe…
Leveraging Big Data to Understand Women’s Mobility in Buenos Aires
While the travelers’ gender has not been a central consideration driving urban mobility planning, increasing evidence points to gender-differentiated mobility preferences and behaviors. This paper explores this topic in the context of the Buenos Aires Metropolitan Area, aiming to identify policy relevant differences between the mobility of women and men. It does so by leveraging mobile phone–based data, combined with existing household travel survey data and an original large-scale interception survey implemented in late 2021 and early 2022. The paper provides descriptive analysis of key spatial and temporal mobility patterns as well as implements statistical analysis to identify whether gender represented a key determinant of mode choice in the context of the pandemic…
So, Mamba vs. Transformers... is the hype real? [Reddit]
Heard all the buzz about Mamba, the new kid on the sequence modeling block. Supposedly it's faster, handles longer sequences better, and even outperforms Transformers on some tasks. But is it really a throne-stealer or just another flash in the pan?…To the AI aficionados out there, is Mamba just the next shiny toy, or a genuine paradigm shift in sequence modeling? Will it dethrone the mighty Transformer, or coexist as a specialized tool? Let's hear your thoughts!…
Data Engineering Vault
Welcome to the Data Engineering Vault an integral part of my Second Brain. This is more than a mere collection of terms; it’s a curated network of data engineering knowledge, designed to facilitate exploration and discovery. Here, you’ll find over 100+ interconnected terms, each serving as a gateway to deeper insights. Similar to a digital garden, this network allows you to weave through concepts, uncovering connections and expanding your understanding with each click. I invite you to dive in and explore the rich landscape of data engineering…
An Intuitive Guide to Self-Attention in GPT: The Venetian Masquerade
Imagine each player (word or token) in the game trying to find their partner. The tools at their disposal? A list of attributes they are looking for (queries), their own attributes (keys), and the secret information they want to exchange (values). This blog post is about unpacking this analogy to understand the intuition behind self-attention, stepping away from daunting equations and diving into a narrative that resonates with our experiences – much like uncovering the mystery in a game of “Inkognito.” Let’s embark on this exploratory quest, and by the end, I promise the concept of self-attention will seem less like a cryptic enigma and more like an old friend from a board game night…
Language Modeling Reading List (to Start Your Paper Club)
Some friends and I started a weekly paper club to read and discuss fundamental papers in language modeling. By pooling together our shared knowledge, experience, and questions, we learned more as a group than we could have individually. To encourage others to do the same, here’s the list of papers we covered, and a one-sentence summary for each. I’ll update this list with new papers as we discuss them. (Also, why and how to read papers .)…
Toward a Grand Unified Theory of Accelerations in Optimization and Machine Learning
Momentum-based acceleration of first-order optimization methods, first introduced by Nesterov, has been foundational to the theory and practice of large-scale optimization and machine learning. However, finding a fundamental understanding of such acceleration remains a long-standing open problem. In the past few years, several new acceleration mechanisms, distinct from Nesterov’s, have been discovered, and the similarities and dissimilarities among these new acceleration phenomena hint at a promising avenue of attack for the open problem. In this talk, we discuss the envisioned goal of developing a mathematical theory unifying the collection of acceleration mechanisms and the challenges that are to be overcome…
Transformers From Scratch
In this blog we’re going to walk through creating and training a transformer from scratch. We’ll go through each foundational element step by step and explain what is happening along the way. This blog is written in a Jupyter notebook which you can download and use to run the code yourself as you follow along. Running the code as you follow along and changing it to see how the output changes will help you learn the concepts better than reading alone…
Data Science Interactive Python Demonstrations
Walk-throughs of my interactive Python dashboards to teach fundamental concepts from data science, data analytics and machine learning…
How Johnny Can Persuade LLMs to Jailbreak Them:
Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
By iteratively applying different persuasion techniques in our taxonomy, we successfully jailbreak advanced aligned LLMs, including Llama 2-7b Chat, GPT-3.5, and GPT-4 — achieving an astonishing 92% attack success rate, notably without any specified optimization. Now, you might think that such a high success rate is the peak of our findings, but there's more. In a surprising twist, we found that more advanced models like GPT-4 are more vulnerable to persuasive adversarial prompts (PAPs). What's more, adaptive defenses crafted to neutralize these PAPs also provide effective protection against a spectrum of other attacks (e.g., GCG, Masterkey, or PAIR)…

Jobs (removed)

We have removed the “Jobs” section of the newsletter as we don’t think it adds much value to you.

Hit reply and let us know what section you want us to start including :)

A recap of exciting stories from the 20-30 subreddits in related spaces?
Highlights of interesting AI tools from the 20-30 newsletters in related spaces?
TikTok / Instagram / Etc. Posts from Data/AI Influencers
leave this section empty - the email is already long enough as it is :)
others?

Sponsor

New ML Challenge – Build a decentralized credit score in Web3

New marketplace for verifiable machine intelligence, leveraging zkML to ensure accuracy, verification, and IP protection for modelers, Spectral has launched its first-ever model-building challenge for data scientists to help address societal issues by leveraging open-source to produce high-performing ML models. The models built from this specific challenge will have massive implications for the crypto industry as we know it. A $100k bounty is on the line as well as an 85% revenue share for the model they built. Engineers can sign up now, and expect more challenges on the way for early 2024.

Apply here

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

Harvard’s Gov 50 - Data Science for the Social Sciences
Learning to use data to explore the social, political, and economic world…
Doing Meta-Analysis with R: A Hands-On Guide
This book serves as an accessible introduction into how meta-analyses can be conducted in R. Essential steps for meta-analysis are covered, including pooling of outcome measures, forest plots, heterogeneity diagnostics, subgroup analyses, meta-regression, methods to control for publication bias, risk of bias assessments and plotting tools. Advanced, but highly relevant topics such as network meta-analysis, multi-/three-level meta-analyses, Bayesian meta-analysis approaches, and SEM meta-analysis are also covered…
Biostatistics for Biomedical Research
The book is aimed at exposing biomedical researchers to modern biostatistical methods and statistical graphics, highlighting those methods that make fewer assumptions, including nonparametric statistics and robust statistical measures. In addition to covering traditional estimation and inferential techniques, the course contrasts those with the Bayesian approach, and also includes several components that have been increasingly important in the past few years, such as challenges of high-dimensional data analysis, modeling for observational treatment comparisons, analysis of differential treatment effect (heterogeneity of treatment effect), statistical methods for biomarker research, medical diagnostic research, and methods for reproducible research…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #528 here.

Cutting Room Floor

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~60,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

All our best,
Hannah & Sebastian

P.S. Was today’s newsletter helpful to your job?

Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)

Data Science Weekly Newsletter

Discussion about this post

Ready for more?