Data Science Weekly - Issue 526
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #526
December 21, 2023
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week.
Editor's Picks
I can't stop using ChatGPT and I hate it
I'm trying to learn various topics like Machine Learning and Robotics etc., and I'm kinda a beginner in programming. For any topic and any language, my first instinct is to…go to ChatGPT,
write down whatever I need my code to do,
copy paste the code
if it doesn't give out good results, ask ChatGPT to fix whatever it's done wrong
repeat until I get satisfactory result
I hate it, but I don't know what else to do…
Deep Multimodal Fusion for Surgical Feedback Classification
Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: “Anatomic,” “Technical,” “Procedural,” “Praise,” and “Visual Aid.” We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale…The Effect of Generative AI on the Human-Tool Interface
I am excited by how tech-based tools, which are the true catalysts for knowledge-based work, can be up-leveled in ways not possible pre-generative AI. To understand why, we first need to understand how humans do work and how tools fit in (what I call the Human-Tool Interface). Then, we will be able to explicitly call out what parts of this Human-Tool Interface are affected by generative AI, and how this will enable the entire Human-Tool Interface to transform in a way not possible beforehand…
A Message from this week's Sponsor:
Is your A/B testing system reliable?
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
Using sequences of life-events to predict human lives
Here we represent human lives in a way that shares structural similarity to language, and we exploit this similarity to adapt natural language processing techniques to examine the evolution and predictability of human lives based on detailed event sequences…We create embeddings of life-events in a single vector space, showing that this embedding space is robust and highly structured. Our models allow us to predict diverse outcomes ranging from early mortality to personality nuances, outperforming state-of-the-art models by a wide margin. Using methods for interpreting deep learning models, we probe the algorithm to understand the factors that enable our predictions…mamba-minimal - Simple implementation of Mamba SSM in one file of PyTorch
Simple, minimal implementation of Mamba in one file of PyTorch.Featuring:
Equivalent numerical output as official implementation for both forward and backward pass
Simplified, readable, annotated code…
15min History of Reinforcement Learning and Human Feedback
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of human preferences that acts as a reward function for optimization…In this [video], we illustrate the complex history of optimizing preferences, and articulate lines of inquiry to understand the sociotechnical context of reward models. In particular, we highlight the ontological differences between costs, rewards, and preferences at stake in RLHF's foundations, related methodological tensions, and possible research directions to improve general understanding of how reward models function…Support vector machines dominate my prediction modeling nearly every time [Reddit]
Whenever I build a stacking ensemble (be it for classification or regression), a support vector machine nearly always has the lowest error. Quite often, its error will even be lower or equivalent to the entire ensemble with averaged predictions from various models (LDA, GLMs, trees/random forests, KNN, splines, etc.). Yet, I rarely see SMVs used by other people. Is this just because you strip away interpretation for prediction accuracy in SMVs? Is anyone else experiencing this, or am I just having dumb luck with SVMs?…Mastering Stacking of Diverse Shapes with Large-Scale Iterative Reinforcement Learning on Real Robots
Reinforcement learning solely from an agent's self-generated data is often believed to be infeasible for learning on real robots, due to the amount of data needed. However, if done right, agents learning from real data can be surprisingly efficient through re-using previously collected sub-optimal data. In this paper, we demonstrate how the increased understanding of off-policy learning methods and their embedding in an iterative online/offline scheme (``col'‘) can drastically improve data efficiency by using all the collected experience, which empowers learning from real robot experience only…NLP Research in the Era of LLMs: 5 Key Research Directions Without Much Compute
In this newsletter, I first argue why the current state of research is not as bleak—rather the opposite! I will then highlight five research directions that are important for the field and do not require much compute…A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library
We provide an optimized implementation of the forward pass of FlashAttention-2, a popular memory-aware scaled dot-product attention algorithm, as a custom fused CUDA kernel targeting NVIDIA Hopper architecture and written using the open-source CUTLASS library. In doing so, we explain the challenges and techniques involved in fusing online-softmax with back-to-back GEMM kernels, utilizing the Hopper-specific Tensor Memory Accelerator (TMA) and Warpgroup Matrix-Multiply-Accumulate (WGMMA) instructions, defining and transforming CUTLASS Layouts and Tensors, overlapping copy and GEMM operations, and choosing optimal tile sizes for the Q, K and V attention matrices while balancing the register pressure and shared memory utilization…Bayesian copula estimation: Describing correlated joint distributions
When we deal with multiple variables (e.g. a and b) we often want to describe the joint distribution P(a,b) parametrically. If we are lucky, then this joint distribution might be ‘simple’ in some way. For example, it could be that a and b are statistically independent, in which case we can break down the joint distribution into P(a,b)=P(a)P(b) and so we just need to find appropriate parametric descriptions for P(a) and P(b). Even if this is not appropriate, it may be that P(a,b) could be described well by a simple multivariate distribution, such as a multivariate normal distribution for example…However, very often when we deal with real datasets, there is complex correlational structure in P(a,b) meaning that these two previous approaches are not available to us. So alternative methods are required…The Rise, Fall, and (Slight) Rise of DVDs. A Statistical Analysis
Recently, DVDs have experienced a minor revival, less as an enduring consumer good and more as a symbol of self-expression. If that sounds hokey or illogical, it's because it is. People (myself included) have chosen to forego convenience and economic practicality in order to fulfill an emotional desire, and so far, I am loving it…So today, we'll explore the rise and fall of physical media, the changing economics of movie ownership, and the minor resurgence of a dying technology…
How I interview data engineers [Reddit]
I am a Head of data (& analytics) engineering at a Fintech company and have interviewed hundreds of candidates. What I have outlined in my blog post (below) would, obviously, not apply to every interview you may have, but I believe there are many things people don't usually discuss. Please go wild with any questions you may have…Understanding GPU Memory 1: Visualizing All Allocations over Time
During your time with PyTorch on GPUs, you may be familiar with this common error message:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 401.56 MiB is free.
In this series, we show how to use memory tooling, including the Memory Snapshot, the Memory Profiler, and the Reference Cycle Detector to debug out of memory errors and improve memory usage…
Structured Hierarchical Retrieval
Doing RAG well over multiple documents is hard. A general framework is given a user query, first select the relevant documents before selecting the content inside. But selecting the documents can be tough - how can we dynamically select documents based on different properties depending on the user query? In this notebook, we show you our multi-document RAG architecture…
Jobs
Data Scientist – BCG X
We are BCG X.
BCG X is the tech build & design unit of BCG. Turbocharging BCG’s deep industry and functional expertise, BCG X brings together advanced tech knowledge and ambitious entrepreneurship to help organizations enable innovation at scale. With nearly 3,000 technologists, scientists, programmers, engineers, and human-centered designers located across 80+ cities, BCG X builds and designs platforms and software to address the world’s most important challenges and opportunities.
Our BCG X teams own the full analytics value-chain end to end: framing new business challenges, designing innovative algorithms, implementing, and deploying scalable solutions, and enabling colleagues and clients to fully embrace AI. Our product offerings span from fully custom-builds to industry specific leading edge AI software solutions.
Our Data Scientists and Senior Data Scientist are part of our rapidly growing team to apply data science methods and analytics to real-world business situations across industries to drive significant business impact. You'll have the chance to partner with clients in a variety of BCG regions and industries, and on key topics like climate change, enabling them to design, build, and deploy new and innovative solutions.
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Sponsor
New ML Challenge – Build a decentralized credit score in Web3
New marketplace for verifiable machine intelligence, leveraging zkML to ensure accuracy, verification, and IP protection for modelers, Spectral has launched its first-ever model-building challenge for data scientists to help address societal issues by leveraging open-source to produce high-performing ML models. The models built from this specific challenge will have massive implications for the crypto industry as we know it. A $100k bounty is on the line as well as an 85% revenue share for the model they built. Engineers can sign up now, and expect more challenges on the way for early 2024.
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Training & Resources
Which youtube channels to follow as a sophomore for ML.Theres a lot of information out there.Only a small percentage seeming not hyped up…
Generative AI exists because of the transformer
This is how it works…Deep Generative Models (Cornell Tech CS 6785, Spring 2023]
YouTube Playlist of lectures…
Last Week's Newsletter's 3 Most Clicked Links
What are 2023's top innovations in ML/AI outside of LLM stuff? [Reddit]
What is your most and least favorite thing about Jupyter notebooks? [Reddit]
* Based on unique clicks.
** Find last week's issue #525 here.
Cutting Room Floor
AI as the New Member of the Engineering Team: Crafting an End-to-End AI Application with AI
A single computational objective drives specialization of streams in visual cortex
Designing Guiding Principles for NLP for Healthcare: A Case Study of Maternal Health
Discovery of a structural class of antibiotics with explainable deep learning
pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
Whenever you're ready, 3 ways we can help:
Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week! :)
All our best,
Hannah & Sebastian
P.S. Was today’s newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.
Cool summary, thanks for sharing. I'm particularly interested in the "Using sequences of life-events to predict human lives" paper, that's definitely going to go into my read list for the Holidays!