Data Science Weekly - Issue 525
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #525
December 14, 2023
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week.
Editor's Picks
Statistical Rethinking (2024 Edition)
This course teaches data analysis, but it focuses on scientific models. The unfortunate truth about data is that nothing much can be done with it, until we say what caused it. We will prioritize conceptual, and causal models and precise questions about those models. We will use Bayesian data analysis to connect scientific models to evidence. And we will learn powerful computational tools for coping with high-dimension, imperfect data of the kind that biologists and social scientists face…
The AI trust crisis
Dropbox added some new AI features. In the past couple of days, these have attracted a firestorm of criticism…The key issue here is that people are worried that their private files on Dropbox are being passed to OpenAI to use as training data for their models—a claim that is strenuously denied by Dropbox….when it comes to data privacy and AI, a “moderately OK job” is a failing grade. Especially if you hold as much of people’s private data as Dropbox does! Two details in particular seem really important…Our First Netflix Data Engineering Summit
Earlier this summer Netflix held our first-ever Data Engineering Forum. Engineers from across the company came together to share best practices on everything from Data Processing Patterns to Building Reliable Data Pipelines. The result was a series of talks which we are now sharing with the rest of the Data Engineering community! You can find each of the talks below with a short description of each, or you can go straight to the playlist on YouTube here…
A Message from this week's Sponsor:
Is your A/B testing system reliable?
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
Teaching HCI Foundations and Frontiers
How do we ground students not just in the practice of human-computer interaction, but also in the theories and big ideas that animate the field? When there’s a vast library of theories, visions, architectures, and critiques, not every design idea has to start from the drawing board. I’ve recently redeveloped our advanced HCI course for undergraduates and graduate students to weave our major HCI theories into modern topics…Smoking Causes Cancer
In the preface of Probably Overthinking It, I wrote:Sometimes interpreting data is easy. For example, one of the reasons we know that smoking causes lung cancer is that when only 20% of the population smoked, 80% of people with lung cancer were smokers. If you are a doctor who treats patients with lung cancer, it does not take long to notice numbers like that.
When I re-read that paragraph recently, it occurred to me that interpreting those numbers might not be as easy as I thought. To find out, I ran a Twitter poll. Here are the results…
What is your most and least favorite thing about Jupyter notebooks? [Reddit]
Curious about what people's experience with notebooks has been like!…How Databases Store and Retrieve Data
Have you ever found yourself executing a database query and wondered how the database engine stores and fetches the data? Or maybe in a review, someone pointed out, “This query is going to require a table scan because it’s not using an indexed column.”, and you thought, “What does that even mean?”. While then, this post might be for you!…DataLab:
Data processing and analysis software for scientific and industrial applications
DataLab is a generic signal and image processing software based on Python scientific libraries (such as NumPy, SciPy or scikit-image) and Qt graphical user interfaces (thanks to the powerful PlotPyStack - mostly the guidata and PlotPy libraries). DataLab is available as a stand-alone application (see for example our all-in-one Windows installer) or as an addon to your Python-Qt application thanks to advanced automation and embedding features…Is a masters while working full-time as an ML engineer worth it? [Reddit]
I’m currently a first-year graduate from undergraduate college working full-time as an ML engineer. My company offers to pay for a graduate degree and I’m considering getting a master’s. First question: for those who are ML or data scientists, what were your degrees in? Second question: how much of this will suck? Is taking two classes per semester reasonable? And are online universities considered okay? Third question: how much of a pay bump did you see after getting your master’s?..Creating Christmas cards with R
The programming language R is capable of creating a wide variety of geometric shapes that can be used to construct high quality graphics – including festive images. In this tutorial, Nicola Rennie walks us through the process of using R packages to create and send Christmas cards with R…VonGoom: A Novel Approach for Data Poisoning in Large Language Models
Overview: We introduce VonGoom (Vectorized Offending Neurons - Guided Obfuscated Objectives in large-language Models), a novel approach for poisoning attacks targeting LLMs during training. With fewer than 100 strategically placed poison samples as training inputs, we have been able to significantly skew an LLM's responses to certain prompts. Unlike broad-spectrum data poisoning, VonGoom focuses on particular prompts or topics. Our method involves crafting text inputs that are seemingly benign but contain subtle manipulations designed to mislead the model during training and disturb learned weights…A cry for help: Early detection of brain injury in newborns
Since the 1960s, neonatal clinicians have known that newborns suffering from certain neurological conditions exhibit altered crying patterns, such as the high-pitched cry in birth asphyxia. Despite an annual burden of over 1.5 million infant deaths and disabilities, early detection of neonatal brain injuries due to asphyxia remains a challenge, particularly in developing countries where the majority of births are not attended by a trained physician. Here, we report on the first inter-continental clinical study to demonstrate that neonatal brain injury can be reliably determined from recorded infant cries using an AI algorithm we call Roseline…
Unlimiformer: Long-Range Transformers with Unlimited Length Input
Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index, while the returned kNN distances are the attention dot-product scores…We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time…What are 2023's top innovations in ML/AI outside of LLM stuff? [Reddit]
What really caught your eye so far this year? Both high profile applications but also research innovations which may shape the field for decades to come…Language Models as Zero-Shot Trajectory Generators
Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence of end-effector poses for manipulation skills, when given access to only object detection and segmentation vision models…
Jobs
Data Scientist – BCG X
We are BCG X.
BCG X is the tech build & design unit of BCG. Turbocharging BCG’s deep industry and functional expertise, BCG X brings together advanced tech knowledge and ambitious entrepreneurship to help organizations enable innovation at scale. With nearly 3,000 technologists, scientists, programmers, engineers, and human-centered designers located across 80+ cities, BCG X builds and designs platforms and software to address the world’s most important challenges and opportunities.
Our BCG X teams own the full analytics value-chain end to end: framing new business challenges, designing innovative algorithms, implementing, and deploying scalable solutions, and enabling colleagues and clients to fully embrace AI. Our product offerings span from fully custom-builds to industry specific leading edge AI software solutions.
Our Data Scientists and Senior Data Scientist are part of our rapidly growing team to apply data science methods and analytics to real-world business situations across industries to drive significant business impact. You'll have the chance to partner with clients in a variety of BCG regions and industries, and on key topics like climate change, enabling them to design, build, and deploy new and innovative solutions.
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Training & Resources
Ask HN: Daily practices for building AI/ML skills?
Say I have around 1 hour daily allocated to developing AI/ML skills.
What, in your opinion, is the best way to invest the time/energy?
1. Build small projects (build what?)
2. Read blogs/newsletters (which ones?)
3. Take courses (which courses?)
4. Read textbooks (which books?)
6. Kaggle competitions
7. Participate in AI/ML forums/communities
8. A combination of the above (if possible, share time % allocation/weightage)
I'm asking this in general to help good SE people build up capabilities in ML…
The Little Book of Deep Learning
This book is a short introduction to deep learning for readers with a STEM background, originally designed to be read on a phone screen. It is distributed under a non-commercial Creative Commons license and was downloaded nearly 250'000 times in the month following its public release…Reflections from 2023 on databases and developer tools – purposely not all about LLMs [Twitter / X]
These are half-baked observations from my year, so I’m curious what others think! 🧵…
Last Week's Newsletter's 3 Most Clicked Links
What opinion about data science would you defend like this? [Reddit]
Which Movies Are The Most Polarizing? A Statistical Analysis
* Based on unique clicks.
** Find last week's issue #524 here.
Cutting Room Floor
FunSearch: Making new discoveries in mathematical sciences using Large Language Models
Randomly pivoted Cholesky: Practical approximation of a kernel matrix with few entry evaluations
Whenever you're ready, 3 ways we can help:
Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week! :)
All our best,
Hannah & Sebastian
P.S. Was today’s newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.
FYI - This article "VonGoom: A Novel Approach for Data Poisoning in Large Language Models" is part of an ongoing art project by artist Sterling Crispin. Vice did an interview on a prior piece Crispin did => https://www.vice.com/en/article/88xk7b/del-complex-ai-training-barge. Special thanks to S. Carter for sharing the Vice article with us.
THE LITTLE BOOK OF DEEP LEARNING is really worth sharing.
Thank you