Data Science Weekly - Issue 554

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Jul 04, 2024

Issue #554
July 04, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Editor's Picks

Exorcising myself of the Primer
If you want to make an educational technologist’s eyes sparkle, just mention “The Young Lady’s Illustrated Primer”. It’s a futuristic interactive schoolbook, described in Neal Stephenson’s The Diamond Age, where it lifts a young girl out of poverty and into sovereign power. It’s my field’s most canonical vision of a wildly powerful learning environment. If you ask a technologist interested in learning what they dream of achieving, most will answer: “building the Primer.”…

Mastering AI Department Reorganizations: Lessons from the Trenches
Do’s and Dont’s after five years of Data Science department reorgs…During the past five years, I’ve served as the VP of Data Science, AI, and Research at two publicly traded companies. In both roles, AI was central to the company’s core product. This provided significant resources and the opportunity to lead substantial departments — comprising 40–50 data scientists, including 2–3 group leaders and 6–8 team leads. One of the greatest challenges in this role has been structuring the department to enhance effectiveness, streamline value, and clarify roles and responsibilities. Today, I’ll share some best practices I’ve gathered through six departmental reorganizations…
From bare metal to a 70B model: infrastructure set-up and scripts
In the span of a few months, with a small team of researchers and engineers, we trained a 70B parameter model from scratch on our own infrastructure that outperformed zero-shot GPT-4o on reasoning-related tasks….we’re sharing an end-to-end guide for setting up the required infrastructure: from bringing up the initial cluster and installing the OS, to automatically recovering from errors encountered during training…In each step, we detail the challenges we encountered and how we resolved them. Along with our learnings, we’re releasing many of the infrastructure scripts we developed to ensure healthy hosts, so that other teams can more easily create stable infrastructure for their own model training…

A Message from this week's Sponsor:

Magical tools for working with data

Building a Big Picture Data Team at StubHub

See how Meghana Reddy, Head of Data at StubHub, built a data team that delivers business insights accurately and quickly with the help of Snowflake and Hex.

The challenges she faced may sound familiar:

Unclear SMEs meant questions went to multiple people
Without SLAs, answer times were too long
Lack of data modeling & source-of-truth metrics generated varying results
Lack of discoverability & reproducibility cost time, efficiency and accuracy
Static reporting reserved interactivity for rare occasion

Register now to hear how Meghana and the StubHub data team tackled these challenges with Snowflake and Hex. And watch Meghana demo StubHub’s data apps that increase quality and speed to insights…

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Why PhDs whiff the onsite, and how to find a diamond in the rough
Lessons from hiring 100+ data scientists…Throughout my (Duncan’s) career, I’ve interviewed or advised hundreds of new PhDs trying to break into data science in tech. One year at Uber, I even hired five new Harvard economics PhDs — about 20% of the graduating class…What to look for during data science interviews: It’s important to remember that the point of the interview isn’t to find someone good at interviewing — rather to find someone who will be good at the job. Most PhDs won’t be good at interviewing! So over the years, I’ve identified a few red flags — and green ones — to look for…
How to create Dashboard in Python from PostgreSQL
Accessing database in a terminal is not the best solution for everyone. Many times, we would like to share data in a more accessible way, for example as a web app with interactive dashboard. However, bulding a dashboard from scratch might be hard and might require using many different technologies. Don't worry. There is an easy way to solve this problem with Python only. In this article, you will learn a Pythonic way on how to:
- connect with your PostgreSQL database,
- send SQL queries from Python,
- fetch results from PostgreSQL as Python variables,
- create an interactive dashboard with Python only,
- deploy a dashboard as web app.
Are you curious? Let's start!…
What are issues in AI/ML that no one seems to talk about? [Reddit Discussion]
I was curious if anyone in their practical, personal, or research experience has come across any unpopular or novel concerns that usually aren’t included in the AI discourse, but stuck with you for whatever reason…On the flip side, are there even issues that are frequently discussed but perhaps are grossly underestimated?..
torch.compile, the missing manual [Google doc]
You are here because you want to use torch.compile to make your PyTorch model run faster…torch.compile is a complex and relatively new piece of software, and so you are likely to have growing pains. This manual is all about how to resolve problems that may arise when working with torch.compile, including both bugs in PyTorch itself, as well as fundamentally difficult problems that require some care from the user…This manual’s focus is for technical end users who don’t know much about PyTorch’s internals, but do understand their model and are willing to interact with PyTorch developers via GitHub…
Sliding Window Attention
In this post, we take a deep dive into Sliding Window Attention that allowed transformers to have long context length. We do this with the help of animations and also implement it from scrath in PyTorch code…
What type of inference is planning?
Multiple types of inference are available for probabilistic graphical models, e.g., marginal, maximum-a-posteriori, and even marginal maximum-a-posteriori. Which one do researchers mean when they talk about "planning as inference"?...In this work we use the variational framework to show that all commonly used types of inference correspond to different weightings of the entropy terms in the variational problem, and that planning corresponds _exactly_ to a _different_ set of weights. This means that all the tricks of variational inference are readily applicable to planning…
Diving into R with Isabella Velasquez: Perspectives from R-Ladies Seattle
Isabella Velasquez, co-organizer of R-Ladies Seattle, recently spoke with the R Consortium about her journey with R and the group’s recent activities. Isabella started as a beginner but has become a key figure in the R community thanks to the supportive and collaborative learning environment…
Regrets and Regression
Here’s a question from the Reddit statistics forum.
I want to write a research article that has regression analysis in it. I normalized my independent variables and want to include in the article the results of a statistical test showing that all variables are normal. I normalized using the scale function in R and some custom normalization functions I found online but whatever I do the new data fails the Shapiro Wilkinson and KS test on some columns? What to do?
There might be a few points of confusion here…
A Machine Learning Approach to Regime Modeling
Financial markets have the tendency to change their behavior over time, which can create regimes, or periods of fairly persistent market conditions. Investors often look to discern the current market regime, looking out for any changes to it and how those might affect the individual components of their portfolio’s asset allocation. Modeling various market regimes can be an effective tool, as it can enable macroeconomically aware investment decision-making and better management of tail risks…In this Street View, we present a machine learning-based approach to regime modeling, display the historical results of that model, discuss its output for today’s environment, and conclude with practical use cases of this analysis for allocators…
How to pivot to a Machine Learning engineer? [HN Discussion]
Software engineer here with 10+ YOE building data (mildly) intensive applications: mainly back-end development experience (from legacy to modern/cloud-native applications, brownfields and greenfields).
(1) is it wise to do this transition?
(2) has anyone else here in HN done it?
(3) how can I do it if my job has no ML in it?
Is there an ML engineering practice that isn't focused on building models but more on managing/deploying/scaling models? i.e. can I avoid learning all the maths underneath?…
Step-by-Step Diffusion: An Elementary Tutorial
We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms…
So You Want to Write a Stream Processor? Beware of the Duck Syndrome.
With this post, I pull back the curtains on what goes into building a reliable event-driven application by shining a light on the internals of Kafka Streams and why we think it’s a foundational technology for writing such apps — so foundational that we’re betting our company on it…

Training & Resources

Introduction to Probability for Computing [Free Book]
This book gives an introduction to probability as it is used in computer science theory and practice, drawing on applications and current research developments as motivation and context. This is not a typical counting and combinatorics book, but rather it is a book centered on distributions and how to work with them. Every topic is driven by what computer science students need to know…
A course on Spatial Data Science
The course Spatial Data Science in Python introduces data science and computational analysis using open source tools written in the Python programming language…The course supports students with little prior knowledge of core competencies in Spatial Data Science (SDS). It includes:
- Advancing their statistical and numerical literacy.
- Introducing basic principles of programming for data science and state-of-the-art computational tools for SDS.
- Presenting a comprehensive overview of the main methodologies available to the Spatial Data Scientist and their intuition on how and when they can be applied.
- Focusing on real-world applications of these techniques in a geographical and applied context…
A friendly introduction to Principal Component Analysis
Principal component analysis (PCA) is probably the most magical linear method in data science. Unfortunately, while it's always good to have a sense of wonder about mathematics, if a method seems too magical it usually means that there is something left to understand. After years of almost, but not quite fully understanding PCA, here is my attempt to explain it fully, hopefully leaving some of the magic intact…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #553 here.

Cutting Room Floor

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~62,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post

Ready for more?