Data Science Weekly - Issue 484

Curated news, articles and jobs related to Data Science.

Mar 02, 2023

Issue #484
March 02 2023

Welcome to another issue of Data Science Weekly!

We want to give a warm welcome to the 300+ people who joined us since our last newsletter. To everyone else reading this, your support means a great deal to us. We hope you enjoy reading the curated links below as much as we did.

This week’s experiment is having an open comment section for everyone! So stop on by and share whatever’s on your mind. :) We’ll see you in there!

And now, let's dive in:

Editor's Picks

Meaningful metrics: How data sharpened the focus of product teams
How do you decide on the metrics that matter? And how do you advocate for an organization to adopt new metrics? And what happens if existing metrics stop moving? Our Data Science team developed a growth framework that helped to grow DAUs by 4x since 2019. Let’s explore the path that led us to that framework (the Growth Model), the tangible impact it’s had on our business, and how we’re thinking of evolving the framework to take us into a new phase of growth…

Data science maturity and the cloud
I regularly speak to data scientists who are frustrated in their roles because the tech in their organization simply does not give them the ability to do their job in the best way possible; or, even worse, they do not have the agency to do their job well. Data science, and data scientists, need the right conditions to flourish. So, if you’re looking at your own organization’s data science offering, what are the key things you should be able to do? And how can we ensure that data scientists have them? Let’s take a look at how to check an organization’s data science maturity..

Feature Selection And Feature Importance: How Are They Related?
How are feature selection and feature importance related? This is a question I came upon often when doing research, but it’s also a practical question when doing machine learning…

We sponsored the newsletter ourselves! :

You should consider becoming a paid subscriber!

https://datascienceweekly.substack.com/subscribe

What you’ll get as a subscriber above and beyond this weekly link roundup:

Office hours covering: careers, job-hopping, getting started, working with recruiting agencies
Study groups: an invite-only discord server with various chat-rooms dedicated to different data science / data engineering / ML / AI learning materials where ourselves and/or “teaching assistants” will be there to help guide you and answer your questions
Q&A’s with various companies whose tools you are already using
and more! (People have asked us to produce a “theory”-style weekly newsletter as well as a “tool”-only weekly newsletter, so maybe we’ll put that together for you as well)

If you’re in a professional organization, you can expense the subscription as a tax deduction. You can also expense the subscription out of your learning, professional development, or training budget.

So please consider becoming a paid subscriber here:
https://datascienceweekly.substack.com/subscribe

Data Science Articles & Videos

Why didn't DeepMind build GPT3?
Trying to answer the question “Why didn’t DeepMind initiate and deliver GPT3?” is one way of shedding light on this puzzle. I say specifically GPT3 because that was the significant innovation — we’ve been following a playbook since then, and most of the perceived advantage of OpenAI stems primarily from how fast they ship, and their appetite for it, not from the pace of discovery. As someone professionally interested in how you build extraordinary scientific teams, there are three things that strike me quite profoundly about GPT3…

Founded Upon an Error
A recent post on Reddit asks, “Why was Bayes’ Theory not accepted/popular historically until the late 20th century?” Great question! As always, there are many answers to a question like this, and the good people of Reddit provide several. But the first and most popular answer is, in my humble opinion, wrong. The story goes something like this…

Snowblowing is NP-complete
The recent winter storm left a lot of snow on my driveway. A lot. My driveway is the perfectly place for huge snowdrifts to form. A tweet of my shoveling resulted in the discovery of The Snowblower Problem by Esther M. Arkin, Michael A. Bender, Joseph S. B. Mitchell, and Valentin Polishchuk…The Snowblower Problem (SBP) answers the following question: “How does one optimally use a snowblower to clear a given polygonal region?”…The snowblower problem is like the Traveling Salesman Problem…
Algorithmic Black Swans
Organizations building AI systems do not bear the costs of diffuse societal harms and have limited incentive to install adequate safeguards. Meanwhile, regulatory proposals such as the White House AI Bill of Rights and the European Union AI Act primarily target the immediate risks from AI, rather than broader, longer-term risks. To fill this governance gap, this Article offers a roadmap for “algorithmic preparedness” — a set of five forward-looking principles to guide the development of regulations that confront the prospect of algorithmic black swans and mitigate the harms they pose to society…
Long commutes show structural inequality in cities, and bad health outcomes
During President Biden’s State of the Union address, he spoke a lot about rebuilding our highways and railroads to improve the infrastructure of America. But if we don’t address the inequalities in the ways we use those roads and trains, and which communities are most in need of support, we risk embedding structural challenges in the very fabric of our cities…Research from Raj Chetty at Harvard on 5 different social factors found that shorter commute times were found to be the strongest predictor of upward mobility. In fact, investments in public transportation have been shown to reduce local inequality and drive down local crime…

The Significance of A/B Testing and Power Analysis in Fraud Detection
In this post, I will focus on the fraud detection domain, specifically on cases where we constantly retrain and replace multiple models…In the first part, I will present possible approaches for collecting data to compare models and discuss their advantages and disadvantages in the context of fraud detection. I will show that A/B testing has powerful advantages over other options…In the second part, I will address cases where we cannot afford to perform A/B testing for all model replacements…
I tested how well ChatGPT can pull data out of messy PDFs (and here’s a script so you can too)
I spent about a week getting familiarized with two datasets and doing all of the preprocessing. Once it’s done, getting ChatGPT to convert a piece of text into JSON is really easy. You can paste in a record and say “return a JSON representation of this” and it will do it…But doing this for multiple records is a bad idea because ChatGPT will invent its own schema, using randomly chosen field names from the text. It will also decide on its own way to parse values. Addresses, for example, will sometimes end up as a string and sometimes as a JSON object or an array, with the constituent parts of an address split up…

Which programming language is required to land a data job at Meta (Facebook) [Reddit Discussion]
A look at "Data Analyst", "Data Engineer", "Data Scientist", "Research Scientist", and "ML Engineer" job postings...

Exploring Data Distributions with an Interactive Ridge Plot
Ridge plots are particularly helpful for identifying differences in distributions between multiple groups or variables…We'll start with an overview of what ridge plots are and why they're useful, then dive into the technical details of building an interactive ridge plot…Along the way, we'll discuss different use cases for interactive ridge plots and the benefits they can offer for data analysis and decision-making. By the end of this post, you'll have a solid understanding of how to create an interactive ridge plot and how to apply it to your own data analysis projects…

Training Deep Networks with Data Parallelism in Jax
One of the main challenges in training large neural networks, whether they are LLMs or VLMs, is that they are too large to fit on a single GPU. To address this issue, their training can be parallelized across multiple GPUs. This means either parallelizing the data or model to distribute computation across several devices. In this post, we'll cover batch splitting, also known as data parallelism, and show how to use JAX's pmap function to parallelize computations across multiple devices…

What should you use ChatGPT for?
So I’ve been trying to understand the hype. I’m interested in what its impact is on the ML systems I’ll be building over the next ten years. And, as a writer and Extremely Online Person, I’m thinking about how it could change how I create and navigate content online…

Building a Bloom filter
In this post, we will explore the Bloom filter — a data structure that is ingenious in its simplicity and elegant in its design. We will delve into the underlying principles of Bloom filters, understand its benefits and drawbacks, and walk you through the process of implementing a Bloom filter using Python. Bloom filters are a popular topic in computer science, particularly in systems design and optimization, and in technical interviews. Understanding this data structure and how to implement it can be an excellent way to showcase your skills as a data engineer…

Jobs

Software Developer Job Opportunity at Observable, Inc

SALARY AND HOURS: $107,640 - $150,000 per year; 40 hours per week.

EXPERIENCE AND REQUIREMENTS: Bachelors degree in Computer Science.

DESCRIPTION OF DUTIES:

Design, develop, test, deploy, maintain and improve software
Write code for Observable’s product and platform, create reliable and sustainable systems, and develop prototypes quickly
Write unit and integration tests to ensure the software is functioning correctly and securely
Deploy and release software at a regular cadence
Support and improve the software through on call and support tasks
Communicate and interact with users to understand their requirements and respond to their issues.
Collaborate on projects with designers, engineers and product managers.

Apply here

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

DiffusionFastForward
A course on diffusion generative models in a fast forward mode…There are three elements integrated into this project: a) 💻 Code, b)💡 Notes (in notes directory), and c) 📺 Video Course (to be released on YouTube)…
Soundscapes: Creating Visuals with Meyda+Shaders
Lately, I have found some very inspiring 3D pieces, most of them combined with sound. I wanted to create some personal pieces (still work in progress) and share here some insights from my experience. With two basic examples ( Experience 1 and Experience 2 ) I explain the whole process from setting up the audio to retrieving the data and visualizing it.
Do Right Joins even matter? [Reddit Discussion]
This is one of those out-of-the-blue thoughts that you get randomly. I am an expert in sql with many years of experience, but have yet to use Right Joins lol. Is there any specific reason or use-case for this type of join?…

Last Week's Newsletter's 3 Most Clicked Links

PyGWalker: Turn your pandas dataframe into a Tableau-style User Interface for visual analysis

The 2023 MAD (Machine Learning, Artificial Intelligence & Data) Landscape

Cleaning sample data in standardized way

* Based on unique clicks.
** Find last week's issue #483 here.

Cutting Room Floor

Have an awesome week!

All our best,
Hannah & Sebastian

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues :)

Data Science Weekly Newsletter

Discussion about this post

Ready for more?