Data Science Weekly - Issue 584
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #584
January 30, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
How to Build a Data Dashboard Prototype with Generative AI
This article is a tutorial that shows how to build a data dashboard to visualize book reading data taken from goodreads.com. It uses a low-code approach to prototype the dashboard using natural language prompts to an open source tool, which generates Plotly charts that can be added to a template dashboard…
What the F*** is a VAE?
Part of my role as CTO of Remade AI is working on diffusion models, the new generation of which are Latent Diffusion Models (LDMs). These models operate in a compressed latent space and VAEs are the things that do the "compressing" and "decompressing." An interesting byproduct of working with such models is that my cofounders have often found me in the office at 2am staring at the screen muttering to myself "What the F*** is a VAE?" This blog post is my attempt to answer that question!..The Short Case for Nvidia Stock
Even though I've thought the valuation was just too rich for my blood for the past year or so, a confluence of recent developments has caused me to flip a bit to my usual instinct, which is to be a bit more contrarian in outlook and to question the consensus when it seems to be more than priced in. The saying "what the wise man believes in the beginning, the fool believes in the end" became famous for a good reason…
What’s on your mind
This Week’s Poll:
Twitter/X or Bluesky or Mastodon or Discord or Tiktok/Instagram?
[Take this quick 5-second poll →]
We’ll share the results next week!
Last Week’s Poll:
Data Science Articles & Videos
Optimization algorithms for your career
Most people in science and engineering professions are very familiar with the concept of optimization…We are also familiar with the traps of local versus global optimization and how to protect our code and models against those traps. However, we frequently fail to realize that the same concepts can be applied to our careers…3 Steps to AI-Ready Data
Before we can take a swim in the AI soup, we need to be able to see where we’re going. Preparing enterprise-ready data for GenAI use cases requires creating a map of our data that is complete, easy to understand, and accurate. In other words, to be successful in the GenAI arms race (or with any data product for that matter) we need our data to reflect the world around us…Two Bites of Data Science in K
Here are two digestible examples of data analysis using K. Both cases are afternoon-scale projects I’ve done in the last few weeks…
* The Most Common Consonants Following r and l in English
* “No bowler with as many wickets has a better average”.Things to consider when modifying your data systems
The question of what to do, of when to move to a data warehouse, or when to update the schema is a messy, complex question that can only by answered by "it depends". But since I hate such unsatisfying answers, here's an attempt at jotting down what one process for figuring things out looks like…A personal history of the tidyverse
This article summarises almost 20 years of package development encompassing over 500 releases of 26 packages. That means this write-up is necessarily abbreviated and when coupled with my fallible memory, that means I’ve almost certainly forgotten some important details…XGBoost is All You Need, Part 2 - What is Tabular Data?
Today I’d like to take a deeper dive into tabular data. This is the kind of data that XGBoost is primarily designed to handle. It also happens to be the most common form of data used by the Data Science practitioners, analysts, and pretty much anyone who is dealing with any kind of business data. If you’ve ever used Excel, or even created an itemized shopping list, then you have used tabular data. So let’s dig in to see what this data is all about…The Illustrated DeepSeek-R1
DeepSeek-R1 is the latest resounding beat in the steady drumroll of AI progress…In this post, we’ll see how it was built…Python Rgonomics - 2025 Update
Switching languages is about switching mindsets - not just syntax. New developments in python data science toolings, like polars and seaborn’s object interface, can capture the ‘feel’ that converts from R/tidyverse love while opening the door to truly pythonic workflows. (Updated from 2025 for new tools)…AI is Creating a Generation of Illiterate Programmers
A couple of days ago, Cursor went down during the ChatGPT outage. I stared at my terminal facing those red error messages that I hate to see. An AWS error glared back at me. I didn’t want to figure it out without AI’s help. After 12 years of coding, I’d somehow become worse at my own craft. And this isn’t hyperbole—this is the new reality for software developers…
Designing and Deploying Internal Quarto Templates
Most of the content in the Quarto docs focuses on creating templates for public use, shared in a GitHub repo, but this talk focused on internal templates to share within an organization. The talk covered why you should bother making a Quarto template, how you can go about designing one, and how you can deploy it via a function in an internal package…How Meta discovers data flows via lineage at scale
In this blog, we will delve into an early stage in Privacy Aware Infrastructure (PAI) implementation: data lineage. Data lineage refers to the process of tracing the journey of data as it moves through various systems, illustrating how data transitions from one data asset, such as a database table (the source asset), to another (the sink asset). We’ll also walk through how we track the lineage of users’ “religion” information in our Facebook Dating app…Where is the standard ML/DL? Are we all shifting to prompting ChatGPT? [Reddit Discussion]
I am working at a consulting company and while so far all the focus has been on cool projects involving setting up ML\DL models, lately all the focus has been shifted on GenAI. As a data scientist/maching learning engineer who tackled difficult problems of data and modles, for the past 3 months I have been editing the same prompt file, saying things differently to make ChatGPT understand me. Is this the new reality? or should I change my environment? Please tell me there are standard ML projects…Taming Complexity with a Simple Data Stack
We overbuilt our data stack at ATM. At the origins of the company, we had a very different set of reporting needs than we do today. Initially, we were solely focused on building an affiliate advertising business in the consumer space. We later added a consumer fintech offering, offering free investment accounts. Much of our analysis efforts were focused on optimizing growth & engagement for the advertising business and laying out a B2B play packaging our investment experience…
.
Last Week's Newsletter's 3 Most Clicked Links
Are there any ways to earn a little extra money on the side as a data scientist?
Modern Polars - A side-by-side comparison of the Polars and Pandas libraries.
.
* Based on unique clicks.
** Find last week's issue #583 here.
Cutting Room Floor
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~66,300 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian
Substack!