Data Science Weekly - Issue 591
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #591
March 20, 2025
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
How to Implement a Cosine Similarity Function in TypeScript for Vector Comparison
To understand how an AI can understand that the word “cat” is similar to “kitten,” you must realize cosine similarity. In short, with the help of embeddings, we can represent words as vectors in a high-dimensional space. If the word “cat” is represented as a vector [1, 0, 0], the word “kitten” would be represented as [1, 0, 1]. Now, we can use cosine similarity to measure the similarity between the two vectors. In this blog post, we will break down the concept of cosine similarity and implement it in TypeScript…
Exporting INE Datasets - Part 2
How do you run a script that generates 500GB worth of data for free? Well, you probably can’t…What you can do instead is optimizing the different steps to make the process more efficient. Let’s go through some of the things I did so it can run on GitHub actions for free!From Data to Viz
Data To Viz categorizes chart types based on data format, making it easy to choose the best visualization for your project. It’s the go-to resource for selecting the right chart to showcase your results!…
What’s on your mind
This Week’s Poll:
Take this quick 5-second poll →
We’ll share the results next week!
Last Week’s Poll:
Data Science Articles & Videos
The-Logic-Band-Methodology
The Logic Band is a novel architecture that was inspired by combining neuroscience knowledge with data science. The resulting enhancement, that is designed to fit into any neural network, improves model performance by enabling the artificial intelligence to locate and evaluate complex feature relationships. The enhanced architecture is designed in a way produces exceptional improvement without significant computational resource cost. The following describes the theology and functionality in detail, while offering a new avenue of growth in the advancement of Artificial Intelligence. The conception of the idea through to proof of concept and architecture design throughout various neural networks…Advice on building a data team [Reddit]
I’m currently the “chief” (i.e., only) data scientist at a maturing start up. The CEO has asked me to put together a proposal for expanding our data team. For the past 3 years I’ve been doing everything from data engineering, to model development, and mlops…We’re getting to the point where we are ready to hire and grow our team, but I have no experience with transitioning from a solo IC to a team leader. Has anybody else made this transition in a start up? Any advice on how to build a team?…ConnectorX: Accelerating Data Loading From Databases to Dataframes
Data is often stored in a database management system (DBMS) but dataframe libraries are widely used among data scientists. An important but challenging problem is how to bridge the gap between databases and dataframes. To solve this problem, we present ConnectorX, a client library that enables fast and memory-e"cient data loading from various databases to different dataframes …Python Project Starter Repository
This repository serves as a template demonstrating Python best practices for research projects. It includes:An example Python program (reading in data and plotting)
Command-line argument parsing (argparse)
Code style checking, aka "linting" (with ruff)
Static type checking (with mypy)
Pre-commit hooks that run these checks automatically (with pre-commit)
Testing (with pytest)
Continuous Integration (with GitHub Actions)
Package management (with pip and pyproject.toml)
An open source license (MIT)…
Building a Pipeline for Automating Case Study Classification
This post outlines my journey in building an automated classification pipeline capable of distinguishing theoretical discussions from concrete, production-grade generative AI (GenAI) implementations. I'll share the challenges faced, approaches tested, and practical insights gained along the way…When will AI systems be able to carry out long projects independently?
In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months….Automatically tuning thresholds
You can get a lot of traction in your data science career if you are good at picking the right threshold. A few years ago this would involve manual labor, but these days you can also use new tools in scikit-learn that can automate the selection. This video explains how it works…Advice on Upskilling
Since summer 2024 I’ve written a ton of scattered pieces on upskilling. I recently cleaned them up and pulled them all together into this little booklet. It’s still somewhat of a work in progress and I’ll be continually extending and refining it in the future, but I feel like it does a decent job of bringing together a lot of previously scattered writing into some sort of more cohesive whole…What’s a good imputation to predict with missing values?
Missing values are prevalent across various fields, posing challenges for training and deploying predictive models. In this context, imputation is a common practice, driven by the hope that accurate imputations will enhance predictions. However, recent theoretical and empirical studies indicate that simple constant imputation can be consistent and competitive. This empirical study aims at clarifying if and when investing in advanced imputation methods yields significantly better predictions…
The fundamental theorem of finite games
In every finite two-player game of perfect information, either one player has a winning strategy or both players have drawing strategies…
fasttransform: Reversible Pipelines Made Simple
Introducing fasttransform, a Python library that makes data transformations reversible and extensible through the power of multiple dispatch…If you’ve ever trained a machine learning model, you know what comes next: the frustrating journey of trying to understand what your model actually saw. You dig through layers of transformations - normalizations, resizes, augmentations - only to realize you’ll need to write inverse functions just to see your data again. It’s so painful that many of us skip it altogether, debugging our models based on abstract numbers rather than actual data…Using Ordering for Better Plans in Apache DataFusion
In this blog post, we explain when an ordering requirement of an operator is satisfied by its input data. This analysis is essential for order-based optimizations and is often more complex than one might initially think…OpenTimes: Free travel times between U.S. Census geographies
Today I’m launching OpenTimes, a free database of pre-computed, point-to-point travel times between major U.S. Census geographies. In addition to letting you visualize travel isochrones, OpenTimes also lets you download massive amounts of travel time data for free and with no limits…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #590 here.
Cutting Room Floor
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~67,400 subscribers by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian