Data Science Weekly - Issue 540
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #540
March 28, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
sql is syntactic sugar for relational algebra
This idea is particularly sticky because it was more or less true 50 years ago, and it's a passable mental model to use when learning sql. But it's an inadequate mental model for building new sql frontends, designing new query languages, or writing tools likes ORMs that abstract over sql. Before we get into that, we first have to figure out what 'syntactic sugar' means…
25 YC companies that have trained their own AI models (Twitter/X Thread)
(0/25) Here's a list of 25 YC companies that have trained their own AI models. Reading through these will give you a good sense of what the near future will look like…My binary vector search is better than your FP32 vectors
Within the field of vector search, an intriguing development has arisen: binary vector search. This approach shows promise in tackling the long-standing issue of memory consumption by achieving a remarkable 30x reduction. However, a critical aspect that sparks debate is its effect on accuracy. We believe that using binary vector search, along with specific optimization techniques, can maintain similar accuracy. To provide clarity on this subject, we showcase a series of experiments that will demonstrate the effects and implications of this approach…
A Message from this week's Sponsor:
Is your A/B testing system reliable?
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
How to Use GitHub Actions to Automate Data Scraping
In this blog, we explore how to automate a data scraping process in the cloud using GitHub Actions. For this case study, we will demonstrate how to automate a Python script which fetches a paginated bulk dataset from the Gateway to Research API (though the techniques used can be applied to a variety of data processing tasks). Using GitHub Actions enables this to be done entirely in the cloud and configured to happen at regular intervals (in this case every day)…Surya - OCR, layout analysis, and line detection in 90+ languages
Surya is a document OCR toolkit that does:Accurate OCR in 90+ languages
Line-level text detection in any language
Layout analysis (table, image, header, etc detection) in any language
It works on a range of documents (see usage and benchmarks for more details).
Please recommend ways to make ML Interview better for candidates [Reddit]
There was a thread that complained about ML interviews being exhaustive.As a hiring manager, Id like recommendations on making the ML interviews better.
What are some things to avoid?
What are some good steps to include?…
Causal ML Book
An introduction to the emerging fusion of machine learning and causal inference. The book introduces ideas from classical structural equation models (SEMs) and their modern AI equivalent, directed acyclical graphs (DAGs) and structural causal models (SCMs), and presents Double/Debiased Machine Learning methods to do inference in such models using modern predictive tools…SQL — Cursors and Their Usage in Databases
An introduction to working with the cursor in PostgreSQL and MS SQL Server…I’m introducing you to a great tool inside any SQL dialect — the cursor. To put it simply, the cursor allows us to consume the results while the database engine is generating them, thus opening up many possible solutions. So, let’s start with the basics…The finicky effects of measurement error: When does the bias go down? When does the bias go up?
So… the other day there was a most interesting discussion over on Twitter/X about the effects of measurement error in linear regression coefficients…I wanted to add my 2 cents to this as well (I mean, I am a psychometrician after all). But the more I thought about it, the more I realized that it would take a lot more than 2 or 3 posts to explain everything I wanted to explain. Which is why I’m writing this in blog format now…The roll, yaw and pitch of strawberries
My arxiv frontpage found another fun article about a new benchmark. This one is about orientation estimation of strawberries. It turns out that this is super useful for fruit picking robots…Tutorial on Diffusion Models for Imaging and Vision
The astonishing growth of generative tools in recent years has empowered many exciting applications in text-to-image generation and text-to-video generation. The underlying principle behind these generative tools is the concept of diffusion, a particular sampling mechanism that has overcome some shortcomings that were deemed difficult in the previous approaches. The goal of this tutorial is to discuss the essential ideas underlying the diffusion models. The target audience of this tutorial includes undergraduate and graduate students who are interested in doing research on diffusion models or applying these models to solve other problems…SQLite Schema Diagram Generator
A properly normalized database can wind up with a lot of small tables connected by a complex network of foreign key references. Like a real-world city, it's pretty easy to find your way around once you're familiar, but when you first arrive it really helps to have a map. Lots of database management tools include some kind of schema diagram view, either automatically generated or manually editable so you can get the layout just right. But it's usually part of a much bigger suite of tools, and sometimes I don't want to install a tool, I just want to get a basic overview quickly…
Thoughts on Academia and Industry in Machine Learning Research
A recent conversation with Jay Shah on his podcast made me think more about career choices and the question of “academia vs. industry” after completing a PhD. Since finishing my PhD, I also had this conversation with many other researchers — and before finishing my PhD I asked recent graduates about this myself. So, in this article, I want to share some of my thoughts…RAFT: Adapting Language Model to Domain Specific RAG
Pretraining Large Language Models (LLMs) on large corpora of textual data is now a standard paradigm. When using these LLMs for many downstream applications, it is common to additionally bake in new knowledge (e.g., time-critical news, or private domain knowledge) into the pretrained model either through RAG-based-prompting, or fine-tuning. However, the optimal methodology for the model to gain such new knowledge remains an open question. In this paper, we present Retrieval Augmented FineTuning (RAFT), a training recipe that improves the model's ability to answer questions in a "open-book" in-domain settings.Building Open Data Portals in 2024
If you look into the Open Data ecosystem you’ll realize tools and approaches we use there are different from the ones we use at the data teams of small to medium companies. I’m sure these differences are there for many reasons, the main one being that open data is not a technological problem to begin with. That said, I wanted to explore the idea of building open data portals using some of the tools, libraries, frameworks, and ideas I’ve been using and loving in the last few years…
Training & Resources
UNets are a famous architecture for image segmentation. With their hierarchical structure they have a wide receptive field. Similar to multigrid methods, we will use them in this video to solve the Poisson equation in the Equinox deep learning framework…
Video lectures, Cornell CS 1110 Introduction to Computing using Python
This page contains all of the pre-recorded video lessons posted so far, with the most recent posting listed first. It is, for all intents and purposes, the textbook of this course. You are expected to keep up with these videos. Each lab and Zoom session will indicate the videos that we will have expected you to watch…A little guide to building Large Language Models in 2024
A little guide through all you need to know to train a good performance large language model in 2024. This is an introduction talk with link to references for further reading. This is the first video of a 2 part series: - Video 1 (this video): covering all the concepts to train a good performance LLM in 2024 - Video 2 (next video): hands-on applying all these concepts with code example This video is adapted from a talk I gave in 2024 at a AI/ML winter school for graduate student. When I shared the slides online people kept asking for a recording of the unrecorded class so I decided to spend a morning recording it to share it more widely along the slides…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #539 here.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~61,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian