Data Science Weekly - Issue 580
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #580
January 03, 2025
Happy New Year!
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
6 Reasons to Be (Cautiously) Optimistic About Movies in 2025: A Statistical Analysis
It's unlikely that we'll ever replicate "peak cinema" of the 1990s, but that doesn't mean we live in a "dying culture" or an "artistic hellscape wrought by late-stage capitalism." So today, we'll quantify trends that movie enthusiasts might consider "bright spots" or "not that bad." Enjoy a brief respite from doomsday prophecies of cinema's impending extinction…
Bare Necessities of Data Management
There are so many data management practices that can help you better organize your project, yet a team’s ability to “do it all” is really limited by factors such as funding, timing, team size, and expertise. Therefore, it is important for teams to consider what practices are feasible as well as which ones will give them the largest return on investment…I think there is a list of core practices that should be implemented early on, before data collection begins, in order for your project to be successful. This blog post will review those practices…Things we learned about LLMs in 2024
A lot has happened in the world of Large Language Models over the course of 2024. Here’s a review of things we figured out about the field in the past twelve months, plus my attempt at identifying key themes and pivotal moments…
Sponsor Message
Quadratic - analyze anything, host anywhere
With Quadratic, combine the spreadsheets your organization asks for with the code that matches your team’s code-driven workflows.
Powered by code, you can build anything in Quadratic spreadsheets with Python, JavaScript, or SQL, all approachable with the power of AI.
Use the data tool that actually aligns with how your team works with data, from ad-hoc to end-to-end analytics, all in a familiar spreadsheet.
Level up your team’s analytics with Quadratic today
.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
Best Data Visualization Projects of 2024
Many datasets were analyzed and many charts were made this year. If I liked a project, it was on FlowingData. But only a handful can be the best. These are my favorite data visualization projects from 2024…A Gentle Introduction to Using a Vector Database
In which we learn how to build a simple vector database using Pinecone and OpenAI embeddings, and discover it was way easier than we might have expected…Quick recap on the state of reasoning
This is a talk I gave at NeurIPS at the Latent Space unofficial industry track. I wanted to directly address the question on if language models can reason and what o1 and the reinforcement finetuning (RFT) API tell us about it. It’s somewhat rambly, but asks the high level questions on reasoning that I haven’t written about yet and is a good summary of my coverage on o1’s implementation and the RFT API…Is it too late for me as 32 years old female with completely zero background jump into data engineering? [Reddit Discussion]
I’ve enrolled in a Python & AI Fundamentals course, even though I have no background in IT. My only experience has been in customer service, and I have a significant gap in my employment history. I’m feeling uncertain about this decision, but I know that starting somewhere is the only way to find out if this path is right for me…Anyone can share their experience or any advice? Please helpp, really appreciate it!…Three-Sided Testing to Establish Practical Significance: A Tutorial
Researchers may want to know whether an observed statistical relationship is either meaningfully negative, meaningfully positive, or small enough to be considered practically equivalent to zero. Such a question can not be addressed with standard null hypothesis significance testing, nor with standard equivalence testing. Three-sided testing (TST) is a procedure to address such questions, by simultaneously testing whether an estimated relationship is significantly below, within, or above predetermined smallest effect sizes of interest…In this paper, we give a non-technical introduction to TST, provide commands for conducting TST in both R and Jamovi, and provide a Shiny app for easy implementation…Multiple Regression with StatsModels - Look at predictions, not parameters
In this chapter we’ll get farther into regression, including multiple regression and one of my all-time favorite tools, logistic regression. These tools will allow us to explore relationships among sets of variables. As an example, we will use data from the General Social Survey (GSS) to explore the relationship between education, sex, age, and income…Let's Build a Simple Database
Writing a sqlite clone from scratch in C…Linear Algebra : Essence & Form
I am about to teach a new applied linear algebra course at [at]PennEngineers , meant for datasci/ML/AI. to support the class, i've written a book…Every month I send my team at Google a few paper recommendations. For this end-of-year blog post, I went through all my monthly emails, picked my favorite articles, and I grouped them into categories. In each category I kept them ordered by publication date, so you may get a sense of progress in each of them…
Gradient boosting machines, a tutorial
This article gives a tutorial introduction into the methodology of gradient boosting methods with a strong focus on machine learning aspects of modeling. A theoretical information is complemented with descriptive examples and illustrations which cover all the stages of the gradient boosting model design. Considerations on handling the model complexity are discussed. Three practical examples of gradient boosting applications are presented and comprehensively analyzed.Multi-modal Catalog Attribute Extraction Platform at Instacart
Attribute creation is a process of sourcing attribute information for products in Instacart’s catalog; for example, sourcing the attribute “sheet_count” for all our napkins products. It’s a critical foundation for a broad range of features that enhance the customer shopping experience across our platforms…To effectively support the attribute drive user experiences at Instacart, our attribute creation system needs to meet two core objectives: (1) generate highly accurate attribute data, and (2) scale to accommodate millions of products with thousands of diverse attributes…Imputation Use Cases [Reddit Discussion]
I’m wondering how and why people use this technique. I learned about it early on in my career and have avoided it entirely after trying it a few times. If people could provide examples of how they’ve used this in a real life situation it would be very helpful…I personally think it’s highly problematic in nearly every situation for a variety of reasons. The most important reason for me is that nulls are often very meaningful. Also I think it introduces unnecessary bias into the data itself. So why and when do people use this?..Stream processing, LSMs and leaky abstractions with Chris Riccomini
In this episode, we chat with Chris Riccomini about the evolution of stream processing and the challenges in building applications on streaming systems. We also chat about leaky abstractions, good and bad API designs, what Chris loves and hates about Rust and finally about his exciting new project that involves object storage and LSMs…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #579 here.
Cutting Room Floor
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~65,230 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian