Data Science Weekly - Issue 537

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Mar 07, 2024

Issue #537
March 07, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Editor's Picks

What's the Greatest Year in Oscar History? A Statistical Analysis
The Oscars once stood as a reflection of popular culture and artistic excellence, and, over time, Hollywood and its biggest night have gone astray. So today, we'll evaluate ninty-five years of Academy history to understand when The Oscars thrived as well as the periods that appealed to various demographics.
To assess the quality of different Oscar years, we'll employ a diverse set of perspectives, highlighting memorable nominees based on the selections of various groups, including:
- Rankings from online databases: the people's choice.
- "Best of" lists from movie critics: the intelligentsia's choice.
- Box office success: the choices of The Invisible Hand….

Building a GPU Cluster at home
incomplete list of things to consider when building a GPU server for home…
Training great LLMs entirely from ground up in the wilderness as a startup
Given that we’ve successfully trained pretty strong multimodal language models at Reka, many people have been particularly curious about the experiences of building infrastructure and training large language & multimodal models from scratch from a completely clean slate…I complain a lot about external (outside Google) infrastructure and code on my social media, leading people to really be curious about what are the things I miss and what I hate/love in the wilderness…So here’s a post (finally). This blogpost sheds light on the challenges and lessons learned. I hope this post will be interesting and/or educational for many…

A Message from this week's Sponsor:

Join 9,000 AI builders at GenAI Productionize 2024

Register for GenAI Productionize 2024 to learn how to build GenAI apps from top GenAI experts from LinkedIn, Google, Coinbase, Roblox, Comcast, and more.

Explore the emerging enterprise GenAI stack, including practical architecture and tool insights for governance, evaluation, and monitoring.

Don't miss this industry-first virtual summit. Free registration!

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Would a FAANG company just use a t-test for A/B Testing? [Reddit]
Currently doing a job application test for a FAANG company that asks me to conduct an A/B tests and cruelly says:
There are several equally acceptable stats approaches, so we’re interested to see your approach.
Which is, like -- well, I was going to just do a T-Test, but now I'm terrified to try something that simple.
Any thoughts on what methods a FAANG company would expect on an A/B test like this?…
GGUF, the long way around
We’ve going on a whirlwind adventure to build up our intuition of how machine learning models work, what artifacts they produce, how the machine learning artifact storage story has changed over the past couple years, and finally ended up in GGUF’s documentation to better understand the log that is presented to us when we perform local inference on artifacts in GGUF. Hope this is helpful, and good luck!...
The State of Competitive Machine Learning - 2023 Edition
We summarise the state of the competitive landscape and analyse the 300+ competitions that took place in 2023. Plus a deep dive analysis of 60+ winning solutions to figure out the best strategies to win at competitive ML…
How do calculators compute sine?
Sine, one of the fundamental trigonometric functions, plays a crucial role in various fields, including mathematics, physics, engineering, and computer science. Its calculation is not trivial, especially when it comes to implementing it in electronic calculators, where efficiency and accuracy are paramount. In previous entries of the series, we looked into how calculators solve equations and how they calculate square roots. In this blog post, we’ll delve into the intricate process of calculating the sine function, starting from simple approximations to more sophisticated methods…
The paradox of diffusion distillation
In this blog post, let’s take a closer look at the various ways in which the number of sampling steps required to get good results from diffusion models can be reduced. We will focus on various forms of distillation in particular: this is the practice of training a new model (the student) by supervising it with the predictions of another model (the teacher). Various distillation methods for diffusion models have produced extremely compelling results…I intended this to be relatively high-level when I started writing, but since distillation of diffusion models is a bit of a niche subject, I could not avoid explaining certain things in detail, so it turned into a deep dive…
A Large-Scale Investigation of Everyday Moral Dilemmas
Questions of right and wrong are central to daily life, yet how people experience everyday moral dilemmas remains uncertain. We combined state-of-the-art tools in machine learning with survey-based methods in psychology to analyze a massive online English-language repository of everyday moral dilemmas. In 369,161 descriptions (“posts”) and 11M evaluations (“comments”) of moral dilemmas extracted from Reddit’s “Am I the Asshole?” forum (AITA), users described a wide variety of everyday dilemmas, ranging from broken promises to privacy violations…
What python data visualization package are you using in 2024? [Reddit]
I've almost always used seaborn in the past 5 years as a data scientist. Looking to upgrade to something new/better to use!…
Levels of Complexity: RAG Applications
This post comprehensive guide to understanding and implementing RAG applications across different levels of complexity. Whether you're a beginner eager to learn the basics or an experienced developer looking to deepen your expertise, you'll find valuable insights and practical knowledge to help you on your journey. Let's embark on this exciting exploration together and unlock the full potential of RAG applications…
Why are there so many ETL tools when we have SQL and Python? [Reddit]
I've been wondering why there are so many ETL tools out there when we already have Python and SQL. What do these tools offer that Python and SQL don't? Would love to hear your thoughts and experiences on this…And yes, as a junior I’m completely open to the idea I’m wrong about this😂…
Automated Statistical Model Discovery with Language Models
Statistical model discovery involves a challenging search over a vast space of models subject to domain-specific modeling constraints. Efficiently searching over this space requires human expertise in modeling and the problem domain. Motivated by the domain knowledge and programming capabilities of large language models (LMs), we introduce a method for language model driven automated statistical model discovery. We cast our automated procedure within the framework of Box's Loop: the LM iterates between proposing statistical models represented as probabilistic programs, acting as a modeler, and critiquing those models, acting as a domain expert. By leveraging LMs, we do not have to define a domain-specific language of models or design a handcrafted search procedure, key restrictions of previous systems…
R2R: Production-ready RAG systems
A framework for rapid development and deployment of production-ready RAG systems…R2R was conceived to bridge the gap between experimental RAG models and robust, production-ready systems. Our semi-opinionated framework cuts through the complexity, offering a straightforward path to deploy, adapt, and maintain RAG pipelines in production. We prioritize simplicity and practicality, aiming to set a new industry benchmark for ease of use and effectiveness…
Recognizing protected and anthropogenic patterns in landscapes using interpretable machine learning and satellite imagery
The accurate and comprehensive mapping of land cover has become a central task in modern environmental research, with increasing emphasis on machine learning approaches. However, a clear technical definition of the land cover class is a prerequisite for learning and applying a machine learning model. One of the challenging classes is naturalness and human influence, yet mapping it is important due to its critical role in biodiversity conservation, habitat assessment, and climate change monitoring. We present an interpretable machine learning approach to map patterns related to territorial protected and anthropogenic areas as proxies of naturalness and human influence using satellite imagery…

Training & Resources

PandasAI - Chat with your Pandas DataFrame
Chat with your data (SQL, CSV, pandas, polars, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG…
Native RAG on Apple Sillicon Mac with MLX
Chat with your data natively on Apple Silicon using MLX Framework…
A Survey on Data Selection for Language Models
Knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #536 here.

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~61,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post

Ready for more?