Data Science Weekly - Issue 536

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Mar 01, 2024

Issue #536
February 29, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

If you like what you read, consider becoming a paid member here: https://datascienceweekly.substack.com/subscribe :)

And now…let's dive into some interesting links from this week.

Editor's Picks

Intro to DSPy: Goodbye Prompting, Hello Programming!
How the DSPy framework solves the fragility problem in LLM-based applications by replacing prompting with programming and compiling…

Beyond Transformers: Structured State Space Sequence Models
A new paradigm is rapidly evolving within the realm of sequence modeling that presents a marked advancement over the Transformer architectures. This new approach demonstrates superior capability compared to Transformers in accurately modeling extensive sequence lengths and contextual dependencies. It exhibits an order of magnitude improvement in computational efficiency during training and inference…
Modern Dimensionality Reduction [Reddit]
I’m familiar with the more classical techniques of dimensionality reduction like SVD, PCA, and factor analysis. But are there any modern techniques or maybe some tricks that people have learned over the years that they would like to share. For context, this would be for tabular data. Thanks!…

A Message from this week's Sponsor:

Learn from GenAI experts at GenAI Productionize 2024

Learn from top GenAI experts at GenAI Productionize 2024 – the one and only event on productionizing enterprise GenAI!

See how Coinbase, LinkedIn, Comcast, Proctor & Gamble, Roblox, Databricks, JPMorgan Chase, Fidelity, and more get their GenAI apps into production, including practical strategies for governance, evaluation, and monitoring.

Register for GenAI Productionize 2024 to learn:

- How organizations have successfully adopted enterprise GenAI
- Practical hands-on architecture and tool insights from leading GenAI builders
- The emerging enterprise GenAI stack, from orchestration to evaluation, inference, retrieval and more

Free Registration!

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Mamba No. 5 (A Little Bit Of...)
In this post, I attempt to provide a walkthrough of the essence of the Mamba state space model architecture, occasionally sacrificing some rigor for intuition and overall pedagogical friendliness. I don’t assume readers have any familiarity with state space models, but I do assume some familiarity with machine learning and mathematical notation…
ggplot2 101 - Getting started with data visualization
This is a material adapted from my class on Data Visualization taught in the Introduction to R course for my postgraduate program. The aim was not to present everything about ggplot2, just the most important that can be used in scientific articles. As this may be useful to more people, I decided to make it available to everyone here. Feel free to share it with others!…
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective…
Shape Suffixes — Good Coding Style
Variable names should be concise and informative. For a tensor, nothing is more informative than how many dimensions it has, and what those dimensions represent…
Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages. This underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing. We propose STORM, a writing system for the Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking. STORM models the pre-writing stage by (1) discovering diverse perspectives in researching the given topic, (2) simulating conversations where writers carrying different perspectives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the collected information to create an outline…
How fast can we process a CSV file
If you care about performance, you may want to avoid CSV files. But since our data sources are often like our family, we can't make a choice, we'll see in this blog post how to process a CSV file as fast as possible….Since I've been in the pandas core development team for more than 5 years, and a pandas user for much longer than that, I'm very biased on thinking that the de facto standard for many people to process a CSV file is something like this…While I think this is still today a reasonable choice, in this blog post I will show many other options, with a particular focus on speed…
Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers
Despite their remarkable effectiveness and broad application, the drivers of success underlying ensembles of trees are still not fully understood. In this paper, we highlight how interpreting tree ensembles as adaptive and self-regularizing smoothers can provide new intuition and deeper insight to this topic. We use this perspective to show that, when studied as smoothers, randomized tree ensembles not only make predictions that are quantifiably more smooth than the predictions of the individual trees they consist of, but also further regulate their smoothness at test-time based on the dissimilarity between testing and training inputs…
Favorite SQL patterns? [Reddit]
What are the SQL patterns you use on a regular basis and why?…
Levels of Complexity: RAG Applications
This post comprehensive guide to understanding and implementing RAG applications across different levels of complexity. Whether you're a beginner eager to learn the basics or an experienced developer looking to deepen your expertise, you'll find valuable insights and practical knowledge to help you on your journey. Let's embark on this exciting exploration together and unlock the full potential of RAG applications…
MOSAIC - Scalable, interactive data visualization
Mosaic is a framework for linking data visualizations, tables, input widgets, and other data-driven components while leveraging a database for scalable processing. With Mosaic, you can interactively visualize and explore millions and even billions of data points. A key idea is that interface components – Mosaic clients – publish their data needs as queries managed by a central coordinator. The coordinator may further optimize queries before issuing them to a backing data source such as DuckDB…
Polars for Data Science
The goal of the project is to reduce dependencies, improve code organization, simplify data pipelines, and overall facilitate analysis of various kinds of tabular data that a data scientist may encounter. It is a package built around your favorite Polars dataframe…
Mamba: The Easy Way
Today, basically any language model you can name is a Transformer model. OpenAI’s ChatGPT, Google’s Gemini, and GitHub’s Copilot are all powered by Transformers, to name a few. However, Transformers suffer from a fundamental flaw: they are powered by Attention, which scales quadratically with sequence length…Many models have attempted to solve this problem, but few have done as well as Mamba. Published two months ago by Albert Gu and Tri Dao, Mamba appears to outperform similarly-sized Transformers while scaling linearly with sequence length…The prospect of an accurate linear-time language model has gotten many excited about the future of language model architectures…In this blogpost, I’ll try to explain how Mamba works in a way that should be fairly straightforward, especially if you’ve studied a little computer science before. Let’s get started!..

Educational Sponsor

R in 3 Months is here to help you finally learn R

R in 3 Months is the program to help you achieve your goal of **finally** learning R.

R in 3 Months is more than just a course: it's got high-quality course material, and it has personalized feedback and a supportive community.

Over 300 people from around the world have participated in R in 3 Months. Now it's your turn. The next cohort starts March 14 and you can get $50 off if you sign up by March 1 and use coupon code DSWEEKLY.

Training & Resources

DSPy Explained!
DSPy is a super exciting new framework for developing LLM programs! Pioneered by frameworks such as LangChain and LlamaIndex, we can build much more powerful systems by chaining together LLM calls! This means that the output of one call to an LLM is the input to the next, and so on. We can think of chains as programs, with each LLM call analogous to a function that takes text as input and produces text as output. DSPy offers a new programming model, inspired by PyTorch, that gives you a massive amount of control over these LLM programs…
Understanding Deep Learning Book (free, fully available via PDF)
The title of this book is “Understanding Deep Learning” to distinguish it from volumes that cover coding and other practical aspects. This text is primarily about the ideas that underlie deep learning. The first part of the book introduces deep learning models and discusses how to train them, measure their performance, and improve this performance. The next part considers architectures that are specialized to images, text, and graph data. These chapters require only introductory linear algebra, calculus, and probability and should be accessible to any second-year undergraduate in a quantitative discipline. Subsequent parts of the book tackle generative models and reinforcement learning. These chapters require more knowledge of probability and calculus and target more advanced students…
Causal ML Book (free, fully available via PDF)
An introduction to the emerging fusion of machine learning and causal inference. The book introduces ideas from classical structural equation models (SEMs) and their modern AI equivalent, directed acyclical graphs (DAGs) and structural causal models (SCMs), and presents Double/Debiased Machine Learning methods to do inference in such models using modern predictive tools…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #535 here.

Whenever you're ready, 2 ways we can help:

Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~61,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.

Thank you for joining us this week! :)

Stay Data Science-y!

All our best,
Hannah & Sebastian

PS. If you like what you read, consider becoming a paid member here: https://datascienceweekly.substack.com/subscribe :)

Data Science Weekly Newsletter

Discussion about this post