Data Science Weekly - Issue 539
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #539
March 21, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Where Are We So Far? Understanding Data Storytelling Tools from the Perspective of Human-AI Collaboration
Data storytelling is powerful for communicating data insights, but it requires diverse skills and considerable effort from human creators. Recent research has widely explored the potential for artificial intelligence (AI) to support and augment humans in data storytelling….This paper investigated existing tools with a framework from two perspectives: the stages in the storytelling workflow where a tool serves, including analysis, planning, implementation, and communication, and the roles of humans and AI in each stage, such as creators, assistants, optimizers, and reviewers…
Python Rgonomics
Switching languages is about switching mindsets - not just syntax. New developments in python data science toolings, like polars and seaborn’s object interface, can capture the ‘feel’ that converts from R/tidyverse love while opening the door to truly pythonic workflows…Unraveling the History of Technological Skepticism
Technological advancements have always been met with a mix of skepticism and fear. From the telephone disrupting face-to-face communication to calculators diminishing mental arithmetic skills, each new technology has faced resistance. Even the written word was once believed to weaken human memory…
A Message from this week's Sponsor:
Magical tools for working with data
Building a Big Picture Data Team at StubHub
See how Meghana Reddy, Head of Data at StubHub, built a data team that delivers business insights accurately and quickly with the help of Snowflake and Hex.
The challenges she faced may sound familiar:
Unclear SMEs meant questions went to multiple people
Without SLAs, answer times were too long
Lack of data modeling & source-of-truth metrics generated varying results
Lack of discoverability & reproducibility cost time, efficiency and accuracy
Static reporting reserved interactivity for rare occasion
Register now to hear how Meghana and the StubHub data team tackled these challenges with Snowflake and Hex. And watch Meghana demo StubHub’s data apps that increase quality and speed to insights.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
Introducing RewardBench: The First Benchmark for Reward Models (of the LLM Variety)
Reward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. To date, very few descriptors of capabilities, training methods, or open-source reward models exist. In this paper, we present REWARDBENCH, a benchmark dataset and code-base for evaluation, to enhance scientific understanding of reward models….How do I future proof my career as a Data Engineer? [Reddit]
AI at this point is inevitable and it’s become quite clear to me that the roles and responsibilities of a data engineer today will significantly change as AI tools become more common place. At this point it’s all speculative but my questions are A) what does the data engineer of tomorrow look like B) how can I adapt to a changing landscape and essentially future proof my career…SQL — order of query execution
Important to know before optimizing query performance…To maximize your query’s speed on any SQL engine, it’s essential to have an understanding of the SQL execution order. Even though you can work without this knowledge, I recommend reading this article to gain a quick understanding of it…The power of predicate pushdown
In our last article we provided an overview of the inner workings of Polars. In this blog we will dive into the query optimizer and explain how one of the most important optimization rules works: predicate pushdown. The idea is simple yet effective, apply any filters you have as close to the data source as possible. This avoids unnecessary computation for data that will be thrown away and reduces the amount of data being read from disk or send over the network. Note that predicate pushdown (and all other optimizations) happen under the hood automatically…Fundamental Components of Deep Learning: A category-theoretic approach
Deep learning, despite its remarkable achievements, is still a young field. Like the early stages of many scientific disciplines, it is marked by the discovery of new phenomena, ad-hoc design decisions, and the lack of a uniform and compositional mathematical foundation. From the intricacies of the implementation of backpropagation, through a growing zoo of neural network architectures, to the new and poorly understood phenomena such as double descent, scaling laws or in-context learning, there are few unifying principles in deep learning. This thesis develops a novel mathematical foundation for deep learning based on the language of category theory. We develop a new framework that is a) end-to-end, b) unform, and c) not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features…Code Reviews - All about code reviews from the author to the reviewer
Code reviews are one of the activities you will spend the most time doing in your software engineering career. Unfortunately, it is not something software engineers learn in schools or are specifically mentored about. Below is a detailed guide on how to conduct a code review from both the perspective of the author and the reviewer…The Ink Splotch Effect: A Case Study on ChatGPT as a Co-Creative Game Designer
This paper studies how large language models (LLMs) can act as effective, high-level creative collaborators and ``muses'' for game design. We model the design of this study after the exercises artists use by looking at amorphous ink splotches for creative inspiration. Our goal is to determine whether AI-assistance can improve, hinder, or provide an alternative quality to games when compared to the creative intents implemented by human designers…Three prototype games are designed across 3 different genres: (1) a minimalist base game, (2) a game with features and game feel elements added by a human game designer, and (3) a game with features and feel elements directly implemented from prompted outputs of the LLM, ChatGPT…Basketball by the numbers: Insights from Paradime's NBA Data Modeling Challenge
I recently wrapped the "NBA Data Modeling Challenge" where 100+ participants used SQL + dbt to generate insights from... you guessed it... historical NBA Data. In this blog, I highlighting some of my favorite, participant-generated insights!…Ben Schmidt of Nomic talks WebGL and WebGPU for AI embedding visualization
Paul was joined by Ben Schmidt of Nomic to talk WebGL, WebGPU, and the challenges of visualizing large-scale AI embeddings in the browser…
Open-Sora: Democratizing Efficient Video Production for All
We present Open-Sora, an initiative dedicated to efficiently produce high-quality video and make the model, tools and contents accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video production. With Open-Sora, we aim to inspire innovation, creativity, and inclusivity in the realm of content creation…BloombergGPT: A Large Language Model for Finance
In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks…Applied Machine Learning for Tabular Data
Welcome! This is a work in progress. We want to create a practical guide to developing quality predictive models from tabular data…The book takes a holistic view of the predictive modeling process and focuses on a few areas that are usually left out of similar works. For example, the effectiveness of the model can be driven by how the predictors are represented. Because of this, we tightly couple feature engineering methods with machine learning models. Also, quite a lot of work happens after we have determined our best model and created the final fit. These post-modeling activities are an important part of the model development process and will be described in detail….
Training & Resources
Here, we'll cover the derivations from scratch to provide a rigorous understanding of the core ideas behind diffusion. What assumptions are we making? What properties arise as a result? A reference [codebase] is written from scratch, which provides minimalist re-production of the MNIST example below. It clocks in at under 500 lines of code. Each page takes up to an hour to read thoroughly. Approximately a lecture each…
Scalax: scaling utilities for JAX (or scale and relax)
Scalax is a collection of utilties for helping developers to easily scale up JAX based machine learning models. The main idea of scalax is pretty simple: users write model and training code for a single GPU/TPU, and rely on scalax to automatically scale it up to hundreds of GPUs/TPUs. This is made possible by the JAX jit compiler, and scalax provides a set of utilities to help the users obtain the sharding annotations required by the jit compiler. Because scalax wraps around the jit compiler, existing JAX code can be easily scaled up using scalax with minimal changes…Has It All Been Solved? Open NLP Research Questions Not Solved by Large Language Models
Recent progress in large language models (LLMs) has enabled the deployment of many generative NLP applications. At the same time, it has also led to a misleading public discourse that ``it's all been solved.'' Not surprisingly, this has, in turn, made many NLP researchers -- especially those at the beginning of their careers -- worry about what NLP research area they should focus on. Has it all been solved, or what remaining questions can we work on regardless of LLMs? To address this question, this paper compiles NLP research directions rich for exploration. We identify fourteen different research areas encompassing 45 research directions that require new research and are not directly solvable by LLMs…
Last Week's Newsletter's 3 Most Clicked Links
LLM Evaluation Metrics: Everything You Need for LLM Evaluation
How are simple token predictors able to do so much stuff these days?
What I learned from looking at 900 most popular open source AI tools
* Based on unique clicks.
** Find last week's issue #538 here.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~61,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian