Data Science Weekly - Issue 530

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Jan 19, 2024

∙ Paid

Issue #530
January 18, 2024

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :) ( you get extra links each week! )

And now…let's dive into some interesting links from this week.

Editor's Picks

Friends Don't Let Friends Make Bad Graphs
Friends don't let friends make certain types of data visualization - What are they, and why are they bad…This is an opinionated essay about good and bad practices in data visualization. Examples and explanations are below…

The Hard Truth about Artificial Intelligence in Healthcare: Clinical Effectiveness is Everything, not Flashy Tech
Building machine learning/artificial intelligence medical devices (MAMDs) is much like bringing a new drug to market. First, both must be developed “in the lab.” Then, rigorous testing for efficacy and safety is conducted. Finally, physicians must be convinced to use the product and payers to reimburse it. Along this path, the vast majority of ostensibly promising drugs fail because the bar for commercial success is set high. This bar is no different for MAMDs. Yet, in the public discourse, too much weight is given to the technological sophistication of MAMDs instead of what is needed for successful implementation…
What Was Watched on Netflix in 2023? A Statistical Analysis
An investigation into Netflix viewership activity in 2023…, Netflix published viewership statistics from January to June 2023, a massive data dump covering over 18,000 titles with a minimum watch time of 50,000 hours…If you want to know what 247,150,000 million households watched on television in 2023, this is the dataset for you. So today, we'll review five major takeaways from half a year of Netflix viewership and what this tells us about streaming's past, present, and future…

A Message from this week's Sponsor:

New Infrastructure to Build Knowledgeable AI

Learn how Pinecone's new serverless vector database helps Notion, Gong, and CS DISCO optimize their AI infrastructure from our VP of R&D, Ram Sriharsha:

Up to 50x lower costs because of the separation of reads, writes, and storage
O(s) fresh results with vector clustering over blob storage
Fast search without sacrificing recall powered by industry-first indexing and retrieval algorithms
Powerful performance with a multi-tenant compute layer
Zero configuration or ongoing management

Read the technical deep dive to understand how it was built and the unique considerations that needed to be made.

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge
This article attempts to comprehensively review relevant algorithms to provide a general understanding of this booming research area. The basis of our framework categorises these studies by the approach of solving ANNS problem, respectively hash-based, tree-based, graph-based and quantization-based approaches. Then we present an overview of existing challenges for vector databases. Lastly, we sketch how vector databases can be combined with large language models and provide new possibilities…
Chess Transformers - Teaching transformers to play chess
Chess Transformers is a library for training transformer models to play chess by learning from human games…
Machine Learning for Big Code and Naturalness
Research on machine learning for source code…Search across all paper titles, abstracts, authors by using the search field. Please consider contributing by updating the information of existing papers or adding new work…
What do you think about Yann Lecun's controversial opinions about ML? [Reddit]
Yann Lecun has some controversial opinions about ML, and he's not shy about sharing them. He wrote a position paper called "A Path towards Autonomous Machine Intelligence" a while ago. Since then, he also gave a bunch of talks about this…
Six not-so-basic base R functions
R is known for its versatility and extensive collection of packages. As of the publishing of this post, there are over 23 thousand packages on R-universe. But what if I told you that you could do some pretty amazing things without loading any packages at all?…There’s a lot of love for base R, and I am excited to pile on. In this blog post, we will explore a few of my favorite “not-so-basic” (i.e., maybe new to you!) base R functions. Click ‘Run code’ in order to see them in action, made possible by webR and the quarto-webr extension!…
The Perfect Way to Smooth Your Noisy Data
Insanely fast and reliable smoothing and interpolation with the Whittaker-Eilers method…Real-world data is never clean. Whether you’re carrying out a survey, measuring rainfall or receiving GPS signals from space, noisy data is ever present. Dealing with such data is the main part of a data scientist’s job. It’s not all glamorous machine learning models and AI — it’s cleaning data in an attempt to extract as much meaningful information as possible. If you’re currently looking at a graph that has way too many squiggles to be useful. Well, I have the solution you’re looking for…
Casual and trustworthy machine learning: methods and applications
This work focuses on the intersection of machine learning and causal inference and the way in which the two fields can enhance each other by sharing ideas: utilizing machine learning techniques for the computation of causal quantities, the use of ideas from causal inference for invariant predictions under unseen treatment regimes, and the exploration of topics in trustworthy machine learning, including interpretability and fairness, with a causal lens. In each one of the presented works, we grappled with the strength of assumptions needed to utilize causal inference techniques and relax portions of them when possible…
AI for Economists: Prompts & Resources
This page contains example prompts and responses intended to showcase how generative AI, namely LLMs like GPT-4, can benefit economists. Example prompts are shown from six domains: ideation and feedback; writing; background research; coding; data analysis; and mathematical derivations…
ARIMA vs Prophet vs LSTM for Time Series Prediction
In this post, we will discuss three popular approaches to learning from time-series data:
- 1) The classic ARIMA framework for time series prediction
- 2) Facebook’s in-house model Prophet, which is specifically designed for learning from business time series
- 3) The LSTM model, a powerful recurrent neural network approach that has been used to achieve the best-known results for many problems on sequential data…
Iterative ‘mapping’ in R
In my consulting work, I’m commonly asked to build out maps, charts, or reports for a large number of cities or regions at once. The goal here is often to allow for rapid exploration / iteration, so a basic map template might be fine. Doing this for a few cities one-by-one isn’t a problem, but it quickly gets tedious when you have dozens, if not hundreds, of visuals to produce – and keeping all the results organized can be a pain…
🩺 Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
We propose a framework that decodes specific information from a representation within an LLM by “patching” it into the inference pass on a different prompt that has been designed to encourage the extraction of that information. A "Patchscope" is a configuration of our framework that can be viewed as an inspection tool geared towards a particular objective. For example, this figure shows a simple Patchscope for decoding what is encoded in the representation of "CEO" in the source prompt (left). We patch a target prompt (right) comprised of few-shot demonstrations of token repetitions, which encourages decoding the token identity given a hidden representation…
tinytable
tinytable is a small but powerful R package to draw HTML, LaTeX, PDF, Markdown, and Typst tables. The user interface is minimalist, but it gives users access to powerful frameworks to create endlessly customizable tables….

Training & Resources

RLHF learning resources in 2024 - A list for beginners and wannabe experts and everyone in between
I’ve given a lot of effort into sharing information on Reinforcement Learning from Human Feedback (RLHF). I figured I would categorize them in one place for people who come to me or Interconnects to learn about the topic…
einx - Tensor Operations in Einstein-Inspired Notation
einx is a Python library that allows formulating many tensor operations as concise expressions using Einstein notation. It is inspired by einops, but follows a novel and unique design:
- Fully composable and powerful Einstein expressions with []-notation.
- Support for many tensor operations (einx.{sum|max|where|add|dot|flip|get_at|...}) with Numpy-like naming.
- Easy integration and mixing with existing code. Supports tensor frameworks Numpy, PyTorch, Tensorflow and Jax.
- Just-in-time compilation of all operations into regular Python functions using Python's exec().
Rembg- Rembg is a tool to remove images background
Rembg is a tool to remove images background…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #529 here.

Thank you for joining us this week! :)

All our best,
Hannah & Sebastian

P.S. A new thing for paid subscribers => Even more links are below!

Data Science Weekly Newsletter