Data Science Weekly - Issue 482

Curated news, articles and jobs related to Data Science.

Feb 16, 2023

Issue #482
February 16 2023

Editor's Picks

State of the Art of Visual Analytics for eXplainable Deep Learning
This survey aims to (i) systematically report the contributions of Visual Analytics for eXplainable Deep Learning; (ii) spot gaps and challenges; (iii) serve as an anthology of visual analytical solutions ready to be exploited and put into operation by the Deep Learning community (architects, trainers and end users), and (iv) prove the degree of maturity, ease of integration and results for specific domains…

ChatGPT and AI’s Impact on Data Visualization Work
If ChatGPT and other AI tools like it continue to improve, what impact will they have in the future for DataViz work? The ChatGPT interface already has the potential to revolutionise the way we retrieve information from the internet. So how disruptive would AI tools be to the field of data visualisation? Here I would like to speculate what the future might hold.…

Symbolic Discovery of Optimization Algorithms
We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discovers a simple and effective optimization algorithm, Lion (EvoLved Sign Momentum). It is more memory-efficient than Adam as it only keeps track of the momentum…

A Message from this week's Sponsor:

Qdrant open-source vector search engine launches managed cloud platform

Qdrant—robust vector similarity search engine with advanced filtering support. It is written in Rust, which ensures stability and high performance proved by benchmarks.

The managed cloud platform is now fully available for business use, allowing companies of any size to benefit from Qdrant's cutting-edge features without handling its deployment and maintenance.

Qdrant cloud platform can be accessed through the website.

Data Science Articles & Videos

Understanding Large Language Models -- A Transformative Reading List
In just half a decade large language models – transformers – have almost completely changed the field of natural language processing…Since transformers have such a big impact on everyone’s research agenda, I wanted to flesh out a short reading list for machine learning researchers and practitioners getting started. The following list below is meant to be read mostly chronologically, and I am entirely focusing on academic research papers.…

Creating a data cleaning workflow
Despite our best intentions, many times we (myself included) end up cleaning data in a somewhat haphazard way, leading to more work after the cleaning process, having to organize our messy work so that others can understand what we did. So how do we create data cleaning workflows that are standardized, reproducible, and produce reliable data? In this second post of the data cleaning workflow series, I share some of my ideas.…

One Queue or Two
Suppose you are designing the checkout area for a new store. There is room for two checkout counters and a waiting area for customers. You can make two lines, one for each counter, or one line that serves both counters. In theory, you might expect a single line to be better, but it has some practical drawbacks: in order to maintain a single line, you would have to install rope barriers, and customers might be put off by what seems to be a longer line, even if it moves faster. So you’d like to check whether the single line is really better and by how much. Simulation can help answer this question. The following figure shows the three scenarios I simulated...

Online Rumsey Downloadable Maps Reach 120,000
The David Rumsey Map Collection online database has grown to over 120,000 maps and related images. Below are over 500 highlights of maps added between 2017 and 2023…

Five disturbing impossibility theorems
One would be disturbed by the number of basic and fundamental goals that are now known to be out of reach [in statistics]. And not just out of reach given present techniques or datasets: provably, irretrievably out of reach given any technique and any experimental design. Some examples…

Is Seattle a 15-minute city? It depends on where you want to walk
I made an interactive map showing every single block in Seattle where you can live within walking distance of daily necessities:…

Announcing the launch of the Medical AI Research Center (MedARC)
We are proud to announce the launch of Medical AI Research Center (MedARC), a novel, open, and collaborative approach to research dedicated to advancing the field of AI applied to healthcare…

Ask-Me-Anything: Open & Reproducible Data Science
In the spirit of Love Data Week 2023, I asked people to send in questions they have on Open and Reproducible Data Science (AMA). Here are my answers…

Meta-Task-Generator - Automatically generating meta-tasks
A small program that automatically generates simple meta-reinforcement learning tasks from a parametrized space. The parametrization is expressive enough to include bandit tasks, the Harlow task, the two-step tasks, T-mazes, and other meta-tasks…

Google Research, 2022 & beyond: Robotics
An undercurrent this past year has been the exploration of how large, generalist models, like PaLM, can work alongside other approaches to surface capabilities allowing robots to learn from a breadth of human knowledge and allowing people to engage with robots more naturally. As we do this, we’re transforming robot learning into a scalable data problem so that we can scale learning of generalized low-level skills, like manipulation. In this blog post, we’ll review key learnings and themes from our explorations in 2022…

A Catalog of Big Visions for Biology
My goal here is to begin to compile a list of the grand visions or dreams for the future of biology. If we think 200 years into the future, what do we think will be possible? I have grouped them into six categories, but the list will likely be expanded over time as I learn of more. Please send me references, ideas, and criticisms, and I will try to keep the list updated…

Favorite matplotlib configuration setting for beautiful scientific charts? [Twitter Thread]
What is your favorite matplotlib configuration setting for beautiful scientific charts? Links welcome to open source or examples of charts you love.…

Jobs

Data Scientist / Machine Learning Engineer - Epsilon - NYC

Epsilon Strategy and Insights, Data Sciences team is looking for a talented team player in a Data Scientist/Machine Learning Engineer role. You are an expert, mentor and advocate. You have strong machine learning and deep learning background and are passionate about transforming data into ml models. You welcome the challenge of data science and are proficient in Python, Spark MLLib, Tensorflow, Keras, ML algorithms and Deep Neural Networks, Big Data. You must be self-driven, take initiative and want to work in a dynamic, busy and innovative group...

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch
In this article, we are going to understand how self-attention works from scratch. This means we will code it ourselves one step at a time. Since its introduction via the original transformer paper (Attention Is All You Need), self-attention has become a cornerstone of many state-of-the-art deep learning models, particularly in the field of Natural Language Processing (NLP). Since self-attention is now everywhere, it’s important to understand how it works…

Transformer models: an introduction and catalog
In the past few years we have seen the meteoric appearance of dozens of models of the Transformer family, all of which have funny, but not self-explanatory, names. The goal of this paper is to offer a somewhat comprehensive but simple catalog and classification of the most popular Transformer models. The paper also includes an introduction to the most important aspects and innovation in Transformer models.…

Mctx: MCTS-in-JAX
Mctx is a library with a JAX-native implementation of Monte Carlo tree search (MCTS) algorithms such as AlphaZero, MuZero, and Gumbel MuZero. For computation speed up, the implementation fully supports JIT-compilation. Search algorithms in Mctx are defined for and operate on batches of inputs, in parallel. This allows to make the most of the accelerators and enables the algorithms to work with large learned environment models parameterized by deep neural networks.…

Last Week's Newsletter's 3 Most Clicked Links

Big Data is Dead

Your guide to AI: February 2023

Data as a Product vs. Data as a Service

* Based on unique clicks.
** Find last week's issue #481 here.

Cutting Room Floor

Have an awesome week!

All our best,
Hannah & Sebastian

Follow us on Twitter

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :)

Data Science Weekly Newsletter

Discussion about this post