Data Science Weekly - Issue 532
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #532
February 01, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If you find what you read meaningful, consider subscribing to support more writing. The membership program funds the free newsletter: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week.
Editor's Picks
Forecast Evaluation for Data Scientists: Common Pitfalls and Best Practices
The field of forecasting has mainly been fostered by statisticians/econometricians; consequently the related concepts are not the mainstream knowledge among general ML practitioners. The different forms of non-stationarities associated with time series challenge the capabilities of data-driven ML models. Nevertheless, recent trends in the domain have demonstrated that with the availability of massive amounts of time series, ML and DL techniques are quite competent in time series forecasting, when related pitfalls are properly handled. Therefore, in this work we provide a tutorial-like compilation of the details of one of the most important steps in the overall forecasting process, namely the evaluation. This way, we intend to impart the information associated with forecast evaluation to fit the context of ML, as means of bridging the knowledge gap between traditional methods of forecasting and current state-of-the-art ML techniques…
What is the dumbest thing you have seen in data science?
The dumbest thing that I have ever seen in data science is someone who created this elaborate Tableau dashboard that took months to create, with tons of calculated fields and crazy logic, for a director who asked that the data scientist on the project then create a python script that will take pictures of the charts in the dashboard, and send them out weekly in an email. This was all automated…What is the dumbest thing you have seen?…What distinguishes production-grade data pipelines from amateur setups?
What do amateurs usually not do well?…
A Message from this week's Sponsor:
New Infrastructure to Build Knowledgeable AI
Learn how Pinecone's new serverless vector database helps Notion, Gong, and CS DISCO optimize their AI infrastructure from our VP of R&D, Ram Sriharsha:
Up to 50x lower costs because of the separation of reads, writes, and storage
O(s) fresh results with vector clustering over blob storage
Fast search without sacrificing recall powered by industry-first indexing and retrieval algorithms
Powerful performance with a multi-tenant compute layer
Zero configuration or ongoing management
Read the technical deep dive to understand how it was built and the unique considerations that needed to be made.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
MambaByte: Token-free Selective State Space Model
Token-free language models learn directly from raw bytes and remove the bias of subword tokenization. Operating on bytes, however, results in significantly longer sequences, and standard autoregressive Transformers scale poorly in such settings. We experiment with MambaByte, a token-free adaptation of the Mamba state space model, trained autoregressively on byte sequences. Our experiments indicate the computational efficiency of MambaByte compared to other byte-level models. We also find MambaByte to be competitive with and even outperform state-of-the-art subword Transformers. Furthermore, owing to linear scaling in length, MambaByte benefits from fast inference compared to Transformers. Our findings establish the viability of MambaByte in enabling token-free language modeling…Searle's Chinese Room: Slow Motion Intelligence
Imagine if the only books ever written were children's books. People would think books in general were a joke. I think the situation with computers and algorithms today is similar: people don't understand the ridiculous potential power of an algorithm because they only have experience with the "children's algorithms" that are running on their PC today. Take John Searle's famous Chinese room thought experiment, which goes like this…“Keeping the polynomial monster under control"
In the previous post we we saw that the Bernstein polynomials can be used to fit a high-degree polynomial curve with ease, without its shape going out of control. In this post we’ll look at the Bernstein polynomials in more depth, both experimentally and theoretically. First, we will explore the Bernstein polynomials…RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture
There are two common ways in which developers are incorporating proprietary and domain-specific data when building applications of Large Language Models (LLMs): Retrieval-Augmented Generation (RAG) and Fine-Tuning. RAG augments the prompt with the external data, while fine-Tuning incorporates the additional knowledge into the model itself. However, the pros and cons of both approaches are not well understood. In this paper, we propose a pipeline for fine-tuning and RAG, and present the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4…Getting Started With CUDA for Python Programmers
I used to find writing CUDA code rather terrifying. But then I discovered a couple of tricks that actually make it quite accessible. In this video I introduce CUDA in a way that will be accessible to Python folks, & I even show how to do it all for free in Colab!…Mixed-input matrix multiplication performance optimizations
In this blog, we focus on mapping mixed-input matrix multiplication onto the NVIDIA Ampere architecture. We present software techniques addressing data type conversion and layout conformance to map mixed-input matrix multiplication efficiently onto hardware-supported data types and layouts. Our results show that the overhead of additional work in software is minimal and enables performance close to the peak hardware capabilities. The software techniques described here are released in the open-source NVIDIA/CUTLASS repository…Databases Are Falling Apart: Database Disassembly and Its Implications
Why are engineers taking databases apart and putting them back together, again?…In this post, I discuss the history of database disassembly, the industry’s current state, where we’re heading, and the implications of this trend. I find it instructive to look at disassembly through the lens of two elephant-themed projects: Apache Hadoop and PostgreSQL. Though Hadoop and PostgreSQL are from different parts of the data stack, both have influenced modern disassembly efforts. Let’s start with Hadoop…Using DuckDB-WASM for in-browser Data Engineering
Rapid prototyping SQL Queries & Data Visualizations…One of the first things that came to my mind once I learned about the existence of DuckDB-WASM was that it could be used to create an online SQL Playground, where people could interactively run queries, show their results, but also visualize them. DuckDB-WASM sits at its core, providing the storage layer, query engine and many things more.….Prompt Design and Engineering: Introduction and Advanced Methods
Prompt design and engineering has become an important discipline in just the past few months. In this paper, we provide an introduction to the main concepts and design approaches. We also provide more advanced techniques all the way to those needed to design LLM-based agents. We finish by providing a list of existing tools for prompt engineering…
float8_experimental - library for accelerating training with float8 in native PyTorch
This is an early version of a library for accelerating training with float8 in native PyTorch according to the recipes laid out in https://arxiv.org/pdf/2209.05433.pdf. The codebase strives to stay small, easily hackable, and debuggable with native PyTorch tooling. torch.compile is supported out of the box. With torch.compile on, initial results show throughput speedups of up to 1.2x on small scale (8 GPUs) LLaMa pretraining jobs….Simon Willison interview: AI software still needs the human touch
Simon Willison, a veteran open source developer who co-created the Django framework and built the more recent Datasette tool, has become one of the more influential observers of AI software recently. His writing and public speaking about the utility and problems of large language models has attracted a wide audience thanks to his ability to explain the subject matter in an accessible way. The Register interviewed Willison in which he shares some thoughts on AI, software development, intellectual property, and related matters…Building Your Own Product Copilot: Challenges, Opportunities, and Needs
In this work, we present the findings of an interview study with 26 professional software engineers responsible for building product copilots at various companies. From our interviews, we found pain points at every step of the engineering process and the challenges that strained existing development practices. We then conducted group brainstorming sessions to collaborative on opportunities and tool designs for the broader software engineering community…
Training & Resources
Forecasting: Principles and Practice Book (free, online)
This textbook is intended to provide a comprehensive introduction to forecasting methods and to present enough information about each method for readers to be able to use them sensibly. We don’t attempt to give a thorough discussion of the theoretical details behind each method, although the references at the end of each chapter will fill in many of those details…
Cookbook Polars for R
Welcome to the Polars cookbook for R users. The goal of the cookbook is to provide solutions to common tasks and problems in using Polars with R. It allows R users using their usual packages to quickly get the syntax required to use Polars with R. It is structured around side-by-side comparisons between polars, R base, dplyr, tidyr and data.table….Self-supervised Learning: Generative or Contrastive
In this survey, we look into new self-supervised learning methods for representation in computer vision, natural language processing, and graph learning. We comprehensively review the existing empirical methods and summarize them into three main categories according to their objectives: generative, contrastive, and generative-contrastive (adversarial). We further investigate related theoretical analysis work to clarify how self-supervised learning works. Finally, we briefly discuss open problems and future directions for self-supervised learning. An outline slide for the survey is provided…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #531 here.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~61,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian
P.S. If you found what you read meaningful, consider subscribing to support more writing. The membership program funds the free newsletter: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2024 DataScienceWeekly.org, All rights reserved.