Data Science Weekly - Issue 551
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #551
June 13, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Updating Beliefs on Bayesian Education: Reflections on Research and Teaching Projects
Example Projects and How They Shaped My Current Beliefs on Bayesian Education…In a few weeks, I will become an “Associate Professor of Teaching”. For this talk, I wanted to share my own Bayesian journey and focus on some of the projects I worked on…
📽️ New 4 hour (lol) video lecture: "Let’s reproduce GPT-2 (124M)"
Andrej Karpathy start with empty file and end up with a GPT-2 (124M) model: - first we build the GPT-2 network - then we optimize it to train very fast - then we set up the training run optimization and hyperparameters by referencing GPT-2 and GPT-3 papers - then we bring up model evaluation, and - then cross our fingers and go to sleep…Home-Cooked Software and Barefoot Developers
The emerging golden age of home-cooked software, barefoot developers, and why the local-first community should help build it…
A Message from this week's Sponsor:
Join industry’s leading AI conference - free passes available!
Don’t miss the AI conference of the year! Join 5000+ attendees, 350+ speakers and 150+ AI exhibitors at Ai4, North America's largest AI industry conference — taking place Las Vegas on August 12-14. Enjoy dedicated content & unbeatable networking for both business & technical leaders from every major industry and job function.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
🧵 AI-powered Jupyter Notebook built using React 🧵
Thread is a Jupyter Notebook that combines the experience of OpenAI's code interpreter with the familiar development environment of a Python notebook. With Thread, you can use natural language to generate cells, edit code, ask questions or fix errors all while being able to edit or re-run code as you would in a regular Jupyter Notebook. Best of all, Thread runs locally, and can be used for free with your own API key…Jason Wei & Hyung Won Chung of OpenAI
Two part talk: Intuitions on Language Models (Jason) Jason will talk about some basic intuitions on language models, inspired by manual examination of data…Shaping the Future of AI from the History of Transformer (Hyung Won)…I will provide a highly-opinionated view on the early history of Transformer architectures, focusing on what motivated each development and how each became less relevant with more compute…What mishap have you done because you were good in ML but not the best in statistics? [Reddit Discussion]
I feel like there are many people who are good in ML but not necessarily good in statistics. I am curious about the possible trade offs not having a good statistics foundation…Can LLMs invent better ways to train LLMs?
Earlier this year, Sakana AI started leveraging evolutionary algorithms to develop better ways to train foundation models like LLMs. In a recent paper, we have also used LLMs to act as better evolutionary algorithms! Given these surprising results, we began to ask ourselves: Can we also use LLMs to come up with a much better algorithm to train LLMs themselves? We playfully term this self-referential improvement process LLM² (‘LLM-squared’) as an homage to previous fundamental work in meta-learning. As a significant step towards this goal, we’re excited to release our report, Discovering Preference Optimization Algorithms with and for Large Language Models…jax-diffusion-transformer
Implementation of Diffusion Transformer (DiT) in JAX…PostgreSQL and Pgvector: Now Faster Than Pinecone, 75% Cheaper, and 100% Open Source
Introducing pgvectorscale, a new open-source extension that makes PostgreSQL an even better database for AI applications. Pgvectorscale builds upon pgvector to unlock large-scale, high-performance AI use cases previously only achievable with specialized vector databases like Pinecone…The Unreasonable Effectiveness of Human Feedback
This post presents quantitative results showing how human feedback allows Foyle to assist with building and operating Foyle. In 79% of cases, Foyle provided the correct answer, whereas ChatGPT alone would lack sufficient context to achieve the intent. Furthermore, the LLM API calls cost less than $.002 per intent whereas a recursive, agentic approach could easily cost $2-$10…Can children (4-8 years old) strategically decide what learning activity to practice when they are free to choose? [PDF link downloads]
Yes…[interesting research paper]How Bad Is the Data Environment where you work? [Reddit Discussion]
I just want to know if data and it's processes is as shocking as it is where I work. I have bridging tables that don't bridge. I have tables with no keys. I have tables with incomprehensible soup of abbreviations as names. I have columns with the same business name in different databases that have different values and both are incorrect. So many corners have been cut that this is environment is a circle. Is it this bad everywhere or is it better where you work? Edit: Please share horror stories, the ones I see so far are hilarious and are making me feel better😅…
An Empirical Study on the Energy Usage and Performance of Pandas and Polars Data Analysis Python Libraries [PDF]
We aim to assess the energy usage of Pandas, a widely-used Python data manipulation library, and Polars, a Rust-based library known for its performance. The study aims to provide insights for data scientists by identifying scenarios where one library outperforms the other in terms of energy usage, while exploring the possible correlations between energy and performance metrics…Incorporating time-varying seasonality in forecast models
Seasonality is very common in real-world time series. Many series vary in periodic, regular ways. For example, ice cream sales tend to be higher in warmer holiday months, while counts of migratory birds fluctuate strongly around the annual migration cycle. Because of how pervasive seasonality is, many time series and forecasting methods have been developed specifically to deal with this feature…The purpose of this brief post is to highlight one strategy for capturing seasonality, and time-varying seasonal patterns, in Dynamic Generalized Additive Models…Enhancing Code Completion for Rust in Cody
Although most LLMs are trained on corpora that include several programming languages, we often observe differential performance across languages, especially languages like Rust that are not well represented in popular training datasets. In this post, we share early results from our efforts to improve the performance of LLMs for code completion in such languages…
Training & Resources
The Prompt Report: A Systematic Survey of Prompting Techniques
While prompting is a widespread and highly researched concept, there exists conflicting terminology and a poor ontological understanding of what constitutes a prompt due to the area's nascency. This paper establishes a structured understanding of prompts, by assembling a taxonomy of prompting techniques and analyzing their use. We present a comprehensive vocabulary of 33 vocabulary terms, a taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities. We further present a meta-analysis of the entire literature on natural language prefix-prompting…A User’s Guide to Statistical Inference and Regression
Quantitative research involves a host of choices about the model to use, variables to include, tuning parameters to set, assumptions to make, and so on. Without a deep understanding of statistics, you may find these choices bewildering and confusing, and you may simply (and possibly erroneously) yield to the default settings of your statistical software. The goal of this book is to give you the foundation to make methodological choices for your specific application with knowledge and with confidence….Language models on the command-line
Handout for a talk I gave about LLM and CLI tools…Notes for a talk I gave at Mastering LLMs: A Conference For Developers & Data Scientists…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #550 here.
Cutting Room Floor
NAACL 24 Tutorial: Explanations in the Era of Large Language Models
Purdue CHE 597 Computational Optimization spring 2024, by Can Li
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~62,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian