Data Science Weekly - Issue 562
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #562
August 29, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Predicting the Future of Distributed Systems
There are significant changes happening in distributed systems. Object storage is becoming the database, tools for transactional processing and analytical processing are becoming one in the same, and there are new programming models that promise some combination of superior security, portability, management of application state, or simplification. These changes will influence how systems are operated in addition to how they are programmed…
Current William Guy Lecturers 2024-25
The theme for this academic year (1 August 2024–31 July 2025) is Statistics in plain sight. The lecturers will be inspiring the next generation about the importance of statistics and data to the world around us – where the role of statistics is crucial but may not be immediately apparent. The talks will encourage young people to dig deeper and be curious about the building blocks behind their everyday activities and the topics they encounter in daily life…Find out more about the 2024-25 lecturers, how to contact them, and watch their talks below…Introduction to Mechanistic Interpretability
Mechanistic Interpretability is an emerging field that seeks to understand the internal reasoning processes of trained neural networks and gain insight into how and why they produce the outputs that they do. AI researchers currently have very little understanding of what is happening inside state-of-the-art models.[1] Current frontier models are extremely large – and extremely complicated. They might contain billions, or even trillions of parameters, spread across over 100 layers. Though we control the data that is inputted into a network and can observe its outputs, what happens in the intervening layers remains largely unknown. This is the ‘black box’ that mechanistic interpretability aims to see inside…
A Sponsor Message
Dagster Deep Dive - Building a True Data Platform: Beyond the Modern Data Stack
We know how to build a basic Modern Data Stack, but how do we make these systems maintainable?
Evolve from simply moving data around to creating a data platform that's scalable, reliable, and built to last.
During Dagster's next webinar in our Deep Dive series, we'll cover:
How to build scalable, maintainable data platforms
Practical methods for ensuring data quality and governance
Strategies to maximize data insights and discoverability
Save you spot now for Dagster's Deep Dive on Sept 3 at 9 AM PST.
.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
JPMorgan's Python training for business analysts and traders
This course is designed to be an introduction to numerical computing and data visualization in Python. It is not designed to be a complete course in Computer Science or programming, but rather a motivational demonstration of how relatively complex topics can be accessible even to those without formal progamming backgrounds…Everything I Learned from AI Consulting
By the end of this post, you will understand how to transition from a task-oriented contractor to a high-value consultant. You'll learn strategies to dramatically increase your earning potential by aligning your services with client outcomes, structuring compelling proposals, and pricing based on value rather than time…Anthropic's Prompt Engineering Interactive Tutorial
This course is structured to allow you many chances to practice writing and troubleshooting prompts yourself. The course is broken up into 9 chapters with accompanying exercises, as well as an appendix of even more advanced methods. It is intended for you to work through the course in chapter order. Each lesson has an "Example Playground" area at the bottom where you are free to experiment with the examples in the lesson and see for yourself how changing prompts can change Claude's responses…Convex Optimization in the Age of LLMs
In 1948, at the meeting of the Econometric Society at the University of Wisconsin, Madison, George Dantzig first presented his formulation of Linear Programming. Dantzig proposed something radical at the time. Posing problems as goal maximization wasn’t particularly novel, nor were simple algorithms for finding extreme points. But no one had an algorithm for maximizing a goal subject to a complex model constraining potential solutions. Dantzig found he could model all sorts of important planning programs–whether it be the transportation of goods, assignment of people to tasks, or computing shortest paths–as maximizing linear functions subject to linear inequality constraints…Would love to hear some success stories for people who recently got into DS roles without any prior experience. [Reddit Discussion]
Would be great if you could provide a little background (formal education, certs, prior work experience, age, region of the world). Need some motivation (and need to spend less time on the r/layoffs sub) Edit: especially interested in hearing success stories from older applicants who’ve successfully made career transitions (35yo and up)…Reproducible Data Science in R: Writing better functions
This article builds on that momentum of the previous blog post , heavily drawing upon Tidy design principles to improve our functional programming skills. With a little time invested in how we write functions, we can save ourselves and our collaborators a lot of time trying to decipher our code by focusing on intuitive and user-centered design. There are a lot of ways to do this, but this article contains a couple of the most straightforward and important aspects of writing useful and usable functions…MiniTorch is a diy teaching library for machine learning engineers
MiniTorch is a diy teaching library for machine learning engineers who wish to learn about the internal concepts underlying deep learning systems. It is a pure Python re-implementation of the Torch API designed to be simple, easy-to-read, tested, and incremental. The final library can run Torch code…JAX and Equinox - An Introduction
This is a brief introduction to some key ideas in JAX and Equinox (mostly on the latter)…Today, everyone loves JAX for its speed, but quite a few people struggle with its functional approach, especially when considering high-level frameworks such as Haiku or Equinox. To ease the process, this is a short guided tour to some of key concepts in JAX and Equinox, following a realistic implementation of an activation function…Large Language Models Understand and Can be Enhanced by Emotional Stimuli
In this paper, we take the first step towards exploring the ability of LLMs to understand emotional stimuli. To this end, we first conduct automatic experiments on 45 tasks using various LLMs, including Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4. Our tasks span deterministic and generative applications that represent comprehensive evaluation scenarios. Our automatic experiments show that LLMs have a grasp of emotional intelligence, and their performance can be improved with emotional prompts (which we call "EmotionPrompt" that combines the original prompt with emotional stimuli)…
archives.design
A digital archive of graphic design related items that are available on the Internet Archives…Fine-tuning Best Practices Series Introduction and Chapter 1: Training Data
Hi there friend! We’re going to be releasing a series of 3 recorded conversations with Kyle Corbitt, OpenPipe Co-Founder & CEO, on the topic of LLM fine-tuning best practices. Each conversation will be accompanied by a related article. Today's conversation and article will focus on the first chapter (Training Data) listed in the series below:Chapter 1: Training Data - We’ll explore how to choose the best data, common methods for collecting it, and common methods for shaping it (automated, relabeling, human shaping)…
Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress
The use of synthetic data has played a critical role in recent state-of-art breakthroughs. However, overly relying on a single oracle teacher model to generate data has been shown to lead to model collapse and invite propagation of biases. These limitations are particularly evident in multilingual settings, where the absence of a universally effective teacher model that excels across all languages presents significant challenges. In this work, we address these extreme difference by introducing "multilingual arbitrage", which capitalizes on performance variations between multiple models for a given language…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #561 here.
Cutting Room Floor
Why is Python the most widely used language for machine learning if it's so slow?
DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude
Neural and Non-Neural AI, Reasoning, Transformers, and LSTMs - interview with Jürgen Schmidhuber
.
Whenever you're ready, 3 ways we can help:
Learning something for your work? We can help you a) learn faster, b) learn deeper, and c) learn better. Reply to this email to find out how we can work 1-on-1 with you to speed up your learning.
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~63,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian