Data Science Weekly - Issue 516
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #516
October 12 2023
Poll Results!
So a bunch of you voted and it was 50/50 whether we should expand to two issues a week or not. We had tons of great suggestions - thanks to everyone who wrote in or commented.
This week and next we’ll do 1 single issue per week while we go through again all of the positive and negative feedback and what makes sense to help you.
Thanks for taking the time to share your opinions :)
Hello!
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If you don’t find this email useful, please unsubscribe here.
Is this newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week
Editor's Picks
CS/HCI PhD Opportunity Tracker (2024)
This page collects computer science (CS) and human computer interaction (HCI) PhD recruitment posts and opportunities to help answer "who is taking PhD students this year" and "who is recruiting a student in {X} this year"…By collecting posts, we can help students find interdisciplinary PhD positions in computing that might not otherwise show up on their radar…Information is collected manually from the Internet…
The Hard Economics of Selling Web Data
As already seen, it would make enormous sense to buy pre-scraped data instead of building a new code from scratch. Yet, many efforts that have been made in the past to sell datasets didn’t catch up. Why is that? Why do companies hire or commission external consultants to scrape rather than search for pre-scraped data? Why is it build preferred to buy? Selling data independently can be hard, as unit economics pull against it. But things look differently when we understand the market…Neuroscience for machine learners
This is a freely available online course on neuroscience for people with a machine learning background. The aim is to bring together these two fields that have a shared goal in understanding intelligent processes. Rather than pushing for “neuroscience-inspired” ideas in machine learning, the idea is to broaden the conceptions of both fields to incorporate elements of the other in the hope that this will lead to new, creative thinking…
A Message from this week's Sponsor:
Is your A/B testing system reliable?
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
Why data integration will never be fully solved, and what Fivetran, Airbyte, Singer, dlt and CloudQuery do about it
If you ask a data engineer what is the most frustrating and error-prone part of their job, chances are they'll say data ingestion. Moving data from A to B is one of the most mundane and time-consuming tasks of any platform team…This post covers how Fivetran, Airbyte, Singer, dltHub, and CloudQuery approached data integration. Even though we'd argue that this problem will never be fully solved by an integration vendor, there is a lot we can do to make the ingestion process more reliable, maintainable, and cost-effective…
ML Engineer Here - Tell me what you wish to learn and I'll do my best to curate the best resources for you 💪 [Reddit]
Tell me what you wish to learn and I'll do my best to curate the best resources for you…
The AI research job market sh*t show (and my experience)
It’s pretty wild that everyone is so interested in the where’s and when’s of researchers in AI these days. It's becoming a bit like transfer news in all of our favorite sports leagues. It's more than just drama and gossip. Zoomed-in, it's the leading indicator of which companies are going to gain and fall behind. Zoomed-out, it's a way to measure the consolidation vs. dispersion of talent in AI…Welcome to State of AI Report 2023
For much of the last year, it’s felt like Large Language Models (LLMs) have been the only game in town…However, this is the State of AI, not the state of LLMs, and the report dives into progress in other areas of the field - from breakthroughs in navigation and weather predictions through to self-driving cars and music generation. This has been one of the most exciting years to produce this report and we believe that it will have something for everyone - from AI research through to politics…"Programming Distributed Systems" by Mae Milano [YouTube]
In this talk, I'll show how to use ideas from programming languages to make programming at scale easier, without sacrificing performance, correctness, or expressive power in the process. We'll see how slight tweaks to modern imperative programming languages can provably eliminate common errors due to replica consistency or concurrency---with little to no programmer effort. We'll see how new language designs can unlock new systems designs, yielding both more comprehensible protocols and better performance. And we'll conclude by imagining together the role that a new cloud-centric programming language could play in the next generation of distributed programs…
Artificial General Intelligence Is Already Here
Today’s frontier models perform competently even on novel tasks they were not trained for, crossing a threshold that previous generations of AI and supervised deep learning systems never managed…Decades from now, they will be recognized as the first true examples of AGI, just as the 1945 ENIAC is now recognized as the first true general-purpose electronic computer. The ENIAC could be programmed with sequential, looping and conditional instructions, giving it a general-purpose applicability that its predecessors, such as the Differential Analyzer, lacked…Today’s computers far exceed ENIAC’s speed, memory, reliability and ease of use, and in the same way, tomorrow’s frontier AI will improve on today’s. But the key property of generality? It has already been achieved….How do data scientist managers manage data scientists? [Reddit]
As a data science manager how do you manage your team? Specifically how do you manage your DS’s career growth and promotion opportunities? Imagine you have a team of 5 DS’s: 2 DS1, 2 DS2, and 1 DS3, where DSX is a Data Scientist 1-4. What is your measure of success - promotions, completed projects, revenue contribution, etc? How do DSX become DSX+1?…
Everything about Distributed Training and Efficient Fine tuning
A deep dive into distributed training and efficient fine tuning - DeepSpeed ZeRO, FSDP, practical guidelines and gotchas with multi-GPU and multi-node training…I wanted to write this post to focus on the nitty gritty details of distributed training strategies, specifically DeepSpeed and FSDP, along with a summary of different efficient fine tuning methods, with special focus on multi-GPU and multi-node training. The trend right now is clear: We’re going to be using more and more compute, and thus more GPUs with bigger models. So, understanding these topics is important in this context, especially so when you’re trying to up your game from just using a home server with a couple of 3090s, to, say, a GCP container with 8xA100 80GBs…
A Tour of Video Understanding Use Cases
With the explosive growth of online video content, the need to understand and analyze videos has become increasingly crucial. Enter video understanding, a fascinating field that harnesses the power of artificial intelligence and machine learning to decipher the rich visual information embedded in videos. In a previous article, we looked at the evolution of video understanding from an academic perspective. In this article, we will embark on a captivating tour of video understanding and explore its diverse range of use cases…
What data engineering stack do the small players have here? [Reddit]
For a few weeks I've been lurking this subreddit and most posts are aimed towards bigger stacks. Big datalake, etc. What do the small players here for their ETL pipelines? Most recent popular tools seem very big overkill for small tasks…
Retrieval Augmented Generation at scale — Building a distributed system for synchronizing and ingesting billions of text embeddings
Getting a Retrieval Augmented Generation (RAG) application started is pretty straightforward. The problem comes when trying to scale it and making it production-ready. In this blog we will go into some technical and architectural details of how we do this…specifically on how we did this for a pipeline syncing 1 billion vectors…First off, What exactly is RAG?…
Exploratory Data Analysis for Humanities Data
In the spring of 2023, I co-taught a course called Literature as Data…One of the goals of the course was to try to teach enough computing to a mostly non-technical and definitely not computer-experienced population that they could use computers to do an interesting and new (to them) exploration of some dataset that they found intriguing…After much discussion, we decided that for the programming aspects, we would devote half of each class meeting to a "studio" where I would lead the students through hands-on computing exercises on some small dataset, followed by having the students do similar exercises on larger datasets outside the class…
Jobs
Data Science Intern:
Performance Control & Digitalization
More than 90% of automotive innovations are based on electronics and software.
We, the BMW Group, offer you an interesting and varied internship in data science for Performance Control & Digitalization. To take our operations to the next level, the BMW Group – Performance Control & Digitalization department is looking for a Data science intern to contribute to the Supply Chain Innovations Think Tank of BMW Group and continue BMW’s leadership in supply chain management. The goal of the team will be to research emerging technologies including Data Science (ML, AI, BI etc.).
Location is Munich. Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Training & Resources
Why does the Least Squares Regression Line use square residuals and not the residuals themselves? [Reddit]
Why is the LSRL defined by a line that forms the least squared residuals of a scatter plot? Why not just least residuals? The smaller the square is, the smaller the residual will be. So why can’t it be the “Least Residuals Regression Line”?…Why Do We Need Weight Decay in Modern Deep Learning?
Weight decay is a broadly used technique for training state-of-the-art deep networks, including large language models. Despite its widespread usage, its role remains poorly understood. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory…Let's reverse engineer Disney's adorable, lifelike robot! [Twitter/X]
I couldn't find a white paper, but this is how I think it's trained: 1. The emotional behaviors are curated by Disney animation artists, keyframe by keyframe...2. Reinforcement learning (RL) is a great tool for training low-level robot controllers...3. Enters Adversarial Motion Prior (AMP): a technique that learns the human preference by training a classifier on what we consider "emotional & cute"...4. Add lots of data augmentation to make the controller robust to physical disturbances…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #515 here.
Cutting Room Floor
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
My lecture notes on regularization techniques in machine learning
Whenever you're ready, 3 ways we can help you:
Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course: A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week :)
All our best,
Hannah & Sebastian
P.S.
Is this newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.