Data Science Weekly - Issue 560
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #560
August 15, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
- The Turing Test and our shifting conceptions of intelligence 
 The Turing Test, was proposed by Turing to combat the widespread intuition that computers, by virtue of their mechanical nature, cannot think, even in principle. Turing’s point was that if a computer seems indistinguishable from a human (aside from its appearance and other physical characteristics), why shouldn’t we consider it to be a thinking entity? Why should we restrict “thinking” status only to humans (or more generally, entities made of biological cells)? As the computer scientist Scott Aaronson described it, Turing’s proposal is “a plea against meat chauvinism.”…
- Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation 
 We will be coding the PaliGemma Vision Language Model from scratch while explaining all the concepts behind it: - Transformer model (Embeddings, Positional Encoding, Multi-Head Attention, Feed Forward Layer, Logits, Softmax) - Vision Transformer model - Contrastive learning (CLIP, SigLip) - Numerical stability of the Softmax and the Cross Entropy Loss - Rotary Positional Embedding - Multi-Head Attention - Grouped Query Attention - Normalization layers (Batch, Layer and RMS) - KV-Cache (prefilling and token generation) - Attention masks (causal and non-causal) - Weight tying - Top-P Sampling and Temperature and much more!…
- Can astrologers use astrological charts to understand people's character and lives? Our new study put astrologers to the test 
 Astrology is very popular — both Gallup and YouGov report that about 25% of Americans believe that the position of the stars and planets can affect people's lives, with an additional 20% of people reporting being uncertain about astrology’s legitimacy…Previously, we tested whether facts about a person's life can be predicted using their astrological sun signs (such as Pisces, Aries, etc.). A number of astrologers criticized this work…Inspired by these critiques, we enlisted the help of six astrologers, and with their feedback and guidance, we designed a new test to see whether astrologers can truly gain insights about people from entire astrological charts…
A Sponsor Message
Bridging AI, Vector Embeddings and the Data Lakehouse - Live Webinar
Join NielsenIQ and Onehouse to explore the crucial role of vector embeddings in AI.
Discover how Onehouse makes it more cost-efficient, simple, and scalable to generate and manage vector embeddings directly from your data lake, amidst rising vector database costs.
Live webinar. Aug 27, 2024 | 10 am PT
Can't make it? Register anyway to receive the recording!
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
- How I Use "AI" 
 In this post, I want to try and ground the conversation. I'm not going to make any arguments about what the future holds. I just want to provide a list of 50 conversations that I (a programmer and research scientist studying machine learning) have had with different large language models to meaningfully improve my ability to perform research and help me work on random coding side projects…
- Decayed estimators for time series 
 If you are working on a time series then you are typically interested in predicting the near future…So does it make sense to look very far back into the past? Maybe not! This is something we can tune with sample weights, which many scikit-learn models support…
- Open-endedness is all we’ll need - On “Agentic AI” 
 Following the explosion of interest in generative AI, we’ve seen a succession of new buzzwords. We started with the old ‘ChatGPT for x’, moved to genAI more broadly, and then to multimodality. However, the new word of the quarter appears to be ‘agentic AI’…Agentic systems offer the promise of agents - a software system that takes action in an environment and gets feedback - autonomously performing multi-step tasks, making decisions based on a combination of user input and environmental factors…This week, we’re asking how far along this road we are, looking at some of the more promising research threads, and assessing where value might be found in the meantime…
- What's your Mlops stack [Reddit Discussion] 
 I'm an experienced software engineer but I have only dabbled in mlops. There are do many tools in this space with a decent amount of overlap…What combination of tools do you use in your company? I'm looking for specific brands here so I can do some research / learning ..
- Dealing with rejection (in distributed systems) 
 I want to focus on a problem that we confronted when building WarpStream: backpressure…backpressure is a really simple concept. When the system is nearing overload, it should start “saying no” by slowing down or rejecting requests. This applies pressure back toward the client (hence the term) and prevents the system from tipping over into catastrophic failure…
- Step-by-Step Diffusion: An Elementary Tutorial 
 We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms…
- An Empirical Study of Applying AI on the Edge for Earth Observation 
 In this work we evaluate the benefits of using an image product processed on board [on a satellite] with respect to an on-ground solution. The experiments are carried out using iquaflow, our open-source framework originally created to provide a set of tools to assess image quality by means of conducting learning-based tasks, such as detection, segmentation, or classification…
- Coloured text in {ggplot2}: {ggtext} vs {marquee} 
 When you use colour to denote the values of a variable in a visualisation, it’s very common to add a legend showing how the colours map to different values. If you create your charts using {ggplot2}, a legend is added automatically when you add colour or fill within the aesthetic mapping. One problem with these legends is that they take up a lot of space - space where we could be plotting data instead! An alternative to using a traditional legend, is using coloured text within a subtitle or annotation…
- This repo aims at providing a collection of efficient Triton-based implementations for state-of-the-art linear attention models… 
- Analysis-Ready, Cloud Optimized ERA5 
 ARCO-ERA5 dataset: ERA5 is the fifth generation of ECMWF's Atmospheric Reanalysis. It spans atmospheric, land, and ocean variables. ERA5 is an hourly dataset with global coverage at 30km resolution (~0.28° x 0.28°), ranging from 1979 to the present…All packaged into a single cloud-friendly Zarr file, and loadable with a single line of Python code…
- Free APIs for personal projects [Reddit Discussion] 
 I'm learning data engineering and wanted to get more practice with pulling data via an API and using an orchestrator to consistently get in stored in a db…Just wanted to get some ideas from the community on fun datasets. Google gives the standard (and somewhat boring) gov data, housing data, weather etc…
- Seven basic rules for causal inference 
 In this blog post I will describe seven basic rules that govern the relationship between causal mechanisms in the real world and associations/correlations we can observe in data. To make each rule as easy as possible to understand, I will describe each rule both in words and in causal graph and logic terms, and I will offer some very simple simulation R code for each rule to demonstrate how it works in practice. These seven rules represent basic building blocks of causal inference. Most causal analysis procedures involve one or more of these rules to some extent…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #559 here.
Cutting Room Floor
Whenever you're ready, 2 ways we can help:
- Looking to get a job? Check out our “Get A Data Science Job” Course 
 It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
- Promote yourself/organization to ~63,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate. 
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian


