Data Science Weekly - Issue 565
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #565
September 19, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Outlier Detection in R: Hampel Filter for time series
… or the method you probably never heard of. Maybe I am wrong but this method is the most popular and at the same time highly underestimated. So we are going to fix this gap today. In the industry of outlier detection, there are still many tips and tricks. Just like we dissected Grubbs’ Test and the Tukey Method, it’s time to see how the Hampel Filter can help us clean our data…
Jensen’s Inequality As An Intuition Tool - Practice in distinguishing linear vs non-linear phenomena
Here’s where we’re going:Why I found Jensen’s Inequality interesting
The conditions and statement of the inequality
An example that affects us all
Spotting Jensen’s in the wild…
Code review for statisticians, data scientists & modelers
Software developers have some really good approaches to code review. Here’s a data scientist’s plea to listen to the software developers!..
A Sponsor Message
Quadratic - analyze anything, host anywhere
With Quadratic, combine the spreadsheets your organization asks for with the code that matches your team’s code-driven workflows.
Powered by code, you can build anything in Quadratic spreadsheets with Python, JavaScript, or SQL, all approachable with the power of AI.
Use the data tool that actually aligns with how your team works with data, from ad-hoc to end-to-end analytics, all in a familiar spreadsheet.
Level up your team’s analytics with Quadratic today
.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
DataMesh: How Uber laid the foundations for the data lake cloud migration
Uber’s batch data platform is used by over 10,000 active internal users, ranging from data scientists, city operations, and business analysts to engineers…In this blog, we delve into the details of how Uber laid the foundations for the batch data cloud migration by incorporating key data mesh principles…Effective Online Surveys: An Introduction to Best Practices with Andrew Miles
This first hour of Andrew Miles's "Designing Effective Online Surveys" seminar provides an overview of the course and begins discussing key considerations for creating surveys that yield high-quality data, such as question formulation…This seminar offers a practical guide to web-based survey design, covering the latest research and best practices…We'll provide practical, easy-to-apply strategies for navigating challenges and discuss the advantages and disadvantages of using online samples (including design-based and post-hoc approaches to addressing some of these challenges)…What makes working with data so hard for ML ? [Reddit Discussion]
I’ve been speaking to a couple of my colleagues who are data scientists and the overarching response I get when I ask what’s the hardest part of their job, almost everyone says it’s having data in the right shape ? What makes this so hard and what has your experience been like when building your own models ? Do you currently have any tools that aid with this and do you really think it’s a genuine problem ?…An In-Depth Guide to Contrastive Learning: Techniques, Models, and Applications
In self-supervised learning, we partition the data into positive and negative samples - similar to binary (supervised) classification - by treating the object under consideration as a positive example and all the other samples as negative…In contrastive methods, we take different samples of the same data, like different views of the same image and try to maximize their similarity scores, while trying to minimize them for the other samples/images…Contrastive learning centers around a simple concept of choosing a representation that maximizes the similarities between positive data pairs, while minimizing for negative pairs….Building RAG with Postgres
Postgres is a powerful tool for implementing Retrieval-Augmented Generation (RAG) systems…By diving deep into a technology you’re already familiar with, you can experience a significant productivity boost. As the saying goes, “stick with the tools you know.” Using Postgres for RAG allows you to reason about the system more easily, cutting through the hype and focusing on building something great. Now, let’s explore how we can build a RAG system using Postgres. We’ll go through each component step by step, from data ingestion to response generation, and see how Postgres fits into the overall architecture…Bagging vs Boosting - Ensemble Learning In Machine Learning Explained
In this video I cover the Bagging (Bootstrap Aggregating) and Boosting ensemble learning algorithms that are commonly across machine learning. I present how both Bagging and Boosting works, together discuss their similarities and differences…How to Explain Things - Guidelines for Effective Scientific Communication
For every 1 unit of work time, you create 100 units of explaining time…how well you communicate will have a huge impact on your career…What is Entropy?
This short book is an elementary course on entropy, leading up to a calculation of the entropy of hydrogen gas at standard temperature and pressure. Topics covered include information, Shannon entropy and Gibbs entropy, the principle of maximum entropy, the Boltzmann distribution, temperature and coolness, the relation between entropy, expected energy and temperature, the equipartition theorem, the partition function, the relation between expected energy, free energy and entropy, the entropy of a classical harmonic oscillator, the entropy of a classical particle in a box, and the entropy of a classical ideal gas…rerankers: A Lightweight Python Library to Unify Ranking Methods
Re-ranking is an integral component of many retrieval pipelines; however, there exist numerous approaches to it, all with different implementation methods. To mitigate this, we propose rerankers, a Python library which provides a simple, easy-to-use interface to all commonly used re-ranking approaches…
A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models
In this article, we review the literature on statistical theories of neural networks from three perspectives: approximation, training dynamics and generative models..Which SQL trick, method, or function do you wish you had learned earlier? [Reddit Discussion]
In my case, I wish I had started to use CTEs sooner in my career, this is so helpful when going back to SQL queries from years ago!!..Learning Theory from First Principles, to appear in Fall 2024 at MIT Press (Final Draft)
The goal of this textbook is to present old and recent results in learning theory for the most widely used learning architectures. Doing so, a few principles are laid out to understand the overfitting and underfiting phenomena, as well as a systematic exposition of the three types of components in their analysis, estimation, approximation, and optimization errors. Moreover, the goal is not only to show that learning methods can learn given sufficient amounts of data but also to understand how quickly (or slowly) they learn, with a particular eye toward adaptivity to specific structures that make learning faster (such as smoothness of the prediction functions or dependence on low-dimensional subspaces)…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #564 here.
Cutting Room Floor
Improving LLM Reasoning using SElf-generated data: RL and Verifiers
Whistles, songs, boings, and biotwangs: Recognizing whale vocalizations with AI
.
Whenever you're ready, 3 ways we can help:
Need to learn something for your work? Reply to this email to find out how we can work 1-on-1 with you to speed up your learning.
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~63,300 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian