Data Science Weekly - Issue 546
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #546
May 09, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
Interpretable time-series modeling using Gaussian Processes
I love time-series analysis, so we are going to dive deep into it here. Another thing I love is Gaussian processes (GP), so why not combine the two?…GPs are an insanely powerful tool that can model an absurd range of data (including continuous and discrete) in an interpretable way that gives the modeler a large degree of control over how the technique learns from data. This makes GPs an invaluable tool in any applied statistician/machine learning practitioner’s toolkit…Throughout this post, we are going to build a basic GP from scratch to highlight the intuition and mechanics underlying the modeling process…
The part of PostgreSQL we hate the most
We want to discuss one major thing that sucks about PostgreSQL: how PostgreSQL implements multi-version concurrency control (MVCC)…In this article, we’ll dive into MVCC: what it is, how PostgreSQL does it, and why it is terrible…A systems engineer has to make several design decisions when building a DBMS that supports MVCC. At a high level, it comes down to the following:How to store updates to existing rows.
How to find the correct version of a row for a query at runtime.
How to remove expired versions that are no longer visible.
These decisions are not mutually exclusive. In the case of PostgreSQL, it’s how they decided to handle the first question in the 1980s that caused problems with the other two that we still have to deal with today…
A first attempt at DSPy Agents from scratch
The goal of this tutorial is to guide you through the process of building a simple agent application using DSPy. Since DSPy is not an agent framework, this is really more of a "what does DSPy give you and what doesn't it give you" kind of post…By the end of this post, you'll have an understanding of the key concepts, such as Plans, Workers, and Tools, and how they might work together to create a functional agent system…The key takeaways you'll learn from this post include:How you might structure Agents in DSPy
How DSPy is really just python - and you should think about it as such.
By the end of this tutorial, you'll at least have a perspective for building your own agent applications using DSPy, and you'll be equipped with the knowledge to explore further extensions and optimizations…
A Message from this week's Sponsor:
Is your A/B testing system reliable?
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
CHI 2024 Papers explorer
Explore the CHI 2024 papers using sentence-embeddings. Enter a query in and see the most relevant papers according to their title and abstract. In the scatterplot two papers should be closer together if they have similar contents. You can also drag on the chart to filter…The system works by computing sentence embeddings on the papers text, then applying dimensionality reduction on the embeddings to compute the scatterplot. You can even change the dimensionality reduction and embeddings algorithm and parameters…On Singular value decomposition and it's use in recommendation systems
In this post we will learn what Singular Value decomposition is and how it can be used in recommender systems. SVD (Singular Value Decomposition) is one of the most beautiful equation of mathematics and a highlight of linear algebra. You can think of it as a data reduction tool. It’s an algorithm that you need to learn if you want to make money using linear algebra. In this post I will only assume you can calculate eigenvalues and eigenvectors :)…When duckdb meets dplyr!
I like DuckDB 🦆. I am excited to see that it is now possible to use it with dplyr using the fantastic duckplyr package which gives us another way to bridge dplyr with DuckDB…In this short post, I will show how duckplyr can be used to query parquet files hosted on an S3 bucket. I will use the duckplyr_df_from_parquet() function to read the data and then use dplyr verbs to summarize the data…I dislike Azure & 'low-code' software, is all Data Engineering like this? [Reddit]
I hate my workflow as a Data Engineer at my current company. Everything we use is Microsoft/Azure. Everything is super locked down. ADF is a nightmare... I wish I could just write and deploy code in containers but I stuck trying to shove cubes into triangle holes…its torture to have such a slow workflow, no modern features, and we don't even have GIT integration on our notebooks. Are all data engineer jobs like this? I have been thinking lately I must move to SWE so I don't lose my mind…I am at my wits end... is DE just not for me?…Why tree gradients give you a boost
Gradient boosted trees are a powerful machine learning technique that appear a lot in practice. There's a lot to like about their performance but there are plenty of other details under the hood that are worth having a deeper look at. That's why we're doing a long series of videos about these models, start with this first one that's all about the intuition…ChuXin: 1.6B Technical Report
In this report, we present ChuXin, an entirely open-source language model with a size of 1.6 billion parameters. Unlike the majority of works that only open-sourced the model weights and architecture, we have made everything needed to train a model available, including the training data, the training process, and the evaluation code…OptPDE: Discovering Novel Integrable Systems via AI-Human Collaboration
Integrable partial differential equation (PDE) systems are of great interest in natural science, but are exceedingly rare and difficult to discover. To solve this, we introduce OptPDE, a first-of-its-kind machine learning approach that Optimizes PDEs' coefficients to maximize their number of conserved quantities, nCQ, and thus discover new integrable systems. We discover four families of integrable PDEs, one of which was previously known, and three of which have at least one conserved quantity but are new to the literature to the best of our knowledge…Online Tutorial: AlphaFold - A practical guide
This tutorial is aimed at researchers who are interested in using AlphaFold2 to predict protein structures and integrate these predictions into their projects. An undergraduate-level knowledge of protein structure and structural biology would be an advantage. The content of this course provides an understanding of the fundamental concepts behind AlphaFold2, how users can run protein predictions and how AlphaFold2 has been used to enhance research…Temporal autocorrelation in GAMs and the mvgam package
Generalized Additive Models (GAMs) are flexible tools that have found particular application in the analysis of time series data…Given the many ways that GAMs can model temporal data, it is tempting to extrapolate from their smooth functions to produce out of sample forecasts. Here I inspect how smoothing splines behave when extrapolating outside the training data to examine whether this can be useful in practice. I also discuss some of the pitfalls that tend to arise when trying to fit smoothing splines to temporally autocorrelated data. I then give a few solutions, most notably by making use of Bayesian Dynamic GAMs in the {
mvgam
} R package…Empowering Biomedical Discovery with AI Agents
Long-standing ambition for biomedical AI is the development of AI systems that can make major scientific discoveries with the potential to be worthy of a Nobel Prize—fulfilling the Nobel Turing Challenge…While the concept of an “AI scientist” is aspirational, advances in agent-based AI pave the way to the development of AI agents as conversable systems eventually capable of skeptical learning and reasoning that coordinate LLMs, ML tools, experimental platforms, or even combinations of them…Rather than taking humans out of the discovery process, biomedical AI agents can combine human creativity and expertise with AI’s ability to analyze large datasets, navigate hypothesis spaces, and execute repetitive tasks…High-level tools to simplify visualization in Python
HoloViz provides a set of Python packages that make viz easier, more accurate, and more powerful: Panel for making apps and dashboards for your plots from any supported plotting library, hvPlot to quickly generate interactive plots from your data, HoloViews to help you make all of your data instantly visualizable, GeoViews to extend HoloViews for geographic data, Datashader for rendering even the largest datasets, Lumen to build data-driven dashboards from a simple YAML specification, Param to create declarative user-configurable objects, and Colorcet for perceptually uniform colormaps…Ten years of neuroscience at Google yields maps of human brain
Marking ten years of connectomics research at Google, we are releasing a publication in Science about a reconstruction at the synaptic level of a small piece of the human brain. We discuss the reconstruction process and dataset, and we present several new neuron structures discovered in the data…
Training & Resources
Video lectures, Harvard Economics 2355 Deep Learning for Economics spring 2023, by Melissa Dell
A vast number of important economic questions remain unanswered, in substantial part because the data required to examine them has traditionally been inaccessible. For example, much historical data remains trapped in hard copy. More broadly, information that could elucidate important questions is scattered throughout text, or contained in scans, photographs, videos, or audio files. This course will provide an introduction to deep learning-based methods and other data science tools that can process such sources on a massive scale.The course will cover natural language processing, computer vision, and multimodal methods. Topics in NLP include neural language modeling, topic and sentiment classification, text retrieval named entity recognition, dependency parsing, and knowledge intensive NLP. Topics in computer vision include convolutional neural networks, vision transformers, object detection, document layout analysis, image classification, image retrieval, GANs, and OCR…
A gentle introduction to DSPy
We'll start by explaining what DSPy is and the key benefits it offers. Then, we'll walk through a simple example of using DSPy to translate text to "Grug speak." Along the way, we'll highlight important concepts like building datasets, defining prompts, and zero-shot prompting. Finally, we'll show you how to measure and optimize prompts automatically. We'll show how DSPy makes this process more data-driven and systematic, allowing us to fine-tune our models to achieve better results. Overall flow By the end of this post, you'll have a solid understanding of how DSPy works and how to approach building basic DSPy programs…Guest lectures on data science with R and Python in the web browser with WASM in Stanford University's STATS 352
These lectures delve into the world of dynamic interactions available through interactive documents by exploring the integration of web-based versions of R and Python within the Quarto framework. The dynamic capabilities of the Quarto publishing framework, coupled with in-browser versions of leading data science language distributions based on WebAssembly, offer a unique platform for real-time code execution, fostering interactive experiences in data analysis and scientific computing….
Last Week's Newsletter's 3 Most Clicked Links
What's the most practical thing you have done with ai? [Reddit]
How to Beat Proprietary LLMs With Smaller Open Source Models
* Based on unique clicks.
** Find last week's issue #545 here.
Cutting Room Floor
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~61,500 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian