Data Science Weekly - Issue 552
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #552
June 20, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
notebooks are McDonalds of code
You can come to McDonalds and order a salad, but you won't. Same with notebooks, you can write NASA-production-grade software in a notebook, but most likely you won't. Notebooks make you lazy, and encourage bad practices…
Lessons Learned from Scaling to Multi-Terabyte Datasets
This post is meant to guide you through some of the lessons I’ve learned while working with multi-terabyte datasets…I’ve divided this post into two sections: scaling on single machines and multi-machine scaling. The goal is to maximize your available resources and reach your goals as quickly as possible…Beyond the Basics of Retrieval for Augmenting Generation
LLMs are powerful, but have limitations: their knowledge is fixed in their weights, and their context window is limited. Worse: when they don’t know something, they might just make it up. RAG, for Retrieval Augmented Generation, has emerged as a way to mitigate both of those problems. However, implementing RAG effectively is more complex than it seems. The nitty gritty parts of what makes good retrieval good are rarely talked about: No, cosine similarity is, in fact, not all you need. In this workshop, we explore what helps build a robust RAG pipeline, and how simple insights from retrieval research can greatly improve your RAG efforts. We’ll cover key topics like BM25, re-ranking, indexing, domain specificity, evaluation beyond LGTM@few, and filtering. Be prepared for a whole new crowd of incredibly useful buzzwords to enter your vocabulary…
A Message from this week's Sponsor:
Online Data Science Programs from Drexel University
Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career. Learn more.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
Measuring data rot: An analysis of the continued availability of shared data from a Single University
To determine where data is shared and what data is no longer available, this study analyzed data shared by researchers at a single university. 2166 supplemental data links were harvested from the university’s institutional repository and web scraped…Making a bignum library for fun
What happens when numbers get too big for a computer to work with?…I've always wanted to know how these bignum libraries work, so this is my adventure in learning about them…Since Python supports them, I thought I'd just go learn from CPython's implementation. It can be found in longobject.c. Nearly 6000 lines and 100 functions…So, I built it and wrote up what I learned!A Long Guide to Giving a Short Academic Talk
I once asked a junior professor how—given the need to publish, teach, and meet with students—he had time to keep up with all of the new scholarship coming out each month. His response: he didn’t. He said he kept up with new work primarily by attending talks, workshops, and conferences…I share this long-winded windup that has little do with talks for a reason—because this was the moment I realized that if you want others to read your work, you cannot simply publish it and assume others will find (and cite) it. You need to sell it…Advice from senior Data Engineers to junior DEs [Reddit Discussion]
Fellow Senior DEs of this sub,If you would like to give advice to junior DEs, what would it be?
Looking back, what mistakes do you think you should have avoided when you were beginners?
What do you think is the best way to advance up the DE ladder in a short amount of time?
How can one start their DE journey when there are so many resources and tools out there?
What tools should one master?
What kind of projects should one work on in the beginning to clear their concepts?
Any guidance of yours that could help junior DEs immensely will be appreciated!…
Advanced Retrieval-Augmented Generation Techniques
In this talk, I will present some of the latest advances in retrieval-augmented generation(RAG) techniques, which combine the strengths of both retrieval-based and generative approaches for chatbot development….I will cover the following topics: 1. Hybrid search with vector databases…2. Query generation using LLMs…3. Automatically excluding irrelevant search results…4. Re-ranking…5. Chunking Techniques…I will demonstrate the effectiveness of these advanced techniques in the RAG workflow…Bayesian transition models for ordinal longitudinal outcomes
Ordinal longitudinal outcomes are becoming common in clinical research, particularly in the context of COVID-19 clinical trials. These outcomes are information-rich and can increase the statistical efficiency of a study when analyzed in a principled manner. We present Bayesian ordinal transition models as a flexible modeling framework to analyze ordinal longitudinal outcomes. We develop the theory from first principles and provide an application using data from the Adaptive COVID-19 Treatment Trial (ACTT-1) with code examples in R. We advocate that researchers use ordinal transition models to analyze ordinal longitudinal outcomes when appropriate alongside standard methods such as time-to-event modeling…How Narwhals and scikit-lego came together to achieve dataframe-agnosticism
Recently, because Polars has been gaining traction, the library maintainers have been looking for a simple way to support multiple dataframe implementations…The project still wants to support pandas, but it would be a shame if newer dataframes couldn't be supported. Enter Narwhals. As of scikit-lego 0.9.0, you can also pass dataframes from Polars, Modin, cuDF, and in principle a whole host of other dataframes…But, how does it work? Why did they use Narwhals? Why don't they just convert to pandas internally on the user's behalf?…We'll start by answering these questions, and will end with some hopes for the future…A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges
Recent advances in large language models (LLMs) have unlocked novel opportunities for machine learning applications in the financial domain…In this survey, we explore the application of LLMs on various financial tasks, focusing on their potential to transform traditional practices and drive innovation. We provide a discussion of the progress and advantages of LLMs in financial contexts, analyzing their advanced technologies as well as prospective capabilities in contextual understanding, transfer learning flexibility, complex emotion detection, etc. We then highlight this survey for categorizing the existing literature into key application areas, including linguistic tasks, sentiment analysis, financial time series, financial reasoning, agent-based modeling, and other applications…Why Your SSD (Probably) Sucks and What Your Database Can Do About It (When are SSDs slow?)
Database system developers have a complicated relationship with storage devices: They can store terabytes of data cheaply, and everything is still there after a system crash. On the other hand, storage can be a spoilsport by being slow when it matters most…This blog post shows
how SSDs are used in database systems,
where SSDs have limitations,
and how to get around them…
How to Use AI to Create Role-Play Scenarios for Your Students
We’ve found GPT-4 class models particularly effective in creating role-play scenarios. This could be a student in a negotiation class taking on the role of a seller in a high-stakes negotiation or a student in an entrepreneurship class acting as a startup founder pitching a business idea…Here we explain how you can create a role-play scenario with generative AI, using our negotiation prompt as an example. We share guidance on how to take our prompt and adapt it for your class, along with instructions on how to introduce this exercise in your own classroom…Data science "volunteering"? [Reddit Discussion]
Uncommon question here. I would like to do some volunteering but am quite bad with human interactions. Does there exist something (idk site, platform) in which you can do ethical data science activity for a good cause?…Patterns for Building LLM-based Systems & Products
This write-up is about practical patterns for integrating large language models (LLMs) into systems & products. We’ll build on academic research, industry resources, and practitioner know-how, and distill them into key ideas and practices.There are seven key patterns. They’re also organized along the spectrum of improving performance vs. reducing cost/risk, and closer to the data vs. closer to the user.
Evals: To measure performance
RAG: To add recent, external knowledge
Fine-tuning: To get better at specific tasks
Caching: To reduce latency & cost
Guardrails: To ensure output quality
Defensive UX: To anticipate & manage errors gracefully
Collect user feedback: To build our data flywheel…
Training & Resources
Introduction to Software Development Tooling
We are teaching this course as a series of four modules. The first, Command Line, will give you an overview of the operating system and tools available to you to make your life as a software engineer easier. The second, VCS, will teach you about how to version control your software with Git, so that you may maintain a history of your changes and collaborate with others. The third, Build, will teach you about how to reliably build your software with Make. The fourth and final module, Correctness, will introduce you to testing and other tools for ensuring that your software meets quality standards…[Free Course] Reactive Web Dashboards with Shiny Course
There are dozens of ways to build web applications in Python. You can use Django or FastAPI with a Javascript front-end, or build a simple dashboard using a tool like Streamlit. However, almost all of the python web app frameworks are event-driven and require you to manually manage callback functions and application state…Shiny uses transparent reactive programming to let you build efficient dashboards and applications without the headaches. Shiny automatically detects the relationships between application components and uses those relationships to minimally re-rennder the app. In this course you will learn how to use Shiny to build a simple dashboard, and also the essential concepts which will allow you to build more ambitious applications in Shiny…Step-by-Step Diffusion: An Elementary Tutorial
We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #551 here.
Cutting Room Floor
Data package for the data sets from the book "A Handbook of Small Data Sets" by David Hand (1994)
Milos Makes Maps - I paint the world with R and teach you how to unleash your inner map artist
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~62,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian