Data Science Weekly - Issue 578
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #578
December 19, 2024
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
Editor's Picks
- Quick software tips for new ML researchers 
 Just a quick set of tips for new ML researchers working in Python who are likely self-taught and haven't had a mentor to guide them on best practices. They're small, easy, and will generally improve your productivity and level of professionalism. I wrote this up for a class intended to teach DL to engineers without an ML or software background and thought I'd share it…
- Archetypes of LLM apps 
 I recently returned from a trip to San Francisco. While there, I presented to the innovation group of a large insurance company about how startups are applying AI. This post shares that presentation I gave. In addition to the written presentation, I've recorded an audio version of it here, too…
- Beyond Jupyter 
 Beyond Jupyter is a collection of self-study materials on software design, with a specific focus on machine learning applications, which demonstrates how sound software design can accelerate both development and experimentation…
Sponsor Message
Quadratic - analyze anything, host anywhere
With Quadratic, combine the spreadsheets your organization asks for with the code that matches your team’s code-driven workflows.
Powered by code, you can build anything in Quadratic spreadsheets with Python, JavaScript, or SQL, all approachable with the power of AI.
Use the data tool that actually aligns with how your team works with data, from ad-hoc to end-to-end analytics, all in a familiar spreadsheet.
Level up your team’s analytics with Quadratic today
.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
- History of Residuals and a Word of Caution 
 A residual in deep learning is y := f(x) + x. People on twitter recently started reporting some success with y := λ1 f(x) + λ2 x for a new residual connection between Transformer's value activations, where λ1/λ2 are learnable scalars initialized to 0.5. That's cool! But also reminds me of some lore…
- What is the purpose of CI/CD and what would happen if we didn’t use it? [Reddit Discussion] 
 I'm currently learning about CI/CD, and I’m still trying to understand its practical benefits. From what I know, it helps detect errors in the code more quickly for example. However, I'm unsure of the value of using CI/CD if I’m already committing changes to GitHub. I know this might sound stupid, but as a beginner, I’d really appreciate any examples that could help clarify its usefulness…
- Implementing the Raft distributed consensus algorithm 
 Raft is a relatively new algorithm (2014), but it's already being used quite a bit in industry…The best known example is probably Kubernetes, which relies on Raft through the etcd distributed key-value store. The goal of this series of posts is to describe a fully functioning and rigorously tested implementation of Raft, and provide some intuition for how Raft works along the way…
- When precision equals recall 
 Precision can actually be equal to recall. For balanced datasets it can even be pretty common! But understanding when this happens may also help you understand both metrics a bit more…
- S3 Is the New SFTP 
 Modern data processing is centralizing around data lakehouses using S3, Apache Iceberg, Apache Parquet, and data lake query engines such as DuckDB and Trino. SaaS vendors can upload Parquet files to a shared S3 bucket managed by an Iceberg catalog. Customers can then query the files simply by adding an external table to their existing data warehouse or by querying the data directly using a data lake query engine…
- Pipelines & Prompt Optimization with DSPy 
 It’s true: you spend much, much less time prompting when you use DSPy to build LLM-powered applications. Because you let DSPy handle that bit for you. There’s something really clean and freeing about ceding the details and nuance of the prompt back to an LLM. Let’s quickly walk through how DSPy handles prompting for you and step through an simple categorization task as an example…
- AI Engineering Resources 
 During the process of writing AI Engineering [book], I went through many papers, case studies, blog posts, repos, tools, etc. The book itself has 1200+ reference links and I've been tracking 1000+ generative AI GitHub repos. This document contains the resources I found the most helpful to understand different areas…
- Multimodal data with LanceDB on TalkPython['Podcast'] 
 LanceDB is a developer-friendly, open source database for AI. It's used by well-known companies such as Midjourney and Character.ai. We have Chang She, the CEO and cofounder of LanceDB on to give us a look at the concept of multi-modal data and how you can use LanceDB in your own Python apps…
- What projects are you working on and what is the benefit of your efforts? [Reddit Discussion] - I would really like to hear what you guys are working on, challenges you’re facing and how your project is helping your company. Let’s hear it… 
- Efficient Cloud-Native Raster Data Access: An Alternative to Rasterio/GDAL 
 The rapid growth of Earth observation data in cloud storage, which will continue to grow exponentially, powered by falling rocket launch prices by companies like SpaceX, has pushed us to think of how we access and analyze satellite imagery… With major space agencies like ESA and NASA adopting Cloud-Optimized GeoTIFFs (COGs) as their standard format, we're seeing unprecedented volumes of data becoming available through public cloud buckets. This accessibility brings new challenges around efficient data access…In this article, we introduce an alternative approach to cloud-based raster data access, building upon the foundational work of GDAL and Rasterio…
- A Comprehensive Guide to Evaluating LLM-Generated Knowledge Graphs 
 Creating Knowledge Graphs using Large Language Models (LLMs) is powerful but requires robust evaluation. This guide shows you how to:- Build a production-ready evaluation pipeline for LLM-driven knowledge graphs 
- Implement automated quality checks using LLMs as judges 
- Set up comprehensive tracing and monitoring 
- Use industry-standard metrics to measure graph quality 
- Handle entity resolution and graph integration… 
 
- Late data arrival [Reddit Discussion] 
 I get asked this very often in interviews…How do you handle late arrival of data? Eg if a batch job has already ran and you have late data arrival for that job, how do you handle scenarios like this? How do you guys handle this in your products?…
- A Breakdown of Romania's Engines of Development, County by County 
 This interactive material offers a detailed perspective on Romania’s economic and social evolution over the past 30 years. By analyzing 16 key indicators—from demography to education, from economy to health—we show how Romania has transformed in all the aspects that matter in a citizen’s life. Thus, through interactive maps and charts, we have constructed the most detailed analysis of Romania to date. This helps us understand not only the transformations that have brought us here, for better or worse, but also the challenges that will define the coming decades…
.
Last Week's Newsletter's 3 Most Clicked Links
- Is studying Data Science still worth it? [Reddit Discussion] 
- Taming LLMs - A Practical Guide to LLM Pitfalls with Open Source Software 
- LLM Agents in Production: Architectures, Challenges, and Best Practices 
.
* Based on unique clicks.
** Find last week's issue #577 here.
Cutting Room Floor
.
Whenever you're ready, 3 ways we can help:
- Learning something for your job? Reply and we’ll make a mini-course for you for free. 
- Looking to get a job? Check out our “Get A Data Science Job” Course 
 It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
- Promote yourself/organization to ~64,300 subscribers by sponsoring this newsletter. 35-45% weekly open rate. 
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian


