Data Science Weekly - Issue 480

Curated news, articles and jobs related to Data Science.

Data Science Weekly

Feb 02, 2023

Issue #480
February 02 2023

DSW Newsletter 2.0

Hi Friends!

This newsletter, which we’ve been mailing out since November 28, 2013, is now hosted on Substack.

Here’s the deal:

The Thursday “Data Science Weekly” newsletter will not change and it will always remain free to anyone who wants to sign up.
There is now a paid subscription option. Subscribers will get access to:
- Chats with recruitment agencies to get a scope of who/what/where/why of data science hiring
- Career chats & advice
- Study groups
- Resume feedback
- Data Science Office Hours with practitioners
- Ability to comment and join special discussion threads
- Q&A with Podcast guests
- Show & Tells where you can share your projects and get feedback (if you want)

We switched to Substack specifically for the ability to do the discussion threads and comments, so we’re excited to start having a more interactive newsletter.

The Thursday newsletter means a great deal to us and we love doing it and we want to keep doing it indefinitely.

Thank you to everyone who has read the newsletter over the years and shared it far and wide.

If you’re new - welcome again! You’re in for a treat :)

If you’re not signed up, you can subscribe now, here:

Thanks again and to another ~500 issues!

All our best,
Hannah & Sebastian

Editor's Picks

Data scientists work alone and that's bad
I found myself at at my first data science job where I thought I knew everything after a single year. I saw how the systems had been built and thought, “Psshhh, I can do that”…I left to be the first data scientist at a Seed stage startup. I built some things and felt smart...The company was smarter, though, since they hired a boss for me...

Should You Measure the Value of a Data Team?
Data teams are sometimes asked to prove their ROI to senior leadership to justify a budget for new hires, tools, projects, or process changes…Often the reason for this ROI question isn’t rooted in a lack of proper metrics but rather a lack of trust and relationships with stakeholders…This post summarizes key arguments from several blog posts, podcasts, discussions from data communities, and my experience…

Have researchers given up on traditional machine learning methods? [Reddit Discussion]
I feel that most of the time when people talk about machine learning in the world today, they are referring to deep learning, but is this the same in the academic world? Have people who have been studying traditional methods switched to neural networks? I know that many researchers are excited about deep learning, but I am wondering what they think about other methods…

A Message from this week's Sponsor:

Tell us how MLOps is more than just tools and you could win $100

At Toloka, we’re exploring what MLOps culture looks like across the industry at the start of 2023.

A huge variety of tools are available for ML development, but the culture and practices still have some catching up to do.

How do you see MLOps evolving this year?

Share your thoughts in our 5-minute survey. We’ll follow up to share the research results and pick a random winner for a $100 Amazon certificate!

Data Science Articles & Videos

JupySQL: Better SQL in Jupyter
We forked ipython-sql (pip install jupysql) and are actively developing it to bring a modern SQL experience to Jupyter! We’ve already built some great features, such as SQL query composition and plotting for large-scale datasets!…

Blueprints for recommender system architectures: 10th anniversary edition
Ten years ago, we published a post in the Netflix tech blog explaining our three-tier architectural approach to building recommender systems (see below). A lot has happened in the last 10 years in the recommender systems space for sure. That’s why, when a few months back I designed a Recsys course for Sphere, I thought it would be a great opportunity to revisit the blueprint…In this post I summarize 4 existing architectural blueprints, and present a new one that, in my opinion, encompasses all the previous ones…

Comparing Different Automatic Image Augmentation Methods in PyTorch
One of the best ways to reduce overfitting is to collect more (good-quality) data. However, collecting more data is not always feasible or can be very expensive. A related technique is data augmentation. Data augmentation involves generating new data records or features from existing data, expanding the dataset without collecting more data…This article compares four automatic image augmentation techniques in PyTorch: AutoAugment, RandAugment, AugMix, and TrivialAugment…

Using Computer Vision To Destroy My Childhood High Score in a DS Game
Generating pre-labeled data with Matplotlib, optimizing code for real-time performance, and training an object detection model to control a DS emulator and become an expert in playing the Super Mario 64 DS mini-game, “Wanted!”…

When to Build vs. Buy Your Data Warehouse (5 Key Factors)
Nishith Agarwal, Head of Data & ML Platforms at Lyra Health and creator of Apache Hudi, draws on his experiences at Uber and Lyra Health to share how his 5 considerations—cost, complexity, expertise, time to value, and competitive advantage—impacts the decision to build vs buy data warehouse, data lake, and data lakehouse layers of your data stack…

Data Science meets classic poetry: a reinterpretation of Dante's Divine Comedy from the point of view of ML lifecycle
I used ChatGPT to augment my skills and reframe the most famous parts of Divina Commedia from a Data Science and ML lifecycle style: a) Inferno -> Data, b) Purgatorio -> Modelling, c) Paradiso -> Production…

BirdFlow: Learning seasonal bird movements from eBird data
Large-scale monitoring of seasonal animal movement is integral to science, conservation and outreach. However, gathering representative movement data across entire species ranges is frequently intractable. Citizen science databases collect millions of animal observations throughout the year, but it is challenging to infer individual movement behavior solely from observational data...We present BirdFlow, a probabilistic modeling framework that draws on citizen science data from the eBird database to model the population flows of migratory birds…

Vincent Warmerdam: Calmcode, Explosion, Data Science | Learning From Machine Learning #2 [YouTube Video]
Learning from Machine Learning, a podcast that explores more than just algorithms and data: Life lessons from the experts…This episode we welcome Vincent Warmerdam, creator of calmcode, and machine learning engineer at SpaCy to discuss Data Science, models and much more…

Beyond Pandas — working with big(ger) data more efficiently using Polars and Parquet
As data scientists/engineers, we often deal with large datasets that can be challenging to work with. Pandas, a popular Python package for data manipulation, is great for small- to medium-sized datasets, but it can become slow and resource-intensive when working with larger datasets. In this article, we will discuss how using the Python package Polars and the Parquet file format can help improve the efficiency and scalability of your data workflow…

SQL should be your default choice for data engineering pipelines
I became a data scientist, and loved pandas and dplyr for their expressiveness and power. As a data engineer, I dabbled in PySpark. Most recently, I’ve returned to SQL for data manipulation…These alternative tools were developed to address deficiencies in SQL, and they are undoubtedly better in certain respects. But overall, I’m convinced that SQL is better in most circumstances, especially when working in a team or on long-term projects…This post will make the case for SQL. I’ll then suggest when other tools may be preferable. Finally, I’ll finish by mentioning some future directions, and new libraries to keep an eye on.

Simpson's Paradox and Existential Terror
I learned the truth about Simpson’s: it’s not Simpson’s, there’s no paradox, the accidental umbrella term can confuse different concepts, and it sounds science-y but hides the breakdown of the fact-value dichotomy when you get into actually doing “science before statistics.”…This post summarizes what I learned falling down this rabbit hole, so you don’t have to…

Data Cleaning Plan
Data cleaning or data wrangling is the process of organizing and transforming raw data into a dataset that can be easily accessed and analyzed. A data cleaning plan is a written proposal outlining how you plan to transform your raw data into the clean, usable data. This is different than a code file or even a pseudocode file in that there is no code or syntax in a data cleaning plan…An example of a very simplified cleaning plan…

Tool*

Sync customer data from your warehouse to any SaaS tool with Hightouch

Hightouch is the leading Data Activation platform, powered by Reverse ETL. Sync customer data from your warehouse into the tools your business teams rely on.

Get started for free at app.hightouch.io, or book a demo to see how it can work for your team.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

Jobs

Data Scientist / Machine Learning Engineer - Epsilon - NYC

Epsilon Strategy and Insights, Data Sciences team is looking for a talented team player in a Data Scientist/Machine Learning Engineer role. You are an expert, mentor and advocate. You have strong machine learning and deep learning background and are passionate about transforming data into ml models. You welcome the challenge of data science and are proficient in Python, Spark MLLib, Tensorflow, Keras, ML algorithms and Deep Neural Networks, Big Data. You must be self-driven, take initiative and want to work in a dynamic, busy and innovative group...

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

Learning Observable
Our definitive course to getting up and running on Observable.

My lab is moving to #JuliaLang, and I’ll be putting together some R => Julia tips for our lab and others who are interested. [Twitter Thread]
Here are a few starter facts…Julia draws inspiration from a number of languages, but the influence of R on Julia is clear…

Just know stuff. (Or, how to achieve success in a machine learning PhD)
Quite a few folks messaged me – mostly on Twitter or Mastodon – asking for advice on how to achieve success in a machine learning PhD? Each time my answer is: Just Know Stuff…Now, I don’t think “Just Know Stuff” is a terribly controversial opinion – undergraduate classes are largely based around imparting knowledge; the first year of a new PhD’s life is usually spent reading up on the literature – but from the number of questions I get it would seem that this is something worth restating…

Last Week's Newsletter's 3 Most Clicked Links

Exploratory programming: what it is, why it matters, & what it requires

What we look for in a resume

Are your data normal? Hint: no

* Based on unique clicks.
** Find last week's newsletter here.

Cutting Room Floor

Have an awesome week!