Data Science Weekly - Issue 481

Curated news, articles and jobs related to Data Science.

Data Science Weekly

Feb 09, 2023

Issue #481
February 09 2023

Editor's Picks

On The Road: An Interactive Map of Jack Kerouac's journey
And this was really the way that my whole road experience began and the things that were to come are too fantastic not to tell…

Big Data is Dead
The world in 2023 looks different from when the Big Data alarm bells started going off. The data cataclysm that had been predicted hasn’t come to pass. Data sizes may have gotten marginally larger, but hardware has gotten bigger at an even faster rate. Vendors are still pushing their ability to scale, but practitioners are starting to wonder how any of that relates to their real world problems…

Practicing AI research
I’m not a particularly experienced researcher, but I’ve worked with some talented collaborators and spent a fair amount of time thinking about how to do research, so I thought I might write about how I go about it…My perspective is this: doing research is a skill that can be learned through practice, much like sports or music…The way I decompose research is into four skills: (1) idea conception and selection, (2) experiment design and execution, (3) writing the paper, and (4) maximizing impact. In other words, what differentiates good and bad researchers is these four skills…

A Message from this week's Sponsor:

Pinecone vector database

The Pinecone vector database makes it easy to build high-performance vector search applications. Developer-friendly, fully managed, and easily scalable without infrastructure hassles.

Use Pinecone to build semantic search, object recognition, recommendations, anomaly detection, and other vector-based functionality into your applications.

Data Science Articles & Videos

Your guide to AI: February 2023
Welcome to the latest issue of your guide to AI, an editorialized newsletter covering key developments in AI research, industry, geopolitics and startups during January 2023. This one is a monster one…

What is the hype around duckDB? [Reddit Discussion]
When we have moved to data lakes and data lake house based architecture, why should I care about an OLAP DB? At this point in the data eco system? I seem to have missed the memo on this one lol…

NLP Tips and Tricks [Video]
A lovely collection of practical advice for, e.g. UMAP viz, entity deduping, string grouping, SpaCy, and more...

How Data Analytics can benefit your business: A guidebook
Data Analytics can transform your business in three fundamental ways: by tracking performance you can get an accurate picture of your entire organization, through on-demand analytics you can problem solve specific issues using data-based insights curated by your data analytics team, and using end-to-end analytics you can proactively build data solutions that anticipate issues and provide not just a snapshot, but deep understanding…This playbook provides an in-depth analysis of these advantages and sketches out how they could impact your business…

Reflecting On The Past 6 Years Of the Data Engineering Podcast
This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes…

Copilot Internals
Github Copilot has been incredibly useful to me…I was curious about how it worked, so I decided to take a look at the source code…In this post, I try to answer specific questions about the internals of Copilot, while also describing some interesting observations I made as I combed through the code. I will provide pointers to the relevant code for almost everything I talk about, so that interested folks can take a look at the code themselves…

SQL, Malloy, and the Art of the Renaissance
This post is the third in a series comparing SQL with a promising new query language called Malloy. I think Malloy represents a leap forward in how we work with data, and in the following paragraphs, I'll attempt to draw a connection to another time in history of rapid technological advancement: the Renaissance in Western Europe…

Relative representations enable zero-shot latent space communication
Neural networks embed the geometric structure of a data manifold lying in a high-dimensional space into latent representations…we empirically observe that, under the same data and modeling choices, distinct latent spaces typically differ by an unknown quasi-isometric transformation: that is, in each space, the distances between the encodings do not change…In this work, we propose to adopt pairwise similarities as an alternative data representation, that can be used to enforce the desired invariance without any additional training…

Prodigy OpenAI recipes
This repository contains example code on how to combine zero- and few-shot learning with a small annotation effort to obtain a high-quality dataset with maximum efficiency. Specifically, we use large language models available from OpenAI to provide us with an initial set of predictions, then spin up a Prodigy instance on our local machine to go through these predictions and curate them. This allows us to obtain a gold-standard dataset pretty quickly, and train a smaller, supervised model that fits our exact needs and use-case…

Data as a Product vs. Data as a Service
There are two broad mandates that data teams tend to get formed with (I’m being overly simplistic on purpose): 1) Provide data to the company 2) Provide insights to the company…These might sound similar — and they’re certainly both important — but they necessitate completely different skillsets…In fact, I’m going to argue that the conflation of these two objectives is exactly what kills good data talent and confounds the hiring process…

Understanding the Self-Attention Mechanism of Large Language Models From Scratch
In this article, we are going to understand how self-attention works from scratch. This means we will code it ourselves one step at a time…Since its introduction via the original transformer paper (Attention Is All You Need), self-attention has become a cornerstone of many state-of-the-art deep learning models, particularly in the field of Natural Language Processing (NLP). Since self-attention is now everywhere, it’s important to understand how it works…

Unleashing ML Innovation at Spotify with Ray
Our current platform experience is heavily weighted towards a single user journey: an ML engineer using TensorFlow/TFX for supervised learning production applications…To better support our target market of a broader range of constituents, we need to lower the barrier to entry and embrace more diverse ML tooling while maintaining scalability and performance in end-to-end ML workflows…Introducing Ray..After extensive prototyping and investigation, we believe Ray addresses those needs. Ray is an open-source, unified framework for scaling AI and Python applications…

Tool*

Sync customer data from your warehouse to any SaaS tool with Hightouch

Hightouch is the leading Data Activation platform, powered by Reverse ETL. Sync customer data from your warehouse into the tools your business teams rely on.

Get started for free at app.hightouch.io, or book a demo to see how it can work for your team.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

Jobs

Data Scientist / Machine Learning Engineer - Epsilon - NYC

Epsilon Strategy and Insights, Data Sciences team is looking for a talented team player in a Data Scientist/Machine Learning Engineer role. You are an expert, mentor and advocate. You have strong machine learning and deep learning background and are passionate about transforming data into ml models. You welcome the challenge of data science and are proficient in Python, Spark MLLib, Tensorflow, Keras, ML algorithms and Deep Neural Networks, Big Data. You must be self-driven, take initiative and want to work in a dynamic, busy and innovative group...

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

MIT 6.S098: Intro to Applied Convex Optimization
Just finished teaching a 4 week convex optimization course during MIT's IAP. All lecture notes are available online…

Beginner's Crash Course to Elastic Stack
In this season, we are building a full stack JavaScript app with Elasticsearch and visualizing data with Kibana Lens. Click here to get started!!…

“Machine Learning Design Patterns” Book Notes
This book is all about patterns for doing ML. It's broken up into several key parts, building and serving. Both of these are intertwined so it makes sense to read through the whole thing, there are very many good pieces of advice from seasoned professionals. The parts you can safely ignore relate to anything where they specifically use GCP. The other issue with the book it it's very heavily focused on deep learning cases. Not all modeling problems require these. Regardless, let's dive in. I've included the stuff that was relevant to me in the notes…

Last Week's Newsletter's 3 Most Clicked Links

Data scientists work alone and that's bad

Have researchers given up on traditional machine learning methods? [Reddit Discussion]

Should You Measure the Value of a Data Team?

* Based on unique clicks.
** Find last week's newsletter here.

Cutting Room Floor

Have an awesome week!

All our best,
Hannah & Sebastian

Follow us on Twitter

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :)

Data Science Weekly Newsletter

Discussion about this post

Ready for more?