[in case you missed it] Data Science Weekly - Issue 376
Issue #376 Feb 04 2021
Editor Picks
"Everyone wants to do the model work, not the data work":
Data Cascades in High-Stakes AI
Data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations. Paradoxically, data is the most under-valued and de-glamorised aspect of AI. In this paper, we report on data practices in high-stakes AI, from interviews with 53 AI practitioners in India, East and West African countries, and USA. We define, identify, and present empirical evidence on Data Cascades---compounding events causing negative, downstream effects from data issues---triggered by conventional AI/ML practices that undervalue data quality...
2011: DanNet triggers deep CNN revolution
In 2021, we are celebrating the 10-year anniversary of DanNet..the first pure deep convolutional neural network (CNN) to win computer vision contests...From 2011 to 2012 it won every contest it entered, winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012), driven by a very fast implementation based on graphics processing units (GPUs). Remarkably, already in 2011, DanNet achieved the first superhuman performance in a vision challenge, although compute was still 100 times more expensive than today. In July 2012, our CVPR paper on DanNet hit the computer vision community. The similar AlexNet (citing DanNet) joined the party in Dec 2012...Today, a decade after DanNet, everybody is using fast deep CNNs for computer vision...
Data science notebooks | 2020 Review
We dug into the data to learn more about the current state of a vital part of the data science ecosystem: the notebooks...This article consists of 5 key sections: 1) First, we explore key stats and trends of Jupyter notebooks on GitHub, 2) Second, we double down on popular Python libraries and show you what libraries to add to your toolkit for plotting, ML, NLP, and other use cases, 3) Last, we look at search trends from Google & YouTube, 4) Data sources and how you can build on top of them, and 5) Conclusion & ideas for future research...
A Message from this week's Sponsor:
Become a Data Scientist Without Paying a Dime
One Week Left to Apply for TDI’s Spring Data Science Fellowship
Did you know that you can attend The Data Incubator’s Fellowship program without paying a dime until you land a job and are earning over a certain threshold?
Apply now to work with expert, live instructors and our career services team to help you land your next data science job—maybe with one of our exciting hiring partners.
Attend full-time for 8 weeks, or part-time for 20 weeks and without paying anything until you’re working.
Applications close February 12.
Apply Now.
Data Science Articles & Videos
Machine Learning Street Talk #40: Adversarial Examples with Dr. Nicholas Carlini, Dr. Wieland Brendel, and Florian Tramèr [YouTube Video]
Adversarial examples have attracted significant attention in machine learning, but the reasons for their existence and pervasiveness remain unclear. there's good reason to believe neural networks look at very different features than we would have expected...Adversarial examples can be directly attributed to the presence of non-robust features: features derived from patterns in the data distribution that are highly predictive, yet brittle and incomprehensible to humans...Adversarial examples don't just affect deep learning models. A cottage industry has sprung up around Threat Modeling in AI and ML Systems and their dependencies. Joining us this evening are some of currently leading researchers in adversarial examples...
Reinforcement Learning At Facebook with Jason Gauci [Podcast Episode]
Jason has worked on YouTube recommendations. He was an early contributor to TensorFlow the open-source machine learning platform. His thesis work was cited by DeepMind...But what I find so fascinating with Jason is he recognized this problem that was being solved the wrong way and set out to find a solution to it...The problem was making recommendations, like on Amazon, people who bought this book might like that book. He didn’t exactly know how to solve the problem, but he knew it could be done better...So that’s the show today. Jason is going to share his story...
Machine learning accelerated computational fluid dynamics (CFD) [PDF]
Numerical simulation of fluids plays an essential role in modeling many physical phenomena, such as weather, climate, aerodynamics and plasma physics. Fluids are well described by the Navier-Stokes equations, but solving these equations at scale remains daunting, limited by the computational cost of resolving the smallest spatiotemporal features. This leads to unfavorable trade-offs between accuracy and tractability. Here we use end-to-end deep learning to improve approximations inside computational fluid dynamics for modeling two-dimensional turbulent flows...
A Complete Guide to Revenue Cohort Analysis
A Cohort Analysis is an extremely useful tool that allows you to gather insights pertaining to customer churn, lifetime value, product engagement, stickiness, and more...Cohort analyses are especially useful for improving user onboardings, product development, and marketing tactics. What makes cohort analyses so powerful is that they’re essentially a 3-dimensional visualization, where you can compare a value/metric across different segments over time...By the end of this article, you’ll learn how to create something like this...
Creating Master Data at Scale with AI
The Data Exchange Podcast: Sonal Goyal and Ben Lorica on master data, data preparation, entity resolution, data fusion, and more...
Data engineering Course Notes Draft (26 Pages) from Stanford's Machine Learning Systems Design Class [Google Doc]
Data systems, in and of themselves, are beasts. If you haven’t spent years and years digging through literature, it’s very easy to get lost in acronyms. There are many challenges and possible solutions—if you look into the data stack for different tech companies it seems like each is doing their own thing...In this lecture, we’ll cover the basics of data engineering...
Desirable streets: Where do people prefer to walk?
Every trip has a shortest route, from a to b...But on average, pedestrians choose to walk around 10% farther than their shortest path...Why don't people always take the shortest route?...
Data Observability: Building Data Quality Monitors Using SQL
In this article series, we walk through how you can create your own data observability monitors from scratch, mapping to five key pillars of data health. Part 1 of this series was adapted from Barr Moses and Ryan Kearns’ O’Reilly training, Managing Data Downtime: Applying Observability to Your Data Pipelines, the industry’s first-ever course on data observability...
spaCy 3.0 release notes
v3.0.0: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more...
A Data Scientist’s Guide to Lazy Evaluation with Dask
Dask is an open-source framework that enables parallelization of Python code...We talk a lot about Dask and parallel computing, but sometimes we don’t do enough to explain the concepts that make it possible. Read on to learn how lazy evaluation works, how Dask uses it, and how it makes parallelization not only possible but easy!...
Training*
Quick Question For You: Do you want a Data Science job?
After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.
The course is broken down into three guides:
Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)
Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate
Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!
Click here to learn more ...
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!
Jobs
Data Scientist - Apple Pay Analytics - NYC
You will play a key role improving the Apple Pay product experience. As a member of the analytics team you will be supporting a product function. You will partner with business owners, understand goals, craft KPIs and measure ongoing performance. You will initially engage with the product and engineering teams in ensuring that we have the appropriate instrumentation in place to deliver on these metrics. You will subsequently use advanced statistical, ML and analytical techniques to analyze product performance and identify key insights that inform product improvements and business strategy. The role requires a high degree of independence, ownership and collaboration working cross functionally across all levels of a highly matrixed organization...
Want to post a job here? Email us for details >> team@datascienceweekly.org
Training & Resources
Introduction to ExpandR: Exploring panel data with R
Exploring panel data with R: An introduction to the ExPanDaR package...
Weave.jl - Scientific Reports Using Julia
This is the documentation of Weave.jl. Weave is a scientific report generator/literate programming tool for Julia. It resembles Pweave, knitr, R Markdown, and Sweave...
Full Stack Deep Learning - Lecture #1
In this video, we discuss the fundamentals of deep learning. We will cover artificial neural networks, the universal approximation theorem, three major types of learning problems, the empirical risk minimization problem, the idea behind gradient descent, the practice of back-propagation, the core neural architectures, and the rise of GPUs...
Books
Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian