[in case you missed it] Data Science Weekly - Issue 374
Issue #374 Jan 21 2021
Editor Picks
Controlled Experiments - Why Bother?
I spent some time earlier this year orchestrating a massive experiment for Firefox. We launched a bunch of new features with Firefox 80 and we wanted to understand whether these new features improved our metrics...In the process, I ended up talking with a bunch of Firefox engineers and explaining why we need to run a controlled experiment. There were a few questions that got repeated a lot, so I figure it's worth answering them here...This article is targeted at new data scientists or engineers interested in data...
Deep Learning in the Sciences
In this episode of the Data Exchange Podcast I speak Bharath (“Bart”) Ramsundar, author and open source developer. While in graduate school, Bart created DeepChem, an open source project that aims to democratize deep learning for science. DeepChem historically was developed for researchers in the life sciences, so the working examples in its tutorials draw from areas like chemistry and bioinformatics...
AI in Drug Discovery 2020 - A Highly Opinionated Literature Review
In this post, I present an annotated bibliography of some of the interesting machine learning papers I read in 2020...This list reflects a few interesting trends I saw this year...a) More of a practical focus on active learning, b) Efforts to address model uncertainty, as well as the admission that it's a very difficult problem, c) The (re)emergence of molecular representations that incorporate 3D structure, d) Several interesting strategies for data augmentation, e) Additional efforts toward model interpretability, coupled with the acknowledgment that this is also a difficult problem, and f) The application of generative models to more practical objectives (e.g. not LogP and QED)...
A Message from this week's Sponsor:
New Year, New Career
Jumpstart Your Career When You Apply to TDI’s Spring Data Science Fellowship Program
With The Data Incubator’s data science fellowship program, you’ll work closely with our expert instructors to master the in-demand data skills and programs you need to conquer the business world.
Our career service team will help you land a great job in data. And with our income sharing agreements, you won’t pay a cent in tuition until you get that job.
Attend full-time or part-time. Applications close on February 12.
Apply Now.
Data Science Articles & Videos
Machine Learning Models are Missing Contracts
Why pretrained machine learning models are often unusable and irreproducible — and what we can do about it...A useful approach to designing software is through contracts. For every function in your codebase, you start by writing its contract: clearly specifying what inputs are expected and valid for that function (the precondition), and what the function will do (the postcondition) when provided an appropriate input...
A Visual History of Interpretation for Image Recognition
In this piece, we provide an overview of the interpretation methods invented for image recognition, discuss their tradeoffs and provide examples and code to try them out yourself...
Making sense of sensory input
This paper attempts to answer a central question in unsupervised learning: what does it mean to “make sense” of a sensory sequence? In our formalization, making sense involves constructing a symbolic causal theory that both explains the sensory sequence and also satisfies a set of unity conditions. The unity conditions insist that the constituents of the causal theory – objects, properties, and laws – must be integrated into a coherent whole. On our account, making sense of sensory input is a type of program synthesis, but it is unsupervised program synthesis...
Three mysteries in deep learning: Ensemble, knowledge distillation, and self-distillation
We focus on studying the discrepancy of neural networks during the training process that has arisen purely from randomizations. We ask the following questions: besides this small deviation in test accuracies, do the neural networks trained from different random initializations actually learn very different functions? If so, where does the discrepancy come from? How do we reduce such discrepancy and make the neural network more stable or even better? These questions turn out to be quite nontrivial, and they relate to the mysteries of three techniques widely used in deep learning...
Predicting drive failure & an introduction to machine learning
We’ve all had a hard drive fail on us, and often it’s as sudden as booting your machine and realizing you can’t access a bunch of your files. It’s not a fun experience. It’s especially not fun when you have an entire data center full of drives that are all important to keeping your business running. What if we could predict when one of those drives would fail, and get ahead of it by preemptively replacing the hardware before the data is lost? This is where the history of predictive drive failure...begins...
ZeRO-Offload: Democratizing Billion-Scale Model Training
Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch, and it does so without requiring any model change from the data scientists or sacrificing computational efficiency. ZeRO-Offload enables large model training by offloading data and compute to CPU...
Worried about your firm’s AI ethics? These startups are here to help:
A growing ecosystem of “responsible AI” ventures promise to help organizations monitor and fix their AI models...
How Facebook is using AI to improve photo descriptions for people who are blind or visually impaired
When Facebook users scroll through their News Feed, they find all kinds of content...many users who are blind or visually impaired (BVI) can also experience that imagery, provided it’s tagged properly with alternative text...Unfortunately, many photos are posted without alt text, so in 2016 we introduced a new technology called automatic alternative text (AAT)...The latest iteration of AAT ...makes it possible to include information about the positional location and relative size of elements in a photo. So instead of describing the contents of a photo as “May be an image of 5 people,” we can specify that there are two people in the center of the photo and three others scattered toward the fringes, implying that the two in the center are the focus...
A retrospective of NeurIPS 2020
An incredible 23,000 people virtually attended the 2020 Conference on Neural Information Processing Systems, a highly regarded machine learning conference. Here you will find my personal, quite random, and definitely incomplete retrospective. Some of my favourite topics included model understanding, model compression, training bag of tricks, self-supervised learning for audio, a walk through the world of BERT, and indigenous in AI...
Prostate Cancer can be precisely diagnosed using a urine test with artificial intelligence
Prostate cancer is one of the most common cancers among men. Patients are determined to have prostate cancer primarily based on PSA, a cancer factor in blood. However, as diagnostic accuracy is as low as 30%, a considerable number of patients undergo additional invasive biopsy and thus suffer from resultant side effects, such as bleeding and pain...The Korea Institute of Science and Technology (KIST) announced that the collaborative research...for diagnosing prostate cancer from urine within only 20 minutes with almost 100% accuracy. The research team developed this technique by introducing a smart AI analysis method to an electrical-signal-based ultrasensitive biosensor...
Data Platform*
GET PAID TO SOURCE DATA
We’re tired of seeing data scientists not getting paid for collecting data!
DoltHub is a platform for data collaboration that wants to pay you to source data. We recently launched a $10,000 bounty to collect the best open dataset of hospital prices! Get paid for every row you contribute: submit 20% of the dataset, get a $2,000 reward.
DoltHub makes data collaboration easy. Dolt databases can be forked, cloned, and merged just like Git repositories. That means multiple people can work on the same dataset without stomping on each other's changes.
Please see the link to the bounty here or join our Discord here.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!
Jobs
Data Scientist - Apple Pay Analytics - NYC
You will play a key role improving the Apple Pay product experience. As a member of the analytics team you will be supporting a product function. You will partner with business owners, understand goals, craft KPIs and measure ongoing performance. You will initially engage with the product and engineering teams in ensuring that we have the appropriate instrumentation in place to deliver on these metrics. You will subsequently use advanced statistical, ML and analytical techniques to analyze product performance and identify key insights that inform product improvements and business strategy. The role requires a high degree of independence, ownership and collaboration working cross functionally across all levels of a highly matrixed organization...
Want to post a job here? Email us for details >> team@datascienceweekly.org
Training & Resources
ML Theory with bad drawings
This semester I am teaching a seminar on the theory of machine learning. For the first lecture, I would like to talk about what is the theory of machine learning. I decided to write this (very rough!) blog post mainly to organize my own thoughts...
Book Review: Deep Learning With PyTorch
After its release in August 2020, Deep Learning with PyTorch has been sitting on my shelf before I finally got a chance to read it during this winter break. It turned out to be the perfect easy-going reading material for a bit of productivity after the relaxing holidays. As promised last week, here are my thoughts...
SVM Classifier and RBF Kernel — How to Make Better Models in Python
A complete explanation of the inner workings of Support Vector Machines (SVM) and Radial Basis Function (RBF) kernel...The story covers the following topics: a) The category of algorithms that SVM classification belongs to, b) An explanation of how the algorithm works, c) What are kernels, and how are they used in SVM?, and d) A closer look into RBF kernel with Python examples and graphs...
Books
Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian