Data Science Weekly - Issue 198
Issue #198 Sept 7 2017
Editor Picks
How I replicated an $86 million project in 57 lines of code
When an experiment with existing open source technology does a “good enough” job...
Meet Michelangelo: Uber's Machine Learning Platform
Uber Engineering is committed to developing technologies that create seamless, impactful experiences for our customers. We are increasingly investing in artificial intelligence (AI) and machine learning (ML) to fulfill this vision. At Uber, our contribution to this space is Michelangelo, an internal ML-as-a-service platform that democratizes machine learning and makes scaling AI to meet the needs of business as easy as requesting a ride...
Hurricane How-To
When projections showed Hurricane Harvey could bring a record setting amount of rain to Houston, the graphics desk at the New York Times started exploring ways of showing the rainfall. After a couple of days of scrambling, we managed to make a map showing both the accumulation and rate of rainfall. Getting this done on deadline required mashing together a couple of different web technologies, like SVG and canvas, with a grab bag of open source libraries. This post describes some of the tricks and techniques we used to bring everything together...
A Message from this week's Sponsor:
Insurance AI & Analytics Europe, October 9-10, London
Europe’s largest AI event dedicated to insurance unites 300 attendees from analytics, pricing, marketing, claims and underwriting. 80% of the audience are from insurance, including AIG, Aviva, Allianz, Helvetia, XL Catlin, Tryg, Chubb, Generali, Groupama, Admiral, AXA, SCOR, Zurich. Attend to find out how advanced analytics, automation and AI drive operational efficiencies, greater profitability and improved customer loyalty. Plus, explore AI in the specific areas of product design, customer experience, underwriting, pricing, fraud and claims.
Data Science Articles & Videos
What we learned from three years of interviews with data journalists, web developers and interactive editors at leading digital newsrooms
Over the last three years, Storybench, a website from Northeastern University’s School of Journalism’s Media Innovation graduate program, has interviewed 72 data journalists, web developers, interactive graphics editors, and project managers from around the world to provide an “under the hood” look at the ingredients and best practices that go into today’s most compelling digital storytelling projects...
How to Regulate Artificial Intelligence
The technology entrepreneur Elon Musk recently urged the nation’s governors to regulate artificial intelligence “before it’s too late.” Mr. Musk insists that artificial intelligence represents an “existential threat to humanity,” an alarmist view that confuses A.I. science with science fiction. Nevertheless, even A.I. researchers like me recognize that there are valid concerns about its impact on weapons, jobs and privacy. It’s natural to ask whether we should develop A.I. at all...
Hiring a data scientist
We recently needed to backfill a data analyst position at the Wikimedia Foundation. If you’ve hired for this type of position in the past, you know that this is no easy task. Based on our successful hiring process, we’d like to share what we learned, and how we drew on existing resources to synthesize a better approach to interviewing and hiring a new member of our team...
How To Fix a Toilet: And Other Things We Couldn't Do Without Search
I can't fix a toilet. Or a sink. I know how to write an essay, how to visualize a dataset, how to draw a diagram, how to make liquid olives ... but I can't fix anything around the house. My father-in-law was quite disappointed. So I search how to do everything around the house. To my relief, I am not alone...
Automated Crowdturfing Attacks and Defenses in Online Review Systems
In this paper, we identify a new class of attacks that leverage deep learning language models (Recurrent Neural Networks or RNNs) to automate the generation of fake online reviews for products and services. Not only are these attacks cheap and therefore more scalable, but they can control rate of content output to eliminate the signature burstiness that makes crowdsourced campaigns easy to detect...
7 ways to view correlation
Correlation is a fundamental statistical concept that measures the linear association between two variables. There are multiple ways to think about correlation: geometrically, algebraically, with matrices, with vectors, with regression, and more...
The curious connection between warehouse maps, movie recommendations, and structural biology
Here at Stitch Fix, we work on many fun and interesting areas of Data Science. One of the more unusual ones is drawing maps, specifically internal layouts of warehouses. These maps are extremely useful for simulating and optimising operational processes. In this post, we’ll explore how we are combining ideas from recommender systems and structural biology to automatically draw layouts and track when they change...
What metrics should be used for evaluating a model on an imbalanced data set? (precision + recall or ROC=TPR+FPR)
I always thought the subject of metrics to be somehow confusing, specifically when the data set is imbalanced (as happens so often in our usual problems). In order to clarify things I’ve decided to test a few simple examples of an imbalanced data sets with the different type of metrics and see which reflects more correctly the model performance — ROC curve metrics — TPR and FPR or precision or recall...
Jobs
Data Scientist - School of Media and Journalism, University of North Carolina - Chapel Hill, NC The School of Media and Journalism at the University of North Carolina at Chapel Hill seeks a data scientist to work with students, faculty and industry in the Reese News Lab. This is a full-time, 12-month non-faculty position that is funded by a three-year grant from the Knight Foundation.
This is an exciting opportunity to apply data analysis skills to media innovation and the challenge of getting the right information to the right people at the right time -- especially information that citizens need to hold powerful people accountable and better understand the complex world we all share...
Training & Resources
Databases Using R
If you want to use dbplyr with SQL Server check out this guide...
Companion Jupyter notebooks for the book "Deep Learning with Python"
Jupyter notebooks for the code samples of the book "Deep Learning with Python"...
My Neural Network isn't working! What should I do?
So you're developing the next great breakthrough in deep learning but you've hit an unfortunate setback: your neural network isn't working and you have no idea what to do. You go to your boss/supervisor but they don't know either - they are just as new to all of this as you - so what now? Well luckily for you I'm here with a list of all the things you've probably done wrong and compiled from my own experiences implementing neural networks and supervising other students with their projects...
Books
Keeping Up with the Quants:
Your Guide to Understanding and Using Analytics "Perfect for professionals who want a better grounding in understanding and applying data results to business goals..."
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S. Looking to hire a Data Scientist? Find an awesome one among our readers! Email us for details on how to post your job :) - All the best, Hannah & Sebastian