Data Science Weekly - Issue 140
Issue #140 July 28 2016
Editor Picks
How I built a Slack bot to help me find an apartment in San Francisco
I moved from Boston to the Bay Area a few months ago. Priya (my girlfriend) and I heard all sorts of horror stories about the rental market. The fact that searching for “How to find an apartment in San Francisco” on Google yields dozens of pages of advice is a good indicator that apartment hunting is a painful process...
Approaching (Almost) Any Machine Learning Problem
An average data scientist deals with loads of data daily. Some say over 60-70% time is spent in data cleaning, munging and bringing data to a suitable format such that machine learning models can be applied on that data. This post focuses on the second part, i.e., applying machine learning models, including the preprocessing steps. The pipelines discussed in this post come as a result of over a hundred machine learning competitions that I’ve taken part in...
Entropy Explained, With Sheep
From melting ice cubes to a mystery about time. Lets start with a puzzle...
A Message from this week's Sponsor:
Where science and policy change the world. And You.
Apply your knowledge & skills to federal policy via the AAAS Science & Technology Policy Fellowships. A year-long professional development opportunity for doctoral level data scientists to serve in the federal government in Washington, D.C.
STPF fosters a career-enhancing network of science leaders who understand policymaking & contribute to society...
Data Science Articles & Videos
Researchers use neural networks to turn face sketches into photos
We all have a soft spot for Prisma, the app that turns smartphone photos into stylized artwork. But the reverse process — transforming artwork into pictures — is no less fascinating. And it’s not far from becoming real, researchers in the Netherlands said...
Fly the Frustrating Skies: Sentiment Analysis of the U.S. Airline Industry
For this project, I compared the performance of various sentiment analyzers in order to identify the most effective strategy for identifying customer complaints directed towards the Twitter accounts of 6 major U.S. airlines (United, American Airlines, U.S. Airways, Southwest, Delta, and Virgin America)...
Applying Data Science To The Supreme Court: Topic Modeling Over Time With NMF (and A D3.js Bonus)
The Supreme Court is arguably the most important branch of government for guiding our future, but it's incredibly difficult for the average American to get a grasp of what's happening. I decided that a good start in closing this gap would be to model topics over time and create an interactive visualization for anyone with an interest and an internet connection to utilize to educate themselves...
NYC Subway Math
I started tracking all subway trains one day and completely forgot about it. Several weeks later I had a 3GB large data dump full of all the arrivals for 1, 2, 3, 4, 5, 6, L, SI and GC (the latter two being Staten Island railway and Grand Central Shuttle). Let’s do some cool stuff with this data!...
The Skynet Salesman
Operations covers a broad range of problems and can involve things like optimizing shipping, allocating items to warehouses, coordinating processes to ensure that our products arrive on time, or optimizing the internal workings of a warehouse. One of the canonical questions in operations is the traveling salesman problem (TSP). In its simplest form, we have a busy salesperson who must visit a set number of locations once...
Language modeling a billion words
In this Torch blog post, we use noise contrastive estimation (NCE) [2] to train a multi-GPU recurrent neural network language model (RNNLM) on the Google billion words (GBW) dataset...
Why I’m Not a Fan of R-Squared
R2 answers the question: “does my model perform better than a constant model?” But we often would like to answer a very different question: “does my model perform worse than the true model?”...
Degrees Of Separation On A Tree Algorithm
Imagine that you are like me and looking for an algorithm to compute degrees of separation along a hierarchical tree like the ones pictured below. Your tree could represent any data — let’s pretend it’s a company orgchart with node A as the CEO, nodes B and I are execs, and so on. The distance, or degrees of separation, between any two nodes is the number of links along the shortest path that separates them...
Jobs
Applied Data Scientist - Ancestry - Lehi, UT AncestryDNA, is the world's largest consumer genomics database providing consumers insights into their ancestral origins. The service enables customers to not only uncover their ethnic mix and rich family stories, but discover distant relatives with a common ancestral match, and help solve the toughest family mysteries.The Data Science team is looking for an experienced Data Scientist who has a passion to build data products and data systems...
Training & Resources
Scipy Lecture Notes - One document to learn numerics, science, and data with Python
Tutorials on the scientific Python ecosystem: a quick introduction to central tools and techniques. The different chapters each correspond to a 1 to 2 hours course with increasing level of expertise, from beginner to expert...
Generalized linear models, abridged.
This note grew out of our own desire to better understand the numerics of generalized linear models. We highlight aspects of GLM implementations that we find particularly interesting. We present some reference implementations stripped down to illuminate core ideas; often with just a few lines of code...
googleVis: Interface between R and the Google Chart Tools
The googleVis package provides an interface between R and the Google Charts API. It allows users to create web pages with interactive charts based on R data frames...
Books
Bayesian Methods for Hackers:
Probabilistic Programming and Bayesian Inference Master Bayesian Inference through Practical Examples and Computation–Without Advanced Mathematical Analysis...
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S. Interested in reaching fellow readers of this newsletter? Consider sponsoring! Email us for details :) - All the best, Hannah & Sebastian