Data Science Weekly - Issue 45
Issue #45 Oct 2 2014
Editor Picks
How BuzzFeed Thinks About Data Science
Jonah Peretti hired BuzzFeed’s first data scientist in 2010, to predict when and how articles would go viral on the Internet. It’s a hard problem. We are still thinking about this same question today, but our canvas has changed. BuzzFeed now covers news, politics, business, tech, entertainment, food, international coverage and much more, reaching over 150 million unique visitors a month. The data science team has evolved alongside it...
Using Machine Learning and NodeJS to detect the gender of Instagram Users
The goal of this article is to provide a very practical guide to deploying a machine learning solution at scale...
Teach statistics before calculus! [TED Talk]
Someone always asks the math teacher, "Am I going to use calculus in real life?" And for most of us, says Arthur Benjamin, the answer is no. He offers a bold proposal on how to make math education relevant in the digital age...
Data Science Articles & Videos
How big data is unfair
As we’re on the cusp of using machine learning for rendering basically all kinds of consequential decisions about human beings in domains such as education, employment, advertising, health care and policing, it is important to understand why machine learning is not, by default, fair or just in any meaningful way...
Exercise to detect Algorithmically Generated Domain Names
In this notebook we're going to use some great python modules to explore, understand and classify domains as being 'legit' or having a high probability of being generated by a DGA (Dynamic Generation Algorithm)...
Integrating Kafka & Spark Streaming: Code Examples & State of the Game
Spark Streaming has been getting some attention lately as a real-time data processing tool, often mentioned alongside Apache Storm. If you ask me, no real-time data processing tool is complete without Kafka integration (smile), hence I added an example Spark Streaming application to kafka-storm-starter that demonstrates how to read from Kafka and write to Kafka, using Avro as the data format and Twitter Bijection for handling the data serialization...
Looking Beyond the Visible Scene
Can you find the nearest McDonald's faster than a DeepLearning computer?...
What I Learned From The Kaggle Criteo Data Science Odyssey
Each Kaggle challenge is like an odyssey : you start with nothing, you don’t know how it will end, and when it’s finished, it reminds you good old memories...
Artificial General Intelligence that plays Atari video games:
How did DeepMind do it?
Last December, an article named “Playing Atari with Deep Reinforcement Learning” was uploaded to arXiv by employees of a small AI company called DeepMind. Two months later Google bought DeepMind for 500 million euros, and this article is almost the only thing we know about the company . Currently our team is trying to replicate their artificial mind, and in this post we describe its inner workings...
Gilt and Preemptive Shipping: A Q&A with Our Chief Data Scientist
The Gilt tech team doesn’t need an in-house psychic to help us predict which customers will buy products we’ve never sold before. Instead, we rely on the data wizardry performed by our Principal Data Scientist, Igor Elbert, who has been helping us to refine our product performance predictions (say that three times fast) by using various machine learning and predictive modeling techniques...
In Chicago, food inspectors are guided by big data
In Chicago, just 32 food inspectors — called sanitarians — are responsible for auditing the city’s more than 15,000 restaurants. Traditionally, sanitarians are assigned beats, or groups of restaurants, that they inspect a few times a year, depending on a restaurant’s assessed risk level: How complex a restaurant’s menu items are, and how likely ingredients are to trigger food poisoning. Today, the city is experimenting with a new technology to guide where those inspections should occur, based on factors such as current weather, nearby construction and past health code violations...
Kernel tricks and nonlinear dimensionality reduction via RBF kernel PCA
The focus of this article is to briefly introduce the idea of kernel methods and to implement a Gaussian radius basis function (RBF) kernel that is used to perform nonlinear dimensionality reduction via KBF kernel principal component analysis (kPCA)...
META: Most Read Data Science Weekly Articles, Q3 2014
The most clicked articles on the newsletter in the last quarter...
Jobs
Data Scientist, Tradesy - Santa Monica, CA Tradesy is a new kind of peer-to-peer marketplace that addresses the pain-points associated with selling on sites like eBay and Craigslist. Our mission is to make it simple and delightful for anyone to sell the unused or underused goods cluttering their closets. We have millions of passionate members, a product that people love, and an office with an ocean view in sunny Santa Monica. We're backed by some of the best investors around, including KPCB and Sir Richard Branson. But enough about us, lets talk about you! You're a Data Scientist that pushes the limits of what can be possible with data. You have a high-level of ownership over your work and absolutely hate to be micro-managed...
Training & Resources
How to install Theano on Amazon EC2 GPU instances for deep learning
Hopefully these steps will help you get your deep learning models up and running on AWS...
Hacker's guide to Neural Networks
I've worked on Deep Learning for a few years as part of my research and among several of my related pet projects is ConvNetJS - a Javascript library for training Neural Networks. Javascript allows one to nicely visualize what's going on and to play around with the various hyperparameter settings, but I still regularly hear from people who ask for a more thorough treatment of the topic. This article (which I plan to slowly expand out to lengths of a few book chapters) is my humble attempt. It's on web instead of PDF because all books should be, and eventually it will hopefully include animations/demos etc...
Bayesian models in R
There are many ways to run general Bayesian calculations in or from R. To get an idea on their comparison I decided to run a number of calculations through all of them...
Books
Thoughtful Machine Learning: A Test-Driven Approach JUST RELEASED!: Learn how to apply test-driven development (TDD) to machine-learning algorithms—and catch mistakes that could sink your analysis...
"In this practical guide, author Matthew Kirk takes you through the principles of TDD and machine learning, and shows you how to apply TDD to several machine-learning algorithms, including Naive Bayesian classifiers and Neural Networks..."
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S. Enjoyed the newsletter? Please forward it to friends and peers - we'd love to have them onboard too :-) - All the best, Hannah & Sebastian