Data Science Weekly

Jul 23, 2015

Issue #87 July 23 2015

Editor Picks

Two examples of why machine learning is becoming the
most powerful way to increase revenue
From recommendations and personalization to ads and e-commerce, companies like Google, Facebook, Amazon, Netflix, and LinkedIn have been increasing revenue and engagement with machine learning for years. The success stories that follow show how we’re leveling the playing field by helping product teams and publishers leverage the same technology as these tech giants, without the need to build it in house...

Exploring the shapes of stories using Python and sentiment APIs
Using two hacks and a multinomial logistic regression model of n-grams with TF-IDF features, a pre-trained sentiment model can score the long-range sentiment of text of stories, books, and movies. The models do a reasonable job of summarizing the “shapes of stories” directly from text...

Cute robot politely shows self-awareness
A robot allegedly demonstrates basic self-awareness while attempting to solve a classic logic puzzle...

Want to be a Data Scientist, but don't know where to start?
Learn essential Data Science skills in SlideRule's Intro to Data Science Workshop. In this online bootcamp, you'll learn R, data wrangling, analytics and visualization by working on real projects, with 1-on-1 mentorship from expert Data Scientists from LinkedIn, Glassdoor, Trulia and Stripe.

Spots are limited; registration ends in 48 hours!

Deriving the Reddit Formula
A few things about Reddit's hot formula have always bothered me. The formula has obviously been a success when it comes to setting the Internet on fire, but I have to wonder...

Is there a simple algorithm for intelligence?
The question I explore here is whether there is a simple set of principles which can be used to explain intelligence? In particular, and more concretely, is there a simple algorithm for intelligence?...

Inferring Networks of Substitutable and Complementary Products
Here we develop a method to infer networks of substitutable and complementary products. We formulate this as a supervised link prediction task, where we learn the semantics of substitutes and complements from data associated with products...

Split Testing for Geniuses
You are sitting at a slot machine with two levers, labeled A and B. When you pull a lever, sometimes a dollar comes out of the slot and sometimes not. The casino tells you that each lever has a fixed chance of giving you a dollar (its success rate) but, of course, they don’t tell you what it is. Since you don’t have any way of distinguishing them to start, you pull lever A and a dollar comes out (Yipee!). What do you do next?...

Kaggle Competition Tips & Summaries
Over the years, I’ve participated in a few Kaggle competitions and wrote a bit about my experiences. This page contains pointers to all my posts, and will be updated if/when I participate in more competitions....

Data Science at Agari: Forwarder Classification
Among the challenges that our engineering team faces is the ability to classify an email-sending entity as a forwarder. At Agari, we are primarily interested in the authentication of emails from originating senders...

Machine learning to predict San Francisco crime
In today’s post, we document our submission to the recent Kaggle competition aimed at predicting the category of San Francisco crimes, given only their time and location of occurrence. As a reminder, Kaggle is a site where one can compete with other data scientists on various data challenges. We took this competition as an opportunity to explore the Naive Bayes algorithm. With the few steps discussed below, we were able to quickly move from the middle of the pack to the top 33% on the competition leader board, all the while continuing with this simple model!...

Deepdream: Avoiding Kitsch
Yes yes, #deepdream. But as Memo Atkin and others point out, this is going to kitsch as rapidly as Walter Keane and lolcats unless we can find a way to stop the massive firehose of repetitive #puppyslug that has been opened by a few websites letting us upload selfies...

Data Scientist - City of New York - NYC The successful candidate will serve as a Data Scientist reporting to the Mayor’s Office of Criminal Justice. Responsibilities will include: Gather and convert data into insights to guide policy development and evaluation; work with City agencies to integrate data sets and develop data-informed strategies; utilize programming languages such as SAS, SQL, R, SPSS, Python; develop new approaches to collecting data not presently incorporated into City systems; work closely with operations, policy, and analytic teams to establish and validate models and approaches; and perform special projects and initiatives as assigned...

Pyxley: Python Powered Dashboards
We have written a Python package, called Pyxley, to not only help simplify the development of web-applications, but to provide a way to easily incorporate custom Javascript for maximum flexibility...

Ibis on Impala: Python at Scale for Data Science
This new Cloudera Labs project promises to deliver the great Python user experience and ecosystem at Hadoop scale...

Discovering Statistics Using R Recommended by several readers of the newsletter...

"The book is a great overview of statistics concepts and provides a gentle, yet comprehensive, introduction to the R language..."

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

P.S. Interested in reaching fellow readers of this newsletter? Consider sponsoring! Email us for details :) - All the best, Hannah & Sebastian