Data Science Weekly

Aug 14, 2014

Issue #38 Aug 14 2014

So You Wanna Try Deep Learning?
I’m keeping this post quick and dirty, but at least it’s out there. The gist of this post is that I put out a one file gist that does all the basics, so that you can play around with it yourself...

Scholar Octopus
Fun hack: I took 7200 papers from 34 CV/ML conferences, and layed them out with t-SNE based on bigram tfidf. Explore...

Is HBase’s slow and steady approach winning the NoSQL race?
In the world of NoSQL databases, the products that have dominated the conversation are MongoDB and DataStax Enterprise, a leading distribution of Apache Cassandra. But a couple of headlines this week bring into focus a perhaps less-splashy, though rather tenacious player: Apache HBase, which is included with most major Hadoop distributions...

Building a Production Machine Learning Infrastructure
Josh Wills, Director of Data Science at Cloudera has a gift for making fairly complicated technology explanations very digestible to the novice and intermediary techie. What I most love about this video is how Josh explains -very clearly – the issue of translating analytics Machine Learning on a large set of data records (see: individuals) and making it work in a production environment on one individual (think eCommerce)...

Machine predicts heart attacks 4 hours before doctors
When someone shouts "Code Blue!" in a hospital, it usually means a patient needs immediate help. An algorithm may be able to make that call 4 hours earlier to head off dangerous situations...

Using scikit-learn Pipelines and FeatureUnions
Since I posted a postmortem of my entry to Kaggle's See Click Fix competition, I've meant to keep sharing things that I learn as I improve my machine learning skills. One that I've been meaning to share is scikit-learn's pipeline module. The following is a moderately detailed explanation and a few examples of how I use pipelining when I work on competitions...

An Empirical Analysis of Stop-and-Frisk in New York City
Between 2006 and 2012, the New York City Police Department made roughly four million stops as part of the city’s controversial stop-and-frisk program. We empirically study two aspects of the program by analyzing a large public dataset released by the police department that records all documented stops in the city...

Interfaces, Efficiency and Big Data
The recording of John Chambers' keynote presentation from the useR! 2014 conference, Interfaces, Efficiency and Big Data, is now available for viewing thanks to Data Science LA...

The Top 5 Questions A Data Scientist Should Ask During a Job Interview
The data science job market is hot and an incredible number of companies, large and small are advertising a desperate need for talent. Before jumping on the first 6-figure offer you get, it would be wise to ask the penetrating questions below to make sure that the seemingly golden opportunity in front of you isn’t actually pyrite...

The Question to Ask Before Hiring a Data Scientist
When hiring data scientists, there’s nothing more frustrating than making the wrong hire. Data scientists are in notoriously high demand, hard to attract, and command large salaries — compounding the cost of a mistake...

Visualizing product relationships in a Market Basket analysis
I came up with this technique to visualize and explain market basket analysis in very simple visualization. This was the core thought behind this technique: Algorithms used in Text mining can be leveraged to create relationship plots in a Market basket analysis...

Data analytics recipes for PyToolz
We demonstrate data workflows with Python data structures and PyToolz. We also introduce join, a new operation in toolz...

Q&A: OkCupid’s Co-Founder on the Myth of the Data Scientist ‘Unicorn’
Christian Rudder, co-founder and president of the IAC/InterActiveCorp.’s OkCupid, caused a stir recently when he responded to Facebook news feed controversy with some no-regrets comments about the dating website’s own experiments on its customers...

Data Scientist - zulily - Seattle zulily is seeking an intellectually curious, collaborative data expert to work as an acquisition-focused data scientist and statistician. As a zulily Data Scientist, you will use statistical analysis and machine learning to better understand how users engage with zulily, and you will use that information to build models that inform our retention and acquisition practices, recommender systems, and optimize content. You should have a strong background in statistics and probability, machine learning, and working with large datasets. Additionally, you should have knowledge of and experience in online marketing practices and metrics...

Data Science at the Command Line - Webcast
We data scientists love to create exciting data visualizations and insightful statistical models. However, before we get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data. The command line, although invented decades ago, is an amazing environment for performing such data science tasks...

Machine Learning Theory: An Introductory Primer
This article introduces the basics of Machine Learning theory, laying down the common themes and concepts, making it easy to follow the logic and get comfortable with the topic....

Data Analysis Using Regression and Multilevel/Hierarchical Models Comprehensive manual in accessible style...

"Andrew Gelman is a top researcher in Bayesian statistics as well as an excellent writer. He has written an excellent text on Bayesian data analysis that uses the Markov Chain Monte Carlo methods for dealing with hierarchical linear models. This book starts out on an introductory level covering a wide variety of statistical modeling problems including logistic regression and multilevel logistic regression, generalized linear models and causal inference..."

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

P.S. Enjoyed the newsletter? Please forward it to friends and peers - we'd love to have them onboard too :-) - All the best, Hannah & Sebastian