Data Science Weekly - Issue 80
Issue #80 June 4 2015
Editor Picks
The Unknown Perils of Mining Wikipedia
If a machine is to learn about humans from Wikipedia, it must experience the corpus as a human sees it and ignore the overwhelming mass of robot-generated pages that no human ever reads. We provide a cleaned corpus (also a Wikipedia recommendation API derived from it)...
Using Amazon Machine Learning to Predict the Weather
Amazon recently launched their Machine Learning service, so I thought I’d take it for a spin. Machine Learning (ML) is all about predicting future data based on patterns in existing data. As an experiment I wanted to see if machine learning would be able to predict the weather of tomorrow based on weather observations...
Virtual Eyes Train Deep Learning Algorithm to Recognize Gaze Direction
Gaze estimation is a classic problem of machine vision, which can now be solved by one computer training another...
Data Science Articles & Videos
Recurrent Neural Shady
Inspired by a recent blog post from Andrej Karpathy, I trained a character by character Recurrent Neural Network model on Eminem lyrics. Then, using the trained model I let it generate it's own Shady lyrics by sampling from the learned distribution (one character at a time). Adding the result to some background music made it quite amusing....
China unveils world's first facial recognition ATM machine
China has unveiled the world's first facial recognition ATM, which will not allow users to withdraw cash unless their face matches their IDs. The machine was created by Tsinghua University and Hangzhou-based technology company Tzekwan...
Extending the NFL Season to a Million Games
If time, money, and brain damage weren’t an issue, how would you decide if Nick Foles or Sam Bradford is better? You’d take a generic team, plug in Nick Foles, and play a million games. Obviously we can’t do this in real life, but we can in computers. We just need to be able to simulate football games. That sounds great, but how do we do it?...
Interview with Chris Wiggins, Chief Data Scientist, New York Times
Video of the interview...
So, You Need a Statistically Significant Sample?
Although a commonly used phrase, there is no such thing as a "statistically significant sample" – it’s the result that can be statistically significant, not the sample. Word-mincing aside, for any study that requires sampling – e.g. surveys and A/B tests – making sure we have enough data to ensure confidence in results is absolutely critical...
Prediction intervals for Random Forests
An aspect that is important but often overlooked in applied machine learning is intervals for predictions, be it confidence or prediction intervals. For classification tasks, beginning practitioners quite often conflate probability with confidence...
My aversion to pipes
At the risk of coming across as even more of a curmudgeonly old fart than people already think I am, I really do dislike the current vogue in R that is the pipe family of binary operators...
How to Evaluate Machine Learning Models, Part 4: Hyperparameter Tuning
In the realm of machine learning, hyperparameter tuning is a “meta” learning task. It happens to be one of my favorite subjects because it can appear like black magic, yet its secrets are not impenetrable. In this post, I'll walk through what is hyperparameter tuning, why it's hard, and what kind of smart tuning methods are being developed to do something about it...
Scientists dismissed "hot streaks" in sports for decades.
They were wrong.
Most sports fans and athletes believe in hot streaks. A basketball player who has hit several shots in a row, the thinking goes, has a greater chance of hitting the next one, due to a "hot hand." Now, however, it's starting to look like the hot hand might be real after all...
Jobs
Snr Data Scientist - Nike - Portland, OR Nike does more than outfit the world's best athletes. We are a place to explore potential, obliterate boundaries, and push out the edges of what can be. Nike’s Global Consumer Knowledge Center of Excellence is responsible for building and deepening a holistic view of Nike’s consumers through data and analytics. We are looking for a senior statistician to work across Nike’s consumer facing businesses to define and implement measurement strategies, instrument and analyze consumer behavior, and inform Nike’s global strategy...
Training & Resources
Kaggle R Tutorial on Machine Learing
Always wanted to compete in a Kaggle competition but not sure you have the right skillset? This interactive tutorial by Kaggle and DataCamp on Machine Learning offers the solution...
Top 20 Python Machine Learning Open Source Projects
We examine top Python Machine learning open source projects on Github, both in terms of contributors and commits, and identify most popular and most active ones...
Out-of-core Learning and Model Persistence using scikit-learn
When we are applying machine learning algorithms to real-world applications, our computer hardware often still constitutes the major bottleneck of the learning process. Of course, we all have access to supercomputers, Amazon EC2, Apache Spark, etc. However, out-of-core learning via Stochastic Gradient Descent can still be attractive if we'd want to update our model on-the-fly ("online-learning"), and in this notebook, I want to provide some examples of how we can implement an "out-of-core" approach using scikit-learn...
Books
Python: Learn Python in One Day and Learn It Well Clear theory and a project to work through at the end...
"I am a novice to programming and decided to learn Python as I'm told it is one of the easiest language to learn. I read a few books on Python and this is definitely one of the best. The author is able to explain difficult concepts clearly, and the project at the end definitely helped my learning..."
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S. Interested in reaching fellow readers of this newsletter? Consider sponsoring! Email us for details :) - All the best, Hannah & Sebastian