Data Science Weekly - Issue 33
Issue #33 July 10 2014
Editor Picks
Understanding Random Forests: From Theory to Practice The goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability...
Frequentism and Bayesianism: What's the Big Deal? Statistical analysis comes in two main flavors: frequentist and Bayesian. The subtle differences between the two can lead to widely divergent approaches to common data analysis tasks. After a brief discussion of the philosophical distinctions between the views, I’ll utilize well-known Python libraries to demonstrate how this philosophy affects practical approaches to several common analysis tasks...
Feature Learning Escapades My summer internship work at Google has turned into a CVPR 2014 Oral titled "Large-scale Video Classification with Convolutional Neural Networks". Politically correct, professional, and carefully crafted scientific exposition in the paper and during my oral presentation at CVPR last week is one thing, but I thought this blog might be a nice medium to also give a more informal and personal account of the story behind the paper and how it fits into a larger context...
Data Science Articles & Videos
Project Jupyter
Launching Project Jupyter, the evolution of the language-agnostic parts of IPython into an open platform for interactive computation, research and education...
Building a Data Pipeline from Scratch - Joe Croback, Project Floridae
Big data processing with Apache Hadoop, Spark, Storm and friends is all the rage right now. But getting started with one of these systems requires an enormous amount of infrastructure, and there are an overwhelming number of decisions to be made...
How to become a Data Scientist
Ryan Orban of Zipfian Academy and Dennis O'Brien of Idle Games talking about becoming a data scientist...
Modeling Musical Influence with Topic Models
The role of musical influence has long been debated by scholars and critics in the humanities, but never in a data-driven way. In this work we approach the question of influence by applying topic-modeling tools to a dataset of 24941 songs by 9222 artists from the years 1922 to 2010...
Replicating 538's plot styles in Matplotlib
After pulling a few graphs locally, sampling colors, and crowd-sourcing the fonts used, I was able to come pretty close to replicating the style in Matplotlib styles. Here's an example (my figure dropped into an article on FiveThirtyEight.com)...
An Exhaustive List of Google's Ranking Factors
[...] scoured the internet to find all of the mention of Google ranking factors and created an all-inclusive infographic listing and categorizing 200 factors that make up Google's ranking algorithm. ...
Google Cofounder Sergey Brin: We Will Make Machines That 'Can Reason, Think, And Do Things Better Than We Can'
Google cofounders Larry Page and Sergey Brin sat for an interview with venture capitalist Vinod Khosla. During the interview, Brin was asked about machine learning and artificial intelligence. He says that so far, we haven't come close to replicating human intelligence. However, he thinks it's only a matter of time before that changes...
Elo ratings (part 4)
We’ve almost arrived at the end of the ratings and rankings tutorials. I’ll do one more post on Markov ratings, then a couple of posts on ensemble ratings, and then it’ll almost be time for season. This week I’ll be talking about Elo ratings. Originally used to rate and rank chess players, Elo ratings are now used in a number of sports, including by Jeff Sagarin for USA Today. They’re a very simple and elegant way to create ratings...
META: Data Science Most Read Articles
Most read articles from the Data Science Weekly Newsletter by Quarter - updated to reflect the latest quarter of newsletters...
Jobs
Data Scientist - Predictive Modeler, Humana Inc - Dallas, TX Humana needs your analytical skills to help us tell a compelling story about healthcare today. As a Data Scientist/ Predictive Modeler, you will conduct cutting edge analyses that drive Humana’s Behavioral Health strategy and operations. You will collect and analyze large amounts of data from disparate sources to develop and implement sophisticated predictive models that will help improve health outcomes. You will be responsible for presenting technical information to a broad audience. You will use claims, business, consumer and other data sources to develop innovative solutions that ultimately lead to improved clinical outcomes for our members...
Training & Resources
36 Excellent Data Visualization Tools
Data is always useful but it is not easy to comprehend it when it is not presented understandably. This is where data visualization comes in...
SciPy 2014 Videos
SciPy is a community dedicated to the advancement of scientific computing through open source Python software for mathematics, science, and engineering. The annual SciPy Conference allows participants from all types of organizations to showcase their latest projects, learn from skilled users and developers, and collaborate on code development...
Machine Learning and Pattern Classification
A collection of tutorials and examples for solving and understanding machine learning and pattern classification tasks...
Books
Love and Math: The Heart of Hidden Reality A New York Times Science Bestseller...
"An interesting combination of an autobiography and the latest developments in the unification of branches of mathematics and their potential application in theoretical physics."
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S. Did you enjoy the newsletter? Do you have friends/colleagues who might like it too? If so, please forward it along - we would love to have them onboard :)