Data Science Weekly - Issue 32
Issue #32 July 3 2014
Editor Picks
Visualizing Algorithms Algorithms are a fascinating use case for visualization. To visualize an algorithm, we don’t merely fit data to a chart; there is no primary dataset. Instead there are logical rules that describe behavior. This may be why algorithm visualizations are so unusual, as designers experiment with novel forms to better communicate. This is reason enough to study them...
The Ideal Length of Everything Online, Backed by Research Every so often when I’m tweeting or emailing, I’ll think: Should I really be writing so much? Curious, I dug around and found some answers for the ideal lengths of tweets and titles and everything in between...
How Warby Parker Supercharged Its Data Science Tools For e-commerce companies, customer behavior is a treasure trove of helpful information. Here's how Warby Parker uses it...
Data Science Articles & Videos
Principal Component Analysis (PCA) vs. Ordinary Least Squares (OLS) -
A Visual Explanation
If you have wondered what makes OLS and PCA different, open up an R session and play along...
Sibyl: A System for Large Scale Machine Learning at Google
Last week, at the IEEE/IFIP International Conference on Dependable Systems and Networks, Google Software Engineer Tushar Chandra gave a keynote address outlining the systems aspects of Sibyl, a supervised machine learning system that is used for solving a variety of prediction challenges, such as YouTube video recommendations...
New Beginnings in Facial Recognition
Developments in neural networks and deep learning are bringing great improvements in facial recognition, which could have exciting (and scary) applications on platforms like Google Glass...
What Data-Driven Soccer Could Look Like During the Next World Cup
The world of sport is already incredibly data-driven. Imagine what soccer could look like if Silicon Valley gets its big data hands on the game...
The Hidden Markov, The Goal-Keeper and The World Cup
It’s really hard to believe that Spain, being the previous World Cup Champion, cannot pass the Group Stage. I feel sorry for Casillas [Goal-Keeper], but Spain losing 1:5 to Netherlands was really an unexpected one. So, we suggest using a Hidden Markov Model to discuss the performance of Casillas in the match...
The Kaggle Higgs Challenge – Beat the benchmarks with scikit-learn
This post is intended as a quick-start guide to getting a competitive score in the Higgs Boson Machine Learning Challenge, using just a bit of python and scikit-learn...
Big data's dirty problem
Inaccuracies, misspellings, and obsolete information makes achieving the big data utopia a slog for businesses and researchers...
Clustering Documents & Gaussian Data with Dirichlet Process Mixture Models
In this post we will try to link the theory with the practice by introducing two models DPMM: the Dirichlet Multivariate Normal Mixture Model which can be used to cluster Gaussian data and the Dirichlet-Multinomial Mixture Model which is used to cluster documents...
NIH Associate Director for Data Science -
The Importance of “Data to the Biomedicine Enterprise”
During his keynote speech at Big Data in Biomedicine 2014, Philip Bourne, PhD, the first permanent associate director for data science at the National Institutes of Health, shared how the federal agency hopes to capitalize on big data to accelerate biomedicine discovery, address scientific questions with potential societal benefit and promote open science...
Jobs
Data Scientist - TrueCar Inc - Santa Barbara, CA We’re looking for a Data Scientist who will be responsible for exploratory data analysis, data mining and building predictive algorithms of various automotive industry data sets. This person will analyze current modeling methodologies with a focus on possible improvements to prediction accuracy. As an Analyst in ALG’s Analytics Team, this role will also spend a significant amount of time exploring large datasets and building new predictive algorithms for a variety of automotive valuation projects....
Training & Resources
Markov Chains Explained
Markov Chains is a probabilistic process, that relies on the current state to predict the next state...
Pandas CookBook
This is a repository for short and sweet examples and links for useful pandas recipes. We encourage users to add to this documentation. This is a great First Pull Request (to add interesting links and/or put short code inline for existing links)...
Using Python's sci-packages to prepare data for Machine Learning tasks and other data analyses
In this short tutorial I want to provide a short overview of some of my favorite Python tools for common procedures as entry points for general pattern classification and machine learning tasks, and various other data analyses...
Books
Building probabilistic graphical models with Python Just released...
"If you are a data scientist who knows about machine learning and want to enhance your knowledge of graphical models, such as Bayes network, in order to use them to solve real-world problems using Python libraries, this book is for you. This book is intended for those who have some Python and machine learning experience, or are exploring the machine learning field."
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S. Did you enjoy the newsletter? Do you have friends/colleagues who might like it too? If so, please forward it along - we would love to have them onboard :)