Data Science Weekly - Issue 29
Issue #29 June 12 2014
Editor Picks
Getting started in Data Science: My thoughts [Trey Causey] There's no denying that 'data scientist' is a hot job title to have right now, and for good reason. It's a tremendously fun and challenging field to be in, and despite all of the often undeserved hoopla that surrounds it, data scientists are doing some pretty amazing things. So it's no surprise that many people are clamoring to find out how to become data scientists. As I run a blog that attempts to teach some basic data science using sports analytics, I often get email asking how one gets started in data science and/or how quickly one can learn the prerequisites for being a data scientist. Instead of replying to these all the time, I thought I'd write my thoughts up here...
Has Uber led to a decrease in Seattle DUIs? Using a technique called regression discontinuity, an Uber plog post claimed they'd reduced DUI arrests by about 10% on average. The post bugged me though, because there is not a lot of detail on the methods, and regression discontinuity is the sort of research design that is very much dependent on specification. In this post I replicate the study and walk through what regression discontinuity is and why it can be a very effective research design. Ultimately I think it’s plausible that Uber did in fact reduce DUIs in Seattle, but the story is a bit more complex than the blog post lets on...
Statistical Shortcomings in Standard Math Libraries (And How To Fix Them) Ever wonder why statisticians flock to specialized languages? It’s not hard to figure out. They’re not just flocking; in part, they’re fleeing. The C standard math library — and by extension, nearly every standard math library out there — lacks even the most basic functionality for doing statistical analysis...
Data Science Articles & Videos
Data Science vs The Hunch:
What happens when the figures contradict your gut instinct?
Despite the widespread adoption of analytics and scientific testing by businesses around the world, the management sixth sense continues to flourish...
A Statistical Model of the 2014 World Cup
Today we introduce our statistical model for predicting the outcome of the 2014 World Cup. At a very high level, our approach is as follows...
What Big Data Visualization Analytics can learn from Radiology
As I research on part III of the “What Healthcare can learn from Wall Street” series, which is probably going to turn in to a Part III, Part IV, and Part V, I was thinking about visualization tools in big data and how to use them to analyze large data sets rapidly (relatively) by a human (or a deep unsupervised learning type algorithm) – and it came to me that us radiologists have been doing this for years...
Sentiment Classification Using scikit-learn (Ryan Rosario talk)
Facebook produces millions of pieces of text content every day. In this talk we discuss a system based on scikit-learn and the Python scientific computing ecosystem that describes and models positive and negative sentiment of user generated content on Facebook...
Open-sourcing Haxl, a library for Haskell
Today we’re open-sourcing Haxl, a Haskell library that simplifies access to remote data, such as databases or web-based services. Haxl is a layer that sits between the application code and one or more “data sources”—APIs for fetching remote data...
Frequentism and Bayesianism II: When Results Differ
While it is easy to show that the two approaches are often equivalent for simple problems, it is also true that they can diverge greatly for more complicated problems. I've found that in practice, this divergence makes itself most clear in two different situations...
Finding Entity Names in Google's Knowledge Graph
I wrote about this patent because it describes how data janitors might use anchor text pointed to a page about an entity to help find other names for that entity. I wrote about it because it does a great job of showing how this knowledge graph kind of fact extraction differs from the web crawling and indexing that we often talk about when we talk about the indexing of things found on the web...
Self-Teaching Neural Network
What is this? My experiments creating a self teaching neural network (nn) using genetic algorithms in JavaScript. What the hell is that? A neural network can be thought of as computational model of a biological brain, made up of network of neurons that receive input and create outputs...
The Colors of Chemistry
This notebook documents my exploration of color theory and its applications to photochemistry. It also shows off the functionality of several Julia packages: Color.jl for color theory and colorimetry, SIUnits.jl for unitful computations, and Gadfly.jl for graph plotting...
Jobs
Data Scientist - Dow Jones - New York, NY When you join Dow Jones, you become part of the most dynamic, creative and savvy news and information companies in the world. The Data Scientist is an integral member of the Dow Jones Data Science and Engineering team. The role will support the data science and engineering strategy at Dow Jones...
Training & Resources
Data Science Stack Exchange
Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It's 100% free, no registration required...
The Data Science Clock Learning Data Science isn’t easy – even just working out what you need to learn about is tricky. This is what made Swami Chandrasekaran come up with his Curriculum via Metromap, (well worth a look!). Inspired by that, I’ve created a Data Science Clock...
Numeric Matrix Manipulation:
The Cheat Sheet for MATLAB, Python NumPy, R, and Julia
At its core, this article is about a simple cheat sheet for basic operations on numeric matrices, which can be very useful if you working and experimenting with some of the most popular languages that are used for scientific computing, statistics, and data analysis...
Books
Practical Machine Learning: Innovations in Recommendation Just released! ... And free on Kindle :)
"Building a simple but powerful recommendation system is much easier than you think. This report explains innovations that make machine learning practical for business production settings — and demonstrates how even a small-scale development team can design an effective large-scale recommender. The style of the report makes this subject approachable for all levels of expertise."
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S. Did you enjoy the newsletter? Do you have friends/colleagues who might like it too? If so, please forward it along - we would love to have them onboard :)