Data Science Weekly - Issue 107
Issue #107 December 10 2015
Editor Picks
Wikipedia-Mining Algorithm Reveals World’s Most Influential Universities
An algorithm’s list of the most influential universities contains some surprising entries...
Many rules of statistics are wrong
There are two kinds of people who violate the rules of statistical inference: people who don't know them and people who don't agree with them. I'm the second kind...
Hidden Technical Debt in Machine Learning Systems
Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt, we find it is common to incur massive ongoing maintenance costs in real-world ML systems...
A Message from this week's Sponsor:
The Power of Three
97% of developers will fail you. We are the other 3%. Get access to power.
Data Science Articles & Videos
Google and Facebook Race to Solve the Ancient Game of Go with AI
Over the last 20 years, machines have topped the best humans at so many games of intellectual skill, we now assume computers can beat us at just about anything. But Go—the Eastern version of chess in which two players compete with polished stones on 19-by-19-line grid—remains the exception...
Analyzing San Francisco Crime Data to Determine When Arrests Occur
The SF OpenData portal is a good source for detailed statistics about San Francisco. One of the most popular datasets on the portal is the SFPD Incidents dataset, which contains a tabular list of 1,842,050 reports (at time of writing) from 2003 to present. For this article, I’m going to do something different and illustrate the data processing step-by-step, both as a teaching tool, and to show that I am not using vague methodology to generate a narratively-convenient conclusion...
The math of humor: Why 'snunkoople' makes you squeal
A mathematical formula that predicts which words people find funny might not get its own Comedy Central special, but it does offer insights into the human mind....
Visualising your hiking trails and photos with My Tracks, R and Leaflet
After a hiking vacation, it is nice to have some sort of visual record afterwards. While there are likely professionaly solutions to record and visualise your trails, as a recreational hiker you can already get a lot of milage from your smartphone in combination with the R data-analysis ecosystem...
Learning to Generate Chairs, Tables and Cars with Convolutional Networks
We train generative 'up-convolutional' neural networks which are able to generate images of objects given object style, viewpoint, and color. Our experiments show that the networks do not merely learn all images by heart, but rather find a meaningful representation of 3D models allowing them to assess the similarity of different models, interpolate between given views to generate the missing ones, extrapolate views, and invent new objects not present in the training set by recombining training instances, or even two different object classes...
Where are the Opportunities for Machine Learning Start-Ups?
Machine Learning and AI are fast becoming ubiquitous in data driven businesses, that is to say, an awful lot of businesses. Here I choose a few areas where it’s possible that big corporations haven’t already eaten everybody’s lunch...
Exploring Virtual Reality Data Visualization with Gear VR
With the release of the Gear VR virtual reality headset by Samsung and Oculus, it feels like the future is here. It’s easy to see how a number of industries are going to be disrupted by this new media format over the next few years by virtual reality... But what about data science? The applications are much less clear than in entertainment and marketing, but it’s likely that virtual reality will enable some interesting new data visualizations that 2D images, even interactive ones, don’t provide...
How Much Memory Does A Data Scientist Need?
Recently, I discovered an interesting blog post Big RAM is eating big data – Size of datasets used for analytics from Szilard Pafka. He says that “Big RAM is eating big data”. This phrase means that the growth of the memory size is much faster than the growth of the data sets that typical data scientist process. So, data scientist do not need as much data as the industry offers to them. Would you agree?...
Beyond the Venn diagram
Identifying the essential skills for data scientists....
Do I need to be an open-source contributor to be taken seriously by a Data Science Hiring Manager?
Response to reader question on role of open-source contributions in hiring - basically just one potential "proof point"...
Jobs
Data Scientist - eBay - San Jose, CA At eBay, our systems scale to billions of transactions per day, and we run our sites 24x7 with 99.99% reliability. We pride ourselves to be the leader in cloud computing, Big Data, search, and many other lead-edge technologies. We are seeking a highly talented, creative, and passionate applied researchers to help us create the most relevant recommendations, machine translation and search experiences...
Training & Resources
NIPS 2015 papers
This year's NIPS 2015 papers in nice LDA format...
Pretty Tensor - Fluent Neural Networks in TensorFlow
Pretty Tensor provides a high level builder API for TensorFlow. It provides thin wrappers on Tensors so that you can easily build multi-layer neural networks...
What's the best database for an analyst?
Which database is best? The question, obviously, depends on what you want to use it for...
Books
Mastering Predictive Analytics with R Well-written and full of good examples...
"As a data scientist lead, I found this to be a book that is full of clear explanations on the important topics you need to master predictive modeling... The code samples provided with the book are very well organized and make it trivial to pick up and execute examples from anywhere in the book. I'm currently recommending this to everyone who joins my team..."
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S. Interested in reaching fellow readers of this newsletter? Consider sponsoring! Email us for details :) - All the best, Hannah & Sebastian