Data Science Weekly - Issue 39
Issue #39 Aug 21 2014
Editor Picks
Data Carpentry
The New York Times has an article titled For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Mostly I really like it. What I’m less thrilled about is calling this “janitor work”. For one thing, it’s not particularly respectful of custodians, whose work I really appreciate. But it also mischaracterizes what this type of work is about. I’d like to propose a different analogy that I think fits a lot better: data carpentry...
When A Machine Learning Algorithm Studied Fine Art Paintings,
It Saw Things Art Historians Had Never Noticeds
AI reveals previously unrecognised influences between great artists...
Supervised Machine Learning: A Review of Classification Techniques
This paper describes various supervised machine learning classification techniques. Of course, a single article cannot be a complete review of all supervised machine learning classification algorithms (also known induction classification algorithms), yet we hope that the references cited will cover the major theoretical issues, guiding the researcher in interesting research directions and suggesting possible bias combinations that have yet to be explored. ...
Data Science Articles & Videos
Machine Learning, Heart Rate Variability & Highway Congestion
Using a machine learning algorithm I had determined my baseline for relaxation and stress. When I began I had used 11 features to train the algorithm which means I used 11 sources of data that the algorithm used to try and predict a relaxed or stress sate. The WEKA software has the ability to show which of the data sources is the most useful in determining the outcome and as it did so it allowed me to narrow the features to three...
How A Computer Algorithm Predicted West Africa’s Ebola Outbreak
Before It Was Announced
The Ebola outbreak in West Africa is focusing a spotlight on a unique online tool run by experts in Boston that flagged a “mystery hemorrhagic fever” in forested areas of southeastern Guinea nine days before the World Health Organization formally announced the epidemic...
Train an artificial neural network with a constraint solver
In this article I discuss about a different way to "train" an artificial neural network. This technique is based on the translation of a neural network into into a constraint problem that can be solved by a constraint solver...
A Convolutional Neural Network for Modelling Sentences
We describe a convolutional architecture, dubbed the Dynamic Convolutional Neural Network (DCNN) that we adopt for the semantic modelling of sentences...
The Future of Datacenter Architecture:
Podcast with CEO and Co-Founder Florian Leibert
A few weeks ago Mesosphere’s CEO and Co-Founder Florian Leibert met with Maxta’s CEO Yoram Novick and Andreessen Horowitz’s Peter Levine to talk about the future of datacenter architecture. The team at Andreessen Horowitz recorded the session and produced this 20-minute podcast...
Democratizing Data Science
In this paper we propose ways to "democratize data science": that is to allocate the power of data science to society's greatest needs...
A Periodic Table To Help You Choose The Best Type Of Visualization
If you want to find the best type of visualization to represent the material you’re working with, this fancy schmancy chart will be right up your alley. It offers a periodic-table-full of visualization options to choose from...
Exploratory analysis and predictive models of how Chicago's neighborhoods interact with the City's 311 service requests.
Through the City of Chicago's 311 system, every Chicagoan can ask for city services, from graffiti removal to pothole filling to abandoned car removal. The 311 data these service requests produce reflect - albeit imperfectly - the needs of the city and its inhabitants. We want to investigate how patterns of service requests are related to the social and economic makeup of Chicago's neighborhoods...
Which cities get the most sleep?
People in Melbourne sleep the most, people in Tokyo sleep the least, and Americans just need more sleep overall. Those are some of the findings from a vast new dataset released to The Wall Street Journal by Jawbone, the makers of the UP, a digitized wristband that tracks how its users move and sleep...
Streamline your cross-validation and classification workflow
The Pipeline class in scikit-learn's pipeline module offers a convenient way to execute a chain of normalization and pre-processing steps, as well as predictors and classifiers on a dataset...
Jobs
Data Science Product Manager - Booking.com - Amsterdam Booking.com is looking for a E-Commerce Product Owner with substantial experience in a data science related discipline, preferably recommender engines or other personalization methods. We are looking for someone to extend our already highly impactful teams with industry experience and a different perspective... you will be responsible for creating products and features improving our customers' experience using our data, with a strong focus on driving conversion and customer loyalty....
Training & Resources
Theano Tutorial
Theano is a software package which allows you to write symbolic code and compile it onto different architectures (in particular, CPU and GPU). It's especially good for machine learning techniques which are CPU-intensive and benefit from parallelization (e.g. large neural networks). This tutorial will cover the basic principles of Theano, including some common mental blocks which come up...
Baby steps in Python – Exploratory analysis in Python (using Pandas)
Pandas are one of the most useful data analysis library in Python (I know these names sounds weird, but hang on!). They have been instrumental in increasing the use of Python in data science community. In this tutorial, we will use Pandas to read a data set from a Kaggle competition, perform exploratory analysis and build our first basic categorization algorithm for solving this problem...
How-to: Use IPython Notebook with Apache Spark
The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL...
Books
Nine Algorithms That Changed the Future:
The Ingenious Ideas That Drive Today's Computers An easy to read introduction to algorithms...
"A terrific book if you are interested in understanding how these algorithms work. The author is superb at explaining the core ideas in clear, understandable terms. You don't need to be a computer geek to follow this book. All you need is a desire to understand. I wish I had had more teachers like this guy when I was in school. I am truly impressed with his ability to explain..."
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S. Enjoyed the newsletter? Please forward it to friends and peers - we'd love to have them onboard too :-) - All the best, Hannah & Sebastian