Data Science Weekly

Aug 20, 2015

Issue #91 August 20 2015

Editor Picks

The State of Artificial Intelligence in Six Visuals
We cover many emerging markets in the startup ecosystem. Previously, we published posts that summarized Financial Technology, Internet of Things, Bitcoin, and MarTech in six visuals. This week, we do the same with Artificial Intelligence (AI). At this time, we are tracking 855 AI companies across 13 categories, with a combined funding amount of $8.75billion...

Eigenstyle: Principal Component Analysis and Fashion
Any set of images can be broken down with Principal Component Analysis. This has been done pretty successfully with faces. Here we’ll take a look at style. Our dataset is 807 pictures of dresses from Amazon...

DataNerd: Start a FREE Trial with New Relic and
we'll send you this geek-chic shirt 4 FREE

Deep learning for assisting the process of music composition (part 1)
This is part 1 of my explorations of using deep learning for assisting the process of music composition. In this part, I look at some almost-winning output of a model trained by deep learning methods on over 23,000 folk tunes, and make improvements to produce a session-ready piece...

Mining Administrative Data to Spur Urban Revitalization
After decades of urban investment dominated by sprawl and outward growth, municipal governments in the United States are responsible for the upkeep of urban neighborhoods that have not received sufficient resources or maintenance in many years. One of city governments' biggest challenges is to revitalize decaying neighborhoods given only limited resources. In this paper, we apply data science techniques to administrative data to help the City of Memphis, Tennessee improve distressed neighborhoods...

The reusable holdout: Preserving validity in adaptive data analysis
We present a new methodology for navigating the challenges of adaptivity. A central application of our general approach is the reusable holdout mechanism that allows the analyst to safely validate the results of many adaptively chosen analyses without the need to collect costly fresh data each time...

Deep Convolutional Networks on Graph-Structured Data
In this paper we consider the general question of how to construct deep architectures with small learning complexity on general non-Euclidean domains, which are typically unknown and need to be estimated from the data. In particular, we develop an extension of Spectral Networks which incorporates a Graph Estimation procedure, that we test on large-scale classification problems, matching or improving over Dropout Networks with far less parameters to estimate...

Adventures in Data Mining -
What we talk about when we talk about space heaters
I thought maybe I could save on the gas bill by getting space heaters for just a couple rooms...I faced the problem with a vision for what might be useful to a consumer: a web app containing interactive scatterplot visualizations of the product features...Overall, the problem can then be broken down into two sub-tasks: 1) Identify the relevant product “features” or aspects that people need to know about (label the axes), and 2) Determine consumer attitudes to each feature or aspect of a product as implied by the reviews (score each product along the axes)...

Data Science on Firesquads: Classifying Emails with Naive Bayes
This year, for our Firesquad rotation, we on the Data Science squad wanted to help automate the classification of support emails. The short-term goal was to reduce the time Coach Relations needs to spend when answering emails. Longer term, this tool could allow us to automatically detect patterns and raise alarms when specific support requests are occurring at an abnormal rate...

The Effects of Hyperparameters on SGD Training of Neural Networks
The performance of neural network classifiers is determined by a number of hyperparameters, including learning rate, batch size, and depth. A number of attempts have been made to explore these parameters in the literature, and at times, to develop methods for optimizing them. However, exploration of parameter spaces has often been limited. In this note, I report the results of large scale experiments exploring these different parameters and their interactions...

With Discovery, 3 Scientists Chip Away At An Unsolvable Math Problem
Armed with an algorithm, McLoud-Mann, along with her husband, Casey Mann, and David Von Derau — all of the University of Washington, Bothell — had been trying to help unravel one of math's long-standing unanswered questions. How many shapes are able to "tile the plane" — meaning the shapes can fit together perfectly to cover any flat surface without overlapping or leaving any gaps...

Watch a Boston Dynamics humanoid robot wander around outside
Boston Dynamics, which Google bought in 2013, has begun testing one of its humanoid robots — those that are designed to function like humans — out in the wild...

Data Scientist - Remind - San Francisco, CA Remind helps teachers, students, and parents engage in safe, simple communication. With more than 25 million users, we are one of the fastest-growing companies in edtech. We are hiring data scientists who are energized about solving the communication challenges that teachers face every day. Apply here

8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset
In this post you will discover the tactics that you can use to deliver great results on machine learning datasets with imbalanced data...

Denoising Dirty Documents: Part 1
So this blog is the first in a series of blogs about how to put together a reasonable solution to Kaggle’s Denoising Dirty Documents competition...

Out-of-Core Dataframes in Python: Dask and OpenStreetMap
In this post, I'll take a look at how dask can be useful when looking at a large dataset: the full extracted points of interest from OpenStreetMap. We will use Dask to manipulate and explore the data, and also see the use of matplotlib's Basemap toolkit to visualize the results on a map...

The History of Statistics: The Measurement of Uncertainty before 1900 Comprehensive history of statistics from its beginnings around 1700 to its emergence as a distinct and mature discipline around 1900...

"This book is THE definitive work on the early development of statistics. Obviously written by a man in love with his subject. Bernoulli, de Moivre, Bayes, Laplace, Gauss, Quetelet, Lexis, Galton, Edgeworth and Pearson all but come alive. I particularly enjoyed the reproductions of first sources included that you would otherwise have to travel to Paris to see..."

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

P.S. Interested in reaching fellow readers of this newsletter? Consider sponsoring! Email us for details :) - All the best, Hannah & Sebastian