Data Science Weekly

Sep 01, 2016

Issue #145 Sep 01 2016

Editor Picks

Y'all have a Texas accent? Siri (and the world) might be slowly killing it
Voice recognition tools such as Apple’s Siri still struggle to understand regional quirks and accents, and users are adapting the way they speak to compensate...

Facebook is trying to get rid of bias in Trending news by getting rid of humans
Facebook will no longer employ humans to write descriptions for items in its Trending section, which attracted controversy over allegations of political bias in May...

The Three Faces of Bayes
As I’ve read more outside the fields of machine learning and natural language processing — from psychometrics and environmental biology to hackers who dabble in data science — I’ve noticed three distinct uses of the term “Bayesian.”...I’ll present the three main uses of “Bayesian” as I understand them, all through the lens of a naïve Bayes classifier. I hope you find it useful and interesting!...

Harness the business power of big data.

How far could you go with the right experience and education? Find out. At Capitol Technology University. Earn your PhD Management & Decision Sciences — in as little as three years — in convenient online classes. Banking, healthcare, energy and business all rely on insightful analysis. And business analytics spending will grow to $89.6 billion in 2018. This is a tremendous opportunity — and Capitol’s PhD program will prepare you for it. Learn more now!

An Exclusive Look at How AI and Machine Learning Work at Apple
An exclusive inside look at how artificial intelligence and machine learning work at Apple...

Majority Of Mathematicians Hail From Just 24 Scientific ‘families’
Most of the world’s mathematicians fall into just 24 scientific 'families', one of which dates back to the fifteenth century. The insight comes from an analysis of the Mathematics Genealogy Project (MGP), which aims to connect all mathematicians, living and dead, into family trees on the basis of teacher–pupil lineages, in particular who an individual's doctoral adviser was...

The sneaky math that made the lottery more alluring — and harder to win
As lottery jackpots climbed to new highs earlier this year, including the Powerball’s historic $1.5 billion jackpot in January, followers of Salil Mehta’s blog began writing him to ask if there were strategies they could use to improve their odds...

Debugging Machine Learning
I've been thinking, mostly in the context of teaching, about how to specifically teach debugging of machine learning. Personally I find it very helpful to break things down in terms of the usual error terms: Bayes error, approximation error, estimation error, and optimization error. I've generally found that trying to isolate errors to one of these pieces, and then debugging that piece in particular has been useful...For instance, my general debugging strategy involves steps like the following...

Infrastructure For Deep Learning
In this post, we'll [OpenAI] share how deep learning research usually proceeds, describe the infrastructure choices we've made to support it, and open-source kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes. We hope you find this post useful in building your own deep learning infrastructure...

Large scale matrix multiplication with pyspark (or — how to match two large datasets of company names)
Spark and pyspark have wonderful support for reliable distribution and parallelization of programs as well as support for many basic algebraic operations and machine learning algorithms...In this post we describe the motivation and means of performing name-by-name matching of two large datasets of company names using Spark....

Learning From Imbalanced Classes
If you’re fresh from a machine learning course, chances are most of the datasets you used were fairly easy. Among other things, when you built classifiers, the example classes were balanced, meaning there were approximately the same number of examples of each class...But when you start looking at real, uncleaned data one of the first things you notice is that it’s a lot noisier and imbalanced...where machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases...If you deal with such problems and want practical advice on how to address them, read on...

Densely Connected Convolutional Networks
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion...The DenseNet obtains significant improvements over the state-of-the-art on all five highly competitive object recognition benchmark tasks (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN)...

Data Scientist - WeWork - NYC We are seeking a mid-level data scientist to join WeWork’s Data Science team. This sits within the Data Team, a centralized and relatively small, collaborative, and tightly-knit team whose job is support analysts and decision makers across the organization, and to support you. We want you to focus on shipping machine learning models and make a difference to the business...

Deep Learning Part 1: Comparison of Symbolic Deep Learning Frameworks
Deep learning is an emerging field of research, which has its application across multiple domains. I try to show how transfer learning and fine tuning strategy leads to re-usability of the same Convolution Neural Network model in different disjoint domains. Application of this model across various different domains brings value to using this fine-tuned model...In this blog (Part1), I describe and compare the commonly used open-source deep learning frameworks. I dive deep into different pros and cons for each framework, and discuss why I chose Theano for my work...

Conda: Myths and Misconceptions
In the four years since its initial release, many words have been spilt introducing conda and espousing its merits, but one thing I have consistently noticed is the number of misconceptions that seem to remain in the (often fervent) discussions surrounding this tool. I hope in this post to do a small part in putting these myths and misconceptions to rest...

PyFlux: An open source time series library for the Python Programming Language
Built upon the NumPy/SciPy/Pandas libraries, PyFlux allows for easy application of a vast array of time series methods and inference capabilities...

Python Pocket Reference Updated for both Python 3.4 and 2.7, this convenient pocket guide is the perfect on-the-job quick reference. You’ll find concise, need-to-know information on Python types and statements, special method names, built-in functions and exceptions, commonly used standard library modules, and other prominent Python tools...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

P.S. Interested in reaching fellow readers of this newsletter? Consider sponsoring! Email us for details :) - All the best, Hannah & Sebastian