Data Science Weekly - Issue 42
Issue #42 Sept 11 2014
Editor Picks
Recursive Deep Learning for NLP and Computer Vision
As the amount of unstructured text data that humanity produces overall and on the Internet grows, so does the need to intelligently process it and extract different types of knowledge from it. My research goal in this thesis is to develop learning models that can automatically induce representations of human language, in particular its structure and meaning in order to solve multiple higher level language tasks...
Large-scale graph partitioning with Apache Giraph
At the end of last year, we talked about the graph processing system Apache Giraph and our work to make it run at Facebook’s massive scale. Today, we’d like to present one of the use cases in which Giraph enabled computations that were previously difficult to process without incurring high latency...
Using Neo4J for Document Classification
Graphs are a perfect solution to organize information and to determine the relatedness of content. In this webinar, Neo4j Developer Evangelist Kenny Bastani will discuss using Neo4j to perform document classification. He will demonstrate how to build a scalable architecture for classifying natural language text using a graph-based algorithm called Hierarchical Pattern Recognition...
Data Science Articles & Videos
Why Twitter Should Not Algorithmically Curate the Timeline
Twitter’s CFO made some headlines recently by suggesting that Twitter was going to tweak the reverse chronology of the feed and introduce algorithmic curation...
Publication Bias in Psychology: A Diagnosis Based on the Correlation between Effect Size and Sample Size
The p value obtained from a significance test provides no information about the magnitude or importance of the underlying phenomenon. Therefore, additional reporting of effect size is often recommended. Effect sizes are theoretically independent from sample size. Yet this may not hold true empirically: non-independence could indicate publication bias...
The Revolutionary Technique That Quietly Changed Machine Vision Forever
Machines are now almost as good as humans at object recognition, and the turning point occurred in 2012, say computer scientists...
New to Machine Learning? Avoid these three mistakes
Modern machine learning (i.e. not the theoretical statistical learning that emerged in the 70s) is very much an evolving field and despite its many successes we are still learning what exactly can ML do for data practitioners. I gave a talk on this topic earlier this fall at Northwestern University and I wanted to share these cautionary tales with a wider audience...
Design and analysis of experiments in networks:
Reducing bias from interference
In this work, we evaluate methods for designing and analyzing randomized experiments that aim to reduce this bias and thereby reduce overall error...
The Science of Crawl (Part 1): Deduplication of Web Content
We've come to discover that building a functional crawler can be done relatively cheaply, but building a robust crawler requires overcoming a few technical challenges. In this series of blog posts, we will walk through a few of these technical challenges including content deduplication, link prioritization, feature extraction and re-crawl estimation...
Data science: how is it different to statistics ?
Recently, there has been much hand-wringing about the role of statistics in data science. In this and future columns, I’ll discuss both the threat and opportunity of data science. I believe that statistics is a crucial part of data science, but at the same time, most statistics departments are at grave risk of becoming irrelevant...
NYT Data Scientist Chris Wiggins on the way we create and consumer content
“At The New York Times, we produce a lot of content every day, but we also have a lot of data about the way people engage with that content,” Wiggins says. “[The Times] wanted to build out a data science function not only to curate and make available those data, but to learn from those data. In particular, the thing that the New York Times is interested in learning is: what makes for a good long-term relationship with a reader?”...
Long Memory and the Nile: Herodotus, Hurst and H
The ancient Egyptians were a people with long memories. So, it seems reasonable, and maybe even appropriate, that one of the first attempts to understand long memory in time series was motivated by the Nile...
Accelerate Machine Learning with the cuDNN Deep Neural Network Library
Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, NVIDIA is introducing a library of primitives for deep neural networks called cuDNN. The cuDNN library makes it easy to obtain state-of-the-art performance with DNNs, and provides other important benefits...
Jobs
Data Scientist - CIA - Washington DC Do you have a passion for creating data-driven solutions to the world's most difficult problems? The CIA needs technically-savvy specialists to organize and interpret Big Data to inform US decision makers, drive successful operations, and shape CIA technology and resource investments. The CIA is looking for individuals from diverse educational backgrounds to fill the role of data scientist. If you have experience in data analytics, computer science, mathematics, statistics, economics, operations research, computational social science, quantitative finance, engineering or other data analysis fields, consider a career as a Data Scientist at CIA....
Training & Resources
Libraries for Large-scale Linear Classification on Distributed Environments
MPI LIBLINEAR is an extension of LIBLINEAR on distributed environments. The usage and the data format are the same as LIBLINEAR...
Deep Learning through Examples
Detailed talk...
The Automatic Statistician
Welcome to fully automatic data analysis...
Books
The Art of R Programming: A Tour of Statistical Software Design Builds up from basic knowledge to more complex examples...
"There are hundreds of R books, but this is the best one to address the core problem of learning to *program* in R..."
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S. Enjoyed the newsletter? Please forward it to friends and peers - we'd love to have them onboard too :-) - All the best, Hannah & Sebastian