Data Science Weekly - Issue 394
Issue #394 June 10 2021
Editor Picks
Why We Should End the Data Economy
The data economy depends on violating our right to privacy on a massive scale, collecting as much personal data as possible for profit...
A graph placement methodology for fast chip design
Chip floorplanning is the engineering task of designing the physical layout of a computer chip. Despite five decades of research, chip floorplanning has defied automation, requiring months of intense effort by physical design engineers to produce manufacturable layouts. Here we present a deep reinforcement learning approach to chip floorplanning. In under six hours, our method automatically generates chip floorplans that are superior or comparable to those produced by humans in all key metrics, including power consumption, performance and chip area...
Attack of the Robot Authors! - The NaNoGenMo 2020 Roundup
Despite everything, November still arrived in 2020, and with it a new crop of “novels” written by computer programs for National Novel Generation Month...NaNoGenMo is an event where programmers write a computer program that outputs a “novel” — a document of at least 50,000 words...I hope to show the variety of entrants from 2020, but I also encourage people to check out the main Issues page on Github. There are more novels than just those covered here, and each entry has commentary from the author plus links to the source code to learn more...
A Message from this week's Sponsor:
Online Data Science Programs from Drexel University
Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career. Learn more.
Data Science Articles & Videos
Launching the NetHack Challenge at NeurIPS 2021
NetHack is frequently referred to as one of the hardest games in the world...To this end, Facebook AI open-sourced the NetHack Learning Environment (NLE) last year. This year as part of a NeurIPS 2021 competition, we are proud to launch the NetHack Challenge—the most accessible grand challenge for AI research—with our partner and co-organizer AIcrowd...
Graph-Based Deep Learning for Medical Diagnosis and Analysis: Past, Present and Future
In this survey, we thoroughly review the different types of graph architectures and their applications in healthcare. We provide an overview of these methods in a systematic manner, organized by their domain of application including functional connectivity, anatomical structure and electrical-based analysis. We also outline the limitations of existing techniques and discuss potential directions for future research...
Data Cascades in Machine Learning
In “‘Everyone wants to do the model work, not the data work’: Data Cascades in High-Stakes AI”, published at the 2021 ACM CHI Conference, we study and validate downstream effects from data issues that result in technical debt over time (defined as "data cascades"). Specifically, we illustrate the phenomenon of data cascades with the data practices and challenges of ML practitioners...This work is the first that we know of to formalize, measure, and discuss data cascades in ML as applied to real-world projects. We further discuss the opportunity presented by a collective re-imagining of ML data as a high priority, including rewarding ML data work and workers, recognizing the scientific empiricism in ML data research, improving the visibility of data pipelines, and improving data equity around the world...
Automated Machine Learning using PyCaret
Automate your machine learning workflows with less than ten lines of code...
Model Monitoring Enables Robust Machine Learning Applications
Key features of ML monitoring solutions, why companies need a holistic MLOps platform that includes model monitoring, and challenges companies face in making that happen...
New AI supercomputer will help create the largest-ever 3D map of the universe
The newest big-name supercomputer might help solve some of astrophysics' most important questions. VentureBeat reports the National Energy Research Scientific Computing Center has officially dedicated Perlmutter, billed as one of the fastest supercomputers for AI, and it will start by helping to build the largest-ever 3D map of the visible universe to study the dark energy accelerating the cosmos' expansion...The machine will process data from the Dark Energy Spectroscopic Instrument to guide observations...
Measuring the Algorithmic Efficiency of Neural Networks
Three factors drive the advance of AI: algorithmic innovation, data, and the amount of compute available for training. Algorithmic progress has traditionally been more difficult to quantify than compute and data. In this work, we argue that algorithmic progress has an aspect that is both straightforward to measure and interesting: reductions over time in the compute needed to reach past capabilities. We show that the number of floating-point operations required to train a classifier to AlexNet-level performance on ImageNet has decreased by a factor of 44x between 2012 and 2019. This corresponds to algorithmic efficiency doubling every 16 months over a period of 7 years. By contrast, Moore's Law would only have yielded an 11x cost improvement...
Examining Infant Relation Categorization Through Deep Neural Networks
Categorizing spatial relations is central to the development of visual understanding and spatial cognition, with roots in the first few months of life. Quinn (2003) reviews two findings in infant relation categorization: categorizing one object as above/below another precedes categorizing an object as between other objects, and categorizing relations over specific objects predates abstract relations over varying objects. We model these phenomena with deep neural networks, including contemporary architectures specialized for relational learning and vision models pretrained on baby headcam footage (Sullivan et al., 2020). Across two computational experiments, we can account for most of the developmental findings, suggesting these models are useful for studying the computational mechanisms of infant categorization...
What the Heck is a Data Mesh?!
I got sucked into a data mesh Twitter thread this weekend (it’s worth a read if you haven’t seen it). Data meshes have clearly struck a nerve. Some don’t understand them, while others believe they’re a bad idea. Yet, “Demystifying Data Mesh” and “Putting Data Mesh to Work” articles abound... Zhamak Dehghani identifies four data mesh principles: a) Domain-oriented decentralized data ownership and architecture, b) Data as a product, c) Self-serve data infrastructure as a platform, d) Federated computational governance I believe that putting these principles on equal footing creates confusion...
Open Sesame: How to prevent 2FA user drop-off with Magic Links
Think about the account login part in the user’s journey. This part is critical for keeping your customer’s data secure, but this concern can often stand in contrast with a seamless process...In this post I would like to discuss one of the tools we can use to achieve both...
Training*
Quick Question For You: Do you want a Data Science job?
After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.
The course is broken down into three guides:
Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)
Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate
Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!
Click here to learn more ...
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!
Jobs
Senior Data Scientist - WarnerMedia - New York, NY
WarnerMedia is a leading media and entertainment company that creates and distributes premium and popular content from a diverse array of talented storytellers and journalists to global audiences through its consumer brands including: HBO, HBO Max, Warner Bros., TNT, TBS, truTV, CNN, DC Entertainment, New Line, Cartoon Network, Adult Swim, Turner Classic Movies and others.
Reporting to the Sr. Manager, Data Science this role will help to develop the predictive insights and prescriptive capabilities behind CNN’s emerging products, transforming first- and third- party data into quantitative findings, visualizations, and automation
Want to post a job here? Email us for details >> team@datascienceweekly.org
Training & Resources
Tabular Data: Deep Learning is Not All You Need
A key element of AutoML systems is setting the types of models that will be used for each type of task. For classification and regression problems with tabular data, the use of tree ensemble models (like XGBoost) is usually recommended. However, several deep learning models for tabular data have recently been proposed, claiming to outperform XGBoost for some use-cases. In this paper, we explore whether these deep models should be a recommended option for tabular data, by rigorously comparing the new deep models to XGBoost on a variety of datasets...
New Courses: Machine Learning Engineering for Production
By the end of this course, you'll be ready to design and deploy an ML production system end-to-end. You'll understand project scoping, data needs, modeling strategies, and deployment requirements. You’ll know how to optimize your data, models, and infrastructure to manage costs. You'll know how to validate the integrity of your data to get it ready for production use, and then prototype, develop, and deploy your machine learning models, monitor the outcomes, and update the datasets and retrain the models continuously...
Uncertainty Quantification 360
Uncertainty quantification (UQ) gives AI the ability to express that it is unsure, adding critical transparency for the safe deployment and use of AI. This extensible open source toolkit can help you estimate, communicate and use uncertainty in machine learning model predictions through an AI application lifecyle. We invite you to use it and improve it...
Books
Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...
For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.
P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian