[in case you missed it] Data Science Weekly - Issue 413

Oct 24, 2021

Issue #413 October 21 2021

Editor Picks

The philosophical and musical failings of “Beethoven X: The AI Project”
When VAN asked me to do a review of an artificial-intelligence-created realization of Beethoven’s Tenth Symphony called “Beethoven X: The AI Project,” which is based on the skimpy sketches he left when he died, I more or less groaned in my reply. “Not for me,” I said. “I know pretty much what I’ll think about it, and my review could get snarky.” “If so, that would be all right with us,” VAN said. “Well, OK,” I groaned back. So here I am and here goes...At the end of the symphony I found myself more philosophical than annoyed. I’ll start with that...

MIT's "The Missing Semester of Your CS Education" Class
Classes teach you all about advanced topics within CS, from operating systems to machine learning, but there’s one critical subject that’s rarely covered, and is instead left to students to figure out on their own: proficiency with their tools. We’ll teach you how to master the command-line, use a powerful text editor, use fancy features of version control systems, and much more!...

Predicting Spreadsheet Formulas from Semi-structured Contexts
We describe a new model that learns to automatically generate formulas based on the rich context around a target cell. When a user starts writing a formula with the “=” sign in a target cell, the system generates possible relevant formulas for that cell by learning patterns of formulas in historical spreadsheets....

A Message from this week's Sponsor:

Kickstart Your New Career with a Data Science & Analytics Bootcamp

Don’t miss your chance to join a Data Scientist-led, online Metis bootcamp plus get career support until you’re hired. Bootcamps are starting soon! Ready to take your data science or analytics career to the next level? Learn more about the Metis Online Data Science & Analytics Bootcamps.

Data Science Articles & Videos

Explaining in Style: Training a GAN to explain a classifier in StyleSpace
We propose a training procedure for a StyleGAN, which incorporates the classifier model, in order to learn a classifier-specific StyleSpace. Explanatory attributes are then selected from this space. These can be used to visualize the effect of changing multiple attributes per image, thus providing image-specific explanations. We apply StylEx to multiple domains, including animals, leaves, faces and retinal images. For these, we show how an image can be modified in different ways to change its classifier output. Our results show that the method finds attributes that align well with semantic ones, generate meaningful image-specific explanations, and are human-interpretable as measured in user-studies....

Challenges in Detoxifying Language Models
Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the safety of generated text is imperative for deploying LMs in the real world; to this end, prior work often relies on automatic evaluation of LM toxicity...We demonstrate that while basic intervention strategies can effectively optimize previously established automatic metrics on the RealToxicityPrompts dataset, this comes at the cost of reduced LM coverage for both texts about, and dialects of, marginalized groups. Additionally, we find that human raters often disagree with high automatic toxicity scores for texts generated by models with strong toxicity reduction interventions...

Tesla officially launches its insurance using ‘real-time driving behavior,’ starting in Texas
CEO Elon Musk announced that Tesla would expand its insurance product to Texas and add the real-time driving behavior aspect...Electrek was able to confirm that the product is now live in Texas and Tesla owners can apply for quotes...Unlike the product offered by Tesla in California, this new policy in Texas includes the real-time data...

Composability in Julia: Implementing Deep Equilibrium Models via Neural ODEs
In this blog post we will show how to easily, efficiently, and robustly use steady state nonlinear solvers with neural networks in Julia. We will showcase the relationship between steady states and ODEs, thus making a connection between the methods for Deep Equilibrium Models (DEQs) and Neural ODEs...

Comprehensive List of Kaggle Solutions and Ideas
This is a list of almost all available solutions and ideas shared by top performers in the past Kaggle competitions. This list will gets updated as soon as a new competition finishes...

ETL Pipelines with Airflow: the Good, the Bad and the Ugly
In this article, we review how to use Airflow ETL operators to transfer data from Postgres to BigQuery with the ETL and ELT paradigms. Then, we share some challenges you may encounter when attempting to load data incrementally with Airflow DAGs. Finally, we argue why Airflow ETL operators won’t be able to cover the long tail of integrations for your business data...

Considerations Before Pushing Machine Learning Models to Production
I daily see, as a data scientist, the challenges that come with putting AI-based solutions in production. These challenges are numerous and cover a variety of aspects: modeling and system design, data engineering, resource management, SLA, etc...I don’t pretend mastery in any of those fields. I do however know that implementing some software engineering principles and using the right tools helped me a lot in making my work reproducible and ready for production...In this article, I’ll share with you 7 of the considerations I have in mind before productionizing my models....

How to Train Large Deep Learning Models as a Startup
Training large deep learning models is expensive and slow. Yet, startups are all about iterating fast. In this post, we share the lessons we've learned over the past few years...

Generative art resources in R
An extremely incomplete (and probably biased) list of resources to help an aspiring generative artist get started making pretty pictures in R...

Who is a Data Scientist in 2021?
Every year we publishe a study on 1,001 data scientist profiles. The information is collected from public LinkedIn profiles, assuming that the information posted on the social media platform is an unbiased estimator of their resume...This research allows us to gain insights, with a reasonable degree of certainty, about who is a data scientist in 2021. We present only aggregate data to highlight important trends that can be useful to anyone who wants to break into the field, as well as to organizations looking to hire data scientists....

Tools*

Create AI-powered search and recommendation apps with Pinecone

Pinecone is a fully managed vector database that makes it easy to add vector search to production applications. It combines state-of-the-art vector search libraries, advanced features such as filtering, and distributed infrastructure to provide high performance and reliability at any scale. Get started now — it's free!

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

Jobs

Entry Level Data Scientist: 2022 - IBM - Multiple Locations

As a Data Scientist at IBM, you will help transform our clients’ data into tangible business value by analyzing information, communicating outcomes and collaborating on product development. Work with Best in Class open source and visual tools, along with the most flexible and scalable deployment options. Whether it’s investigating patient trends or weather patterns, you will work to solve real world problems for the industries transforming how we live.

Want to post a job here? Email us for details >> team@datascienceweekly.org

Training & Resources

Carnegie Mellon University 10721: Philosophical Foundations of Machine Intelligence
What is this field? What are its normative aims? What are its modes of inquiry? What are (and have been) its intellectual and ideological commitments? What foundational questions is it in dialogue with, and what foundational obstacles obstruct its progress? Finally: What are our responsibilities as researchers & practitioners deploying this technology?...

SHAP: Explain Any Machine Learning Model in Python
Imagine you are trying to train a machine learning model to predict whether an ad is clicked by a particular person. After receiving some information about a person, the model predicts that a person will not click on an ad...But why does the model predict that? How much does each feature contribute to the prediction? Wouldn’t it be nice if you can see a plot indicating how much each feature contributes to the prediction?...That is when Shapley value comes in handy...

Random Forests Algorithm explained with a real-life example and some Python code
Random Forests is a Machine Learning algorithm that tackles one of the biggest problems with Decision Trees: variance...Even though Decision Trees is simple and flexible, it is greedy algorithm. It focuses on optimizing for the node split at hand, rather than taking into account how that split impacts the entire tree. A greedy approach makes Decision Trees run faster, but makes it prone overfitting...An overfit tree is highly optimized to predicting the values in the training dataset, resulting in a learning model with high-variance. How you calculate variance in a Decision Tree depends on the problem you’re solving...

Books

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post