Data Science Weekly

Jul 01, 2021

Issue #397 July 01 2021

Editor Picks

Alien Dreams: An Emerging Art Scene
In recent months there has been a bit of an explosion in the AI generated art scene...Ever since OpenAI released the weights and code for their CLIP model, various hackers, artists, researchers, and deep learning enthusiasts have figured out how to utilize CLIP as a an effective “natural language steering wheel” for various generative models, allowing artists to create all sorts of interesting visual art merely by inputting some text – a caption, a poem, a lyric, a word – to one of these models...

Biases in AI Systems - A survey for practitioners
The broad goal of this [ACM] article is to educate nondomain experts and practitioners such as ML developers about various types of biases that can occur across the different stages of the AI pipeline and suggest checklists for mitigating bias...As this article is directed at aiding ML developers, the focus is not on the design of fair AI algorithms but rather on practical aspects that can be followed to limit and test for bias during problem formulation, data creation, data analysis, and evaluation...

Introduction to Machine Learning Interviews Book [Free]
This book is the result of the collective wisdom of many people who have sat on both sides of the interview table and who have spent a lot of time thinking about the hiring process. It was written with candidates in mind, but hiring managers who saw the early drafts told me that they found it helpful to learn how other companies are hiring, and to rethink their own process....The book consists of two parts. The first part provides an overview of the machine learning interview process, what types of machine learning roles are available, what skills each role requires, what kinds of questions are often asked, and how to prepare for them...The second part consists of over 200 knowledge questions, each noted with its level of difficulty -- interviews for more senior roles should expect harder questions -- that cover important concepts and common misconceptions in machine learning...

A Message from this week's Sponsor:

Quick Question For You: Do you want a Data Science job?

After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.

The course is broken down into three guides:

Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)
Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate
Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!

Click here to learn more ...

Data Science Articles & Videos

Your AI pair programmer
With GitHub Copilot, get suggestions for whole lines or entire functions right inside your editor...

Artificial Intelligence for us and on our terms with Rohit Prasad [Video]
Hear from Amazon VP and Alexa chief scientist Rohit Prasad on why we’re at a crossroads in AI and what we must do to ensure AI advances on our terms...Featuring: Rohit Prasad, Vice President and Head Scientist, Alexa AI, Amazon...

Semantic Search: Measuring Meaning From Jaccard to Bert
Similarity search is a complex topic and there are countless techniques for building effective search engines...In this article, we’ll cover a few of the most interesting — and powerful — of these techniques — focusing specifically on semantic search. We’ll learn how they work, what they’re good at, and how we can implement them ourselves...

Bio Eats World: Evolving Embodied Intelligence
Embodied intelligence posits that the body, or the physical form, plays an active and significant role in shaping an agent’s mind and cognitive capacities...Our [podcast] guests, Li Fei-Fei and Surya Ganguli of the Stanford Institute for Human-Centered AI, set out to develop what they call an “evolutionary playground” to explore the development of embodied intelligence in AI and its connection with the environment and with learning using in silico experiments. They discuss with a16z general partner Vijay Pande and host Lauren Richardson how they created a suite of virtual environments in which agents evolve through a process that mimics aspects of Darwinian evolution...

Removing the 80% Blind Spot: How to Make Unstructured Organizational Data Available for Decision Making
In this article, I’ll describe how we turned unstructured data into useful information with the help of pre-trained machine learning models and semantic search. The specific use-case is about unstructured profile descriptions of our employees and how we can start asking arbitrary questions about the co-workers (e.g. who can I talk about puppies?)...

Double Machine Learning for causal inference
This post tries to explain, briefly yet comprenhensively enough, what is Double Machine Learning and how it works. For this purpose, we will cover the topic from its theoretical foundations to a typical example of application in causal inference...

Bird Sound Classifier on the Edge
The project attempts to recognize different bird calls by continuously listening to the audio through the onboard mic of the Nano 33 BLE Sense. The bird's call heard will be consumed by the model to classify it as one amongst the trained birds...If no bird calls are heard, then the audio would be classified as background noise since we have included the background noise also during training. This project can be helpful for people who are interested in birding and would like to understand the habitat or patterns of the bird calls...

Habitat 2.0: Training home assistant robots with faster simulation and new benchmarks
Today, we are announcing Habitat 2.0, a next-generation simulation platform that lets AI researchers teach machines not only to navigate through photo-realistic 3D virtual environments but also to interact with objects just as they would in an actual kitchen, dining room, or other commonly used space....

Deep Learning on photorealistic synthetic data
By the end of this article, we want to be able to detect the occurrence and position of a cup in real photographs (and maybe even do segmentation and pose estimation) without hand-annotating even a single training datum...

Group thousands of similar spreadsheet text cells in seconds
For this tutorial, I’m going to use this dataset of U.S. Department of Labor wage theft investigations. It contains every DOL investigation of employers due to minimum wage or overtime violations from 1984 through 2018...we’ll cover: a) Building a Document Term Matrix with TF-IDF and N-Grams, b) Using cosine similarity to calculate proximity between strings, c) Using a hash table to convert our findings to a “groups” column in our spreadsheet...

Training*

Sharpen your data skills by solving 3 questions per week – for free

Get data science interview questions frequently asked at top companies every Monday, Wednesday & Friday. Solve the problem before receiving the solution the next morning. Check your work and sharpen your skills! Join our free newsletter.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

Jobs

Senior Data Scientist - WarnerMedia - New York, NY

WarnerMedia is a leading media and entertainment company that creates and distributes premium and popular content from a diverse array of talented storytellers and journalists to global audiences through its consumer brands including: HBO, HBO Max, Warner Bros., TNT, TBS, truTV, CNN, DC Entertainment, New Line, Cartoon Network, Adult Swim, Turner Classic Movies and others.

Reporting to the Sr. Manager, Data Science this role will help to develop the predictive insights and prescriptive capabilities behind CNN’s emerging products, transforming first- and third- party data into quantitative findings, visualizations, and automation

Want to post a job here? Email us for details >> team@datascienceweekly.org

Training & Resources

Meerkat: DataPanels for Machine Learning
This blog post introduces Meerkat, a new Python library for wrangling complex datasets across stages of the ML lifecycle...Meerkat is motivated by a few trends in ML: dataset manipulation, model evaluation, and Multi-modal datasets that combine multiple, complex data types...

Introducing Lightning Flash — From Deep Learning Baseline To Research in a Flash
Flash is a collection of tasks for fast prototyping, baselining and fine-tuning scalable Deep Learning models, built on PyTorch Lightning. Whether you are new to deep learning, or an experienced researcher, Flash offers a seamless experience from baseline experiments to state-of-the-art research. It allows you to build models without being overwhelmed by all the details, and then seamlessly override and experiment with Lightning for full flexibility. Continue reading to learn how to use Flash tasks to get state-of-the-art results in a flash...

Getting Started with Decision Tree Classifiers
Decision trees are a rule-based approach to classification and regression problems. They use the values in each feature to split the dataset to a point where all data points that have the same class are grouped together. However, there’s a clear trade-off between interpretability and performance....

Books

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Data Science Weekly Newsletter

Discussion about this post