Data Science Weekly - Issue 483
Curated news, articles and jobs related to Data Science.
February 23 2023
The 2023 MAD (Machine Learning, Artificial Intelligence & Data) Landscape
The annual MAD landscape is our attempt at making sense of this vibrant space. Its general philosophy has been to open source work that we would do anyway, and start a conversation with the community. So, here we are again, in 2023. This is our ninth annual landscape and “state of the union” of the data and AI ecosystem…This annual state of the union post is organized in four parts: Part I: The Landscape, Part II: Market trends: Financings, M&A and IPOs, Part III: Trends in data infrastructure, and Part IV: Trends in ML/AI…
Turn your Pandas dataframe into a Tableau-style UI for visual analysis
I've just made a plugin which turns your pandas dataframe into a tableau-style component. It allows you to explore the dataframe with easy drag-and-drop UI. You can use PyGWalker in Jupyter, Google Colab, or even Kaggle Notebook to easily explore your data and generate interactive visualizations. PyGWalker (pronounced like "Pig Walker", just for fun) is named as an abbreviation of "Python binding of Graphic Walker"…
Cleaning sample data in standardized way
In the previous two posts of this series we reviewed how to standardize the steps in our data cleaning process to produce consistent datasets across the field of education research, as well as practices we can implement to make our data cleaning workflow more reproducible and reliable. In this final post of the series, I attempt to answer the question, “What does this process look like when implemented in the real world?”. To tackle this question I created a very simple sample dataset based on the following fictitious scenario…
A Message from this week's Sponsor:
Pinecone vector database
The Pinecone vector database makes it easy to build high-performance vector search applications. Developer-friendly, fully managed, and easily scalable without infrastructure hassles.
Use Pinecone to build semantic search, object recognition, recommendations, anomaly detection, and other vector-based functionality into your applications.
Data Science Articles & Videos
I'm creating animations and instructional videos about neural networks: Convolution, Padding, Stride, Groups, Depthwise, and Depthwise-Separable, and more…
Early thoughts on regulating generative AI like ChatGPT
The sense of authenticity will make generative AI appealing for malicious use where the truth is less important than the message it advances, such as disinformation campaigns and online harassment. It is also why an early commercial application of generative AI is to create marketing content, where the strict accuracy of the writing simply isn’t very important. However, when the media website CNET started using generative models for writing financial articles, where the truth is quite important, the articles were discovered to have many errors. These two examples offer a glimpse into two separate sources of risk from generative AI which warrant separate consideration, and likely, distinct policy interventions…
Spreadsheet Risk Management
Today I learned that there is a conference on Spreadsheet Risk Management, which is hosted by the European Spreadsheet Risk Interest Group. According to the conference paper list this event has been active since 2000 and it's been a source of research on the potential dangers of Excel. Some of the listed articles also show some interesting research too…
Papers on the UX of AI programming assistants
This is a list of research papers investigating the user experience of AI-powered programming assistants (e.g., Copilot). I started the list because I was finding it difficult to keep up with the massive surge of papers recently…
Infrastructure in '23
For the third year running, I set aside some time at the beginning of the year to share what I believe to be the most dynamic and important areas of innovation in infrastructure…
Treatment Plan Builder
It conducts smart literature search and extracts treatments to help cancer researchers develop treatment plans. Determining the best cancer treatment is time-consuming and complex, requiring researchers to attempt the impossible task of staying on top of and synthesizing thousands of medical papers…
Collaborating with data scientists in your team [Reddit Discussion]
Of course, implementing better code practices such as commenting and better naming would be a good thing, but this idea is often met with the "We don't have time to do that, we have worse problems and priorities"…So here's my question: Are there any other data scientists that have the same issues as me? How do you approach this problem? I know collaboration is and will always be a source of friction, and I'm not trying to find a way to remove all of it, but at least adding some oil in the process would greatly help…
Natural Language Processing: From Prototype to Production
[Free Webinar, Fri Feb 24]
In this fireside chat, Ines joins Hugo Bowne-Anderson, to discuss what NLP in production actually looks like, including patterns, trends, challenges, use cases, and more…After attending, you’ll know: a) What NLP in production actually means, b) Trends and patterns in NLP use cases across industries, c) What skills data scientists and ML engineers need to build end-to-end NLP systems, d) What all the hype around large language models is, such as GPT-3, and how they can deliver real value in the world of NLP, and more!…
The technology behind GitHub’s new code search
So, how does it work? The short answer is that we built our own search engine from scratch, in Rust, specifically for the domain of code search. We call this search engine Blackbird, but before I explain how it works, I think it helps to understand our motivation a little bit. At first glance, building a search engine from scratch seems like a questionable decision. Why would you do that? Aren’t there plenty of existing, open source solutions out there already? Why build something new?…
Things you wish you knew before you started training on the cloud? [Reddit Discussion]
I really like training in the cloud for some reason and feels satisfying, however here is a couple of things I would've wished I knew beforehand to get things started (see below)...Now what about you all?...
A New Way to Predict Probability Distributions: Exploring multi-quantile regression with Catboost
This article will explore two examples using the multi-quantile loss function on synthetic data. While these examples aren’t necessarily reflective of real-world datasets, they will help us understand how well this loss function quantifies uncertainty by predicting the quantiles of a target distribution…
Down In the Sewers - Tracking Viruses with WastewaterSCAN
As more COVID-19 testing happens at home rather than through a lab, reported data from lab tests has become less available and less reliable. But everyone who lives or works within a sewershed contributes waste to the treatment plant that services that area, and pathogens can be detected with a high degree of accuracy through wastewater solids collected at wastewater treatment plants. The CDC’s National Wastewater Surveillance System, which at times has included data from over 1,200 wastewater treatment plants, is a powerful example of using wastewater surveillance to measure COVID-19 rates…
Software Developer Job Opportunity at Observable, Inc
SALARY AND HOURS: $107,640 - $150,000 per year; 40 hours per week.
EXPERIENCE AND REQUIREMENTS: Bachelors degree in Computer Science.
DESCRIPTION OF DUTIES:
• Design, develop, test, deploy, maintain and improve software.
• Write code for Observable’s product and platform, create reliable and sustainable systems,
and develop prototypes quickly.
• Write unit and integration tests to ensure the software is functioning correctly and securely.
• Deploy and release software at a regular cadence.
• Support and improve the software through on call and support tasks.
• Communicate and interact with users to understand their requirements and respond to
• Collaborate on projects with designers, engineers and product managers.
Want to post a job here? Email us for details --> firstname.lastname@example.org
Training & Resources
Introduction to Data-Centric AI Class
This is the first-ever course on DCAI. This class covers algorithms to find and fix common issues in ML data and to construct better datasets, concentrating on data used in supervised learning tasks like classification. All material taught in this course is highly practical, focused on impactful aspects of real-world ML applications, rather than mathematical details of how particular models work. You can take this course to learn practical techniques not covered in most ML classes, which will help mitigate the “garbage in, garbage out” problem that plagues many real-world ML applications…
Data Visualization Fundamentals and Best Practices [Free]
When do you use a bar chart over a line chart? What are area charts good for? What's wrong with pie charts? Learn about how these different types of data visualization work, and how they're used, in Observable's first data visualization course! Attend lectures (or watch them later), ask questions, and once you've completed a small assignment at the end, you'll earn a certificate…
WiDS Stanford Conference [Online]
The 2023 conference will feature keynotes, technical talks, panel discussions, networking, and more. March 8, 2023. All are welcome…
Qdrant open-source vector search engine launches managed cloud platform
Qdrant—robust vector similarity search engine with advanced filtering support. It is written in Rust, which ensures stability and high performance proved by benchmarks.
The managed cloud platform is now fully available for business use, allowing companies of any size to benefit from Qdrant's cutting-edge features without handling its deployment and maintenance.
Qdrant cloud platform can be accessed through the website.
* Sponsored post
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #482 here.
Cutting Room Floor
Suppressing quantum errors by scaling a surface code logical qubit
Datacast episode 109: Developer productivity, real-time data infrastructure, and the fat-tailed nature of enterprise software with Nnamdi Iregbulem
BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
Scaling data-driven robotics with reward sketching and batch reinforcement learning
Understanding Vision Transformers (ViTs): Hidden properties, insights, and robustness of their representations
Hattie Zhou: Lottery Tickets and Algorithmic Reasoning in LLMs
Have an awesome week!
All our best,
Hannah & Sebastian
P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.
Data Science Weekly Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.