Data Science Weekly - Issue 461
Issue #461 September 22 2022
Editor's Picks
Growing a Career in NLP with Primer’s Amy Heineike
How does Amy apply her curious spirit to her work in NLP? She has been working in NLP for 7 years, on the team building Primer’s core code and applications. That means she has seen the models completely evolve. And one of the core challenges is still figuring out how to use data to draw interesting new conclusions. At this point, a lot of people realize NLP is out there, she says, “but it’s still quite hard to figure out how to make it useful.”...
Curating R-Ladies' Twitter Account - A Fun Ride!
I had an incredible pleasure (and honor) to curate R-Ladies' Twitter account this week. To make it short: It’s been a blast and a fantastic experience that I can only recommend!...If you are interested, there are multiple posts that I all read beforehand about what it is like to be a curator...But let’s start from the beginning...
Productizing Large Language Models
At Replit we have deployed transformer-based language models of all sizes: ~100m parameter models for search and spam, 1-10B models for a code autocomplete product we call GhostWriter, and 100B+ models for features that require a higher reasoning ability. In this post we'll talk about what we've learned about building and hosting large language models...
A Message from this week's Sponsor:
Data Maturity Assessment
You might be data fluent, but what about the rest of your organization? Partner with team members and business stakeholders to complete Pragmatic Institute’s complimentary Data Maturity Assessment so you can measure your organization’s overall data maturity.
By discovering where your organization falls in the data maturity continuum, you can start taking steps to leverage data more strategically.
Take Assessment.
Data Science Articles & Videos
Introducing Whisper
We’ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition...
The Most Popular SQL Transforms - Analyzing the usage of SQL Generator
I have referred to the SQL Generator website as a helpful tool for analysts to generate complicated SQL quickly. This week, I got access to the data so we can look at which are the most popular transformations...
Brain Imaging Generation with Latent Diffusion Models
Diffusion models recently have caught the attention of the computer vision community by producing photorealistic synthetic images. In this study, we explore using Latent Diffusion Models to generate synthetic images from high-resolution 3D brain images. We used T1w MRI images from the UK Biobank dataset (N=31,740) to train our models to learn about the probabilistic distribution of brain images, conditioned on covariables, such as age, sex, and brain structure volumes. We found that our models created realistic data, and we could use the conditioning variables to control the data generation effectively...
Applying for a Master's in AI & ML
My story of a successful application to the US universities for Computer Science and Machine Learning-related degrees...
Previously this mailing list has been used for the Bayes course. Moving forward, we will continue to provide updates to the course on this mailing list. In addition, we will share announcements on Bayesian news, upcoming PyMC events, blogs, releases and more...
Presenting results for multinomial logistic regression: a marginal approach using propensity scores
My goal here is to generate a data set to illustrate how difficult it might be to interpret the parameter estimates from a multinomial model. And then I lay out a relatively simple solution that allows us to easily convert from the odds scale to the probability scale so we can more easily see the effect of the exposure on the outcome...
Generative AI: A Creative New World
The fields that generative AI addresses—knowledge work and creative work—comprise billions of workers. Generative AI can make these workers at least 10% more efficient and/or creative: they become not only faster and more efficient, but more capable than before. Therefore, Generative AI has the potential to generate trillions of dollars of economic value...
Perspectives on knowledge acquisition & mobilization with neural net - Hugo Larochelle - CoLLAs 2022 [Video]
In this talk, I’ll share my thoughts on the state of progress in designing AI systems with neural networks. I’ll frame a perspective that views our success as relying on two separate and equally critical steps, that I refer to as neural knowledge acquisition and neural knowledge mobilization. Then I’ll describe my own research journey from that point of view using various examples, discuss lessons learned and highlight what I think are the opportunities and challenges ahead...
survex: model-agnostic explainability for survival analysis
In this blog, we’d like to cover how model explainability can help make informed choices when working with survival models by showcasing the capabilities of the survex R package...
How to build TRUST in Machine Learning, the sane way
Building trust in machine learning is tough. Loss of trust is possibly the biggest risk that a business can ever face ☠. Unfortunately, people tend to discuss this topic in a very superficial and buzzwordy manner...In this post, I will present why it is difficult to build trust in machine learning projects. To gain the most business value from the model, we want stakeholders to trust it. We want to provide defensive mechanisms to avoid problems impacting stakeholders and to build developers’ trust in the product...
How to incorporate biological insights into network models and why it matters
Here, we argue that building biologically realistic network models is crucial to establishing causal relationships between neurons, synapses, circuits, and behavior. More specifically, we advocate for network models that consider the connectivity structure and the recorded activity dynamics while evaluating task performance...
SQLite: Past, Present, and Future
SQLite is the most widely deployed database engine (or likely even software of any type) in existence. It is found in nearly every smartphone (iOS and Android), computer, web browser, television, and automobile. There are likely over one trillion SQLite databases in active use...
Tool*
DataQA is a no-code tool for model error and quality analysis
Assessing the quality of a model is more than just looking at a few metrics; problems can often be hidden in biases or underperforming segments that are important to the business.
DataQA enables data science teams to accelerate their model QA with an intuitive no-code platform. With it, teams can quickly inspect model performance visually across different segments of the data. DataQA keeps non-technical domain experts involved in the process, replacing the need to send emails and spreadsheets.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!
Jobs
Data Scientist - Success Academy Charter Schools, Inc - NYC
This new Data Scientist role will be a key contributor to our mission of driving innovation across the organization. Reporting to the Leader of Enterprise Analytics, this role will be responsible for working with stakeholders in various functions to understand areas of opportunity, developing analytical solutions ranging from dashboards to sophisticated mathematical models, and helping functional teams adopt those solutions. This role will be part of a highly collaborative team of professionals with a wide range of skills including data science, data engineering, business analysis, and project management....
Want to post a job here? Email us for details --> team@datascienceweekly.org
Training & Resources
Linear Regression
A Visual Introduction To (Almost) Everything You Should Know...
Understanding the Snowflake Query Optimizer
The job of a query optimizer is to reduce the cost of queries without changing what they do. Optimizers cleverly manipulate the underlying data pipelines of a query to eliminate work, pare down expensive operations, and optimally re-arrange tasks...there are three types of optimizations you need to know about: scan reduction, limiting the volume of data read, query rewriting, reorganizing a query to reduce cost, and join optimization, the NP-hard problem of optimally executing a join...In this post, I'll share a reference of most common optimizations you might expect to see when working with Snowflake...
The Illustrated Word2vec - A Gentle Intro to Word Embeddings in Machine Learning [Video]
The concept of word embeddings is a central one in language processing (NLP). It's a method of representing words as numerically -- as lists of numbers that capture their meaning. Word2vec is an algorithm (a couple of algorithms, actually) of creating word vectors which helped popularize this concept. In this video, Jay take you in a guided tour of The Illustrated Word2Vec, an article explaining the method and how it came to be developed...
What you’re up to – notes from DSW readers
Fill out the form below to appear here :) ...
* To share your projects and updates, share the details here.
** Want to chat with one of the above people? Hit reply and let us know :)
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's newsletter here.
Cutting Room Floor
P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian