Data Science Weekly - Issue 515

Curated news, articles and jobs related to Data Science, AI, & Machine Learning

Oct 06, 2023

Issue #515
October 05 2023

Help!?

We’re thinking of adding a second issue of the newsletter per week because there’s so much going on that we’ve been excluding a bunch of great posts. (This week I had 100+ tabs open to select the articles for this edition)

New delivery dates would be Tuesday and Friday.

What do you think?

Hello!

Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

If you don’t find this email useful, please unsubscribe here.

Is this newsletter helpful to your job? Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)

And now…let's dive into some interesting links from this week

Editor's Picks

My F100 company analyzed why our good data scientists are good and here's the recap [Reddit]
A small team of internal researchers inside the company spent time investigating which data scientists preformed the best, which preformed the worst, and what factors played into this. The top 3 indicators of a high preforming data scientist were…

Beyond Hypothesis Tests & P-values: Iterative Refinement in Science and Business
The trajectory of scientific and business methodologies has been marked by evolving frameworks, each seeking to address the multifaceted challenges of their respective domains. While the principle of falsification, introduced by Karl Popper, set the stage, it became increasingly evident that more adaptable and iterative methodologies were needed to navigate the complexities of the modern era. George E.P. Box’s insights on the synergy between theory and practice offered a beacon. Building upon this, Devezer and Buzbas have further refined this vision for today’s intricate challenges. Yet, there remains a pressing need to popularize and integrate these principles, especially within the business community…
The art of data: Empowering art institutions with data and analytics
To help art institutions get started on the journey to increase their leverage of technology, we collaborated with seven leading US art institutions to gain insights into how to strengthen their data and analytics practices…Our work included the creation of an easy-to-use, objective, and scalable dashboard designed to inform institutional strategy, improve business operations, and establish the proper use of data and analytics within each organization…Building data and analytics capabilities: Five-step framework…

A Message from this week's Sponsor:

Build full-stack, private applications with the power of zero-knowledge on Aleo

Aleo is a Layer-1 blockchain built from the ground up with zero-knowledge woven into every layer of the tech stack.

Why build with us?

Aleo harnesses the power of zero-knowledge to deliver both privacy and scalability. By handling computation off-chain, it opens the door to a new era of private, scalable dapps.
You don’t need a Ph.D in cryptography to use zero-knowledge. With our language, Leo, zero-knowledge circuits are automatically generated based on your program allowing any developer to access zero-knowledge proofs.
Development integrates seamlessly on the web with the Aleo SDK. Manage accounts, deploy programs, and integrate with the Aleo network right in your browser.

Know a problem that can be solved with zk? Get paid for your good ideas with Aleo's Ignition grants. Grants start at $3,000 for simple applications.

Start building > https://aleo.org/grants/

* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Story(line) Visualizations
I’ve noticed there are many research papers regarding storyline visualizations. Storyline visualizations all stem from this sketch of Randall Munroe’s movie narrative charts…Various researchers have incrementally pushed this idea forwards, algorithmically generating the chart, optimizing the layout of the chart, etc. Here’s a few…But why storylines? What’s the point of adopting Randall’s sketch and automating it? What’s the storyline do well, and why use it?…

Challenges in evaluating AI systems
What many people working inside and outside of AI don’t fully appreciate is how difficult it is to build robust and reliable model evaluations…In this post, we discuss some of the challenges that we have encountered in developing AI evaluations. In order of less-challenging to more-challenging, we discuss:
1. Multiple choice evaluations
2. Third-party evaluation frameworks like BIG-bench and HELM
3. Using crowdworkers to measure how helpful or harmful our models are
4. Using domain experts to red team for national security-relevant threats
5. Using generative AI to develop evaluations for generative AI
6. Working with a non-profit organization to audit our models for dangerous capabilities
We conclude with a few policy recommendations that can help address these challenges…

torch2jax - Run PyTorch in JAX 🤝
Run PyTorch in JAX…Mix-and-match PyTorch and JAX code with seamless, end-to-end autodiff, use JAX classics like jit, grad, and vmap on PyTorch code, and run PyTorch models on TPUs…torch2jax uses abstract interpretation (aka tracing) to move JAX values through PyTorch code. As a result, you get a JAX-native computation graph that follows exactly your PyTorch code, down to the last epsilon…
Machine UnLearning for Harry Potter
This paper, “Who's Harry Potter? Approximate Unlearning in LLMs” shared an interesting idea. The goal is to have an LLM “unlearn” some of it’s knowledge. In the case of the paper they’re interested in removing knowledge from the popular Harry Potter books. The paper shares two techniques: Technique 1: Reverse Finetuning and Technique 2: Reverse Anchoring…
A Beginner's Guide to Sequence Analytics in SQL
Sequences: they’re all around us. More specifically, they’re in your data warehouse with timestamps, payloads, and mysterious columns from Segment. Many of the real, course-changing insights that data teams dream are hidden deep inside these elusive event streams. This post will help you find them, using your favorite neighborhood query language. For the purposes of this journey, imagine you’re a Data Scientist at Netflix. You and your team want to better understand what your watch funnel looks like – what does a successful session entail? – as well as understand how users interact with important features like search…

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
A 166-page report from Microsoft qualitatively exploring GPT-4V capabilities and usage. Describes visual+text prompting techniques, few-shot learning, reasoning, etc…
Worst Data Engineering Mistake youve seen? [Reddit]
I started work at a company that just got databricks and did not understand how it worked. So, they set everything to run on their private clusters with all purpose compute (3x's the price) with auto terminate turned off because they were ok with things running over the weekend. Finance made them stop using databricks after two months lol…Im sure people have f*cked up worse. What is the worst you’ve experienced?…

Computational Power and AI
In this article we answer the following questions: What is compute and why does it matter? How is the demand for compute shaping AI development? What kind of hardware is involved? What are the components of compute hardware? What does the supply chain for AI hardware look like? What does the market for data centers look like? How can demand for compute be addressed? How are governments responding? What are the policy implications?…

Trials of developing OPT-175B (Episode 77 of the Stanford MLSys Seminar “Foundation Models Limited Series” with Susan Zhang) [Video]
LLM development at scale is an extraordinarily resource-intensive process, requiring compute resources that many do not have access to. The experimentation process will also appear rather haphazard in comparison, given limited compute-time to fully ablate all architectural / hyper-parameter choices…In this talk, we will walk through the development lifecycle of OPT-175B, covering infrastructure and training convergence challenges faced at scale, along with methods of addressing these issues going forward…

ONE book to recommend to social science phds who want to build probability and stats skills? [Twitter/X discussion]
You get to recommend ONE book to social science PhDs who want to build probability and stats skills. What is it?…

Natural language processing for African languages
In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts…

What size is that correlation?
Suppose you find a correlation of 0.36. How would you characterize it? I posed this question to the stalwart few still floating on the wreckage of Twitter, and here are the responses…Whether a correlation is big or small, important or not, and useful or not, depends on the context, of course. But to be more specific, it depends on whether you are trying to predict, explain, or decide. And what you report should follow…

Jobs

Data Science Intern:
Performance Control & Digitalization

More than 90% of automotive innovations are based on electronics and software.

We, the BMW Group, offer you an interesting and varied internship in data science for Performance Control & Digitalization. To take our operations to the next level, the BMW Group – Performance Control & Digitalization department is looking for a Data science intern to contribute to the Supply Chain Innovations Think Tank of BMW Group and continue BMW’s leadership in supply chain management. The goal of the team will be to research emerging technologies including Data Science (ML, AI, BI etc.).

Location is Munich. Apply here

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
TorchGeo is a PyTorch domain library, similar to torchvision, providing datasets, samplers, transforms, and pre-trained models specific to geospatial data.
The goal of this library is to make it simple:
1. for machine learning experts to work with geospatial data, and
2. for remote sensing experts to explore machine learning solutions…
Annotated Forest Plots using ggplot2
This post contains a short R code walkthrough to make annotated forest plots like the one shown above. There are packages to make plots like these such as forester, forestplot, and ggforestplot, but sometimes I still prefer to make my own. The big picture of this is that we’ll be making three separate ggplot2 objects and putting them together with patchwork. You could also use packages like cowplot, gridarrange or ggarrange to put the intermediate plot objects together. You can skip to the end to see the full code…
The medium is the message: R programmers as content creators
I had a blast speaking at [at]cascadiarconf about how #RStats users are content creators! Making [at]quarto_pub slides was a joy, as usual. The recording is lost to time but I have a (mostly) verbatim script 😀…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #514 here.

Cutting Room Floor

Whenever you're ready, 2 ways we can help you:

Get A Data Science Job Course: A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio, and Section 3 covers how to write your resume.
Promote yourself to ~60,000 subscribers by sponsoring this newsletter.

Thank you for joining us this week :)

All our best,
Hannah & Sebastian

P.S.
Was this edition helpful to your job? Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)

Data Science Weekly Newsletter

Discussion about this post