Data Science Weekly - Issue 527
Curated news, articles and jobs related to Data Science, AI, & Machine Learning
Issue #527
December 28, 2023
Hello!
Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
If this newsletter is helpful to your job, please become a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
And now…let's dive into some interesting links from this week.
Editor's Picks
Freakonomics’ Steven Levitt On The Secret To Making Tough Choices
On this special episode, we sat down with Levitt during the inaugural UCPN Podcast Festival, to talk about the legacy of Freakonomics. Almost 20 years later, he told our audience how he views himself as a “data scientist” and not just an economist, what he’s learned about using a coin flip to make hard decisions in life, and why he thinks he may have found the “holy grail” of solving crime…
2023 in AI - A roundup of what happened in AI this year
In episode 104 of The Gradient Podcast, Daniel Bashir speaks to Nathan Benaich. Nathan is Founder and General Partner at Air Street Capital, a VC firm focused on investing in AI-first technology and life sciences companies. Nathan runs a number of communities focused on AI including the Research and Applied AI Summit and leads Spinout.fyi to improve the creation of university spinouts. Nathan co-authors the State of AI Report…Anti-hype LLM reading list
Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought…
A Message from this week's Sponsor:
Is your A/B testing system reliable?
There are seven core components of an A/B testing stack, but if they’re not all working properly, it can mean your company isn’t making the right decisions. Meaning teams aren’t shipping features that are actually helping customers, the org is leaving money on the table, and you’re likely doing lots of manual work to validate experiment results.
Now imagine using a reliable experimentation system like the ones developed by Airbnb, Netflix, and many other experimentation-driven companies. You’d be able to:
Work with the most accurate and up-to-date metrics, completely in your own data warehouse
Leverage business metrics (revenue, activations, and more) over shallow metrics (clicks, signups, etc.) easily and quickly
Save time in setting up, running, and analyzing experiments
Help your non-data teams to self-serve experiment set up and analysis
Most successful tech companies rely on the same seven building blocks of an experimentation platform, and we put together a white paper that walks through these core components and the opportunities and failure modes they each present.
Download the white paper to see if you have all seven, and if you don't, what you could be missing.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Science Articles & Videos
I formally modeled Dreidel for no good reason
Traditionally this game is played by kids, as it's an excellent way to convince them that gambling is boring as all hell. In 2015 Ben Blatt simulated the game and found that for P=4 and N=10, the average game will take 860 spins to finish. But this was based on a mere 50,000 simulated games! We can do better than a mere simulation. We can mathematically model the game and get exact numbers. So we're busting out PRISM. Modeling Dreidel in PRISM PRISM is a probabilistic model checker…Interconnects year in review: 2023
In 2023, Interconnects (newsletter) has grown by more than 10x in subscribers (8k+ now) and even more in viewership (crossed 100k in a month a few times), so statistically, most of you didn't get to read all the posts. This post is a good entry point if you want to review any of the year's core themes as you prepare for more of the same craziness in 2024…There were 55 posts this year (and 1 interview), 5 of which were paid. The core topics, which I detail below, were:RLHF capabilities and understanding (8 posts)
Open LLM ecosystem progress (7 posts)
LLM techniques (6 posts)
Model releases (5 posts)
Moats (3 posts), state of ML industry (3 posts), and preference/reward models (2 posts)
Other AI topics (the remainder)
This post is a great tool for those looking to go deeper…
Is Everyone in data science a mathematician [Reddit]
I come from a computer science background, and I was discussing with a friend who comes from a math background, and he was telling me that if a person doesn’t know why we use kl divergence instead of other divergence metrics or why we divide square root of d in the softmax for the attention paper, we shouldn't hire him, while I didn't know the answer and fell into an existential crisis and kinda had an imposter syndrome after that. We are also working together on a project, so now I question everything I do. Wanted to know ur thoughts on that…Neural Spline Fields for Burst Image Fusion and Layer Separation
In this work, we use burst image stacks for layer separation. We represent a burst of images with a two-layer alpha-composited image plus flow model constructed with neural spline fields – networks trained to map input coordinates to spline control points. We directly input an unstabilized burst of full-resolution 12-megapixel RAW images into our model with no post-processing. Through test-time optimization it jointly fuses these frames into a high-fidelity reconstruction, while, with the help of parallax from natural hand tremor, jointly separating the scene into transmission and obstruction layers. By discarding this obstruction layer, we demonstrate how we can remove occlusions, suppress reflections, and even erase photographer-cast shadows; outperforming learned single-image and multi-view obstruction removal methods…Understanding the Data Quality Maturity Curve: What Does Your Data Quality Really Need?
In this piece, we examine the Data Quality Maturity Curve—a representation of how data quality works itself out at different stages of your organizational and analytical maturity—to offer some experienced perspective on where you should be at each point in your data quality journey…Before we can really understand the tooling equation, it’s important to know what we’re solving for. So, what is data quality exactly? Data quality is defined by how accurate, reliable, complete, discoverable, trustworthy, and actionable a specific dataset is for a given use case…How do you explain, to a non-programmer why it's hard to replace programmers with AI? [Reddit]
To me it seems that AI is best at creative writing and absolutely dogshit at programming, it can't even get complex enough SQL no matter how much you try to correct it and feed it output. Let alone production code.. And since it's all just probability this isn't something that I see fixed in the near future. So from my perspective the last job that will be replaced is programming. But for some reason popular media has convinced everyone that programming is a dead profession that is currently being given away to robots…Mindstorms in Natural Language-Based Societies of Mind
Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents -- all communicating through the same universal symbolic language -- are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving…Do OpenAI executives understand the math? [Reddit]
Question: How well do OpenAI executives understand the math? Sam Altman dropped out of Stanford after 2 years as a CS major, to run Loopt (Wikipedia). Greg Brockman dropped out of Harvard first, and then from MIT, to join Stripe (Wikipedia). Mira Murati has a BS in MechEng from a low-ranking university (not in the same level as Stanford or MIT). Ilya Sutskever got a PhD under Geoffrey Hinton. He's clearly a scientist. Now, Sam and Greg are obviously smart, since they got into top colleges. But, do they understand the math behind AI? Or they just focus on the business side and delegate the math to Ilya?…I, Andrej Karpathy, love reading technology prediction documents because the benefit of hindsight is training data for the future prediction task. Here, 64 years ago, Licklider imagines computing as a fundamentally intelligence amplification tool…
Custom colour scales for {ggplot2} - R-Ladies Cambridge
There are many colour palette packages in R that are compatible with {ggplot2}, but sometimes you might want to use your own choice of colours. For example, you might want them to match company or university branding guidelines. In this talk, I'll show you how to make your own custom colour scale functions to add your own colours to {ggplot2} graphics – no more copying and pasting `scale_colour_manual()`!..How do you version control a dataset?
I’ve got a dataset that I want to watch changes of. It’s got 4 numeric columns and 3 categorical columns. The 3 categories in combination designate an entity, and the 4 numeric values are measurements of the entity. I want to version control the entire thing such that I can see every single change that ever occurs, like with Git. Are logs the best way to do this, or is there a more mature way?…What kind of research can you do if you are GPU poor?
So in my college I don't have much compute resources.What kind of work can I can do in ML?…
Jobs
Data Scientist – BCG X
We are BCG X.
BCG X is the tech build & design unit of BCG. Turbocharging BCG’s deep industry and functional expertise, BCG X brings together advanced tech knowledge and ambitious entrepreneurship to help organizations enable innovation at scale. With nearly 3,000 technologists, scientists, programmers, engineers, and human-centered designers located across 80+ cities, BCG X builds and designs platforms and software to address the world’s most important challenges and opportunities.
Our BCG X teams own the full analytics value-chain end to end: framing new business challenges, designing innovative algorithms, implementing, and deploying scalable solutions, and enabling colleagues and clients to fully embrace AI. Our product offerings span from fully custom-builds to industry specific leading edge AI software solutions.
Our Data Scientists and Senior Data Scientist are part of our rapidly growing team to apply data science methods and analytics to real-world business situations across industries to drive significant business impact. You'll have the chance to partner with clients in a variety of BCG regions and industries, and on key topics like climate change, enabling them to design, build, and deploy new and innovative solutions.
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Sponsor
New ML Challenge – Build a decentralized credit score in Web3
New marketplace for verifiable machine intelligence, leveraging zkML to ensure accuracy, verification, and IP protection for modelers, Spectral has launched its first-ever model-building challenge for data scientists to help address societal issues by leveraging open-source to produce high-performing ML models. The models built from this specific challenge will have massive implications for the crypto industry as we know it. A $100k bounty is on the line as well as an 85% revenue share for the model they built. Engineers can sign up now, and expect more challenges on the way for early 2024.
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Training & Resources
Why can't I transform a distribution by deducting one from all counts?
Suppose I have records of the number of fishes that each fisherman caught from a particular lake within the year. The distribution peaks at count = 1 (i.e. most fishermen caught just one fish from the lake in the year), tapers off after that, and has a long right-tail (a very small number of fishermen caught over 100 fishes).
Such a data could possibly fit either a Poisson Distribution or a Negative Binomial Distribution. However, both of these distributions have a non-zero probability at count = 0, whereas for our data, fishermen who caught no fishes were not captured as a data point.
Why is it not correct to transform our original data by just deducting 1 from all counts, and therefore shifting our distribution to the left by 1 such that there is now a non-zero probability at count = 0?…
Deep Learning Framework From Scratch
I built this Documented and Unit Tested educational Deep Learning framework, with only Numpy…Probability reference book for data science professionals
I want to rehash my understanding of the fundamentals of probability theory. I'm trying to find a probability book that meets the following criteria: - Motivates different concepts and does not just show definition-proof-definition-proof - Is complete in that it covers all the fundamentals of probability - Is not overtly pro/anti bayesian - Suitable for a data science professional with a mathematics undergrad but hasn't studied in several years - Is suitable for self-study (problem solutions must be available in the book or online) I skimmed across Probability by Shiryaev but it was a bit too definition-proof for my liking and too abstract. I am looking for a dedicated probability book and will then move to a dedicated statistics reference book…
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's issue #526 here.
Cutting Room Floor
Whenever you're ready, 3 ways we can help:
Looking to hire? Hit reply to this email and let us know.
Looking to get a job? Check out our “Get A Data Science Job” Course
A comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to put together a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself to ~60,000 subscribers by sponsoring this newsletter.
Thank you for joining us this week! :)
All our best,
Hannah & Sebastian
P.S. Was today’s newsletter helpful to your job?
Consider becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.