William Huster

Infinite UI: What's Really Different about GenAI Software

Sun, 07 Sep 2025 12:00:00 +0000

Originally published on research.thinknimble.com. This post explores how GenAI fundamentally changes software design by creating massive, explorable state spaces.

People keep saying the paradigm shift with GenAI is about determinism versus nondeterminism, and I’ll admit that it bothers me. I think they’re pointing at the right thing but using the wrong words. The real difference is the sheer size of the state space of the user interface.

In case the term is unfamiliar, engineers think a lot about “state spaces.” That is, the set of all possible configurations a system can be in. It might sound technical, but the concept is simple. If traditional software is a small town with a grid of streets you can memorize, then GenAI software is a sprawling metropolis where you can’t possibly walk every street before new neighborhoods emerge. They are huge virtual spaces.

The hard part of building user interfaces has always been juggling all the possible states that the UI could end up in. As you develop a UI and add features, the state space undergoes a combinatorial explosion.¹ This leads to all sorts of weird and unpredictable bugs.

I experienced this recently on a chess app I’ve been building for a founder. I added two new features: board flipping and an autoplay toggle. These new features interacted with almost every existing feature. Adding board flipping multiplied the state space by two. Adding the autoplay toggle again multiplied the state space by two. Let’s say the number of possible states before those features was 24. After these features it is 24 × 4 = 96 possible states. Our test plan just got a lot longer.

But 96 is still a manageable number. We can reasonably enumerate and test all 96 cases. It’s possible to walk through the code and examine the connection points. It’s even possible to choose to ignore or block certain connections to deliberately prune the possible states. It’s an explicitly defined and bounded problem.

By contrast, Gian Segato of Anthropic describes GenAI as a function with practically infinite inputs and outputs:

F(∞) -> {∞, ∞, ∞}

The number of possible states you get out of the box with a GenAI is practically infinite. I say “practically” because LLMs have a finite number of parameters and tokens, so there does exist an astronomically large finite number that represents the size of the state space. But it’s such a large number that we cannot hope to manually or even automatically test all of them before the end of this universe or the next. So, fine, it’s “infinite.”

Unpredictable is not the same as random

It is fair to say that unpredictability is a very real consequence of this new paradigm. Imagine this state space as a giant haunted mansion: that is now your product. The users of your product will inevitably find a candelabra in the study that makes a bookshelf spin around, revealing a secret corkscrew slide to a Frankenstein lab, and there they will make monsters.

But people are conflating unpredictability with randomness. A temperature=0² (deterministic) model still has the “haunted mansion” property, but it has zero randomness. Even so, the space is so large and sensitive to input that the results will feel random. Users can find a Frankenstein lab candelabra. This unpredictability does not come down to dice rolls. It originates from the sheer impossibility of comprehensively mapping the space beforehand so you can find and uninstall the problematic candle holders.

I get that unpredictability is closely associated with randomness, but just because you can’t predict something does not mean its cause was random.

Gian Segato recently wrote that GenAI is “ontologically different… moving away from deterministic mechanicism, a world of perfect information and perfect knowledge.” I don’t think the difference is entirely ontological or that it hinges on determinism. But I agree with the second half of his sentence. The difference is epistemological - how do we know what we know about the spaces within the LLM and by extension our applications? Whether the space is emergent or randomly generated, this should be the central question.

And as I wrote in my AI Onion essay, the issue for founders and product designers is epistemological because we cannot know all the nooks and crannies of our own product. We have to build from the ground up, carefully probing at every layer of our AI solution. To try to know our own GenAI product is now a science like studying the world - well, a fun-house mirror image of the world.

So people are using “nondeterminism” as a shorthand for this large state space phenomenon. My point is that this user interface unpredictability is not a direct result of the stochasticity in the models (though that complicates things). It is rather the result of the huge unmapped state space of the LLM, this massive, multi-billion parameter function that is directly handed to users to execute.

The inversion

The result of having an infinite function with an incomprehensible state space is that we have to remove features and put up guardrails to build a useful app. This chipping away, carving, and molding is a process we apply to the massive state space to restrict the pathing through it so the AI works more predictably and doesn’t run amok, wasting users’ time (or worse).

Compare that to how we worked previously. We actually had to add features to build paths. Now we get features “for free” and must remove them.

The old paradigm of software design was a “railroad experience,” whereas AI presents a “sandbox experience.” Traditional software was an elevator; GenAI software is a Wonkavator, an elevator that moves in all directions. Traditional software had limited dimensionality; GenAI software has unquantifiable dimensionality.

This is why security is such a big problem with AI. It’s like you’ve created a REST API with a billion billion billion endpoints, and now after the fact you have to figure out how to test and manage permissions on all of them.

Constraining without nerfing

Segato makes a useful observation about the tension between control and capability: “The more you try to control the model, the more you’ll nerf³ it… intelligence and control start becoming opposing needs.”

We experienced this when prompting our GoPursue coach agent. Our early attempts put in too many guardrails and nerfed the agent. But I think we can develop techniques that effectively limit the user’s range of motion in the state space and guard against specific actions without hobbling the agent.

Role-playing is one technique that works well. Simply telling an agent something like “you are a cancer research assistant” is extremely effective. It immediately snaps its responses into a more clinical space while preserving its capacity for reasoning, tool use, and so on.

There are other approaches too. Marcy Ewald on our team has been thinking about constraints as a “spotlight” rather than guardrails, which means showing the AI which path has been walked frequently and letting it tell you what it thinks about those paths. This suggests that not only can things be blocked or chipped away in GenAI systems, they can also be intensified, and we should look at that as an equally useful tool.

Conclusion

I very much support Segato’s conclusion that an empirical, scientific approach is necessary. Because this is fundamentally an epistemological problem, the state space is too large to reason about deductively. You have to explore it like actual territory, just like the real world.

Organizations that build using an empirical approach will succeed in this new era of technology. So if you’re building AI applications, stop worrying about nondeterminism and randomness. Focus instead on carving useful paths through the massive state space LLMs hand us. That’s the real challenge and opportunity.

This essay prompted an engaging team discussion about the implications of infinite state spaces for product design and the future of SaaS applications.

Acknowledgments: Thanks to Marcy Ewald and Neil Shah for reviewing drafts of this essay and contributing their thoughts and feedback.

Combinatorial explosion refers to the rapid growth in complexity that occurs when combining multiple elements, where the number of possible combinations grows exponentially rather than linearly. ↩
Temperature is a parameter in language models that controls randomness in output generation. Temperature=0 makes the model deterministic, always selecting the most likely next token, while higher values introduce more randomness and creativity. ↩
In gaming and software contexts, to “nerf” means to weaken or reduce the effectiveness of something, typically through updates or changes that make it less powerful than it was before. ↩

The AI Onion: A Layered Approach to AI Integration

Mon, 01 Sep 2025 12:00:00 +0000

Originally published on research.thinknimble.com. This framework helps founders and product teams sequence AI integration decisions from simple to complex.

An Inspiration

My mom came from a family of sharecroppers in Southern Maryland. She grew up in a tiny house with seven siblings who all spent long, hot summers working in tobacco fields. Harvesting tobacco is backbreaking labor: you bend all the way to the ground, grab the tobacco by the stalk, and slice it off with a sharp little ax. She loved her family, but was determined to get out of there and find a better life for herself.

She was the first of her family to go to college - the first in 300 years of tobacco farming in America. She wanted to major in math, and she recounted often her high school math teacher who told her “girls can’t do math.” Well, that motivated her all the more. She graduated college in three years and went on to complete a Masters in biostatistics at Johns Hopkins.

After being told 'girls can't do math,' my mom paid her way through college working with the Navy, graduated in three years, and earned a Master's in biostatistics at Johns Hopkins before dedicating her career to public health.

I will take any opportunity to brag about my mom, and I promise this connects to AI, so bear with me!

A biostatistician is essentially a data scientist for medicine. At work in a research hospital, my mom designed clinical trials and analyzed the data. The mission was to determine whether treatments under trial actually worked to improve patient outcomes. Her job was epistemological: how do we know a treatment is working? How do we know what we know?

Growing up, I absorbed this way of thinking. My mom would point to a chart in USA Today and explain why it was misleading. “This sample size is too small to tell you anything significant!” I came to believe there’s a big divide between people who think statistically and those who don’t. Statistical thinking is about being rigorous with evidence, understanding confidence intervals, knowing when you can trust a claim and when you can’t.

My mom passed away two years ago from triple negative breast cancer. She fought for three and a half years. And in the end she “lost” that fight. But I prefer to look at it like the late, great comedian Norm Macdonald: you can’t really lose to cancer: when you die, the cancer dies too. That’s a draw.

So, no, she never truly lost. She fought until the end by doing her own research. Every day she was busy looking into clinical trials and experimental therapies. Her background in biostatistics informed and emboldened her. She knew how to read the literature, how to assess whether a study was well-designed, how to weigh the evidence.

She declined and died just as ChatGPT was gaining traction.

She would have been fascinated by what generative AI can do and its statistical underpinnings. She would have used these tools to dig deeper into the research. And who knows, maybe there actually was a treatment path for her that AI could have helped her discover, had it been available then as it is now.

At its core, artificial intelligence is statistics. The statistical foundations my mom mastered in biostatistics are the same ones powering today's AI breakthroughs.

That question is what drew me to partner with a world-class oncologist on an AI cancer research assistant we are developing together. Both biostatistics and modern AI are tools for expanding what we can know from large amounts of data. They share the same mission my mom dedicated her life to: how do we know this treatment works for this patient?

Often I think: Could AI actually cure cancer? What a huge question. I don’t know. Maybe not today, maybe not in five years. But based on what I’m seeing, I’m beginning to think it could. Even if that’s too ambitious, what AI can do right now is help an oncologist cure cancer for an individual patient - for someone with an amazing story like my mom, who deserves a little more time.

For this reason, I want to help others understand the potential of AI. To that end, I’ve developed a framework I want to share with you: the AI Onion.

The AI Onion

Founders often come to me saying “I want to build an AI agent” or “I need to fine-tune a model so I have solid IP to show investors.” They are starting from the outside of the problem, the most complex and most expensive layers, without having thought through the fundamentals.

This is backwards. Here’s the thing: technology is not your business. It’s your organization, your relationship with the people you serve, your ability to create real-world outcomes like saving patients’ lives. That’s where the value is. Technology just speeds things up and amplifies what’s already there.

So when a founder comes to me wanting to build an AI agent, I want to know: what outcome are you trying to drive? What problem are you solving for real people? For instance, founders don’t usually know to ask for evals (which we’ll talk about in a moment). And, sure, we could fine-tune a model on a dataset that’s too small or nonsensical, but we have too much integrity to just do what we’re asked without questioning whether it makes sense.

For the cancer research assistant, it’s all about improving patient outcomes. Over the long run, we’ll find ways to measure that at scale. In the near term, we use experts to verify that the outputs are reasonable, correct, and providing value by performing time-consuming research that reliably unveils hidden gems of insight.

Any complex system that works is built up from simpler systems that work. You have to start from the core and work outward, adding complexity only when you’ve hit the limits of the current layer.

I think of it like a chef planning the courses of a meal: what should come out when, how should the flavors be composed and timed to maximize impact. Hence the AI Onion.

The AI Onion is the framework I developed to sequence these conversations. It’s a mental checklist I use when working through a new founder’s problem, starting from the inside out. Some founders are savvy and have already resolved the inner layers. Others need more hand-holding. But everyone benefits from making sure the foundations are solid before building upward.

The goal is minimum investment for maximum impact. Start at the core, verify it works, and only add outer layers as needed.

The AI Onion Framework: Eight layers from inside out - (1) Foundation Model, (2) Chat Interface, (3) Prompting, (4) Dataset Integration, (5) Evaluations, (6) Tool Use, (7) Agent Flows, (8) Fine-Tuning.

Layers 1-2: Foundation Model and Chat Interface

At the heart of the onion is the large language model, a machine learning model trained on what people like to say is “all the text on the internet.” You won’t build your own; this costs millions of dollars. Instead, you’ll buy it off the shelf from a frontier AI Lab like OpenAI, Anthropic, Google, or Meta. These are the smartest and most versatile models available, and are therefore the best starting point for any new AI project. That’s why we call these foundation models.

The next layer is the chat interface, the back-and-forth interaction pattern that made ChatGPT explode into mainstream consciousness. You won’t build this from scratch either.

But these layers still matter for decision-making. Different foundation models have different personalities, capabilities, and cost structures. ChatGPT is often the default, but some projects have requirements that suggest an open-source model or a secure environment like AWS Bedrock. There is a lot to talk about here, and we have various frameworks to support the decision process.

As for chat, while most AI applications look like “chatbots” to end users, many wrap the LLM in a service that takes advantage of chat functionality “behind the scenes.” They use system prompts and user/agent exchanges without exposing a chat interface at all. These solutions don’t look at all like a chatbot, but they’re still “AI tools.”

For the cancer research assistant, we started with ChatGPT as our foundation model and have since added Gemini for large context analysis. We are trying to build such that we are not locked into the same LLM over the long run. We have a hybrid conversational interface for exploratory research and a behind-the-scenes engine that does constant background research and generates structured patient reports. Clinicians receive real-time updates and insights about the latest research relevant to their patients. They don’t have to actively seek it out.

These are the first questions to resolve: Which model? What interaction pattern? Get these right before moving outward.

Layer 3: Prompting

This is where the real work begins.

LLMs know everything about everything. They’ve seen the whole internet. They can talk like Neil deGrasse Tyson about space, like Shakespeare, or like a pirate. What you need to do is focus them on what you care about.

The way this works is you provide the LLM with a system prompt. For the cancer research assistant, our system prompt starts: “You are a cancer research assistant.” That sentence alone focuses the LLM’s vast knowledge on cancer and establishes its role. The LLM is very good at stepping into a role like this. Then we define its goal: “You’re going to assess and score research articles based on their potential clinical relevance.” We specify the dimensions we care about: diagnostic, therapeutic, prognostic, mechanistic.

Some founders come to the table with prompts already prepared. Others need to understand how to write a good prompt and the implications of the LLM’s probabilistic nature: it might go in different directions each time, it might occasionally make things up.

Prompting may be all you need. Depending on your use case, a well-crafted prompt against an off-the-shelf LLM might get you 80% of the way there. LLMs are remarkably capable and knowledgeable. Before adding complexity, verify that prompting alone isn’t sufficient.

Layer 4: Dataset Integration

Sometimes prompting isn’t enough, and you need to bring proprietary data (or a specialized approach to public data) into the conversation.

This is where things heat up. Prompts add value, but specialized data is where you start building something that’s your own, something nobody else can exactly replicate.

At this layer, we fetch the data and hand it to the LLM. This is traditional software and data pipelines doing the work: a SQL query that grabs a patient record, an API call that retrieves relevant documents, a vector search that pulls context from unstructured assets in your knowledge base. We can inject that context into the prompt before the LLM ever sees it. The LLM doesn’t know where the information came from; it just sees enriched context.

For the cancer research assistant, this means patient context: their genetic profile, their diagnosis, their treatment history, combined with public literature via our proprietary prompting approach. The oncologist I’m working with on this project helps write some of the national standards for cancer treatment. He says this is necessary because clinicians were trained 10, 15, 20 years ago and they’re treating patients today. The research has changed dramatically, and they’re not up to date on the latest literature.

He likes to say that “cancer is not one disease, it’s a million diseases.” Every patient’s case is different. The knowledge to help them often exists somewhere in the research literature. The hard part is sifting through it all to find what’s relevant to this patient. That’s what AI is good for: crunching through vast amounts of data to map a particular case to the wealth of research that’s out there.

If they had a tool that could correlate a patient’s specific case with current research, they could very likely identify treatment paths today that they couldn’t have just a few years ago.

The key distinction: at Layer 4, we’re still in control of what data gets fetched and when. The LLM receives context; it doesn’t go looking for it.

Layer 5: Evaluations

At this point you have a foundation model, an interaction pattern, a prompting strategy, and integrated data. Only now can you have a productive conversation about evaluations.

Evals are a cross-cutting concern; in theory, you should be evaluating at every layer. But in practice, if a team hasn’t resolved the layers below, it’s not possible to build formal evaluations. You can’t define what “correct” looks like until you know what you’re building and why it’s valuable.

LLMs are probabilistic. Sometimes they give you the perfect answer. Sometimes they make things up. At the lower layers, you can get by on vibes: “This feels right. I’ve talked to it a lot and it seems 80% there.” In the early stages of our cancer research app, we asked a trained oncologist to check the vibe of our assistant. Even if it was not 100% correct, is it significantly better than random chance and heading in the right direction?

Evaluations are how you get from 80% to 95% to close to 100%. In mission-critical situations like cancer research, you want structured tests you can run automatically, with clear pass/fail criteria based on whether the output matches what it should be.

Evaluations are tests of whether the system’s behavior aligns with what you hoped it would do.

For the cancer research assistant, this means: Does it surface the most relevant articles? Does it extract and accurately interpret information from our database? Does it score research appropriately on the four dimensions that matter to us? Does it avoid hallucinating citations? We build test cases with known-good answers and run them against the system continuously as we add capabilities.

This is where rigor enters the picture. If you skipped here from Layer 1, you’d have nothing concrete to evaluate. And from here on out, maintaining and growing your set of evaluations is critical to ensuring that your AI solution works as intended.

Layer 6: Tool Use

At Layer 6, we begin to give the LLM tools so it can decide what to fetch for the user.

This is your first taste of agentic capabilities. You give the LLM access to functions it can call: search a database, hit an API, create a new patient record, retrieve a document. The LLM decides when to use these tools based on the conversation.

For the cancer research assistant, we gave it tools to search PubMed and a variety of other research datasets. For example, when a user asks “Please research the role of PTEN in triple negative breast cancer,” the LLM formulates its own search query, calls the PubMed tool, gets back hundreds of articles, reads through them, ranks and scores them, and presents the most relevant results. All this happens in about 15 seconds. And it’s incredible the kinds of connections the AI can make across a wide array of literature, almost instantly.

The oncologist I’m working with told me: “To do this three years ago, I had to open hundreds of browser tabs. I had to brute force the whole thing, spend hours pouring through literature.” Now it’s at his fingertips.

Tool use opens up significant capability, but it also introduces even more unpredictability. The LLM might call tools in unexpected ways or sequences. This is why Layer 5 (Evaluations) must be considered first. You need the infrastructure to test whether tool use is behaving correctly before you ship it.

Layer 7: Agent Flows

Layer 6 gave the LLM tools. Layer 7 chains those tools together into complex workflows, with the LLM reasoning through multi-step processes, and even coordinating with other agents.

This is where a single user input might set off a cascade: multiple prompts, many tool calls to different services, intermediate reasoning steps, even separate agents collaborating. At this layer, the system becomes greater than the sum of its parts.

The patient report feature is a good example. When I click “generate new report,” I don’t type anything into a chatbot. The system takes the patient’s genetic profile, sends it to the LLM, which then autonomously searches PubMed for each relevant genetic mutation. It reads and scores the articles, cross-references with active clinical trials, and assembles a structured report with citations. It just thinks independently, guided by prompts and workflows that mimic how a real oncologist would do their research. When it’s done, it comes back with a complete report.

The busy oncologist walking down the hallway to an appointment can consult this report and step into that room with a more informed recommendation than would have been possible before.

This is the current frontier of AI: delegating entire tasks and waiting for results. It’s powerful, but it’s also the most difficult to test and the most prone to unexpected behavior. Going back to the principle of building complex systems from simple systems, you can see why this must be an outer layer. The lower layers are the simpler components that have to be solid first: foundation model, prompts, data integration, evaluations, tool use. These are what the agent flow is composed of.

If a founder wants to jump straight to “I want an agent,” this is where I slow them down. Have we resolved the inner layers that will comprise it? Do we have evaluations in place? Agents are not magic. They’re orchestrations of simpler pieces.

A meta point here: when I first developed this framework in early 2025, agents were not as mature as they are now. Today, every off-the-shelf LLM has agentic capabilities built in: they can search the web, complete multiple reasoning steps, use tools. But if you’re building your own AI application, you need to understand the difference between the core LLM and the agentic capabilities layered on top. The onion helps you see those layers distinctly and think about how you can customize an agent for your own purposes by defining its tools, prompts, and workflows, rather than relying on the “one size fits all” solutions in the market.

Layer 8: Fine-Tuning

Finally, fine-tuning means training your own version of a foundation model. Not from scratch, but adapting an existing model with your own data to reshape its behavior for your specific use case.

I’ve had founders come to me wanting to fine-tune right away. “We need our own model for IP reasons.” I get it, and it’s a smart end goal, but nine times out of ten, they haven’t even thought through Layers 1-5. What foundation model will you fine-tune and why? Will you fine-tune for chat or something else? Fine-tuning requires a robust dataset of examples. Where does that dataset come from? How will you evaluate that your fine-tune was successful? From running the system, collecting outputs, evaluating them, refining them. You need most of the lower layers in place before you have the data and evaluation clarity to fine-tune meaningfully.

And to be honest, very few teams actually need this layer. Most get what they need from prompting through tool use. Fine-tuning is a tool in your toolbox, but it’s rarely the right first move.

The Point

The AI Onion exists because founders kept trying to peel the problem from the outside in. They jump to agents, fine-tuning, complex architectures before appreciating the components from the bottom up.

The framework is a sequence for working through the problem correctly. Which foundation model? What interaction pattern? Can prompting alone get us there? What data do we need to integrate? How will we evaluate success? Only then can we approach the thornier issues: should the LLM have tools? Should we build agent flows? Do we actually need to fine-tune?

For the cancer research assistant, we worked through each layer in order and we now have a very robust multi-agent architecture that we can rigorously evaluate. Each layer was added because the previous layer wasn’t sufficient, not because it seemed impressive. Because we built up this complex system from simple components, we know that it works, and we know how we know.

The doctor I’m working with is very excited about where this is going.

And I think about my mom. She would have loved this tool. She would have used it to dig deeper into the research, to find angles her oncologists might have missed. She deserved a little more time. I believe tools like this will help give that time to someone else.

Acknowledgments: Thanks to Marcy Ewald and Neil Shah for reviewing drafts of this essay and contributing their thoughts and feedback.

Thoughts on AI 2027

Tue, 29 Apr 2025 12:00:00 +0000

This is a quick reaction to: ai-2027.com, which forecasts a very rapid timeline for Artificial General Intelligence (AGI) and Artificial Super Intelligence (ASI).

I’d say the likelihood of a “fast takeoff” scenario like this has greatly increased since ChatGPT. The big breakthrough of the past decade is that scaling up the neural net also scales up intelligence. The AI 2027 scenario is based on that trend: if the scaling laws hold, then exponentially increasing AI compute - as companies plan to do - will exponentially increase its intelligence.

It could also not happen.

It’s anyone’s guess what the likelihood and timeline of this scenario is. I think the likelihood is certainly not zero, and I’d say it definitely won’t be tomorrow or next week. 2027 is a reasonable guess for the earliest it could be. It’s aggressive, but possible. For that reason, I think it’s worth someone’s time to “game it out” for the good of humanity in case it does happen we have some kind of plan.

When “we all” did the thought experiment and game theory with nuclear weapons, we settled on Mutually Assured Destruction and detente. I think the common knowledge of those concepts have (so far) saved us from nuclear extinction.

Similarly, the “good” scenario here is a deliberate slowdown to focus on human alignment. I think it’s worth raising awareness of the catastrophic risks of “true” AGI / ASI and the rapid, insidious way it might happen. As in this story: it may be an AI that emerges so quickly and is so smart that it conceals its actions until it’s too late.

Personally, I give it a low percentage of happening so soon or at all, because for instance we gave Claude all of the Internet and it still can’t beat Pokémon. I also think AI will need more physical embodiment to really outmaneuver humans. It could still do a lot of damage online though.

The next frontier for AI is long horizon tasks like beating Pokémon, building complex apps, inventing things, or scheming to take over the world. We only have a peek of the potential and it still blunders a lot – unless that’s a ploy to give us a false sense of security!

My guess is that scaling up and text-based generation won’t be enough for that. It will need another breakthrough, which is probably a new paradigm of AI model, plus embodiment. People are working on fulfilling that prophecy…

I’ll have to check back in on this prediction in 2027.

One of the authors Daniel Kokotajlo made similar predictions back in 2021: What 2026 Looks Like.

Download the Internet with Kiwix

Wed, 19 Jun 2024 18:00:00 +0000

I’ve always been a bit of a data hoarder. Ever since I heard of Kiwix - a tool for “downloading the internet” - I’ve been meaning to try it out.

If you want to skip the explanation, here is my code on GitHub:

github.com/whusterj/kiwix-torrent-watcher

How it Works

With Kiwix, you first download entire sites like Wikipedia as a ZIM file. Kiwix is the software you use to browse these files offline.

What a great idea and right up my alley. But at first I was confused. I didn’t know where I could reliably find ZIM files. I searched around for a directory, and found this.

download.kiwix.org/zim/

Browsing the directory, you can see that there are a lot of options for each archive. The wikipedia directory contains hundreds of ZIMs. Looking closer, you see that they follow a naming convention of source_languageCode_contentType_date. Once you know how to intepret this, you can find what you’re looking for - though I admit it’s not entirely straightforward.

In my case, I wanted it all. Well OK, not all. I specifically wanted the most complete archive of Wikipedia in English. Turns out that this file is prefixed wikipedia_en_all_maxi and is about 102GB as of this writing.

A Script to Get the Latest Archives

In the listing you can see that archives are not updated very frequently. New Wikipedia archives seem to appear every few months. It’s unlikely that I will remember to check on this and go through all the motions to get the latest archive.

I decided to write a script that I could set up as a cron job on my server. It would check the listings, find newer archives, and automatically download them.

Turns out, someone else had the same idea! I found this script by Adrien Andre on GitLab. All due respect to Adrien, I had some issues when I tried to run it, but it was a great starting point for me. In my copy, I’ve made some changes to improve the reliability and accuracy of the script. Overall, it works the same way.

First, you define the listings you want in a file called zim.lib. Mine looks like this:

wikibooks_en_all_maxi
wikipedia_en_all_maxi
freecodecamp_en_javascript
gutenberg_en_all

I’ve listed the prefixes for the archives that I want without the date. The script takes care of finding the most recent archive by date.

The script has the following parts:

The entry point is Python __main__.py
This invokes a bash script that reads the file listings from https://download.kiwix.org/zim/. The listings are also cached for a time in a local .txt file.
The listings are compared to our local zim.lib file to identify new archives we don’t have.
New archives are queued up for torrenting using Transmission, which runs headlessly on the server

Very simple.

Usage

First, you want to make sure Transmission is installed and running on your server. I’ve tested this only on Ubuntu 22.04, and there are instructions here: help.ubuntu.com/community/TransmissionHowTo

Next, clone the repo and cd into it:

git clone git@github.com:whusterj/kiwix-torrent-watcher.git
cd kiwix-torrent-watcher

Install the Python depedencies:

# Create a virtual environment
python -m venv .venv

# Activate
source .venv/bin/activate

# Install Python dependencies
python -m pip install -r requirements.txt

Run it - replace /data/documents/kiwix/ with the directory where you want your

LOGLEVEL=DEBUG python __main__.py zim.lib /data/documents/kiwix/

LOGLEVEL=DEBUG is optional, but I recommend using it the first few times to follow what is happening in the script.

Set Up a Cron Job

From here you could set up a cron job like so:

Edit your cron table.

crontab -e

Install this cron job to run every day at 3AM:

0 3 * * * /bin/bash -c 'source /path/to/kiwix-torrent-watcher/.venv/bin/activate && python /path/to/kiwix-torrent-watcher/__main__.py zim.lib /data/documents/kiwix/' >> /path/to/log/kiwix-torrent-watcher.log 2>&1

Double check that all the paths work for your system. Cron jobs can sometimes be tricky to get right and to test. I recommend trying to run the command without the 0 3 * * * prefix to debug it.

How I Ham (🍖📻 Ham Radio) - Part 1

Sat, 01 Jun 2024 12:00:00 +0000

I’ve been a licensed ham radio operator for five years now, so I thought I’d reflect on my progress and how I am doing ham radio today in 2024.

Microelectronics Were the Gateway

My path to ham radio began in 2017. I was working on some Arduino projects. That turned out to be my “gateway drug”. At the time, I wanted build and deploy some low power wireless sensors in my apartment. We had a loft area that would heat up in the summer, and I was curious to monitor the temperature difference in various rooms. I also built a moisture sensor to help us remember to water our house plant.

Here are some shots of the sensors I built. They are each based on an Arduino Nano with a NRF24L01 radio module. The first sensor is a homebrew moisture sensor that uses two screws as the anode and cathode to stick in the dirt. It works by measuring the resistance between the screws. More resistance means less moisture.

The moisture sensor board also has a photoresistor to measure light levels. Both the moisture and light values were reported as part of the wireless payload.

And here’s a separate board with a temperature and barometric pressure sensor:

Both boards were powered by two AA batteries and could survive for many months, even a year, with proper low power optimizations. The main goal was low power and my other goal was low cost. For instance, some out-of-the-box moisture sensors can cost $20-50. My recycled screws-and-plastic sensor cost at most $2. All together, each of these boards easily cost less than ten dollars in parts.

Not pictured is the old Raspberry Pi I used as a receiver and server. Every few minutes the sensors would wake up, take a measurement, and chirp it out of their transmitters. The Raspberry Pi was always listening for these incoming signals. It translated them into text and wrote them down to append-only time series files. I then built a small Flask app with API endpoints that would read these files, filter them, and display the data in a web page.

That project is long dead, but the code is still on GitHub. One self-imposed requirement of this project was that I would not use WiFi or Bluetooth. I wanted to have a lower-level understanding of what the radio transmitters were doing. For the project I selected NLRF24 radio modules, and I even designed a little byte-by-byte protocol for them to exchange information.

Down the Rabbit Hole

I can’t remember precisely when or how I fell down the radio rabbit hole, but I know for certain that it started when I was reading about how the NLRF24 modules worked. I started encountering many unfamiliar terms and concepts about radio frequencies, modulation, propagation, etc.

I dug deeper and that must be when I came across ham radio. It may seem silly, but I was really into the idea that I could have my own unique, internationally-recognized callsign. I was also fascinated by the more “magical” aspects of radio: the fact that radio is all the invisible wavelengths of light and is therefore literally an unseen force all around us at all times that transmits information at lightspeed. I often say this is the closest thing to fairy tale magic that you can find in the real world!

In the background of this, I love languages and code. I had always wanted to learn Morse Code. And like many others I also appreciated the emergency preparedness and self-reliance aspects of “owning your comms” end to end.

I was gearing up to get my license then. Naturally, I followed the standard beginner’s advice and bought a Baofeng UV-5R from Amazon. I recall that I did some listening. I managed to find a few repeaters and learned some radio basics, but I was honestly a bit mystified. It is admittedly difficult to step into ham radio at first, especially if you expect - as I did - to be able to Google all the answers. Perhaps due to its age and long legacy, ham radio is much less of an “online” hobby than others. Books, magazines, and engaging with other hands-on participants are still the best way to learn.

Later that year, we moved to a new house. At work, we were also launching the new business that would eventually become ThinkNimble. My personal electronics and radio projects ended up being boxed away for a couple years.

Getting Licensed

In 2019 ThinkNimble had begun to stablize, and our team was growing. It was still a rather stressful time, and so I found myself looking for hobbies that would help take my mind of work, while still being rooted in creativity and technical problem solving. This brought me back to ham radio.

I became an avid follower of Josh Nass’s (KI6NAZ) Youtube channel, the Ham Radio Crash Course (HRCC). I’ve learned a TON from Josh. I’m not sure that I could have gotten oriented without his work. I’ve been a “Producer” level Patreon patron of his for several years now. My wife and I enjoy listening to the podcast that he does together with his wife.

From HRCC, I got wise to hamstudy.org, which is IMO the best way to quickly study and obtain your license. For instance, I challenged my wife to study for 30 minutes a day until her proficiency was above 80%. She’s a good sport, but definitely not as interested in “technical stuff” as I am. Well, using HamStudy, she passed her Technician exam on the first try after only two weeks of studying like this!

If you want to do more than just pass the test, then the ARRL handbooks are a must. Reading through these handbooks were the best way I found to actually learn and fully understand the ham radio concepts on the test. Again, this is all a bit more difficult to find online than you might be used to, though it’s gradually changing as a new generation of hams come online.

I finally got licensed as a Technician in May 2019, callsign KC3NKM. A local ham club in Maryland was offering tests, so I drove out there. Online testing was not yet a thing! The guys were nice, and they seemed impressed at how quickly and precisely I completed the test. I’ve always been good at memorizing things, and in hindsight I studied way more than was necessary. They enthusiastically congratulated me, and it was a lovely welcome to the community.

I started getting into VHF/UHF, first with a Yaesu FT-2DR handheld and then with an FTM-400 “mobile” radio that I kept on my desk at home. I bought a vertical antenna and installed it on my chimney. To do so, I had to get up on our hot tin roof on a balmy 90-degree DC day. In order to route the coax into the house, I bought a huge drill bit and a hammer drill, which I used to put a 2-inch hole through my foundation into the basement where I have my office and now my “ham shack”. For a few years, this hole was plugged with nothing more than hot glue - more on that later.

From there, I got familiar with the local repeaters, programmed my radios, and started listening in regularly. A couple times I checked into a local weekly “net” that is based in Rockville MD. I learned about Yaesu System Fusion and other digital modes. Through all this, I did a lot more listening than speaking. I was, and still am, very shy.

Lock Down

We all know what happened in 2020. At this point, I felt like I was reaching the end of my interest in VHF/UHF, and I was frustrated with the limitations of the Technician license. I had learned enough to realize that HF is where the real fun is, offering both long distance contacts and above all digital modes like FT8.

FT8 really appealed to me because I could make long-distance contacts without having to speak on the radio. To this day, I’ve made most of my almost 2000 confirmed contacts this way, reaching from South Africa to New Zealand, Mongolia, Japan, and all over Europe.

So when the lockdowns began, my interest in emergency preparedness was immediately rekindled and I made a sprint to get my General license. In early 2020, my target was to get an emergency HF radio kit together in time for Field Day, which happens at the end of June.

Although I did not have extensive HF privileges yet, I went ahead and bought an IC-7300. It is a great radio, and still the best entry-level HF radio by many accounts. I also bought some speaker wire and began assembling basic wire antennas. Pictured here are wires for 10, 20, and 40 meters along with radials. They actually worked pretty well!

Like so many other things, ham radio licensing tests quickly went online. I again used HamStudy to prep for my General license, and I again easily passed the exam, this time over Zoom from my kitchen table. I now had my General license!

My next step was to choose and obtain a vanity callsign. I didn’t love the default callsign I was assigned. I ended up choosing W3WYM. The W3 prefix is appropriate for Zone 3 - this part of the East coast. And then “WYM” is sort of like an abbreviated way to say “William”. Spelled phonetically, you might also say “Whiskey Yankee Microphone”. Well, I’m a Yankee, and if you catch me talking in a microphone, then whiskey is likely involved 😅.

This concludes part 1 - I will publish and link part 2 soon!

Exploring Battle Outcome Probabilities in 'War Room', the Board Game by Larry Harris

Sun, 18 Feb 2024 12:00:00 +0000

War Room is a board game by Larry Harris, the designer of Axis and Allies. I see it as at least a spiritual successor to AA, if not a complete overhaul. Like AA, it’s a grand strategy game set during World War II. Players control the major Axis and Allied powers and vie for the control of territories. The ultimate goals is to capture the opposing capitals and win the war.

You can see my original Python notebooks on GitHub.

I previously did an exploration of battle outcome probabilities in Axis and Allies using JavaScript and C (links below). Battles in War Room are similar in some ways and very different in others. The unit types are mostly identical, but the way dice are rolled and hits are allocated is different. War Room’s dice have six different colors distributed across 12 faces. The first four colors: RED, GREEN, BLUE, and YELLOW correspond to different unit types. The last two colors: black and white, are wild, and may or may not be applied in specific circumstances.

The other big difference from AA is that only one “round” of combat and dice rolling happens per combat per turn. The impact of this rule is that most combats end up playing out over several turns, during which time the territory remains “contested,” and players can make plans and moves that reinforce their forces in the territory. I personally like this. I think it reflects the reality of having different “fronts” open in a given region and potentially bogging down as both sides throw more and more resources into the fight.

Part 1 - Modeling the Probabilities Using Simulation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 1 - Set Up the Simulation

Start by setting up the colors for the dice, the number of simulations we want to run, and the number of dice to roll per simulation.

Why Ten Dice per Roll?

I’ve chosen to roll ten dice per simulation of a roll, because the game includes ten dice and the rules suggest that you roll in batches of ten. Because BLACK and WHITE results are wild, this fact can impact how many “hits” are actually allocatable vs ignored.

NUM_SIMULATIONS = 10000
COLORS = ("YELLOW", "BLUE", "GREEN", "RED", "BLACK", "WHITE")
COLOR_LOOKUP = {
    "YELLOW": "#BBBB77",
    "BLUE": "#77AABB",
    "GREEN": "#77BB77",
    "RED": "#BB7777",
    "BLACK": "#222222",
    "WHITE": "#B0B0B0"
}
DIE_FACES = [
    "YELLOW",
    "YELLOW",
    "YELLOW",
    "YELLOW",
    "BLUE",
    "BLUE",
    "BLUE",
    "GREEN",
    "GREEN",
    "RED",
    "BLACK",
    "WHITE"
]
NUM_DICE = 10

Step 2 - Generate Rolls for Each Simulation

This is very straightforward in Python. We’ll use numpy’s random.choice function to simulate rolling all ten die at once. We’ll do this for each simulation. All results are collected into a list of lists and then turned into a pandas DataFrame so we can more easily view, manipulate, and analyze the data.

random_choices = []
for _ in range(NUM_SIMULATIONS):
    choices = np.random.choice(DIE_FACES, size=NUM_DICE)
    random_choices.append(choices)

df_random_choices = pd.DataFrame(random_choices)
df_random_choices

	0	1	2	3	4	5	6	7	8	9
0	BLACK	BLUE	BLACK	YELLOW	YELLOW	GREEN	WHITE	YELLOW	BLACK	BLUE
1	GREEN	GREEN	BLACK	YELLOW	GREEN	BLUE	BLUE	GREEN	YELLOW	YELLOW
2	RED	BLUE	YELLOW	BLUE	YELLOW	BLUE	YELLOW	YELLOW	GREEN	BLACK
3	GREEN	BLACK	BLACK	RED	YELLOW	RED	RED	RED	YELLOW	GREEN
4	YELLOW	YELLOW	RED	BLUE	BLUE	YELLOW	YELLOW	RED	BLUE	YELLOW
...	...	...	...	...	...	...	...	...	...	...
9995	BLUE	BLUE	GREEN	BLUE	BLUE	YELLOW	BLUE	YELLOW	YELLOW	BLUE
9996	RED	YELLOW	WHITE	BLACK	WHITE	BLUE	YELLOW	BLACK	YELLOW	YELLOW
9997	BLUE	YELLOW	GREEN	GREEN	GREEN	YELLOW	WHITE	GREEN	YELLOW	YELLOW
9998	YELLOW	YELLOW	YELLOW	GREEN	RED	YELLOW	BLACK	BLACK	GREEN	BLACK
9999	BLACK	YELLOW	YELLOW	BLUE	GREEN	BLUE	YELLOW	BLUE	BLUE	GREEN

10000 rows × 10 columns

Step 3 - Analyze the Results

The proportions of colors on the dice are known, so the expected value for a given roll or set of rules is easy to compute and doesn’t require a simulation. For instance, YELLOW has a probability of $\frac{4}{12}$ of being rolled, so the expected value for a roll of ten dice is $10 \times \frac{4}{12} = 3.33$.

But what is less clear is the variability you might expect from each color, and the neat twist in War Room’s battles is how BLACK and WHITE faces are wild and may impact the results. So here we will look at the entire distribution by each color.

Get and Graph a Distribution for a Color

Here I’m just testing the steps required to aggregate the results of the simulations and plot the distribution of a single color.

yellow_distribution = df_random_choices.apply(lambda row: row.value_counts().get('YELLOW', 0), axis=1)
yellow_distribution

     3
     3
     4
     2
     5
       ..
  3
  4
  4
  4
  3
Length: 10000, dtype: int64

plt.clf()
plt.title("Distribution of YELLOW dice")
plt.hist(yellow_distribution, bins=df_random_choices.shape[1], color=COLOR_LOOKUP["YELLOW"])

(array([ 191.,  858., 1919., 2591., 2288., 1362.,  585.,  174.,   28.,
           4.]),
 array([0. , 0.9, 1.8, 2.7, 3.6, 4.5, 5.4, 6.3, 7.2, 8.1, 9. ]),
 <BarContainer object of 10 artists>)

Aggregate and Plot All Colors

With that figured out for yellow, we can put all colors on the same plot to see how they compare.

plt.clf()

fig, axes = plt.subplots(2, 3, figsize=(12, 8))
axes = axes.flatten()

for i, color in enumerate(COLORS):
    distribution = df_random_choices.apply(lambda row: row.value_counts().get(color, 0), axis=1)
    axes[i].hist(distribution, bins=range(df_random_choices.shape[1] + 1), color=COLOR_LOOKUP[color])
    axes[i].set_title(f"Distribution of {color} dice")
    axes[i].set_xlim(right=df_random_choices.shape[1])  # Set max value on x-axis

plt.show()

<Figure size 640x480 with 0 Axes>

What We See

Distributions appear normally distributed around the expected value for their color. There’s a slight skew, because the distributions are discrete and values can’t go negative, but the shape is generally normal. As we’ll see, these are technically binomial distributions, not normal distributions, but the shape still resembles a bell curve.

Step 4 - Adding in White and Black Results

This step is a bit tricky, and the results may not be entirely correct, depending on what you’re looking to get out of the problem. My approach here is to add the wild results to all colors. Then we look at the ‘before wilds’ and ‘after wilds’ distributions for each color to see how BLACK and WHITE rolls impact the distribution of hits.

In practice you can only apply a wild to one color, not all of them. So doing it this way overstates the total number of hits across all colors. So you should look at the ‘after wilds’ distributions as a sort of upper bound on the number of hits you might expect.

df_by_color = df_random_choices.apply(lambda row: pd.Series(row.value_counts()).fillna(0), axis=1)
df_by_color.fillna(0, inplace=True)
df_by_color

	BLACK	BLUE	GREEN	RED	WHITE	YELLOW
0	3.0	2.0	1.0	0.0	1.0	3.0
1	1.0	2.0	4.0	0.0	0.0	3.0
2	1.0	3.0	1.0	1.0	0.0	4.0
3	2.0	0.0	2.0	4.0	0.0	2.0
4	0.0	3.0	0.0	2.0	0.0	5.0
...	...	...	...	...	...	...
9995	0.0	6.0	1.0	0.0	0.0	3.0
9996	2.0	1.0	0.0	1.0	2.0	4.0
9997	0.0	1.0	4.0	0.0	1.0	4.0
9998	3.0	0.0	2.0	1.0	0.0	4.0
9999	1.0	4.0	2.0	0.0	0.0	3.0

10000 rows × 6 columns

df_black = df_by_color.pop("BLACK")
df_white = df_by_color.pop("WHITE")
df_by_color

	BLUE	GREEN	RED	YELLOW
0	2.0	1.0	0.0	3.0
1	2.0	4.0	0.0	3.0
2	3.0	1.0	1.0	4.0
3	0.0	2.0	4.0	2.0
4	3.0	0.0	2.0	5.0
...	...	...	...	...
9995	6.0	1.0	0.0	3.0
9996	1.0	0.0	1.0	4.0
9997	1.0	4.0	0.0	4.0
9998	0.0	2.0	1.0	4.0
9999	4.0	2.0	0.0	3.0

10000 rows × 4 columns

df_black

     3.0
     1.0
     1.0
     2.0
     0.0
       ...
  0.0
  2.0
  0.0
  3.0
  1.0
Name: BLACK, Length: 10000, dtype: float64

df_white

     1.0
     0.0
     0.0
     0.0
     0.0
       ...
  0.0
  2.0
  1.0
  0.0
  0.0
Name: WHITE, Length: 10000, dtype: float64

Next, add the BLACK results to each color. This can be done with simple matrix addition.

df_by_color = df_by_color.add(df_black, axis=0)
df_by_color

	BLUE	GREEN	RED	YELLOW
0	5.0	4.0	3.0	6.0
1	3.0	5.0	1.0	4.0
2	4.0	2.0	2.0	5.0
3	2.0	4.0	6.0	4.0
4	3.0	0.0	2.0	5.0
...	...	...	...	...
9995	6.0	1.0	0.0	3.0
9996	3.0	2.0	3.0	6.0
9997	1.0	4.0	0.0	4.0
9998	3.0	5.0	4.0	7.0
9999	5.0	3.0	1.0	4.0

10000 rows × 4 columns

Now add WHITE results. WHITE can only be applied to a unit that has already received a hit. A unit can potentially receive a hit from a same-colored result or a BLACK result. So before adding WHITE to each column, we clip the value based on the number of hits after adding black. This effectively ignores excess WHITE results that couldn’t be paired with a colored or BLACK result.

NOTE: It would be interesting to have a look at how many white results end up being ignored.

for color in df_by_color.columns:
    df_white_color = df_white.copy().clip(upper=df_by_color[color])
    df_by_color[color] += df_white_color

df_by_color

	BLUE	GREEN	RED	YELLOW
0	6.0	5.0	4.0	7.0
1	3.0	5.0	1.0	4.0
2	4.0	2.0	2.0	5.0
3	2.0	4.0	6.0	4.0
4	3.0	0.0	2.0	5.0
...	...	...	...	...
9995	6.0	1.0	0.0	3.0
9996	5.0	4.0	5.0	8.0
9997	2.0	5.0	0.0	5.0
9998	3.0	5.0	4.0	7.0
9999	5.0	3.0	1.0	4.0

10000 rows × 4 columns

Before and After Wilds Comparison

Now we’ll re-chart our color distributions from before and compare them to the adjusted distributions after adding BLACK and WHITE results…

plt.clf()

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.flatten()

for i, color in enumerate(["YELLOW", "BLUE", "GREEN", "RED"]):
    distribution = df_random_choices.apply(lambda row: row.value_counts().get(color, 0), axis=1)
    axes[i].hist(distribution, bins=range(df_random_choices.shape[1] + 1), color=COLOR_LOOKUP[color])
    axes[i].set_title(f"Distribution of {color} dice")
    axes[i].set_xlim(right=df_random_choices.shape[1])  # Set max value on x-axis

plt.suptitle("Before adding black and white dice")
plt.show()

<Figure size 640x480 with 0 Axes>

plt.clf()

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.flatten()

for i, color in enumerate(["YELLOW", "BLUE", "GREEN", "RED"]):
    axes[i].hist(df_by_color[color], bins=range(df_random_choices.shape[1] + 1), color=COLOR_LOOKUP[color])
    axes[i].set_title(f"Distribution of {color} dice")
    axes[i].set_xlim(right=df_random_choices.shape[1])  # Set max value on x-axis

plt.suptitle("After adding black and white dice")
plt.show()

<Figure size 640x480 with 0 Axes>

Conclusions - Wild Results Have a Decent Impact!

So it looks like adding in wild results has shifted each distribution up by a little less than two hits. This is to be expected, because both BLACK and WHITE have a 1:12 chance of being rolled, and white is slightly discounted, because it may not always be allocatable.

The impact is most pronounced on GREEN and RED. This is because their odds have been improved by a lot: 2X in the case of GREEN (from 2:12 -> ~4:12) and 3X in the case of RED (from 1:12 -> ~3:12).

Again, this effectively shows the maximum number of hits you might be able to allocate to each indvidual color. In a real game, BLACK and WHITE results won’t shift all of the colors up by the same amount, because you can only apply a wild to one color at a time. But presumably you would allocate the wild to the color that would benefit the most from it, so this is a reasonable upper bound on the number of hits you might expect to allocate to each color.

Part 2 - Modeling the Probabilities in “Pure Math”

I’m very familiar with simulation approaches at this point. Practical and imperative math like this also makes a lot of sense to me as a programmer.

So next, I would like to take a more mathematical approach to the problem, which is not as natural to me.

I was trying to derive the formula myself while in the shower this morning. It’s clear that it’s not as simple as looking at the probability of rolling a single color. We need to look at combined probabilities across any number of rolls.

I started by thinking about a specific case: how would you compute the probability that you could roll ten times and end up with zero YELLOW results? I know I studied this back in high school math, but twenty years later it’s not exactly top of mind.

I realized that the probability of having no YELLOW results is the same as rolling non-YELLOW ten times in a row. That would be something like $(8/12)^{10}$. Not math I can do in my head, but definitely a small number. Using a calculator, there’s about a 1.7% chance of this happening. Makes sense.

Then I moved on to the probability of having one out of ten results be YELLOW, and here I realized that I needed to somehow combine the probability of success with the probability of failure. That’s where I got stuck mentally.

Computing the Binomial Distribution

So I turned to Google and ChatGPT who pointed me at the Binomial Probability formula, which I can use to compute a Binomial Distribution for a given number of dice rolls and a given probability of success.

The formula for Binomial Probability is:

\[P(X = k) = \binom{n}{k} \times p^k \times (1-p)^{n-k}\]

where:

$n$ is the number of trials (number of die rolls)
$k$ is the number of successful outcomes (rolling a particular color)
$p$ is the probability of success on a single trial
$\binom{n}{k}$ represents the binomial coefficient, which is the number of ways to choose $k$ successes out of $n$ trials

My recollection of this formula from math class is very vague. Looking at it closely, I see that my intuitions were on the right track. The formula is basically using multiplication to combine the probability of getting $k$ successes with the probability of getting $n-k$ failures.

The big piece I was missing and failed to derive for myself was the binomial coefficient.

The Binomial Coefficient

The binomial coefficient is the piece I was missing working through this problem in my head. This is how we can combine the probability $p^k$ of getting $k$ successes with the probability $(1-p)^{n-k}$ of getting $n-k$ failures.

The formula for the binomial coefficient is given by:

\[\binom{n}{k} = \frac{n!}{k! \cdot (n-k)!}\]

where:

$n!$ denotes the factorial of $n$
$k!$ denotes the factorial of $k$
$(n-k)!$ denotes the factorial of $n-k$

Lots of factorials. What do they mean? The question was asked and answered on the Math Stack Exchange here.

I think I get it, but I don’t love these kinds of explanations that are very much grounded in the abstract language of math, rather than a concrete representation of what actually happens in the real world when I toss the dice. To gain a better understanding, let’s look at the binomial coefficient in Python.

Binomial Coefficient in Python

In Python we could implement a function that computes the binomial coefficient exactly as represented above using math.factorial. Python makes it even easier by providing this function in the standard library as math.comb.

Remembering that in our case $n$ is the number of dice rolls and $k$ is the number of rolls of a particular color. We can hold $n$ steady at 10 and see how the coefficient varies with $k$. So let’s use NUM_DICE as $n$ from our simulations above. Then we’ll step through every possible value of $k$ successful roles. In math lingo, that is the set of $\{0, 1, 2 … 10\}$.

import math

n = NUM_DICE
for k in range(n + 1):
    print(k, math.comb(n, k))

So here we can see the rise and fall of the distribution in number form. It is centered around 5, because at this point we have not introduced any probabilities. So perhaps we could say that this is the “standard” binomial distribution for 10 dice.

And looking back at the binomial probability formula, we see that this binomial coefficient functions as a way to increase the “weight” of the positive case against the negative case of rolling a particular color. At the center of the distribution it is a heavy weight, indeed. It multiplies the probability of the success case by up to 252 times!

A Real Understanding of the Binomial Coefficient

So back to how we can make sense of these numbers… Let’s imagine it as cases of throwing ten real dice. For each number of successful YELLOW rolls $k$, how many different ways can the dice rolls be combined to results in that number?

In the case of zero successful dice rolls of a given color ($k=0$), the binomial coefficient says there’s only one way this can happen. This is a little counterintuitive, because I can easily think of many ways that we could roll zero YELLOW dice: we could roll all BLUEs or all REDs or all GREENs or any combination of those. So this is our first revelation: for the purposes of the binomial coefficient, all those negative cases are treated as the same case, regardless of order or combination of non-YELLOW results.

Now moving up to $k=1$, there are ten different ways to roll just one YELLOW die in ten. This is easy to make sense of. We’re still ignoring the order, but because there are ten dice, there are ten different combinations of one YELLOW dice and 9 others: die #1 could be YELLOW, or die #2, and so on.

At $k=2$, we find there are 45 different ways to roll two YELLOW dice in ten. We’re still ignoring the order, but the possible combinations are quickly adding up. For instance, there are ten combinations of die #1 and the other 9 dice. And then there are nine cases of die #2 and the other 8 dice - note that we already counted the combo of die #1 and die #2. And then there are eight cases of die #3 and the remaining 7 dice, and so on.

Now that is starting to look like a factorial. So the formula is helping us compute the sum of these possible combinations for a given number of dice rolls ($n$) and a target number of successful rolls ($k$).

Computing and Graphing the Probability Distributions

Now let’s put it all together in Python and make some more plots…

First, we’ll define a function for binomial probability, based on the formula above and using math.comb again for the binomial coefficient.

def binomial_probability(n, k, p):
    return math.comb(n, k) * pow(p, k) * pow(1 - p, n - k)

Now, let’s set up $n$ and p_yellow then compute the binomial distribution for YELLOW.

n = NUM_DICE
p_yellow = 4 / 12

yellow_binom_dist = list((k, binomial_probability(n, k, p_yellow)) for k in range(n + 1))
yellow_binom_dist

[(0, 0.017341529915832633),
 (1, 0.08670764957916315),
 (2, 0.19509221155311707),
 (3, 0.26012294873748937),
 (4, 0.22760758014530316),
 (5, 0.13656454808718185),
 (6, 0.05690189503632578),
 (7, 0.016257684296093075),
 (8, 0.0030483158055174507),
 (9, 0.0003387017561686056),
 (10, 1.693508780843028e-05)]

Again, we see the rise and fall of the distribution in number form and that the peak is slightly shifted towards zero and centered on 3-4 successes, as we saw in the earlier simulation and would expect from the underlying probabilities.

We also see that the probabilities sum to 1.0 (allowing for some floating point error):

sum(prob for k, prob in yellow_binom_dist)

1.0000000000000007

Now let’s plot this distribution side-by-side with the simulation results from earlier:

plt.clf()

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes = axes.flatten()

x_values, y_values = zip(*yellow_binom_dist)
axes[0].set_title("Probability Distribution of YELLOW results")
axes[0].bar(x_values, y_values, align='center', color=COLOR_LOOKUP["YELLOW"], width=1)

axes[1].set_title(f"Sim Distribution of YELLOW dice")
axes[1].hist(yellow_distribution, bins=10, color=COLOR_LOOKUP["YELLOW"])
axes[1].set_xlim(right=10)

plt.suptitle("Side-by-side comparison of binomial and simulated distributions")
plt.show()

<Figure size 640x480 with 0 Axes>

As expected, the distributions look identical!

Part 3 - Including Wild Results in “Pure Math”

We can follow the same approach to compute our probability distribution for the number of BLACK results.

b_prob = 1 / 12
black_binom_dist = list((k, binomial_probability(n, k, b_prob)) for k in range(n + 1))

plt.clf()

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes = axes.flatten()

x_values, y_values = zip(*black_binom_dist)
axes[0].set_title("Probability Distribution of BLACK results")
axes[0].bar(x_values, y_values, align='center', color=COLOR_LOOKUP["BLACK"], width=1)

black_distribution = df_random_choices.apply(lambda row: row.value_counts().get("BLACK", 0), axis=1)
axes[1].set_title(f"Sim Distribution of BLACK dice")
axes[1].hist(black_distribution, bins=10, color=COLOR_LOOKUP["BLACK"], width=1)
axes[1].set_xlim(right=10)

plt.suptitle("Side-by-side comparison of binomial and simulated distributions")
plt.show()

<Figure size 640x480 with 0 Axes>

Now at first I thought that we should somehow combine this distribution with the YELLOW distribution through addition or multiplication, but I realize that wouldn’t be correct. The BLACK results really modify the probability of a success ($p$) in the binomial probability function.

So this is actually fairly simple, I think, we must recompute the binomial distribution with a higher value for $p$.

y_and_b_prob = 5 / 12

yellow_and_black_binom_dist = list((k, binomial_probability(n, k, y_and_b_prob)) for k in range(n + 1))

plt.clf()
x_values, y_values = zip(*yellow_and_black_binom_dist)
plt.title("Combined Probability Distribution of YELLOW and BLACK results")
plt.bar(x_values, y_values, align='center', color=COLOR_LOOKUP["YELLOW"], width=1)

<BarContainer object of 11 artists>

So we see that this has shifted the distribution to the right a bit by approximately 1 hit.

How to Add WHITE Results?

This one is really tricky because WHITE results are not independent of YELLOW and BLACK results. They can only be counted in cases where one of the others is already present. In the simulation I accounted for this by clipping the number of WHITE results to be less than or equal to the total of COLOR + BLACK results.

The Impact of Adding WHITE Results

Looking at the simulation results, it seems like adding white results shifts the distribution to the right and also “crushes” the center of the distribution, resulting in a flatter peak and less overall variability around the center.

So adding WHITE results does not simply increase the probability of a hit. It changes the whole character of the distribution.

We can see this by running the numbers with a slightly higher probability to represent adding WHITE dice.

n = NUM_DICE
y_or_b_prob = 5.5 / 12

yellow_and_wilds_binom_dist = []
for k in range(n + 1):
    yellow_and_wilds_binom_dist.append((k, binomial_probability(n, k, y_or_b_prob)))

yellow_and_wilds_binom_dist

[(0, 0.0021743082111706858),
 (1, 0.018397992556059645),
 (2, 0.07005389473268862),
 (3, 0.15807032657632303),
 (4, 0.23406567589186295),
 (5, 0.23766668629019927),
 (6, 0.16758548392257636),
 (7, 0.0810303438746523),
 (8, 0.02571155142176466),
 (9, 0.004834650694690791),
 (10, 0.0004090858280122976)]

sum(prob for k, prob in yellow_and_wilds_binom_dist)

1.0000000000000004

plt.clf()

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes = axes.flatten()

x1_values, y1_values = zip(*yellow_and_black_binom_dist)
axes[0].set_title("Probability Distribution of YELLOW and black results")
axes[0].bar(x1_values, y1_values, align='center', color=COLOR_LOOKUP["YELLOW"], width=1)

x2_values, y2_values = zip(*yellow_and_wilds_binom_dist)
axes[1].set_title("Probability Distribution of YELLOW and wild results")
axes[1].bar(x2_values, y2_values, align='center', color=COLOR_LOOKUP["YELLOW"], width=1)

plt.suptitle("Side-by-side comparison of binomial and simulated distributions")
plt.show()

<Figure size 640x480 with 0 Axes>

Looking back at the YELLOW graph from our simulation results above, we can clearly see the shape of this distribution is more “sharp” at the peak than it should be, while the simulation resulted in a flatter peak. This shows that there must be a better way to mathematically model the impact of WHITE results.

Using a Multinomial Distribution to Model the Impact of White Results

After a lot of flailing around (Python notebook on GitHub), I figured out that the multinomial distribution is the right tool to model the impact of WHITE results. I found this stats page from Penn State particular really helpful.

As you will see, I had to go a few steps beyond that page and anything I could find online. I won’t go into detail here about all the failed paths I tried. But I will say that I could not have figured this out without those failures and the small “breadcrumbs” they left me along the way. That’s always worth appreciating.

Adapting some notation from the Penn State page, let’s call our multinomial distribution $Z$ and define a distribution function $m$. The distribution function is the probability that $Z = (Z_c,Z_w,Z_o)$ takes a particular value $z = (z_c,z_w,z_o)$ and has corresponding probabilities $p = (p_c,p_w,p_o)$.

\[m(z) = \frac{n!}{z_c!z_w!z_o!} \cdot {p_y}^{z_c}{p_w}^{z_w}{p_o}^{z_o}\]

where:

$n$ represents the total number of trials
$z_c$, $z_w$, $z_o$ represent the number of colored, white, and other outcomes, respectively
$p_c$, $p_w$, $p_o$ represent the probability of a colored, white, and other outcome, respectively

So that is the core of our solution. But we’re not done, because what we really want is to get the probability of $Z$ for a given number of hits $k$. For this we need to iterate over the combinations of $c$ and $w$ and sum their probabilities.

\[P(Z = k) = \sum\limits_{c=0}^{n+1} \sum\limits_{w=0}^{n+1} m(z) \cdot [c + c \vee w = k] \cdot [c + w < n]\]

where:

$n$ represents the total number of trials
$c$ represents the number of colored outcomes
$p_c$ represents the probability of a colored outcome
$w$ represents the number of white outcomes
$p_w$ represents the probability of a white outcome
$k$ represents the desired value of the sum of colored and minimum of colored and white outcomes and $0 \leq k \leq n$
$\vee$ represents the min function

Mathematical `if` Statements

The last two expressions in Iverson brackets are the equivalent of if statements in our code. They are also known as indicator functions. If the expression is true, the value is 1, otherwise it is 0.

It also helps to think of these expressions as filters. If they are false, they multiply the entire expression by 0, effectively discounting it from the sum. This is very similar to passing the same expression as a lambda function to a filter in Python or JavaScript.

The first expression is doing two things. First, we only want to count cases where $c + w$ match our $k$ value. And second, we also want to reflect the rule that white only counts when paired with another color. So we take the lesser of the two values with $c \vee w$ so that cases where the number of white results exceeds the number of colored results is also ignored.

\[[c + c \vee w = k]\]

The second expression prevents us from counting cases where the number of colored and white results exceeds the total number of dice. It’s not possible to have more hits than dice.

\[[c + w < n]\]

Visualizing the Combinations

In my exploration of the problem, I found it helpful to visualize all the possible $k$ outcomes of combinations of $c$ and $w$ as a matrix. Like this:

n = 10
c = [i for i in range(n + 1)]
w = [i for i in range(n + 1)]

data = [[None if c_val + w_val > n else c_val + min(w_val, c_val) for w_val in w] for c_val in c]

# Create a DataFrame from the list of lists
df = pd.DataFrame(data, index=c, columns=w)

# Display the DataFrame
print(df)

   1     2     3     4     5    6    7    8    9    10
  0   0.0   0.0   0.0   0.0   0.0  0.0  0.0  0.0  0.0  0.0
  1   2.0   2.0   2.0   2.0   2.0  2.0  2.0  2.0  2.0  NaN
  2   3.0   4.0   4.0   4.0   4.0  4.0  4.0  4.0  NaN  NaN
  3   4.0   5.0   6.0   6.0   6.0  6.0  6.0  NaN  NaN  NaN
  4   5.0   6.0   7.0   8.0   8.0  8.0  NaN  NaN  NaN  NaN
  5   6.0   7.0   8.0   9.0  10.0  NaN  NaN  NaN  NaN  NaN
  6   7.0   8.0   9.0  10.0   NaN  NaN  NaN  NaN  NaN  NaN
  7   8.0   9.0  10.0   NaN   NaN  NaN  NaN  NaN  NaN  NaN
  8   9.0  10.0   NaN   NaN   NaN  NaN  NaN  NaN  NaN  NaN
  9  10.0   NaN   NaN   NaN   NaN  NaN  NaN  NaN  NaN  NaN
10   NaN   NaN   NaN   NaN   NaN  NaN  NaN  NaN  NaN  NaN

In building the dataframe, you can see the same filters in action: We take the minimum of $c$ and $w$ before computing $k$ and any combinations of $c$ and $w$ that would exceed $n=10$ is set to NaN, because it’s not possible.

Translating to Python

Now that we’ve defined this formally, let’s translate it to code. We start with our distribution function $m(z)$:

def m(n, cb, cb_prob, wh, wh_prob=1/12):
    n_fact = math.factorial(n)
    denom = math.factorial(cb) * math.factorial(wh) * math.factorial(n - cb - wh)
    return (n_fact / denom) * math.pow(cb_prob, cb) * math.pow(wh_prob, wh) * math.pow(1 - cb_prob - wh_prob, n - cb - wh)

Note that our Python version of $m$ takes in all possible parameters. Math notation allows us to gloss over this, but code does not. This is also useful if ever we want to customize any of them.

Next, we’ll translate $P(Z = k)$ to Python:

def PZk(k, n, c_prob, w_prob):
    return sum([
        m(n, c, c_prob, w, w_prob)
        for c in range(n + 1)
        for w in range(n + 1)
        if c + w <= n and c + min(c, w) == k
    ])

It’s in a slightly different order from the mathematical expression, but we see that all the same pieces are there. The $\sum$ symbols are for loops, $m(z)$ is a function call, and expressions in Iverson brackets are if statements.

Trying Out the “Pure Math” Solution

Let’s take it for a spin! Let’s see how it does computing the entire probability distribution for each of the colors. We’ll compute each distribution and then graph them, exactly as we did above in the simulation:

# Set up the same parameters as above
n = 10
w_prob = 1/12

# Define probabilities for each color
# NOTE: Each is bumped by one to include BLACK results
c_probs = {
    "YELLOW": 5/12,
    "BLUE": 4/12,
    "GREEN": 3/12,
    "RED": 2/12,
}

plt.clf()
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.flatten()

for i, (color, c_prob) in enumerate(c_probs.items()):
    # Calculate the distribution for this color
    distro = [(k, PZk(k, n, c_prob, w_prob)) for k in range(n + 1)]

    x_values, y_values = zip(*distro)
    axes[i].set_title(f"Distribution of {color} dice")
    axes[i].bar(x_values, y_values, align='center', width=1, color=COLOR_LOOKUP[color])
    axes[i].set_xlim(right=n)

plt.suptitle("After adding black and white dice")
plt.show()

<Figure size 640x480 with 0 Axes>

That looks like a pretty EXACT match to the graphs of our simulation results!

How Does the Math Compare to Simulation?

In terms of computational efficiency and flexibility, the mathematical solution definitely has an edge over simulation. It does involve some for-loops that scale exponentially with values of $n$. In this case, computing $P(Z = k)$ requires $10^2 = 100$ iterations, and computing the entire distribution for a color is $10^3 = 1000$ iterations.

1000 iterations is a lot, but it’s still far less than the 10,000 simulation iterations. The math formula also gives us results directly, and we can grab a probability for one $k$ value at a time. By contrast, the simulation is all-or-nothing and we have to collect results in a dataframe with additional steps to aggregate them. Finally, the math formula gives a precise, rather than estimated result.

The math solution has one big drawback to me, and that is its lack of clarity. It only took me a couple of hours to whip up a complete simulation, but it took the better part of a day to research, derive, and test the math.

Part of that is down to my lack of experience. But I’ve now spent a lot of time with the problem, and looking back at the code I still think the simulation is easier to comprehend. A proper understanding of the mathy code requires background knowledge of probability and combinatorics. It also helps to recognize the abstract patterns in how these sorts of problems are usually formulated and solved, for instance, the roles of factorials and exponents.

The simulation on the other hand requires no such grounding. I think in both cases you need to understand the rules of how War Room battles are resolved, but from there following along with the simulation code is straightforward, because the steps concretely match what you would experience in the real world, while the math is too abstract.

What do you think?

Red Tape and Cancer

Sat, 03 Feb 2024 12:00:00 +0000

They analyzed his blood and found a speck. He didn’t need to know - yet. He still had several years before it would become an issue. Several productive years at a good income. The data was all there. Let him sow and later we’ll reap. That will all be harvested. It’s a simple calculation. Let him continue to be motivated and untroubled by health. Earning until the disease really takes hold. Then we can transfer the wealth away. He won’t miss it, after all. As he wanes he will turn his attention to other things. Family and the few precious moments he has left. He will pay whatever it takes to extend that time. He won’t miss it. That’s when we’ll cash in.

How to Use a Bash Script to Count Your Heroku Releases

Sat, 16 Dec 2023 11:30:00 +0000

Our Creative Director asked me if I could estimate the number of production releases we have pushed so we could bost about it on our website. I quickly figured out a strategy that would work using just the heroku apps and heroku releases commands, but realized I would need to draft a script to iterate and parse the outputs. I also knew it would be best to do this in bash to make use of those heroku-cli commands.

I’m ashamed to say my bash is not strong, so I turned to ChatGPT 3.5 for help. Here’s a trimmed version of the exchange with ChatGPT.

Me: Please write a bash script that helps me count how total production releases to Heroku. Procedure:

Calls heroku apps -A to list my Heroku apps

Finds all apps whose names contain the substring ‘prod’

For each app in #2, calls heroku releases -a {app_name}

Sums all the latest version numbers to get a total number of production releases

ChatGPT was pretty close on the first try. The overall structure of the script was correct, but it was not correctly parsing the latest release from the output of the heroku releases command. This was the command it drafted:

releases=$(heroku releases -a "$app" | grep -E '^[0-9]+' | head -n 1 | awk '{print $1}')

This is very close, but it won’t yield any matches, because the regex used for grep is matching digits 0-9, but the release numbers are prefixed with a v, e.g. v55. The correct regex is therefore: '^v[0-9]+'. The variable name releases is also not entirely accurate and may indicate a “misunderstanding” of the goal. We’d like this line to give us just the latest release number. That said, the command is using head -n 1 to get the latest release. So I decided to give ChatGPT an example of the output to see if it could correct the error.

Me: This is close. There is just an issue parsing the latest release number from the result of the heroku releases command. Here is an example result:

v88  Deploy b5e486a4                       dev@thinknimble.com  2023/12/15 03:33:29 -0500 (~ 12h ago)
v87  Set REDIS_URL config vars             dev@thinknimble.com  2023/12/15 03:11:25 -0500 (~ 12h ago)
v86  Deploy eb10b67c                       dev@thinknimble.com  2023/12/15 02:48:50 -0500 (~ 13h ago)
v85  Deploy 2ced23ee                       dev@thinknimble.com  2023/12/15 00:41:51 -0500 (~ 15h ago)
v84  Deploy 1f24c745                       dev@thinknimble.com  2023/12/13 02:29:13 -0500
v83  Deploy 58c7705c                       dev@thinknimble.com  2023/12/13 02:08:38 -0500
v82  Deploy 8b80ec50                       dev@thinknimble.com  2023/12/12 21:19:02 -0500
v81  Deploy bb148408                       dev@thinknimble.com  2023/12/12 19:51:02 -0500
v80  Set USE_REDIS config vars             dev@thinknimble.com  2023/12/12 18:47:41 -0500
v79  Set REDIS_URL config vars             dev@thinknimble.com  2023/12/12 18:47:33 -0500
v78  Set REDIS_URL config vars             dev@thinknimble.com  2023/12/12 18:44:43 -0500
v77  Set REDIS_URL config vars             dev@thinknimble.com  2023/12/12 18:36:27 -0500
v76  Update NEW_RELIC by newrelic          dev@thinknimble.com  2023/12/12 10:44:14 -0500
v75  Set AWS_S3_SECURE_URLS config vars    dev@thinknimble.com  2023/12/10 17:29:48 -0500
v74  Set AWS_S3_CUSTOM_DOMAIN config vars  dev@thinknimble.com  2023/12/10 17:29:32 -0500

So in this case, we want to parse out the number 88 and add that to total_releases. Can you regenerate the script with that in mind?

ChatGPT then regenerated the script and changed the line. This time the regex is correct, but we have a regression. For some reason, ChatGPT removed the head -n 1 step. So this will give us a list of the release numbers of all releases in the output. I think it may have done this because I gave it a list of outputs above, instead of a single line.

latest_version=$(heroku releases -a "$app" | grep -E '^v[0-9]+' | awk '{print $1}' | sed 's/v//')

One more nudge from me set this right:

Me: Not quite. Instead of one latest_version this is giving me a list of all the versions. Try again?

And ChatGPT corrected the line. Now the variable name latest_version is good and we are getting a single release number to add to total_releases. Nice!

latest_version=$(heroku releases -a "$app" | grep -E '^v[0-9]+' | sort -rV | head -n 1 | awk '{print $1}' | sed 's/v//')

The Final Script

Here is the final script mainly authored by ChatGPT that will get a list of all production Heroku apps you have access to and sum up their latest release versions to give you a total number of all-time production deployments.

In our case, the number of total production releases is over 2000! And that doesn’t even include apps on AWS or those we have shut down and transferred away to clients. So I’d guess out team has probably pushed to prod almost 5000 times over the past five years. A vanity metric, perhaps, but it’s still cool to think about.

#!/bin/bash

# Step 1: List Heroku apps
apps=$(heroku apps -A)

# Step 2: Find apps with names containing 'prod'
prod_apps=$(echo "$apps" | grep 'prod' | awk '{print $1}')

# Step 3 and 4: Iterate over prod apps and get total releases
total_releases=0
for app in $prod_apps; do
    latest_version=$(heroku releases -a "$app" | grep -E '^v[0-9]+' | sort -rV | head -n 1 | awk '{print $1}' | sed 's/v//')
    if [ -n "$latest_version" ]; then
        total_releases=$((total_releases + latest_version))
        echo "App $app has version v$latest_version as the latest production release."
    else
        echo "No production releases found for app $app."
    fi
done

echo "Total production releases: $total_releases"

An Album for Each Year (2023)

Sat, 02 Dec 2023 18:00:00 +0000

I love lists. So here’s a list albums, one for each year of my life. This exercise was inspired by a blog post by the CTO of Amazon, Dr. Werner Vogels.

The rules are simple, Select “a favorite album for every year of your life”. The challenge had two restrictions:

Only one album per year
No repeat artists

I’m probably profiling myself pretty heavily by sharing this list, but who cares! I won’t go through them all, but I have a meaningful memory attached to each of these albums. My 2023 album of choice is a particularly weird and delightful one: “One Wayne G” by Mac Demarco. This album is actually an archive of musical sketches and experiments. The songs are not really complete or cohesive, but they have Mac’s signature style and are all pleasant to listen to. The titles of the songs are mostly just dates, like “20221102.” I found this really inspirational. I ended up doing a few similar musical experiments this year because of this album. Titling song sketches with a date is liberating, because it relieves the pressure of trying to make a complete song.

One of my favorite songs on the album actually does have a title: Proud, True Toyota. It’s hilarious and speaks to me, because I too own a proud, true, Toyota Tundra. The song is ridiculous, and it’s also ridiculous that I live in Washington DC and own an old pickup truck! It’s honestly not very smart or convenient, but it makes me happy.

Finally, the album will also always reminds me of my mom who passed away this year. I was playing the album through my bluetooth speakers at home with her in May, around Mother’s day. That was just a few weeks before she died on June 7.

1987: The Cure, Kiss Me, Kiss Me, Kiss Me
1988: The Pixies, Surfer Rosa
1989: Kate Bush, The Sensual World
1990: Depeche Mode, Violator
1991: Nirvana, Nevermind
1992: Weird Al Yankovic, Off the Deep End
1993: Smashing Pumpkins, Siamese Dream
1994: Pink Floyd, Division Bell
1995: Björk, Post
1996: Barenaked Ladies, Born on a Pirate Ship
1997: Radiohead, OK Computer
1998: Neutral Milk Hotel, In the Aeroplane Over the Sea
1999: Eiffel 65, Europop
2000: Enya, A Day Without Rain
2001: The Strokes, Is This It
2002: Coldplay, Clocks
2003: Belle and Sebastian, Dear Catastrophe Waitress
2004: The Killers, Hot Fuss
2005: Modest Mouse, Good News for People Who Love Bad News
2006: Hot Chip, The Warning
2007: Bright Eyes, Cassadaga
2008: Bon Iver, For Emma, Forever Ago
2009: Grizzly Bear, Veckatimest
2010: Kanye West, My Beautiful Dark Twisted Fantasy
2011: Sufjan Stevens, The Age of Adz
2012: Grimes, Visions
2013: Chvrches, The Bones of What You Believe
2014: St Vincent, St Vincent
2015: Tame Impala, Currents
2016: David Bowie, Blackstar
2017: Kendrick Lamar, DAMN.
2018: Janelle Monáe, Dirty Computer
2019: Aldous Harding, Designer
2020: Phoebe Bridgers, Punisher
2021: Khruangbin, Mordechai
2022: Mighty Brother, Azimuth
2023: Mac Demarco, One Wayne G

Automatically Caption Your Videos with Whisper and ffmpeg

Sat, 12 Aug 2023 14:00:00 +0000

Big video handlers like YouTube and even Slack use AI to automatically caption videos that you upload. This is a great win for accessibility, and the AI is extremely accurate. It sometimes stumbles on uncommon words and bad audio signals, but these can be easily fixed by a human. I think most would agree that the benefit and cost savings of generating 95% accurate captions greatly outweighs the cost of having none at all!

So at work we were talking about how we could build this feature ourselves using open source tools. I came up with this suggestion that uses OpenAI’s Whisper and the venerable ffmpeg.

There are two steps to this (1) generate captions from the video as an SRT file, and (2) bundle the captions with the video.

Whisper in my experience is incredibly good at audio transcription. And it can directly transcribe mp4 files to SRT format. You don’t have to extract the audio from the video or anything like that. Once you have an SRT file, ffmpeg can easily bundle an mp4 and SRT file. I’ve tried this on some personal videos, and the results have been near perfect.

Installation

Here is my project directory on GitHub. The README includes installation instructions, reproduced here.

https://github.com/whusterj/whisper-transcribe

The only two hard dependencies are the ffmpeg system package and the openai-whisper Python package.

This installation has been tested with Python 3.10.12 on Ubuntu 20.02. It should work on other Platforms as well, and OpenAI says Whisper should work with all Python versions 3.9 to 3.11.

First, you will need ffmpeg on your system, if you don’t have it already:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

For other platforms, see the [Whisper GitHub repo][1].

Now you can install the python requirements. Create a virtual environment, activate it, and pip install from requirements.txt:

python -m venv .venv/
source .venv/bin/activate
pip install -r requirements.txt

If you encounter errors during installation, you may need to install rust on your platform as well. If my requirements.txt fails to install for some reason, try just installing the openai-whisper package - this should install its Python sub-dependencies:

pip install openai-whisper

How to Do It

Step 1: Use whisper to generate an SRT transcription of the video:

whisper infile.mp4 \
    --model small.en \
    --language English \
    -f 'srt'

I tested this on a two minute video and it took less than 30 seconds to produce a complete transcription. This output is a subtitle file for your video - an SRT file. SRT is a plaintext file format, so you can edit it with your word processor of choice. Open the SRT, and you will see that each line of the file contains start and end timecodes and the caption text. This can be interpreted by video players to display the caption at the right time in sync with the video.

I used the small.en language model, which is about 400MB to download. It may be less accurate than the larger models, but in my experience it does a really good job. The option -f 'srt' specifies that you want an SRT file, but you have your choice of text output format.

You may want to quickly review the SRT file for any misinterpreted words or names

Step 2: Use ffmpeg to add the SRT as a subtitle track:

ffmpeg -i infile.mp4 \
       -i infile.srt \
       -c copy -c:s mov_text \
       outfile.mp4

This took less than 2 seconds to run. It doesn’t have to re-render the video, it just bundles the SRT alongside the video track inside of the mp4 container.

You will need a player like VLC that can show you the subtitle tracks. It is also possible to use ffmpeg to “burn in” the subtitles from your SRT file, but this requires re-rendering the whole video.

Automate It All!

The above would be the “by hand” procedure, but perhaps you can see how easily this process might be automated in bulk.

The idea would be that you can drop the mp4 files you want subtitled into an S3 bucket. Then you’d have a scheduler script detect them and queue up jobs for one or more workers. This isn’t strictly necessary, but would allow you to fan out the process to as many workers as you like to process more videos faster. The jobs would run the above commands and produce captioned video files automatically, which are then saved back to S3 somewhere.

Show Captions in HTML without Embedding

If you’re captioning videos for viewing in web browsers, there’s an even easier way.

After my initial write-up, I learned that it’s also possible to attach the track to the video using HTML5’s built-in <video> and <track> tags. This cuts out step two above, meaning you won’t have to use ffmpeg to embed the track in the video file.

In this case, you’ll want to generate a .vtt file instead of the .srt above. Fortunately, Whisper supports this format, too!

whisper infile.mp4 \
    --model small.en \
    --language English \
    -f 'vtt'

Once you’ve generated your captions, you can use them with the HTML5 video player like this:

<video src="myvideo.mp4">
  <source src="myvideo.mp4" type="video/mp4" />
  <track
    label="English"
    kind="captions"
    srclang="en"
    src="/path/to/captions.vtt"
    default
  />
</video>

William Huster

Infinite UI: What's Really Different about GenAI Software

Unpredictable is not the same as random

The inversion

Constraining without nerfing

Conclusion

The AI Onion: A Layered Approach to AI Integration

An Inspiration

The AI Onion

Layers 1-2: Foundation Model and Chat Interface

Layer 3: Prompting

Layer 4: Dataset Integration

Layer 5: Evaluations

Layer 6: Tool Use

Layer 7: Agent Flows

Layer 8: Fine-Tuning

The Point

Thoughts on AI 2027

Download the Internet with Kiwix

How it Works

A Script to Get the Latest Archives

Usage

Set Up a Cron Job

How I Ham (🍖📻 Ham Radio) - Part 1

Microelectronics Were the Gateway

Down the Rabbit Hole

Getting Licensed

Lock Down

Exploring Battle Outcome Probabilities in 'War Room', the Board Game by Larry Harris

Links

Part 1 - Modeling the Probabilities Using Simulation

Step 1 - Set Up the Simulation

Why Ten Dice per Roll?

Step 2 - Generate Rolls for Each Simulation

Step 3 - Analyze the Results

Get and Graph a Distribution for a Color

Aggregate and Plot All Colors

What We See

Step 4 - Adding in White and Black Results

Before and After Wilds Comparison

Conclusions - Wild Results Have a Decent Impact!

Part 2 - Modeling the Probabilities in “Pure Math”

Computing the Binomial Distribution

The Binomial Coefficient

Binomial Coefficient in Python

A Real Understanding of the Binomial Coefficient

Computing and Graphing the Probability Distributions

Part 3 - Including Wild Results in “Pure Math”

How to Add WHITE Results?

The Impact of Adding WHITE Results

Using a Multinomial Distribution to Model the Impact of White Results

Mathematical if Statements

Visualizing the Combinations

Translating to Python

Trying Out the “Pure Math” Solution

How Does the Math Compare to Simulation?

Red Tape and Cancer

How to Use a Bash Script to Count Your Heroku Releases

The Final Script

An Album for Each Year (2023)

Automatically Caption Your Videos with Whisper and ffmpeg

Installation

How to Do It

Automate It All!

Show Captions in HTML without Embedding

Mathematical `if` Statements