There are lovely libraries near my house. Three, all in different directions. I have spent hours in each, especially before my kids were in school. This is a story is about the one called the Douglass-Truth Library, the surprising way it got that name, and a computer simulation inspired by that surprise. Maybe it will surprise you, too.
The Douglass-Truth Library. Its name stands out from its peers. They have names like “Beacon Hill Library” and “Central Library”. Most Seattle libraries are named to match their neighborhoods.
Not the Douglass-Truth Library. We just celebrated the 50th anniversary of its renaming. It was renamed in 1975 to honor the abolitionists Fredrick Douglass and Sojourner Truth. And how were these particular leaders selected for this honor? A community vote!
Back in the early 1970s, community leaders in a neighborhood that had become predominantly African American thought their library deserved a name that reflected the community. They organized a ballot listing ten distinguished Black Americans and invited the neighborhood to vote.
The ballot clearly says “vote for one”.
How did the library get named after two?
A tie vote! Out of a ten-person race.
That is so unlikely. Is it so unlikely? How unlikely?
When I slow down to think about it, I think it must depend on how many votes there were. One hundred voters tying does not seem likely, but it seems more likely than ten thousand voters ending in a tie. How many ballots were cast? I could not find any documentation of this. But when I was at the renaming anniversary party last December, someone remembered: around 2,000 people voted.
So now can we figure out how unlikely?
Almost. I used to love simplifying probability equations and making approximations for these sorts of formulas, and if I was writing this blog 25 years ago, it would be all about Stirling’s approximation and how with pencil and paper we could puzzle this out. Stirling showed that tie probability scales like , and that means 2,000 voters doesn’t give you 1-in-2,000 odds—it’s more like 1/45. Ties are rarer with more voters, but not vanishingly rare.
Lately, though, I’ve become more of a computer-simulation type. Pencil and paper is fine, but random number sampling is even more fun.
From here, I’ll give just a sketch, so you can have fun the way you like to have fun.
Assumptions: maybe the simplest way to start is if everyone is equally liked and voters are simply choosing randomly (no strategic behaviors). But would it be more likely to have a first-place tie if two front runners were each getting nearly 50% of the vote?
Simulation: With simplifying assumptions, I can simulate one person’s vote with np.random.choice(candidate_list, vote_probabilities), but it is faster to simulate everyone’s vote simultaneously. Then I need to tally the votes and see if there is a tie for first place.
returntallies.iloc[0]==tallies.iloc[1]# True if tie for first
Results: If all ten candidates are equally liked, there’s roughly a 1 in 20 chance of a first-place tie (with 2,000 voters). The chances drops as turnout grows—but stays above 3% even with 5,000 voters.
If there are two front runners it is more like one in 100, and if one candidate has a comfortable majority, then a tie is extremely rare.
Conclusion: The math is fun. The computer simulation is fun. But the real breakthrough happened 50 years ago when the votes came in tied and some genius invented a new option: name the library after both heroes.
A hands-on guide to catching simulations bugs with automated tests
TL;DR
What you’ll learn: Write rigorous tests for randomized algorithms without arbitrary thresholds.
What you’ll get: A failing test that catches a subtle directional bias bug with Bayes factor = 10⁷⁹ (decisive evidence).
The approach: Run simulations many times, count outcomes, validate proportions using Bayesian hypothesis testing. The heart of this Fuzzy Checking Pattern is this method
The Problem: How Do You Test Randomized Algorithms?
Imagine you’re using computer simulation in your research, like an agent-based model or a Monte Carlo simulation. You run your code and it produces output. Then you run it again and get different output. That’s expected! It’s random. But here’s the challenge:
How do you know your “random” behavior is actually correct?
Traditional unit tests fail:
# This doesn't work - random walks vary!
assert result == 42
# Neither does this - what threshold?
assert 40 <= result <= 44
Instead of asking “is this close enough?” (with arbitrary thresholds), ask: “What’s the evidence ratio for bug vs. no-bug?”
We’ll do this with the Bayes factor:
Bayes factor > 100 = “decisive” evidence of a bug → Test FAILS
Bayes factor < 0.1 = “substantial” evidence of correctness → Test PASSES
Bayes factor between 0.1 and 100 = “inconclusive” → WARNING (need more data)
The Bug We’ll Catch
We’ll test a simple 2D random walk simulation. The walker starts at the center of a grid and takes random steps (left, right, up, down) until it reaches an edge.
Can you spot it? [0, -1] appears twice (up), and [0, 1] (down) is missing!
Impact: The walker can move left (25%), right (25%), up (50%), but never down (0%). This creates a bias that’s hard to spot with traditional asserts but shows up dramatically in statistical tests.
I have separated my simulation code (random_walk.py) from my automatic testing code (test_random_walk.py). I recommend this for you, too. The simulation does your science, the tests check if your science has bugs.
1. Write Your Simulation (in random_walk.py)
The fill_grid(grid, moves) function simulates a random walk and returns where it ended:
def fill_grid(grid, moves):
"""Fill grid with random walk starting from center."""
center = grid.size // 2
size_1 = grid.size - 1
x, y = center, center
num = 0
while (x != 0) and (y != 0) and (x != size_1) and (y != size_1):
grid[x, y] += 1
num += 1
m = random.choice(moves)
x += m[0]
y += m[1]
return num, x, y # Steps taken and final position
This method performs Bayesian hypothesis testing to validate that observed proportions match expectations:
from vivarium_testing_utils import FuzzyChecker
# Example: Validate that ~25% of walks exit at left edge
FuzzyChecker().fuzzy_assert_proportion(
observed_numerator=254, # 254 walks exited left
observed_denominator=1000, # Out of 1000 total walks
target_proportion=0.25, # We expect 25%
)
The buggy random walk exits left only 3% of the time (expected 25%):
pytest test_random_walk.py::test_buggy_version_catches_exit_bias -v
AssertionError: buggy_left_exit_proportion value 0.03 is significantly less than expected, bayes factor = 1.37e+79
That’s astronomically decisive evidence of a bug. The buggy version can move up twice but never down, so 94.6% of walks exit at the top edge, dramatically reducing other exits.
One test for the correct version (validates exit edge proportions)
One test for the buggy version (demonstrates catching the bug)
Examples of using fuzzy_assert_proportion() with exit locations
Run the tests with:
pytest test_random_walk.py
Adapting This for Your Own Work
Ready to use fuzzy checking in your own spatial simulations? Here’s how:
Step 1: Install the Package
pip install vivarium_testing_utils pytest
Step 2: Identify Statistical Properties
What should your simulation do in aggregate?
Agent-based models: “30% of agents should be in state A”
Monte Carlo: “Average outcome should be between X and Y”
Spatial sims: “Density should be symmetric when flipped horizontally or vertically”
Step 3: Write Fuzzy Tests
import pytest
from vivarium_testing_utils import FuzzyChecker
@pytest.fixture(scope="session")
def fuzzy_checker():
checker = FuzzyChecker()
yield checker
checker.save_diagnostic_output("./diagnostics") # this pattern saves a csv for further inspection
def test_my_property(fuzzy_checker):
# Run your simulation many times
successes = 0
total = 0
for seed in range(1000):
result = my_simulation(seed)
if condition_met(result):
successes += 1
total += 1
# Validate with Bayesian inference
fuzzy_checker.fuzzy_assert_proportion(
observed_numerator=successes,
observed_denominator=total,
target_proportion=0.30, # Your expected proportion
name="my_property_validation"
)
Step 4: Tune Sample Sizes
Small samples (n < 100): Tests might be inconclusive
Medium samples (n = 100-1000): Good for most properties
Large samples (n > 1000): High power to detect subtle bugs
If you get “inconclusive” warnings, increase your number of simulation runs.
A Challenge: Can You Find the Bug With Even Simpler Observations?
The tests above observe exit locations. But can you detect the bug using only the grid visit counts?
This is perhaps Greg Wilson’s original challenge: find a statistical property of the grid itself that differs between correct and buggy versions.
Some ideas to explore:
Does the distribution of visits differ between quadrants?
Are edge cells visited at different rates?
Does the center-to-edge gradient change?
What about the variance in visit counts?
Can you detect the bias without even tracking final positions?
Try implementing a test that catches the bug using only the grid object returned after the walk.
Additional Challenges: Deepen Your Understanding
Ready to experiment? Try these exercises to build intuition about fuzzy checking:
1. Sample Size Exploration
Reduce num_runs from 1000 to 100 in the directional balance test.
What happens to the Bayes factors?
Do tests become inconclusive?
How many runs do you need for decisive evidence?
2. Create a Subtle Bug
Modify the moves list to this alternative buggy version: [[-1, 0], [1, 0], [1, 0], [0, -1], [0, 1]] (two right moves instead of two up moves; this erroneous addition to the list means that the random walk has some chance to exit from each side).
Does fuzzy checking catch this subtler bias?
How does the Bayes factor compare to the up/down bug?
What does this reveal about detection power?
3. Validate New Properties
Write a new test that validates:
The center cell is visited most often
The walk forms a roughly circular distribution
The total path length scales as (grid size)²
Hint: For the center cell test, compare grid[center, center] to the average of edge cells using fuzzy_assert_proportion().
Testing stochastic simulations doesn’t have to rely on arbitrary thresholds or manual inspection. Bayesian fuzzy checking provides a rigorous, principled approach:
✅ Quantifies evidence with Bayes factors ✅ Expresses uncertainty naturally ✅ Catches subtle bugs that traditional tests may miss
The vivarium_testing_utils package makes this approach accessible with a simple, clean API. Whether you’re testing random walks, agent-based models, or Monte Carlo simulations, fuzzy checking helps you validate statistical properties with confidence.
What About More Complex Simulations?
This tutorial used a simple random walk where tracking direction counts was straightforward. But what about more complex spatial processes?
Greg Wilson’s blog post includes another example: invasion percolation, where a “filled” region grows by randomly selecting neighboring cells to fill next. The grid patterns are much more complex than a random walk.
How would you test that? What statistical properties would you validate? How would you instrument the code to observe the right quantities?
This tutorial was created to demonstrate practical statistical validation for spatial simulations. The fuzzy checking methodology was developed at IHME for validation and verification of complex health simulations.
Comments Off on Testing Stochastic Simulations: Bayesian Fuzzy Checking in Python
I’m still settling back into blogging as a custom, so perhaps that is why it has taken me six months to think of announcing our new python package here! Without further ado, let me introduce you to pseudopeople.
It is a Python package that generates realistic simulated data about a fictional United States population, designed for use in testing entity resolution methods or other data science algorithms at scale.
To see it for yourself, here is a three-line quickstart, suitable for using in a Google Colab or a Jupyter Notebook:
!pip install pseudopeople
import pseudopeople as psp
psp.generate_decennial_census()
It has been the thing that kept me too busy to blog for the last year. But it did generate some aesthetically pleasing figures for test purposes, as well as some population health results of interest. More details to come.