Brandon Rohrer

Artisanal Language Model v6: Sparse second-order Markov model

Sat, 20 Jun 2026 06:36:00 EDT

After some trial and error with a few sparse representation concepts, I settled on one that takes advantage of one of Python's strengths: dictionaries. The sparse second-order model keeps two running dictionaries: one full of bigram counts and one for trigram counts.

Representation

A bigram is a token pair. For the token sequence AKFLS, the bigrams are AK, KF, FL, LS. And a trigram is a triple. The same sequence would contain trigrams AKF, KFL, FLS. (More generally, n-grams have n tokens.)

Dictionaries tracking these have bi/tri-grams as keys and counts as values. Any -grams not yet observed will not be in the dictionary at all, which is what allows it to be sparse.

bigram_dict = {
    "AK": 1,
    "KF": 1,
    "FL": 1,
    "LS": 1,
}
trigram_dict = {
    "AKF": 1,
    "KFL": 1,
    "FLS": 1,
}

To save some more space in memory, instead of using the token text itself, I took the list of token ids, which are integers, and converted it to a tuple. There are ways to shrink this further, but they start to get more convoluted and I'll wait until I bump up against the limits of this scheme before I wrestle with them.

Implementation

In order calculate the likelihood of a next token with this formula

P(ABC | AB) = count(ABC) / count(AB)

the count(ABC) and count(AB) can be looked up directly in their respective dictionaries. The pseudo-Python has two dictionary lookups to estimate the likelihood of F following AK.

likelihood = trigram_dict["AKF"] / bigram_dict["AK"]

The full code adds in default values of 0 for any missing dictionary keys and some small values for numerical stability.

Training

A fun thing about using dictionaries is that they are fast in Python. Training happens almost as fast as the data can be read from disk. A dictionary read is a hash lookup which screams, even with large dictionaries. It's O(1) for you computer science nerds.

I also noticed on closer inspection that my earlier implementation re-calculated the transition probabilities after loading each book. This was unnecessary, and constituted the bulk of the training computation. Skipping this step streamlined the training considerably.

Training for the sparse model ended up quite fast fast and is sospace efficient that it allowed me to go back to using the full 20,000-element tokenizer.

Performance

Unfortunately, moving to the larger tokenizer meant that tokens were longer and three-token sequences were even less likely to re-occur. This means that the proofreader was surprised by a lot of things. The recall was perfect. Every single error got called out as an error. But the precision dropped to an abysmal 4 percent. That means for every mistake it identified correctly, it found 24 others that were not actually mistakes.

version	model	capitalization	punctuation	spelling
01	random	(9) 16	(7) 12	(8) 16
02	fomm_00	(7) 98	(7) 98	(7) 98
03	somm_00	(6) 100	(6) 94	(6) 98
04	fomm_01	(8) 100	(8) 98	(8) 98
05	somm_01	(8) 100	(7) 90	(7) 96
06	somm_02	(4) 100	(4) 100	(4) 100

moar data

The solution to this is to give the model a richer training data set. It has so little experience at this point, that everything is novel. The model needs to cram a lot more text into its representation to be able to better separate correct prose from incorrect.

The temptation here is to indiscriminately scrape a truckload of data and assume that the model will learn the right patterns from it. That is antithetical to the Artisanal approach. This is a point where I get to walk the walk and figure out a way to thoughtfully expand our data set.

I relied heavily on this helpful collection of Project Gutenberg tools which helped me download a lot of raw text versions and compiled a csv of their metadata. This gave me a foothold for reading in the information and selecting the most relevant texts. Using a Python script I whittled the list of 80,000 books down to a subset that

are text, not audio
in English
have "Science fiction" as a subject

That left a little of 3,000 titles. I also removed the titles I had set aside as evaluation texts--Alice in Wonderland, Frankenstein, and Call of Cthulhu. Due to the fact that my download of the archive is only partial so far, that left me with 416 titles in the new training corpus. (It's obnixous to put this much data in a repo, so you'll have to take my word for it.)

Also, in the spirit of investing in data qualityi, I preprocessed the texts to remove the Project Gutenberg headers and footers, ensuring that the models would be focused on learning the underlying data and not fixating on artifacts.

After retraining the tokenizer, the first-order model (version 07), and the second-order model (version 08) on the larger training corpus, the evals show some changes.

version	capitalization	punctuation	spelling
05	(8) 100	(7) 90	(7) 96
06	(4) 100	(4) 100	(4) 100
07	(10) 96	(9) 94	(10) 100
08	(6) 100	(6) 98	(6) 100

Performance values are shown as: (precision %) recall %

Comparing version 08 to version 06, the second-order model shows a tiny bump in precision. Progress, yes, but not very much.

Comparing version 07 to version 05, the first-order model shows a small bump in precision, but it's still pretty low. About one in ten flagged errors is real. The recall is still ridiculously high. Punctuation again has the lowest recall numbers, illustrating how difficult it is and how much context it takes to punctuate well.

The high recall and low precision numbers indicate that the models are too specific. Much of what they see are brand-new to them because they have never seen those particular sequences before. The larger the tokenizer alphabet, the longer the chararacter strings represented by tokens, the rarer they will each seem to be. A good way to illustrate this is by training a set of second-order models using tokenizers with smaller dictionaries.

Dictionary size

Shrinking the dictionary size used with the second-order model does indeed bring up precision at the cost of some recall. This sequence shows dictionary sizes of 10k, 5k, 2k, 1k.

version	capitalization	punctuation	spelling
09	(6) 100	(6) 98	(6) 100
10	(7) 100	(7) 94	(7) 100
12	(11) 98	(9) 88	(10) 100
13	(17) 92	(12) 74	(16) 100

These show how precision and recall can be traded off against each other to a certain extent by adjusting the alphabet size. But they are still too low to be useful. (I fear the answer is LOTS more data.)

Sparsify first-order model

The second order sparse model worked so well that I went back and sparsified the first-order Markov model too. Comparing the two shows that the performance is identical.

version	capitalization	punctuation	spelling
07	(10) 96	(9) 94	(10) 100
11	(10) 96	(9) 94	(10) 100

Actually when I first ran these, the performance was NOT identical. It was a head scratcher for me until a found a bug where I was setting ther error detection threshold in two different places and the two did not agree. It's another illustration of how testing and double checking things that should obviously work is a great way to surface subtle bugs in your code. And there is always another bug.

Reporting

At this point the reporting table is growing large enough to merit splitting into a performance table and a model summary table. Also, the variants are proliferating, so it's helpful to keep track of the characterics of each model so that patterns in what works, and doesn't work, can start to emerge.

Here's what a sample of the model table and eval table look like

version	model	description	tokenizer	alphabet	books	error cutoff
09	somm_04	sparse 2nd-order Markov	06	10k	416	1e-08
10	somm_05	sparse 2nd-order Markov	07	5k	416	1e-08
11	fomm_03	sparse 1st-order Markov	05	20k	416	0.0005
12	somm_06	sparse 2nd-order Markov	08	2k	416	1e-08
13	somm_07	sparse 2nd-order Markov	09	1k	416	1e-08

version	capitalization	punctuation	spelling
09	(6) 100	(6) 98	(6) 100
10	(7) 100	(7) 94	(7) 100
11	(10) 96	(9) 94	(10) 100
12	(11) 98	(9) 88	(10) 100
13	(17) 92	(12) 74	(16) 100

Registry

In fact the work is also reaching the point where pulling out information about each model and how it was trained can be confusing and error-prone. It is far me far beyond the point where I can just hold it all in my head. Tracking model versions, variants, experiments, and their characteristics is the job of a Registry. There are vendors ready to sell you ML Registry products that have a lot of convenient features, but for a project of this size all we need is a file where these things are written down.

I created a few version tracking files. There is now

Having separate files for each, distributed across the repo nearby the artifacts they are tracking, greatly simplified and robustified the process of retrieving and collating information to make the reporting charts. It was also a helpful data modeling exercise, thinking through which tidbits belong where. For instance, whether error_threshold is a characteristic of a Markov model or a proofreader.

These "registries" lack all of the convenience tooling for automatic generation and population. I have to manually create each entry. But the record they provide has already proven quite useful.

New eval: Grammar

I will continue to eat my vegetables and add a new eval at each stage. This time it is grammar. Things like verb tense, pronoun-antecedent agreement, and preposition choice. As I went about injecting errors into sample text, I found it hard to distinguish between purely grammatical errors and errors of word choice, another category I planned to create an eval for. As a result the grammar eval also includes many examples of erroneous word choice.

Nothing jumped out to me about the models' performance on the grammar eval. The models are all hypersensitive and fire off false positives far too often to be anything approaching useful. But after this round of machinations, there are a stable of 13 models and four evals, summarized thus:

version	capitalization	grammar	punctuation	spelling
01	(10) 18	(12) 16	(12) 24	(6) 12
02	(7) 98	(8) 100	(7) 98	(7) 98
03	(6) 100	(7) 96	(6) 94	(6) 98
04	(8) 100	(10) 100	(8) 98	(8) 98
05	(8) 100	(8) 94	(7) 90	(7) 96
06	(4) 100	(5) 100	(4) 100	(4) 100
07	(10) 96	(12) 98	(9) 94	(10) 100
08	(6) 100	(6) 100	(6) 98	(6) 100
09	(6) 100	(7) 100	(6) 98	(6) 100
10	(7) 100	(8) 100	(7) 94	(7) 100
11	(10) 96	(12) 98	(9) 94	(10) 100
12	(11) 98	(11) 92	(9) 88	(10) 100
13	(17) 92	(15) 78	(12) 74	(16) 100

Performance values are shown as: (precision %) recall %

version	model	description	tokenizer	alphabet	books	error cutoff
01	random_00	random baseline	None	0k	0	0.05
02	fomm_00	1st-order Markov	00	20k	10	0.0005
03	somm_00	2nd-order Markov	04	1k	10	0.0005
04	fomm_01	1st-order Markov	00	20k	20	0.0005
05	somm_01	2nd-order Markov	04	1k	20	0.0005
06	somm_02	sparse 2nd-order Markov	00	20k	20	0.0005
07	fomm_02	1st-order Markov	05	20k	416	0.0005
08	somm_03	sparse 2nd-order Markov	05	20k	416	1e-08
09	somm_04	sparse 2nd-order Markov	06	10k	416	1e-08
10	somm_05	sparse 2nd-order Markov	07	5k	416	1e-08
11	fomm_03	sparse 1st-order Markov	05	20k	416	0.0005
12	somm_06	sparse 2nd-order Markov	08	2k	416	1e-08
13	somm_07	sparse 2nd-order Markov	09	1k	416	1e-08

Next steps

The round of improvements pushed Artisanal Language Model for proofreading ahead a little, but there is still much to do. Then next obvious step is to get a lot more data, but in a way that is consistent with the goals of ALMs. And after that explore the next round of models.

Artisanal Language Model v3: Second-order Markov model

Sat, 06 Jun 2026 04:36:00 EDT

The most straightforward way to extend the first-order Markov model is to add an order and make it a second-order model. This means that instead of matching token pairs it matches token triples. It's like using the two most recent words to predict the third. It's surprisingly effective. How would you complete these phrases?

sugar and ...
big bad ...
PB and ...
Guns and ...

You probably said "big bad wolf" (possibly "big bad bear" or "big bad dude") but it probably wasn't "big bad artichoke" and I'm certain it wasn't "big bad consider". The pattern set by two words is stronger than the pattern set by one. Tokens aren't quite the same thing. They are sometimes just parts of words and can include symbols and punctuation. But the idea still holds.

Similar to first-order models, the occurrences of all three-token sequences are counted up. To get from counts to probabilities, the sequence count is divided by total two-token prefix count. The probability that the next token after A and B will be C is given by

P(ABC | AB) = count(ABC) / count(AB)

In pseudo-Python the count array is initialized as a three-dimensional array of zeros

sequence_counts = zeros(dictionary_size, dictionary_size, dictionary_size)

and the counts are collected as the code walks the length of token ids that make up the full tokenized training corpus.

for i_pair in range(len(token_ids) - 2):
    sequence_counts[
        token_ids[i_pair],
        token_ids[i_pair + 1],
        token_ids[i_pair + 2]
    ] += 1

To get the probabilities, the counts are summed across all final tokens (axis 2) to get the total count of occurrences of the two-token prefix of the sequence. Then the count of each sequence is divided by the count of its prefix.

sequence_probabilities = sequence_counts / sum(sequence_counts, axis=2)

Memory management

Working with a three-token sequence means that the count and probability arrays become three-dimensional. That means that the total number of parameters jumps from the square of the tokenizer dictionary size to its cube. What was a manageable 20,000 squared (400 million) is now 20,000 cubed (8 trillion), and my laptop doesn't have the 32 TB of RAM to handle that many parameters.

With some trial and error I found that a dictionary size of 1,000, resulting in a 1 billion element array, is somthing I can handle with a handful of GB of memory. That means that the representation will be quite different than it would be for a 20,000-element dictionary tokenizer. But we can still run it on the evals and see how it stacks up.

Performance

model	name	capitalization	spelling
random	proof_01	(5) 8	(8) 16
FOMM_00	proof_02	(7) 98	(7) 98
SOMM_00	proof_03	(6) 100	(6) 98

Even with 1/20 th the dictionary, the second-order model performs on par with the first-order model. The 98% and 100% recall, coupled with the 6% precision, indicates that both the models are under-trained. Most sequences they encounters are entirely novel to them, and so get tagged as errors. They lack the eperience (training data) to competently distinguish correct from incorrect writing.

Doubling the training data

Currently, the models are all training on ten novels. It's telling to create new versions of the models that have been trained on an additional ten novels and compare the results.

I added the next ten most popular novels from the Project Gutenberg Science Fiction and Fantasy category:

The King in Yellow by Robert W. Chambers
Thuvia, Maid of Mars by Edgar Rice Burroughs
A Honeymoon in Space by George Chetwynd Griffith
The Misplaced Battleship by Henry Harrison
Gulliver's Travels by Jonathan Swift
A Connecticut Yankee in King Arthur's Court by Mark Twain
A Journey to the Center of the Earth by Jules Verne
A Midsummer Night's Dream by William Shakespeare
The Sex Life of Gods by M. E. Knerr
The Crack of Doom by Robert Cromie

and I added a volume to the evaluation data collection

The Call of Cthulhu by H. P. Lovecraft

Some of these I have never heard of before, unlike the first batch, all of which I was familiar with.

The resulting first-order model with double training data I numbered version 4 and assigned the second-order model version number 5.

The results on the doubled training data were underwhelming. Recall mostly held constant, and precision ticked up by a single percentage point. More training data is useful with these models, but it appears that it is going to the a metric ton of the stuff to make a dent in performance using these models.

model	name	capitalization	spelling
random	proof_01	(8) 12	(10) 16
FOMM_00	proof_02	(7) 98	(7) 98
SOMM_00	proof_03	(6) 100	(6) 98
FOMM_01	proof_04	(8) 100	(8) 98
SOMM_01	proof_05	(8) 100	(7) 96

Adding in an eval for punctuation

Writing evals is somewhat tedious so I'm spreading out the work, adding them in gradually. Here's what the updated results look like with punctuation added in.

model	name	capitalization	punctuation	spelling
random	proof_01	(12) 26	(14) 20	(13) 24
FOMM_00	proof_02	(7) 98	(7) 98	(7) 98
SOMM_00	proof_03	(6) 100	(6) 94	(6) 98
FOMM_01	proof_04	(8) 100	(8) 98	(8) 98
SOMM_01	proof_05	(8) 100	(7) 90	(7) 96

The precision numbers for punctuation are consistent with the others, showing a huge number of false positives consistent with an under-trained language model. But the recall numbers are lower for the second-order Markov models, which also have a smaller tokenizer dictionary and shorter tokens. This suggests that some inaccuracies in punctuation only become clear when more context is taken into account. A missing comma, or an extra one, requires understanding the flow of a sentence. I'll get to test this in future verstions that incorporate more context, more tokens.

Adding a heartbeat

One thing about working with models and data sets that strain the capacity of your hardware is that they can be sloooow. A psychological trick for not driving yourself insane is to add a heartbeat to the code. Something, anything really, that prints to the terminal and gives a sense of continued progress. For instance, while training models on 20 novels, I started printing the name of each novel as it came up. Having something happen every second or two is all you need. It's also a useful indicator for when something has stopped working entirely. It can save you waiting for 90 minutes before you decide that the code has hung.

Next steps

It feels like this is progress, but the proofreading performance gains haven't hit yet. Time to try some new things.

While second-order models seemed to be an improvement over first-order, the practical limitations of storing an O(n^3) array limited the tokenizer dictionary size. There other representations to try that are more space efficient.
Extending to third- and higher-order Markov models give more context, allowing detection of more subtle errors.
It appears the the amount of training data is laughably low. It's time to think seriously about scaling it up. Doing this without becoming an indisriminate data vacuum will require careful thought.

Artisanal Language Model v2: First-order Markov model

Thu, 28 May 2026 06:36:00 EDT

It would be tempting to jump straight to a transformer-based language model at this point, but that is a trap. There are a lot of small architectural decisions to work out yet, and there are some wrinkles that still need to be smoothed. Starting with simpler models helps to work through those much more quickly. In simpler models there are far fewer places for bugs to hide. Also, we would be robbing ourselves of the satisfaction of seeing just how weak the simple models are and how much performance we are buying by using more computationally expensive models.

How to train a Markov model

The next simplest model I can think of is a first-order Markov model (a.k.a. Markov chain. It is the first baby step in the journey toward sophistication. The concept behind it is that each token can be used to predict what the next token is likely to be. There is a deeper dive into the mechanics as part of this post on transformers but the most important nugget to understand is that knowing the current token gives a probability for each token that follows.

For example, let's say the word jellybean appears frequently in the training corpus, and that there are two tokens involved, one for jelly and one for bean. In the future when the model encounters the token jelly it will assign a high probability to the following token being bean. If it has also seen jellyroll, jelly jar, and jellybelly, then it will also assign some probability to roll, _jar, and belly. But the more often a sequence is observed, the more likely that sequence is predicted to be.

For brevity, I'll use short token names A, B, C, etc. instead of the text they represent, jelly, bean, belly, etc. Markov models can be learned in a straightforward way. Any time a two-token sequence is observed, note it. Keep a running count of how many times is occurs in the training corpus. For sequences starting with A, the counts might end up looking like AA=2, AB=17, AC=0, AD=5, AE=1. In this example, the token A occurred a total of 25 times in the training data. 17 of those were followed by token B, 5 of those were followed by token D, etc.

To get from counts to probabilities, divide sequence count by total token count. The probability that the next token after A will be B is given by

P(AB|A) = count(AB) / count(A) = 17 / 25 = 68%

Implementation

The bulk of the computing consists of crawling through the text and noting each time every token pair occurs. This is most efficiently done in a single pass, and incrementing the count of each pair as it occurs. A straightforward way to track these is in a two-dimensional array, where each row represents the first token in the pair and each column represents the second. This implies that the array will need as many rows and columns as there are tokens in the dictionary. For a dictionary size n, the count array will have n^2 elements. For n = 1000, this works out to one million elements. This won't even make your RAM break a sweat. Even for n = 10,000, this works out to 100 million elements, which is less than 1GB of memory. At the n = 100,000 point (ten billion elements), you start to need a beefed up workstation and to be thoughtful about how you operate on the array. Luckily the whole promise of ALMs is that small is beautiful.

As of this writing the proofreading ALM is working with a dictionary size of 20,000, which gives the first-order Markov model 400 million elements. Nothing burdensome. Serialized with pickle, this saves to a 4.8GB file, and that's without pulling out any tricks for compression or sparse coding.

You can follow along in the Python implmentation.

In pseudo-Python the count array is initialized as zeros

pair_counts = zeros(dictionary_size, dictionary_size)

and the counts are collected as the code walks the length of token ids that make up the full tokenized training corpus.

for i_pair in range(len(token_ids) - 1):
    pair_counts[token_ids[i_pair], token_ids[i_pair + 1]] += 1

To get the probabilities, the counts are summed across all columns (axis 1) to get the total count of occurrences of the first token of the pair. Then the count of each pair is divided by the count of its first token.

pair_probabilities = pair_counts / sum(pair_counts, axis=1)

In the actual implementation, a probability floor is added in, just so the predicted probability never comes out to be zero. Systems can behave strangely when something occurs that the model believed to be impossible.

Token pairs are also referred to as transitions. This is a Markov-related term, as Markov models originated to describe transitions from one state to the next.

Spelling performance

Now that the model is trained, the burning question is how it compares to the random baseline on the spelling evals.

model	name	precision	recall
random	proof_01	0.10	0.22
first-order Markov model	proof_02	0.07	0.98

Because the random baseline is random, it varies from run to run, and this is just one sample run, but this gives a rough sense of how the first-order Markov Model compares. Precision is lower, but not dramatically so. Recall, however, is dramatically higher. This means that the first-order Markov model calls a lot of false alarms, but it finds nearly all of the actual errors in the process. That shows progress! And room for improvement.

Adding in Capitalization errors

Now is a good time to add another dimension to the evaluation. I added a capitalization evaluation dataset, structured similarly to the spelling evaluation set, with 50 errors scattered throughout 5 paragraphs that the model was not trained on. Both models' performance on these were very similar to the spelling eval.

model	name	precision	recall
random	proof_01	0.09	0.16
first-order Markov model	proof_02	0.07	0.98

Automatically generating a table

Having two models and two evals, it's time to up the reporting game. Keeping in mind that this will eventually need to extend to 6 evals and an unknown number of models, some table reformatting helps keep it compact. Also, because manually compiling a larger table will be tedious and error prone, the report_results() function in the eval_proofreading.py module got an upgrade, so that it generates the table in markdown format automatically when uv run report_results.py` is run.

Here is everything in one place.

model	name	capitalization	spelling
random	proof_01	(9) 16	(12) 22
FOMM	proof_02	(7) 98	(7) 98

where the results for each model on each eval are shown as (precision %) recall %. It's worth noting that false negatives (failing to report an error) are worse than false positives (flagging something that isn't an error). That implies that having a high recall is more important than having a high precision, if we're forced to choose. That said, single-digit precision means that more than 9 in 10 reported errors are not actually errors. The model is currently The Boy Who Cried Wolf. It reports so many errors incorrectly that it is not useful.

Next steps

While the FOMM's recall numbers are excellent, the precision is abysmal. It is spraying error detections machine gun-style, hitting the targets, but also hitting everything else in the process. To get precision higher there are a few things to try.

A more sophisticated model. It's possible a second-order Markov model will be an improvement.
A larger training data set. Inspection of the model shows that the reason so many errors are flagged is that there are a lot of token pairings that the model has never run across in the training corpus. Ten books is a hilariously small training set for a language model. Doubling this and seeing what happens will be an interesting experiment.
Adding more evals. Now that the structure for the evals seems to be stabilizing it makes sense to invest the time in creating evaluation data sets for a few more of them. Each eval category is qualitatively different and may give new insights into the strengths and weaknesses of each model.

Artisanal Language Models: Build an end-to-end v0 prototype

Thu, 30 Apr 2026 08:36:00 EDT

When building an initial prototype, its only purpose is to exist, to hold space, to help us think through how everything fits together. The worse the performance, the better. It serves as a rock bottom baseline against which we can measure all future improvements.

That means that it isn’t important to think about making the individual pieces work well, but it is important to think about how they fit together and make sure they work together smoothly. It’s more about defining structure and interfaces than anything else.

The process

It wasn’t pretty. Directories got created, renamed, moved, moved back, and deleted. Functions that were in one module got moved to another. Modules got combined and split. The layout of the tests changed three times and I’m still not entirely happy with it. During the process, I realized how little I understood what a well structured package should look like and took a detour to write myself this guide. This is all somewhat normal for creating a new project, but it’s important to note that the project wasn’t born looking this way. It evolved.

The structure

The resulting project repository is built around the package code atsrc/proofread_eng_scifi. Outside the main package code, there is a data directory with a copy of the complete training data and an eval directory with the evaluation code. At the moment, all of the tests are low level unit tests, and are scattered throughout the package directories to be as close as possible to the modules they are testing. For a small project like this, it’s not a burden to include them in the package distribution. Saving the extra step of navigating through the directory tree to find them is a nice convenience.

The package code itself consist of proofreading modules at the top level (proof_00.py, proof_01.py) which provide entry points to running the proofreader, and subdirectories for tests and models. I decided to go with a clunky, explicitly numbered versioning approach: filename_01, filename_02, etc. This makes it straightforward to manage which versions of which components are being used. This approach would get messy if it were a large codebase but again, since it’s so small, we can get away with it.

The models directory currently contains the tokenizer model and the random language model. The tokenizer, discussed previously, breaks the text into discrete pieces and numbers them. It also has numbered versions as well as a short script to train each version.

The random language model provides an extremely rough baseline against which future models can be measured. It randomly "estimates" the probability of each token occurring. These are all nonsense estimates, but they have the right shape to be used in our version 01 proofreader prototype.

Returning to the proofreader modules, version 01 is the one that uses the random language model. It uses a command line interface where the name of the text file to be proofread is supplied, for example

uv run proof_01.py to_be_proofread.txt

Then it returns a list of detected errors together with their character positions in the original text file. This again illustrates the "start with something horrible and make it better later" approach. It’s clearly inadequate from a user interface perspective, but it does give some scaffold from which to hang the rest of the prototype.

The proofreader code itself imports the pre-trained tokenizer, as well as the random language model, reads in the file requested, tokenizes it, and gets a probability assigned to each token. The proofreader then implements an arbitrary cutoff of 0.05, and any token with a probability below this is tagged as an error. It also includes some report metrics like how many errors were detected and how many errors per thousand characters, and all of these are displayed in the console.

With important files included, here's what the current structure looks like:

proofread-eng-scifi
├── LICENSE
├── README.md
├── data
│   ├── alices_adventures_in_wonderland.txt
│   ├── dorian_gray.txt
│   ├── dracula.txt
│   ├── frankenstein.txt
│   ├── hound_of_baskervilles.txt
│   ├── legend_of_sleepy_hollow.txt
│   ├── lost_world.txt
│   ├── ozma_of_oz.txt
│   ├── seven_dials.txt
│   ├── time_machine.txt
│   ├── twenty_thousand_leagues.txt
│   └── war_of_worlds.txt
├── evals
│   ├── eval_proofreading.py
│   ├── spelling.py
│   └── test_evaluation_data.py
├── pyproject.toml
├── src
│   └── proofread_eng_scifi
│       ├── models
│       │   ├── random
│       │   │   ├── lm_random_00.py
│       │   │   └── test_lm_random_00.py
│       │   └── tokenizer
│       │       ├── model_versions
│       │       │   ├── tokenizer_00.model
│       │       │   └── tokenizer_00.vocab
│       │       ├── test_tokenizer_training.py
│       │       ├── tokenizer_tools.py
│       │       └── train_tokenizer_00.py
│       ├── proof_00.py
│       ├── proof_01.py
│       └── tests
│           ├── test_proof_00.py
│           └── test_proof_01.py
└── uv.lock

This will continue to evolve, but here is a link to the commit captured in the directory tree above.

Integration with the evals

This end-to-end prototype now has just enough substance to it to let us run the evals against it. The spelling eval described here can call the proofreader through a function and run each of its five mistake-ridden paragraphs against it. Because the evals have a ground truth key for the actual mistakes, it can compare the detected mistakes and find where the hits and misses were. Any ground truth mistake that was overlapped by at least one detection is considered detected, a true positive. Similarly any ground truth mistake that has no overlaps is considered missed, a false negative. And any detected mistake that doesn’t overlap a ground truth mistake is considered a false positive. These then get combined to calculate precision and recall, the metrics of choice for the evals created.

It’s worth noting that neither precision nor recall is zero. Precision is typically 8-12% and recall is typically 15-25%. The random baseline sometimes gets it right.

Random baseline

For most models, a safe place to start a baseline is a random number generator. Instead of making predictions that are intelligent in any way, just roll the dice each time. This is where we started with our spellchecker. For some tasks and performance metrics this will give near zero, but for others it could be much higher. For example, a random baseline on a true/false quiz would be about 50%. Knowing where the performance floor lies is helpful when it's time to interpret the performance numbers of future models.

Consistency

Consistent style makes code easier to read and helps bugs to be more visible. I aim to keep my code mostly consistent and mostly in line with the recommendations of PEP 8, but if I have good reason to deviate I don't hesitate. For example, in most places I pust test files in the same directory as the modules they test, but in src/proofread_eng_scifi I opted to put them in their own subdirectory. Because of how I foresee the project evolving, this made the most sense to me.

I've seen some wildly verbose and incomprehensible code that resulted from blindly keeping to consistent convensions. As Ralph Waldo Emerson notes in his seminal essays on coding standards, "A foolish consistency is the hobgoblin of little minds". This roughly translates to "use common sense". In particular, think about who is going to running and reading your code, and try to make their lives as easy as possible. Future you will thank you.

Next steps

Now that there is a skeleton end-to-end prototype in place, the next steps will be to start improving it piece by piece and re-evaluating it after each step. There is plenty of room for improvement. The evals are incomplete, the language model is laughably bad, the tokenizer has options to experiment with, and the UI needs a whole lot of polish.

Things are just starting to get interesting.

Python packaging with uv

Fri, 17 Apr 2026 06:36:00 EDT

For a long time, every time I needed to create a new project or Python package I copied an existing repository. It worked pretty well, but never perfectly. It always felt like I was wearing someone else’s shoes. And then when I went to make changes, I realized quickly how little I understood about how projects, builds, and distributions work.

Here are some questions that I’ve had and some answers I've found. They focus on the uv toolset of environment management and packaging. This post doesn’t say anything about setuptools, even though setuptools, setup.py, and setup.cfg are still present in a lot of well-built projects, especially pre-2025 ones.

I expect this list of questions to grow and evolve over time as I learn. My primary audience is a clueless future me, but I hope it helps you too.

How do I name projects? Packages? Modules? Repositories?

As the saying goes, naming things is one of the hardest problems in computer science. This is particularly true when it comes to packaging. There are a lot of different entities to be named, and it’s unclear sometimes what names belong to which. There is a repository name, a top level directory name, a project name, and a package name. All of these can be different. For extra confusion, they can get confused with module names and function names as well.

The Python interpreter and build tools have no problem knowing whether a particular name is supposed to reference the project or the package. They determine this from context. For human brains, especially the ones new to packaging, this can be a lot to keep track of. To save yourself unnecessary hassle, a good trick is to use the same name for all these things. A brief, memorable, all-lowercase name is ideal. The one way in which these names will differ is in how they handle multiword names. For project, repository, and top level directory names, separate words with a hyphen, as in my-amazing-tool. For Python packages and modules separate multiple words with an underscore, for example my_amazing_tool. This keeps things consistent with the conventions of the various tools and communities.

But don’t worry too much if you feel the need to deviate from this. Plenty of smart people have differing opinions and it’s a matter of convention only. The PEP 8 recommendation is to give packages single-word names, without underscores, but this can be challenging to do in a readable way. Everyone ignores this.

If you plan to distribute it publically on https://pypi.org, check it first to make sure the package name isn't already taken.

What are wheels and sdists?

An sdist is a source distribution and a wheel is a binary distribution. I have no idea why it's called a wheel. Source is the code itself—.py files and their supporting cast. It comes in a single .tar.gz archive which has to be unzipped with tar -xvf before it can be properly read. Detail on sdists here.

The wheel is the compiled version of the source, containing only the files needed to actually run the code. Because compilation is platform specific, a wheel is tied to a particular operating system, processor architecture, and Python version. A single project can have many wheels if it's meant to be run on many platforms and architectures. The big caveat here is that Python files don't get pre-built into binaries. The local Python interpreter does that at runtime. So if it's a purely Python package, then one wheel is usually sufficient for all platforms and OS's. Detail on wheels here.

What are build tools and why do they matter?

Build tools do the work of taking the original files and the information from pyproject.toml and using them as ingredients for building the sdist and the wheel.

There are two parts to this, a build frontend and a build backend. For the purposes of this post, the frontend is uv build it does some gathering and interpretation of the files and prepares them for the next step. pip and build are other popular build frontends.

There are a few common build backend tools, including hatchling, setuptools, and uv's own uv_build. Here's a short guide for choosing between them, but when in doubt hatchling is a good choice. The backend needs to be called out in pyproject.toml like this

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

Here are some examples for the other backends as well.

Why have a `src` directory?

There are two common patterns for organizing projects flat and src layouts.

A flat layout is intuitive. The package code sits at the top level of the project and is more straightforward to access. But because of how imports work it's easy to lose track of whether your other code, like tests, are referencing the working copy of your code in the project or a previously installed version of the package. It can result in maddening bugs.

A src layout alleviates this. Because the package sits one level lower, it's not so easily reachable for direct import. Any import would need an installed version of the package. Using an editable install makes sure that your most recent changes to the code are what is run.

The src layout gives the benefit of protecting us from ourselves a bit more. And it comes at the cost of a slightly more complex file structure and an extra step to ensure an editable install.

How do I make a package visible across the project?

To make a package a visible from other locations in a project that are outside the package directories, for instance in tests/, the most reliable way is to do an editable install. From within the top level directory of the project run

uv pip install -e .

Modules outside your package shouldn't need __init__.py files in each directory. But now they will be able to import mypackage and go to town.

Note that it's also totally valid to include tests\ within the package. In that case they will very much need their __init__.py files. More detail in pytest best practices.

How do I add other file types to the package? And how do I access them from the code?

The easiest way is to include them under the package directory tree. By default, hatchling includes in the sdist any non-Python files under the mypackage directory that are not in .gitignore. This behavior can be aribitrarily modified for both types of build targets, sdists and wheels. Examples here. They can be instructed to include files from outside the project as well.

From within the code, these files can be accessed by their absolute path The __file__ attribute gives the absolute path of a given module. It can the be modified to point to the data instead. For example for this structure

myproject
├── pyproject.toml
└── src
    └── mypackage
        ├── __init__.py
        ├── mymodule.py
        └── data
            └── mydata.json

within mymodule

import os

mymodule_abspath = __file__
mydata_abspath = os.path.join(mymodule_abspath, 'data')
mydata_absfilename = os.path.join(mydata_abspath, 'mydata.json')
with open(mydata_absfilename, 'rt') as f:
    mydata = f.read()

What goes into `pyproject.toml`?

While pyproject.toml files are powerful and flexible and can be quite long, a minimal pyproject.toml contains just some basic project and build system information, like this.

[build-system]
requires = ['hatchling']
build-backend = "hatchling.build"

[project]
name = 'myproject'
version = '0.1.0'

There are some other commonly used fields for the project table, including description, license, authors, and keywords. project-dependencies will list other packages that yours depends on. The classifiers field gives a set of standard tags that can help humans and software tools alike make better use of your package. The full list of classifiers is lengthy, but some especially helpful ones are

Development Status
Intended Audience
Topic
Programming language

Resources

These are the references that I find most useful when answering these questions.

Artisanal Language Models: Define a task and write evals

Tue, 14 Apr 2026 08:36:00 EDT

A defining feature of an ALM is that it is purpose built for a well-defined task. So far the task has only been described in general terms: proofreading English prose of a novel draft.

It’s time to get more specific about what this ALM will do, what its inputs and outputs will be, so that I can actually start building it. It’s also time to define some tests and performance measures. I’ll need them when I have something running and I make a change; I’ll need some way to measure whether it got better.

Inputs

After the ALM has been trained, I’ll want to feed it text to proofread. To start with I’ll plan to do this through the simplest and clunkiest way I can think of: passing it the path to a text file containing the text to be proofread. In an actual professional product built for users, this is probably not ideal, but it’s a nice generalizable front door that slicker user interfaces can be built around in the future.

Outputs

After the proofreader has done its job and identified segments that might need correction, those segments will be reported as positions in the input text. Specifically, when the text file is read in as a string, the index of starting character position and ending character will be used to tag segments for inspection and correction. Position of the suspected error will be reported as a pair of indices. This collection of start/finish pairs will be the output of the proofreader.

There are a lot of other things that could be enhanced about this to provide a good user experience, and I leave the door open to add those later. For instance, the text file could be modified to include special characters marking the suspect segments. Or a fancier graphical UI could simply highlight the potential errors or underline them, as is common in word processors. An even fancier extension could propose corrections and offer the user a single keystroke way to select from a number of suggestions. But all of these could be built on top of a pair of start and end markers for each proofreading note.

Evaluation

I’ll need a way to evaluate the progress. If I try to enhance my proofreading model, how will I know if it worked? The variety of all possible text to be proofread is so large, it’s impossible to test every variation exhaustively. The best I can hope to do is come up with is a representative sample.

This falls somewhere in between the traditional software engineering practice of testing, where the notion of right and wrong answers and what a function must do is fairly clean cut, and bench marking, which is a consensus driven measure for comparing solutions in a broadly recognized way. It’s what has come to be known in LLM development as evaluations or, more affectionately, evals. Evals are a reasonable sample of the space in which a language model is expected to work. They lack the recognition and respect has a full-blown benchmark, and also lack the rigor and confidence of carefully designed tests. But despite having the worst of both worlds, evals operate in a space that we cannot ignore, and for which there is no better solution that I know of.

In practice, evals are organic. They grow to cover new use cases and new failure modes during the development process. But it’s helpful think through a reasonable initial set. For proofreading there are several classes of errors that are important to cover.

Spelling. Febuary. Febrewary. Februry. Fabuwary.
Word choice. Catching when it should be "imply" and when it should be "infer". To/two/too. There/their/they're.
Punctuation. Appropriate sentence termination. Comma usage. Quotation mark usage.
White space. Extra spaces. Spaces around punctuation. Weird indents.
Capitalization.
Grammar. Verb tense. Pronoun-antecedent agreement. Preposition choice.

These aren’t exhaustive, but there’s no need for evals to be exhaustive in order to be useful. I will almost certainly add more later as I discover new categories that aren’t getting picked up well. But they are a good place to start.

Creating evals for each category

In practice, to test how well a given language model performs in each of these areas, I’ll need to create an evaluation data set. For each of the areas above, I’ll pull five paragraphs arbitrarily from an evaluation text (Frankenstein by Mary Wollstonecraft (Godwin) Shelley) and throw 10 errors into the text of a given type. Having five paragraphs full of spelling errors gives a total of 50 spelling errors to detect. Each paragraph will come with its own answer key, the beginning and end of each word or phrase containing the error. After the proofreading model processes the paragraph, the errors it detects will be compared against the ground truth.

A ground truth error that is overlapped by at least one model detected-error is considered detected (true positive). This is not quite the same thing as
A model-detected error that overlaps at least one ground truth error. This is considered an accurate detection, but there may be several of these per ground truth error. I can't use this as the true positive count because it could result in inflated counts.
A model-detected error that doesn’t overlap a ground truth error will be considered a false positive.
A ground truth error that is not overlapped by at least one model detected error will be considered a false negative.

Recall will be defined as the total number of model-detected ground truth errors (true positives) over the total number of ground truth errors (true positives plus false negatives).

Accuracy will be a total number of model-detected ground truth errors (true positives) divided by the total number of true positives plus false positives. When there are no true positives or false positives, accuracy will be undefined.

Creating the evaluation dataset

Putting this into computer-readable form required creating a Python script with some error-ridden example text and the locations of the errors. I created the initial set of evals for spelling errors, but held off on creating evals for the other error types (punctiation, capitalization, etc.) for now. By the time you read this, there is a good chance it will already have evolved. If that's the case, you can find the latest version here. The evaluation dataset is a list of dicts, each of which contain a paragraph of text taken from a different chapter of Frankenstein which I modified to contain ten spelling mistakes. It also contains a list of ten dicts, each containing

the mis-spelled word
the index of its first and last character
the corrected version of the word

Here's a snippet of the result

evaluation_dataset = [
    {
        'source': 'Frankenstein',
        'chapter': 'L1',
        'paragraph': """
I am already far north of London, and as I walk in the streets of
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight. Do you understand this
feeling? This breeze, which has travelled from the regions towards
which I am advancing, gives me a foretaste of those icey climes.
Inspirited by this wind of promise, my daydreams become more fervent
and vivid. I try in vain to be persuaded that the pole is the seat of
frost and desolation; it ever presents itself to my imagniation as the
region of beauty and delight. There, Margaret, the sun is for ever
visible, its broad disk just skirting the horizon and diffusing a
perpettual splendour. There—for with your leave, my sister, I will put
...
requisite; or by ascertaining the secret of the magnet, which, if at
all possible, can only be effected by an undertaking such as mine.
                """,
        'mistakes': [
            {
                'first_char': 322,
                'last_char': 325,
                'wrong_text': 'icey',
                'correct_text': 'icy',
            },
            {
                'first_char': 526,
                'last_char': 536,
                'wrong_text': 'imagniation',
                'correct_text': 'imagination',
            },
            {
                'first_char': 678,
                'last_char': 687,
                'wrong_text': 'perpettual',
                'correct_text': 'perpetual',
            },
            ...
        ]
    },
]

End-to-end prototyping

It’s fair to ask why I don’t go ahead and complete the evaluation datasets for the other types of errors. It seems logical to completely finish this step before moving onto the next. We can imagine this as a breadth-first solution to the problem, thoroughly working through one stage of development, putting some polish on it before moving to the next. When this work is spread across teams, this is called waterfall style development. One team completes a whole stage of the project like design or backend support before passing it onto the next.

The alternative to this is a depth-first development strategy. Building a bare bones end-to-end solution and then adding breadth and sophistication to it in subsequent passes. Starting with an end-to-end prototype means leaving a lot of things incomplete in the first pass. It means creating something that you would be embarrassed to show to your friends. If you’re working across multiple teams, it means a whole lot more communication up front.

In theory, both of these approaches are valid and will produce a good result in similar timeframes. But in practice, they don’t. The waterfall approach assumes that all of the work done at each stage gets to remain in its final form. In fact, every additional stage teaches us things we didn’t know about what needed to come before. This requires a lot of rework on stages that we had previously thought were complete. In an end-to-end prototyping approach this rework happens quickly. The whole project has a lot less momentum and can pivot more gracefully. It is more agile.

This lesson can take a long time to learn, and in some cases, it is in managers' interest to ignore it, depending on the incentives of the organization. But since I am all of the engineering teams and all of the levels of management for this project I get to decide: We’re going to start with a lightweight end-to-end prototype.

So now that the spelling evals are done, onto the next stage—building a model to detect misspelled words.

Blog Highlights

Sun, 12 Apr 2026 06:36:00 EDT

I revamped my blog to include a highlights reel. It was a big unfriendly wall of links. Now it starts with a small unfriendly wall of links.

Tutorials, projects, code, and thoughts collected into topic groups I've generously called Book Projects.

Highlights

New Releases

Most Visited

Most Loved

I'm most proud of

Brandon Rohrer

Artisanal Language Model v6: Sparse second-order Markov model

Artisanal Language Model v3: Second-order Markov model

Artisanal Language Model v2: First-order Markov model

Artisanal Language Models: Build an end-to-end v0 prototype

Python packaging with uv

Why have a src directory?

What goes into pyproject.toml?

Artisanal Language Models: Define a task and write evals

Blog Highlights

Highlights

Why have a `src` directory?

What goes into `pyproject.toml`?