<rss version="2.0">
<channel>
<title>Brandon Rohrer</title>
<link>https://www.brandonrohrer.com</link>
<description>Brandon Rohrer's blog</description>


  <item>
    <title>
    Artisanal Language Model v6: Sparse second-order Markov model
    </title>
    <link>
    https://brandonrohrer.org/alms_somm_sparse.html
    </link>
    <pubDate>
    Sat, 20 Jun 2026 06:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/alms_somm_sparse.html
    </guid>
    <description><![CDATA[


<p>
After some trial and error with a few sparse representation concepts,
I settled on one that takes advantage of one of Python's strengths:
dictionaries. The sparse second-order model keeps two running
dictionaries: one full of bigram counts and one for trigram counts.
</p>

<h2><a id="Representation"></a><a href="#Representation">Representation</a></h2>

<p>
A bigram is a token pair. For the token sequence AKFLS, the bigrams are
AK, KF, FL, LS. And a trigram is a triple. The same sequence would
contain trigrams AKF, KFL, FLS. (More generally, n-grams have n tokens.)
</p>

<p>
Dictionaries tracking these have bi/tri-grams as keys and counts as
values. Any -grams not yet observed will not be in the dictionary at all,
which is what allows it to be sparse.
</p>

<p>
<pre>
bigram_dict = {
    "AK": 1,
    "KF": 1,
    "FL": 1,
    "LS": 1,
}
trigram_dict = {
    "AKF": 1,
    "KFL": 1,
    "FLS": 1,
}
</pre>
</p>

<p>
To save some more space in memory, instead of using the token text itself,
I took the list of token ids, which are integers, and converted it to
a tuple. There are ways to shrink this further, but they start to get
more convoluted and I'll wait until I bump up against the limits of this
scheme before I wrestle with them.
</p>

<h2><a id="Implementation"></a><a href="#Implementation">Implementation</a></h2>

<p>
In order calculate the likelihood of a next token with this formula
</p>

<p>
<code>P(ABC | AB) = count(ABC) / count(AB)</code>
</p>

<p>
the count(ABC) and count(AB) can be looked up directly in their respective
dictionaries. The pseudo-Python has two dictionary lookups to estimate
the likelihood of F following AK.
</p>

<p>
<pre>
likelihood = trigram_dict["AKF"] / bigram_dict["AK"]
</pre>
</p>

<p>
<a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/commit/2655998848b58179545c73e5144ce51f23ce5b22/src/proofread_eng_scifi/models/somm/somm_sparse.py#L40">The full code</a>
adds in default values of 0 for any missing dictionary keys
and some small values for numerical stability.
</p>

<h2><a id="Training"></a><a href="#Training">Training</a></h2>

<p>
A fun thing about using dictionaries is that they are fast in Python.
Training happens almost as fast as the data can be read from disk.
A dictionary read is a hash lookup which screams, even with large
dictionaries. It's O(1) for you computer science nerds.
</p>

<p>
I also noticed on closer inspection that my earlier implementation re-calculated
the transition probabilities after loading each book. This was unnecessary,
and constituted the bulk of the training computation.
Skipping this step streamlined the training considerably.
</p>

<p>
Training for the sparse model ended up quite fast fast and is  sospace efficient
that it allowed me to go back to using the full 20,000-element tokenizer.
</p>

<h2><a id="Performance"></a><a href="#Performance">Performance</a></h2>

<p>
Unfortunately, moving to the larger tokenizer meant that tokens were longer
and three-token sequences were even less likely to re-occur. This means
that the proofreader was surprised by a lot of things. The recall was perfect.
Every single error got called out as an error. But the precision dropped to
an abysmal 4 percent. That means for every mistake it identified correctly,
it found 24 others that were not actually mistakes.
</p>

<table>
  <thead>
    <tr>
      <th>version</th>
      <th>model</th>
      <th>capitalization</th>
      <th>punctuation</th>
      <th>spelling</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>01</td>
      <td>random</td>
      <td>(9) 16</td>
      <td>(7) 12</td>
      <td>(8) 16</td>
    </tr>
    <tr>
      <td>02</td>
      <td>fomm_00</td>
      <td>(7) 98</td>
      <td>(7) 98</td>
      <td>(7) 98</td>
    </tr>
    <tr>
      <td>03</td>
      <td>somm_00</td>
      <td>(6) 100</td>
      <td>(6) 94</td>
      <td>(6) 98</td>
    </tr>
    <tr>
      <td>04</td>
      <td>fomm_01</td>
      <td>(8) 100</td>
      <td>(8) 98</td>
      <td>(8) 98</td>
    </tr>
    <tr>
      <td>05</td>
      <td>somm_01</td>
      <td>(8) 100</td>
      <td>(7) 90</td>
      <td>(7) 96</td>
    </tr>
    <tr>
      <td>06</td>
      <td>somm_02</td>
      <td>(4) 100</td>
      <td>(4) 100</td>
      <td>(4) 100</td>
    </tr>
 </tbody>
</table>


<h2><a id="moar-data"></a><a href="#moar-data">moar data</a></h2>

<p>
The solution to this is to give the model a richer training data set.
It has so little experience at this point, that everything is novel.
The model needs to cram a lot more text into its representation to be able
to better separate correct prose from incorrect.
</p>

<p>
The temptation here is to indiscriminately scrape a truckload of data
and assume that the model will learn the right patterns from it. That is
antithetical to the Artisanal approach. This is a point where I get to
walk the walk and figure out a way to <em>thoughtfully</em> expand our data set.
</p>

<p>
I relied heavily on this
<a href="https://github.com/pgcorpus/gutenberg">helpful collection of Project Gutenberg tools</a>
which helped me download a lot of raw text versions and compiled a csv
of their metadata. This gave me a foothold for reading in the information
and selecting the most relevant texts. Using
<a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/branch/main/scripts/create_dataset_400.py">a Python script</a>
I whittled the list of 80,000 books down to a subset that
</p>

<ul>
<li> are text, not audio</li>
<li> in English</li>
<li> have "Science fiction" as a subject</li>
</ul>

<p>
That left a little of 3,000 titles. I also removed the titles I had
set aside as evaluation texts--Alice in Wonderland, Frankenstein,
and Call of Cthulhu. Due to the fact that my download of the archive
is only partial so far, that left me with 416 titles in the new training
corpus. (It's obnixous to put this much data
in a repo, so you'll have to take my
word for it.)
</p>

<p>
Also, in the spirit of investing in data qualityi, I preprocessed the texts
to remove the Project Gutenberg headers and footers, ensuring that
the models would be focused on learning the underlying data and not
fixating on artifacts.
</p>

<p>
After retraining the tokenizer, the first-order model (version 07),
and the second-order model (version 08) on the larger training corpus,
the evals show some changes.
</p>

<table>
  <thead>
    <tr>
      <th>version</th>
      <th>capitalization</th>
      <th>punctuation</th>
      <th>spelling</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>05</td>
      <td>(8) 100</td>
      <td>(7) 90</td>
      <td>(7) 96</td>
    </tr>
    <tr>
      <td>06</td>
      <td>(4) 100</td>
      <td>(4) 100</td>
      <td>(4) 100</td>
    </tr>
    <tr>
      <td>07</td>
      <td>(10) 96</td>
      <td>(9) 94</td>
      <td>(10) 100</td>
    </tr>
    <tr>
      <td>08</td>
      <td>(6) 100</td>
      <td>(6) 98</td>
      <td>(6) 100</td>
    </tr>
 </tbody>
</table>


<p>
Performance values are shown as: (precision %) recall %
</p>

<p>
Comparing version 08 to version 06, the second-order model shows a
tiny bump in precision. Progress, yes, but not very much.
</p>

<p>
Comparing version 07 to version 05, the first-order model shows a
small bump in precision, but it's still pretty low. About one in ten
flagged errors is real. The
recall is still ridiculously high.
Punctuation again has the lowest
recall numbers, illustrating how difficult it is and how much context
it takes to punctuate well.
</p>

<p>
The high recall and low precision numbers
indicate that the models are too specific. Much of what they see are brand-new
to them because they have never seen those particular sequences
before. The larger the tokenizer alphabet, the
longer the chararacter strings represented by tokens, the rarer they will
each seem to be. A good way to illustrate this is by training
a set of second-order models using tokenizers with smaller dictionaries.
</p>

<h2><a id="Dictionary-size"></a><a href="#Dictionary-size">Dictionary size</a></h2>

<p>
Shrinking the dictionary size used with the second-order model does
indeed bring up precision at the cost of some recall. This sequence
shows dictionary sizes of 10k, 5k, 2k, 1k.
</p>

<table>
  <thead>
    <tr>
      <th>version</th>
      <th>capitalization</th>
      <th>punctuation</th>
      <th>spelling</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>09</td>
      <td>(6) 100</td>
      <td>(6) 98</td>
      <td>(6) 100</td>
    </tr>
    <tr>
      <td>10</td>
      <td>(7) 100</td>
      <td>(7) 94</td>
      <td>(7) 100</td>
    </tr>
    <tr>
      <td>12</td>
      <td>(11) 98</td>
      <td>(9) 88</td>
      <td>(10) 100</td>
    </tr>
    <tr>
      <td>13</td>
      <td>(17) 92</td>
      <td>(12) 74</td>
      <td>(16) 100</td>
    </tr>
 </tbody>
</table>


<p>
These show how precision and recall can be traded off against each other
to a certain extent by adjusting the alphabet size. But they are still
too low to be useful. (I fear the answer is LOTS more data.)
</p>

<h2><a id="Sparsify-first-order-model"></a><a href="#Sparsify-first-order-model">Sparsify first-order model</a></h2>

<p>
The second order sparse model worked so well that I went back
and sparsified the first-order Markov model too.
Comparing the two shows that the performance is identical.
</p>

<table>
  <thead>
    <tr>
      <th>version</th>
      <th>capitalization</th>
      <th>punctuation</th>
      <th>spelling</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>07</td>
      <td>(10) 96</td>
      <td>(9) 94</td>
      <td>(10) 100</td>
    </tr>
    <tr>
      <td>11</td>
      <td>(10) 96</td>
      <td>(9) 94</td>
      <td>(10) 100</td>
    </tr>
 </tbody>
</table>


<p>
Actually when I first ran these, the performance was NOT identical.
It was a head scratcher for me until a found a bug where I was setting
ther error detection threshold in two different places and the two did
not agree. It's another illustration of how testing and double checking
things that should obviously work is a great way to surface subtle bugs
in your code. And there is <em>always</em> another bug.
</p>

<h2><a id="Reporting"></a><a href="#Reporting">Reporting</a></h2>

<p>
At this point the reporting table is growing large enough to merit splitting
into a performance table and a model summary table. Also, the variants
are proliferating, so it's helpful to keep track of the characterics
of each model so that patterns in what works, and doesn't work, can
start to emerge.
</p>

<p>
Here's what a sample of the model table and eval table look like
</p>

<table>
  <thead>
    <tr>
      <th>version</th>
      <th>model</th>
      <th>description</th>
      <th>tokenizer</th>
      <th>alphabet</th>
      <th>books</th>
      <th>error cutoff</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>09</td>
      <td>somm_04</td>
      <td>sparse 2nd-order Markov</td>
      <td>06</td>
      <td>10k</td>
      <td>416</td>
      <td>1e-08</td>
    </tr>
    <tr>
      <td>10</td>
      <td>somm_05</td>
      <td>sparse 2nd-order Markov</td>
      <td>07</td>
      <td>5k</td>
      <td>416</td>
      <td>1e-08</td>
    </tr>
    <tr>
      <td>11</td>
      <td>fomm_03</td>
      <td>sparse 1st-order Markov</td>
      <td>05</td>
      <td>20k</td>
      <td>416</td>
      <td>0.0005</td>
    </tr>
    <tr>
      <td>12</td>
      <td>somm_06</td>
      <td>sparse 2nd-order Markov</td>
      <td>08</td>
      <td>2k</td>
      <td>416</td>
      <td>1e-08</td>
    </tr>
    <tr>
      <td>13</td>
      <td>somm_07</td>
      <td>sparse 2nd-order Markov</td>
      <td>09</td>
      <td>1k</td>
      <td>416</td>
      <td>1e-08</td>
    </tr>
 </tbody>
</table>


<table>
  <thead>
    <tr>
      <th>version</th>
      <th>capitalization</th>
      <th>punctuation</th>
      <th>spelling</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>09</td>
      <td>(6) 100</td>
      <td>(6) 98</td>
      <td>(6) 100</td>
    </tr>
    <tr>
      <td>10</td>
      <td>(7) 100</td>
      <td>(7) 94</td>
      <td>(7) 100</td>
    </tr>
    <tr>
      <td>11</td>
      <td>(10) 96</td>
      <td>(9) 94</td>
      <td>(10) 100</td>
    </tr>
    <tr>
      <td>12</td>
      <td>(11) 98</td>
      <td>(9) 88</td>
      <td>(10) 100</td>
    </tr>
    <tr>
      <td>13</td>
      <td>(17) 92</td>
      <td>(12) 74</td>
      <td>(16) 100</td>
    </tr>
 </tbody>
</table>


<h2><a id="Registry"></a><a href="#Registry">Registry</a></h2>

<p>
In fact the work is also reaching the point where pulling out information about
each model and how it was trained can be confusing and error-prone.
It is far me far beyond the point where I can just hold it all in my head.
Tracking model versions, variants, experiments, and their characteristics
is the job of a Registry. There are vendors ready to sell you ML Registry
products that have a lot of convenient features, but for a project of this
size all we need is a file where these things are written down.
</p>

<p>
I created a few version tracking files. There is now
</p>

<ul>
<li> <a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/branch/main/src/proofread_eng_scifi/registry.py">one for proofreader versions</a></li>
<li> <a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/branch/main/src/proofread_eng_scifi/models/tokenizer/registry.py">one for tokenizer versions</a></li>
<li> <a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/branch/main/src/proofread_eng_scifi/models/fomm/registry.py">one for first order Markov model versions</a></li>
<li> <a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/branch/main/src/proofread_eng_scifi/models/somm/registry.py">one for second order Markov model versions</a></li>
<li> <a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/branch/main/src/proofread_eng_scifi/models/random/registry.py">one for random model versions</a> and</li>
<li> <a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/branch/main/src/proofread_eng_scifi/data_registry.py">one for training data corpuses</a></li>
</ul>

<p>
Having separate files for each, distributed across the repo nearby the
artifacts they are tracking, greatly simplified and robustified
the process of retrieving and collating information to make the
reporting charts. It was also a helpful data modeling exercise,
thinking through which tidbits belong where. For instance, whether
<code>error_threshold</code> is a characteristic of a Markov model or a proofreader.
</p>

<p>
These "registries" lack all of the convenience tooling for automatic
generation and population. I have to manually create each entry.
But the record they provide has already proven quite useful.
</p>

<h2><a id="New-eval:-Grammar"></a><a href="#New-eval:-Grammar">New eval: Grammar</a></h2>

<p>
I will continue to eat my vegetables and add a new eval at each stage.
This time it is grammar. Things like verb tense, pronoun-antecedent
agreement, and preposition choice. As I went about injecting errors
into sample text, I found it hard to distinguish between purely
grammatical errors and errors of word choice, another category I planned
to create an eval for. As a result the grammar eval also includes
many examples of erroneous word choice.
</p>

<p>
Nothing jumped out to me about the models' performance on the grammar eval.
The models are all hypersensitive and fire off false positives far
too often to be anything approaching useful. But after this round of
machinations, there are a stable of 13 models and four evals, summarized
thus:
</p>

<table>
  <thead>
    <tr>
      <th>version</th>
      <th>capitalization</th>
      <th>grammar</th>
      <th>punctuation</th>
      <th>spelling</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>01</td>
      <td>(10) 18</td>
      <td>(12) 16</td>
      <td>(12) 24</td>
      <td>(6) 12</td>
    </tr>
    <tr>
      <td>02</td>
      <td>(7) 98</td>
      <td>(8) 100</td>
      <td>(7) 98</td>
      <td>(7) 98</td>
    </tr>
    <tr>
      <td>03</td>
      <td>(6) 100</td>
      <td>(7) 96</td>
      <td>(6) 94</td>
      <td>(6) 98</td>
    </tr>
    <tr>
      <td>04</td>
      <td>(8) 100</td>
      <td>(10) 100</td>
      <td>(8) 98</td>
      <td>(8) 98</td>
    </tr>
    <tr>
      <td>05</td>
      <td>(8) 100</td>
      <td>(8) 94</td>
      <td>(7) 90</td>
      <td>(7) 96</td>
    </tr>
    <tr>
      <td>06</td>
      <td>(4) 100</td>
      <td>(5) 100</td>
      <td>(4) 100</td>
      <td>(4) 100</td>
    </tr>
    <tr>
      <td>07</td>
      <td>(10) 96</td>
      <td>(12) 98</td>
      <td>(9) 94</td>
      <td>(10) 100</td>
    </tr>
    <tr>
      <td>08</td>
      <td>(6) 100</td>
      <td>(6) 100</td>
      <td>(6) 98</td>
      <td>(6) 100</td>
    </tr>
    <tr>
      <td>09</td>
      <td>(6) 100</td>
      <td>(7) 100</td>
      <td>(6) 98</td>
      <td>(6) 100</td>
    </tr>
    <tr>
      <td>10</td>
      <td>(7) 100</td>
      <td>(8) 100</td>
      <td>(7) 94</td>
      <td>(7) 100</td>
    </tr>
    <tr>
      <td>11</td>
      <td>(10) 96</td>
      <td>(12) 98</td>
      <td>(9) 94</td>
      <td>(10) 100</td>
    </tr>
    <tr>
      <td>12</td>
      <td>(11) 98</td>
      <td>(11) 92</td>
      <td>(9) 88</td>
      <td>(10) 100</td>
    </tr>
    <tr>
      <td>13</td>
      <td>(17) 92</td>
      <td>(15) 78</td>
      <td>(12) 74</td>
      <td>(16) 100</td>
    </tr>
 </tbody>
</table>


<p>
Performance values are shown as: (precision %) recall %
</p>

<table>
  <thead>
    <tr>
      <th>version</th>
      <th>model</th>
      <th>description</th>
      <th>tokenizer</th>
      <th>alphabet</th>
      <th>books</th>
      <th>error cutoff</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>01</td>
      <td>random_00</td>
      <td>random baseline</td>
      <td>None</td>
      <td>0k</td>
      <td>0</td>
      <td>0.05</td>
    </tr>
    <tr>
      <td>02</td>
      <td>fomm_00</td>
      <td>1st-order Markov</td>
      <td>00</td>
      <td>20k</td>
      <td>10</td>
      <td>0.0005</td>
    </tr>
    <tr>
      <td>03</td>
      <td>somm_00</td>
      <td>2nd-order Markov</td>
      <td>04</td>
      <td>1k</td>
      <td>10</td>
      <td>0.0005</td>
    </tr>
    <tr>
      <td>04</td>
      <td>fomm_01</td>
      <td>1st-order Markov</td>
      <td>00</td>
      <td>20k</td>
      <td>20</td>
      <td>0.0005</td>
    </tr>
    <tr>
      <td>05</td>
      <td>somm_01</td>
      <td>2nd-order Markov</td>
      <td>04</td>
      <td>1k</td>
      <td>20</td>
      <td>0.0005</td>
    </tr>
    <tr>
      <td>06</td>
      <td>somm_02</td>
      <td>sparse 2nd-order Markov</td>
      <td>00</td>
      <td>20k</td>
      <td>20</td>
      <td>0.0005</td>
    </tr>
    <tr>
      <td>07</td>
      <td>fomm_02</td>
      <td>1st-order Markov</td>
      <td>05</td>
      <td>20k</td>
      <td>416</td>
      <td>0.0005</td>
    </tr>
    <tr>
      <td>08</td>
      <td>somm_03</td>
      <td>sparse 2nd-order Markov</td>
      <td>05</td>
      <td>20k</td>
      <td>416</td>
      <td>1e-08</td>
    </tr>
    <tr>
      <td>09</td>
      <td>somm_04</td>
      <td>sparse 2nd-order Markov</td>
      <td>06</td>
      <td>10k</td>
      <td>416</td>
      <td>1e-08</td>
    </tr>
    <tr>
      <td>10</td>
      <td>somm_05</td>
      <td>sparse 2nd-order Markov</td>
      <td>07</td>
      <td>5k</td>
      <td>416</td>
      <td>1e-08</td>
    </tr>
    <tr>
      <td>11</td>
      <td>fomm_03</td>
      <td>sparse 1st-order Markov</td>
      <td>05</td>
      <td>20k</td>
      <td>416</td>
      <td>0.0005</td>
    </tr>
    <tr>
      <td>12</td>
      <td>somm_06</td>
      <td>sparse 2nd-order Markov</td>
      <td>08</td>
      <td>2k</td>
      <td>416</td>
      <td>1e-08</td>
    </tr>
    <tr>
      <td>13</td>
      <td>somm_07</td>
      <td>sparse 2nd-order Markov</td>
      <td>09</td>
      <td>1k</td>
      <td>416</td>
      <td>1e-08</td>
    </tr>
 </tbody>
</table>


<h2><a id="Next-steps"></a><a href="#Next-steps">Next steps</a></h2>

<p>
The round of improvements pushed Artisanal Language Model for
proofreading ahead a little, but there is still much to do. Then next
obvious step is to get a lot more data, but in a way that is consistent
with the goals of ALMs. And after that explore the next round of models.
</p>
    ]]></description>
  </item>


  <item>
    <title>
    Artisanal Language Model v3: Second-order Markov model
    </title>
    <link>
    https://brandonrohrer.org/alms_somm.html
    </link>
    <pubDate>
    Sat, 06 Jun 2026 04:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/alms_somm.html
    </guid>
    <description><![CDATA[
            

<p>
The most straightforward way to extend the first-order Markov model is
to add an order and make it a second-order model. This means that instead
of matching token pairs it matches token triples. It's like using the
two most recent words to predict the third. It's surprisingly effective.
How would you complete these phrases?
</p>

<ul>
<li> sugar and ...</li>
<li> big bad ...</li>
<li> PB and ...</li>
<li> Guns and ...</li>
</ul>

<p>
You probably said "big bad wolf" (possibly "big bad bear" or
"big bad dude") but it probably wasn't "big bad artichoke" and
I'm certain it wasn't "big bad consider". The pattern set by two words
is stronger than the pattern set by one. Tokens aren't quite the same thing.
They are sometimes just parts of words and can include symbols and punctuation.
But the idea still holds.
</p>

<p>
Similar to first-order models, the occurrences of all three-token sequences
are counted up.
To get from counts to probabilities, the sequence count is divided by total
two-token prefix count. The probability that the next token after A and B
will be C is given by
</p>

<p>
P(ABC | AB) = count(ABC) / count(AB)
</p>

<p>
In pseudo-Python the count array is initialized as a three-dimensional array
of zeros
</p>

<p>
<pre>
sequence_counts = zeros(dictionary_size, dictionary_size, dictionary_size)
</pre>
</p>

<p>
and the counts are collected as the code walks the length of token ids
that make up the full tokenized training corpus.
</p>

<p>
<pre>
for i_pair in range(len(token_ids) - 2):
    sequence_counts[
        token_ids[i_pair],
        token_ids[i_pair + 1],
        token_ids[i_pair + 2]
    ] += 1
</pre>
</p>

<p>
To get the probabilities, the counts are summed across all final tokens (axis 2)
to get the total count of occurrences of the two-token prefix of the sequence.
Then the count of each sequence is divided by the count of its prefix.
</p>

<p>
<pre>
sequence_probabilities = sequence_counts / sum(sequence_counts, axis=2)
</pre>
</p>

<h2><a id="Memory-management"></a><a href="#Memory-management">Memory management</a></h2>

<p>
Working with a three-token sequence means that the count and probability
arrays become three-dimensional. That means that the total number of
parameters jumps from the square of the tokenizer dictionary size
to its cube. What was a manageable 20,000 squared (400 million) is now
20,000 cubed (8 trillion), and my laptop doesn't have the 32 TB of RAM to handle
that many parameters.
</p>

<p>
With some trial and error I found that a dictionary size of 1,000, resulting
in a 1 billion element array, is somthing I can handle with a handful
of GB of memory. That means that the representation will be quite different
than it would be for a 20,000-element dictionary tokenizer. But we can
still run it on the evals and see how it stacks up.
</p>

<h2><a id="Performance"></a><a href="#Performance">Performance</a></h2>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th>name</th>
      <th>capitalization</th>
      <th>spelling</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>random</td>
      <td>proof_01</td>
      <td>(5) 8</td>
      <td>(8) 16</td>
    </tr>
    <tr>
      <td>FOMM_00</td>
      <td>proof_02</td>
      <td>(7) 98</td>
      <td>(7) 98</td>
    </tr>
    <tr>
      <td>SOMM_00</td>
      <td>proof_03</td>
      <td>(6) 100</td>
      <td>(6) 98</td>
    </tr>
 </tbody>
</table>


<p>
Even with 1/20 th the dictionary, the second-order model performs
on par with the first-order model. The 98% and 100% recall, coupled with
the 6% precision, indicates that both the models are under-trained.
Most sequences they encounters are entirely novel to them, and so get
tagged as errors. They lack the eperience (training data)
to competently distinguish correct from incorrect writing.
</p>

<h2><a id="Doubling-the-training-data"></a><a href="#Doubling-the-training-data">Doubling the training data</a></h2>

<p>
Currently, the models are all training on ten novels. It's telling
to create new versions of the models that have been trained on an
additional ten novels and compare the results.
</p>

<p>
I added the next ten most popular novels from the Project Gutenberg
<a href="https://www.gutenberg.org/ebooks/bookshelf/638">Science Fiction and Fantasy</a>
category:
</p>

<ul>
<li> The King in Yellow by Robert W. Chambers</li>
<li> Thuvia, Maid of Mars by Edgar Rice Burroughs</li>
<li> A Honeymoon in Space by George Chetwynd Griffith</li>
<li> The Misplaced Battleship by Henry Harrison</li>
<li> Gulliver's Travels by Jonathan Swift</li>
<li> A Connecticut Yankee in King Arthur's Court by Mark Twain</li>
<li> A Journey to the Center of the Earth by Jules Verne</li>
<li> A Midsummer Night's Dream by William Shakespeare</li>
<li> The Sex Life of Gods by M. E. Knerr</li>
<li> The Crack of Doom by Robert Cromie</li>
</ul>

<p>
and I added a volume to the evaluation data collection
</p>

<ul>
<li> The Call of Cthulhu by H. P. Lovecraft</li>
</ul>

<p>
Some of these I have never heard of before, unlike the first batch, all of
which I was familiar with.
</p>

<p>
The resulting first-order model with double training data I numbered
version 4 and assigned the second-order model version number 5.
</p>

<p>
The results on the doubled training data were underwhelming.
Recall mostly held constant, and precision ticked up by a single percentage
point. More training data is useful with these models, but it appears that it
is going to the a metric ton of the stuff to make a dent in performance
using these models.
</p>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th>name</th>
      <th>capitalization</th>
      <th>spelling</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>random</td>
      <td>proof_01</td>
      <td>(8) 12</td>
      <td>(10) 16</td>
    </tr>
    <tr>
      <td>FOMM_00</td>
      <td>proof_02</td>
      <td>(7) 98</td>
      <td>(7) 98</td>
    </tr>
    <tr>
      <td>SOMM_00</td>
      <td>proof_03</td>
      <td>(6) 100</td>
      <td>(6) 98</td>
    </tr>
    <tr>
      <td>FOMM_01</td>
      <td>proof_04</td>
      <td>(8) 100</td>
      <td>(8) 98</td>
    </tr>
    <tr>
      <td>SOMM_01</td>
      <td>proof_05</td>
      <td>(8) 100</td>
      <td>(7) 96</td>
    </tr>
 </tbody>
</table>


<h2><a id="Adding-in-an-eval-for-punctuation"></a><a href="#Adding-in-an-eval-for-punctuation">Adding in an eval for punctuation</a></h2>

<p>
Writing evals is somewhat tedious so I'm spreading out the work,
adding them in gradually. Here's what the updated results look like
with punctuation added in.
</p>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th>name</th>
      <th>capitalization</th>
      <th>punctuation</th>
      <th>spelling</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>random</td>
      <td>proof_01</td>
      <td>(12) 26</td>
      <td>(14) 20</td>
      <td>(13) 24</td>
    </tr>
    <tr>
      <td>FOMM_00</td>
      <td>proof_02</td>
      <td>(7) 98</td>
      <td>(7) 98</td>
      <td>(7) 98</td>
    </tr>
    <tr>
      <td>SOMM_00</td>
      <td>proof_03</td>
      <td>(6) 100</td>
      <td>(6) 94</td>
      <td>(6) 98</td>
    </tr>
    <tr>
      <td>FOMM_01</td>
      <td>proof_04</td>
      <td>(8) 100</td>
      <td>(8) 98</td>
      <td>(8) 98</td>
    </tr>
    <tr>
      <td>SOMM_01</td>
      <td>proof_05</td>
      <td>(8) 100</td>
      <td>(7) 90</td>
      <td>(7) 96</td>
    </tr>
 </tbody>
</table>


<p>
The precision numbers for punctuation are consistent with the others,
showing a huge number of false positives consistent with an under-trained
language model. But the recall numbers are lower for the second-order
Markov models, which also have a smaller tokenizer dictionary and shorter
tokens. This suggests that some inaccuracies in punctuation only
become clear when more context is taken into account. A missing comma,
or an extra one, requires understanding the flow of a sentence.
I'll get to test this
in future verstions that incorporate more context, more tokens.
</p>

<h2><a id="Adding-a-heartbeat"></a><a href="#Adding-a-heartbeat">Adding a heartbeat</a></h2>

<p>
One thing about working with models and data sets that strain the capacity
of your hardware is that they can be sloooow. A psychological trick for
not driving yourself insane is to add a heartbeat to the code. Something,
anything really, that prints to the terminal and gives a sense of continued
progress. For instance, while training models on 20 novels,
I started printing the name of each novel as it came up. Having something
happen every second or two is all you need. It's also a useful indicator
for when something has stopped working entirely. It can save you waiting for 90
minutes before you decide that the code has hung.
</p>

<h2><a id="Next-steps"></a><a href="#Next-steps">Next steps</a></h2>

<p>
It feels like this is progress, but the proofreading performance gains
haven't hit yet. Time to try some new things.
</p>

<ul>
<li> While second-order models seemed to be an improvement over first-order,
the practical limitations of storing an O(n^3) array limited the
tokenizer dictionary size. There other representations to try that are
more space efficient.</li>
<li> Extending to third- and higher-order Markov models give more context,
allowing detection of more subtle errors.</li>
<li> It appears the the amount of training data is laughably low. It's
time to think seriously about scaling it up. Doing this without
becoming an indisriminate data vacuum will require careful thought.</li>
</ul>
    ]]></description>
  </item>


  <item>
    <title>
    Artisanal Language Model v2: First-order Markov model
    </title>
    <link>
    https://brandonrohrer.org/alms_fomm.html
    </link>
    <pubDate>
    Thu, 28 May 2026 06:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/alms_fomm.html
    </guid>
    <description><![CDATA[

<p>
It would be tempting to jump straight to a transformer-based language
model at this point, but that is a trap. There are a lot of small architectural
decisions to work out yet, and there are some wrinkles that still need
to be smoothed. Starting with simpler models helps to work through those
much more quickly. In simpler models
there are far fewer places for bugs to hide. Also,
we would be robbing ourselves of the satisfaction of seeing just how weak
the simple models are and how much performance we are buying by using
more computationally expensive models.
</p>

<h2><a id="How-to-train-a-Markov-model"></a><a href="#How-to-train-a-Markov-model">How to train a Markov model</a></h2>

<p>
The next simplest model I can think of is a first-order Markov model
(a.k.a. <a href="https://en.wikipedia.org/wiki/Markov_chain) ">Markov chain</a>.
It is the first baby step in the journey toward sophistication. The concept
behind it is that each token can be used to predict what the next token is likely
to be. There is a deeper dive into the mechanics as part of
<a href="https://brandonrohrer.at/transformers.html#markov_chain">this post on transformers</a>
but the most important nugget to understand is that knowing the current token
gives a probability for each token that follows.
</p>

<p>
For example, let's say the word <em>jellybean</em> appears frequently in the training
corpus, and that there are two tokens involved, one for <em>jelly</em> and one
for <em>bean</em>. In the future when the model encounters the token <em>jelly</em>
it will assign a high probability to the following token being <em>bean</em>.
If it has also seen <em>jellyroll</em>, <em>jelly jar</em>, and <em>jellybelly</em>, then it will
also assign some probability to <em>roll</em>, <em>_jar</em>, and <em>belly</em>. But the more
often a sequence is observed, the more likely that sequence is predicted to be.
</p>

<p>
For brevity, I'll use short token names A, B, C, etc. instead of the text they
represent, <em>jelly</em>, <em>bean</em>, <em>belly</em>, etc. Markov models can be learned in a
straightforward way. Any time a two-token sequence is observed, note it.
Keep a running count of how many times is occurs in the training corpus.
For sequences starting with A, the counts might end up looking like AA=2,
AB=17, AC=0, AD=5, AE=1. In this example, the token A occurred a total of
25 times in the training data. 17 of those were followed by token B,
5 of those were followed by token D, etc.
</p>

<p>
To get from counts to probabilities, divide sequence count by total
token count. The probability that the next token after A will be B is
given by
</p>

<p>
P(AB|A) = count(AB) / count(A) = 17 / 25 = 68%
</p>

<h2><a id="Implementation"></a><a href="#Implementation">Implementation</a></h2>

<p>
The bulk of the computing consists of crawling through the text and noting each
time every token pair occurs. This is most efficiently done in a single pass,
and incrementing the count of each pair as it occurs. A straightforward
way to track these is in a two-dimensional array, where each row
represents the first token in the pair and each column represents the second.
This implies that the array will need as many rows and columns as there
are tokens in the dictionary. For a dictionary size <em>n</em>, the count
array will have <em>n^2</em> elements. For <em>n</em> = 1000, this works out to one
million elements. This won't even make your RAM break a sweat. Even for
<em>n</em> = 10,000, this works out to 100 million elements, which is less than
1GB of memory. At the <em>n</em> = 100,000 point (ten billion elements), you
start to need a beefed up workstation and to be thoughtful about how
you operate on the array. Luckily the whole promise of ALMs is that small
is beautiful.
</p>

<p>
As of this writing the proofreading ALM is working with
a dictionary size of 20,000, which gives the first-order Markov model
400 million elements. Nothing burdensome.
Serialized with <code>pickle</code>, this saves to a 4.8GB file, and that's without
pulling out any tricks for compression or sparse coding.
</p>

<p>
You can follow along in
<a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/branch/main/src/proofread_eng_scifi/models/fomm/fomm.py">the Python implmentation</a>.
</p>

<p>
In pseudo-Python the count array is initialized as zeros
</p>

<p>
<pre>
pair_counts = zeros(dictionary_size, dictionary_size)
</pre>
</p>

<p>
and the counts are collected as the code walks the length of token ids
that make up the full tokenized training corpus.
</p>

<p>
<pre>
for i_pair in range(len(token_ids) - 1):
    pair_counts[token_ids[i_pair], token_ids[i_pair + 1]] += 1
</pre>
</p>

<p>
To get the probabilities, the counts are summed across all columns (axis 1)
to get the total count of occurrences of the first token of the pair.
Then the count of each pair is divided by the count of its first token.
</p>

<p>
<pre>
pair_probabilities = pair_counts / sum(pair_counts, axis=1)
</pre>
</p>

<p>
In the actual implementation, a probability floor is added in, just so the
predicted probability never comes out to be zero. Systems can behave
strangely when something occurs that the model believed to be impossible.
</p>

<p>
Token pairs are also referred to as transitions. This is a Markov-related
term, as Markov models originated to describe transitions from one
state to the next.
</p>

<h2><a id="Spelling-performance"></a><a href="#Spelling-performance">Spelling performance</a></h2>

<p>
Now that the model is trained, the burning question is how it compares
to the random baseline on the spelling evals.
</p>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th>name</th>
      <th>precision</th>
      <th>recall</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>random</td>
      <td>proof_01</td>
      <td>0.10</td>
      <td>0.22</td>
    </tr>
    <tr>
      <td>first-order Markov model</td>
      <td>proof_02</td>
      <td>0.07</td>
      <td>0.98</td>
    </tr>
 </tbody>
</table>


<p>
Because the random baseline is random, it varies from run to run, and this
is just one sample run, but this gives a rough sense of how the first-order
Markov Model compares. Precision is lower, but not dramatically so.
Recall, however, is dramatically higher. This means that the first-order
Markov model calls a lot of false alarms, but it finds nearly all of
the actual errors in the process. That shows progress! And room for
improvement.
</p>

<h2><a id="Adding-in-Capitalization-errors"></a><a href="#Adding-in-Capitalization-errors">Adding in Capitalization errors</a></h2>

<p>
Now is a good time to add another dimension to the evaluation.
I added a
<a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/branch/main/evals/capitalization.py">capitalization evaluation dataset</a>,
structured similarly to the spelling evaluation set, with 50 errors
scattered throughout 5 paragraphs that the model was not trained on.
Both models' performance on these were very similar to the spelling eval.
</p>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th>name</th>
      <th>precision</th>
      <th>recall</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>random</td>
      <td>proof_01</td>
      <td>0.09</td>
      <td>0.16</td>
    </tr>
    <tr>
      <td>first-order Markov model</td>
      <td>proof_02</td>
      <td>0.07</td>
      <td>0.98</td>
    </tr>
 </tbody>
</table>


<h2><a id="Automatically-generating-a-table"></a><a href="#Automatically-generating-a-table">Automatically generating a table</a></h2>

<p>
Having two models and two evals, it's time to up the reporting game.
Keeping in mind that this will eventually need to extend to 6 evals
and an unknown number of models, some table reformatting helps
keep it compact. Also, because manually compiling a larger table
will be tedious and error prone, the <code>report_results()</code> function in
the <a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/branch/main/evals/eval_proofreading.py">eval_proofreading.py</a>
module got an upgrade, so that it generates the table in markdown format
automatically when <code>uv run </code>report_results.py` is run.
</p>

<p>
Here is everything in one place.
</p>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th>name</th>
      <th>capitalization</th>
      <th>spelling</th>

    </tr>
  </thead>
  <tbody>
    <tr>
      <td>random</td>
      <td>proof_01</td>
      <td>(9) 16</td>
      <td>(12) 22</td>
    </tr>
    <tr>
      <td>FOMM</td>
      <td>proof_02</td>
      <td>(7) 98</td>
      <td>(7) 98</td>
    </tr>
 </tbody>
</table>


<p>
where the results for each model on each eval are shown as
<code>(precision %) recall %</code>. It's worth noting that false negatives
(failing to report an error) are worse than false positives (flagging
something that isn't an error). That implies that having a high recall
is more important than having a high precision, if we're forced to choose.
That said, single-digit precision means that more than 9 in 10
reported errors are not actually errors. The model is currently The Boy
Who Cried Wolf. It reports so many errors incorrectly that it is
not useful.
</p>

<h2><a id="Next-steps"></a><a href="#Next-steps">Next steps</a></h2>

<p>
While the FOMM's recall numbers are excellent, the precision is abysmal.
It is spraying error detections machine gun-style, hitting the targets,
but also hitting everything else in the process. To get precision
higher there are a few things to try.
</p>

<ul>
<li> A more sophisticated model. It's possible a second-order Markov model
will be an improvement.</li>
<li> A larger training data set. Inspection of the model shows that the reason
so many errors are flagged is that there are a lot of token pairings that
the model has never run across in the training corpus. Ten books is a
hilariously small training set for a language model. Doubling this
and seeing what happens will be an interesting experiment.</li>
<li> Adding more evals. Now that the structure for the evals seems to be
stabilizing it makes sense to invest the time in creating evaluation
data sets for a few more of them. Each eval category is qualitatively
different and may give new insights into the strengths and weaknesses
of each model.</li>
</ul>

    ]]></description>
  </item>


  <item>
    <title>
    Artisanal Language Models: Build an end-to-end v0 prototype
    </title>
    <link>
    https://brandonrohrer.org/alms_end_to_end.html
    </link>
    <pubDate>
    Thu, 30 Apr 2026 08:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/alms_end_to_end.html
    </guid>
    <description><![CDATA[
            


<p>
When building an initial prototype, its only purpose is to exist,
to hold space, to help us think through how everything fits together.
The worse the performance, the better. It serves as a rock bottom baseline
against which we can measure all future improvements.
</p>

<p>
That means that it isn’t important to think about making the individual
pieces work well, but it <em>is</em> important to think about how they fit
together and make sure they work together smoothly. It’s more about
defining structure and interfaces than anything else.
</p>

<h2><a id="The-process"></a><a href="#The-process">The process</a></h2>

<p>
It wasn’t pretty. Directories got created, renamed, moved, moved back,
and deleted. Functions that were in one module got moved to another.
Modules got combined and split. The layout of the tests changed three times
and I’m still not entirely happy with it. During the process, I realized
how little I understood what a well structured package should look like
and took a detour to write myself
<a href="https://brandonrohrer.org/python_packaging.html">this guide</a>.
This is all somewhat normal for creating a new project,
but it’s important to note that the
project wasn’t born
<a href="https://codeberg.org/brohrer/proofread-eng-scifi">looking this way</a>.
It evolved.
</p>

<p>
<img alt="Project structure for proofread-eng-scifi
" src="https://raw.githubusercontent.com/brohrer/blog_images/refs/heads/main/alms_task/project_structure.png">
</p>

<h2><a id="The-structure"></a><a href="#The-structure">The structure</a></h2>

<p>
The resulting project repository is built around
<a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/branch/main/src/proofread_eng_scifi">the package code</a>
at<code>src/proofread_eng_scifi</code>.
Outside the main package code, there is a <code>data</code> directory with a copy of
the complete training data and an <code>eval</code> directory with the evaluation code.
At the moment, all of the tests are low level unit tests, and are scattered
throughout the package directories to be as close as possible to the modules
they are testing. For a small project like this, it’s not a burden to
include them in the package distribution. Saving the extra step of
navigating through the directory tree to find them is a nice convenience.
</p>

<p>
The package code itself consist of proofreading modules at the top level
(<code>proof_00.py</code>, <code>proof_01.py</code>) which provide entry points to running
the proofreader, and subdirectories for tests and models. I decided to go
with a clunky, explicitly numbered versioning approach:
<code>filename_01</code>, <code>filename_02</code>, etc. This makes it
straightforward to manage which versions of which components are being used.
This approach would get messy if it were a large codebase but again, since
it’s so small, we can get away with it.
</p>

<p>
The <code>models</code> directory currently contains the <code>tokenizer</code> model and the
<code>random</code> language model. The tokenizer,
<a href="https://brandonrohrer.org/alms_tokenizer.html">discussed previously</a>,
breaks the text into discrete pieces and numbers them. It also has numbered
versions as well as a short script to train each version.
</p>

<p>
The random language model provides an extremely rough baseline against which
future models can be measured. It randomly "estimates" the probability of
each token occurring. These are all nonsense estimates, but they have
the right shape to be used in our version 01 proofreader prototype.
</p>

<p>
Returning to the proofreader modules, version 01 is the one that uses the
random language model. It uses a command line interface where the name
of the text file to be proofread is supplied, for example
</p>

<p>
<pre>
uv run proof_01.py to_be_proofread.txt
</pre>
</p>

<p>
Then it returns a list of detected errors together with their
character positions in the original text file. This again illustrates
the "start with something horrible and make it better later" approach.
It’s clearly inadequate from a user interface perspective, but it does give
some scaffold from which to hang the rest of the prototype.
</p>

<p>
The proofreader code itself imports the pre-trained tokenizer, as well as
the random language model, reads in the file requested, tokenizes it,
and gets a probability assigned to each token. The proofreader then
implements an arbitrary cutoff of 0.05, and any token with a probability
below this is tagged as an error. It also includes some report metrics
like how many errors were detected and how many errors per thousand
characters, and all of these are displayed in the console.
</p>

<p>
With important files included, here's what the current structure looks like:
</p>

<p>
<pre>
proofread-eng-scifi
├── LICENSE
├── README.md
├── data
│   ├── alices_adventures_in_wonderland.txt
│   ├── dorian_gray.txt
│   ├── dracula.txt
│   ├── frankenstein.txt
│   ├── hound_of_baskervilles.txt
│   ├── legend_of_sleepy_hollow.txt
│   ├── lost_world.txt
│   ├── ozma_of_oz.txt
│   ├── seven_dials.txt
│   ├── time_machine.txt
│   ├── twenty_thousand_leagues.txt
│   └── war_of_worlds.txt
├── evals
│   ├── eval_proofreading.py
│   ├── spelling.py
│   └── test_evaluation_data.py
├── pyproject.toml
├── src
│   └── proofread_eng_scifi
│       ├── models
│       │   ├── random
│       │   │   ├── lm_random_00.py
│       │   │   └── test_lm_random_00.py
│       │   └── tokenizer
│       │       ├── model_versions
│       │       │   ├── tokenizer_00.model
│       │       │   └── tokenizer_00.vocab
│       │       ├── test_tokenizer_training.py
│       │       ├── tokenizer_tools.py
│       │       └── train_tokenizer_00.py
│       ├── proof_00.py
│       ├── proof_01.py
│       └── tests
│           ├── test_proof_00.py
│           └── test_proof_01.py
└── uv.lock
</pre>
</p>

<p>
This will continue to evolve, but
<a href="https://codeberg.org/brohrer/proofread-eng-scifi/src/commit/37d0d4874a20f2aed7046a16b16205ff836d3fb1">here is a link</a>
to the commit captured in the directory tree above.
</p>

<h2><a id="Integration-with-the-evals"></a><a href="#Integration-with-the-evals">Integration with the evals</a></h2>

<p>
This end-to-end prototype now has just enough substance to it to let us
run the evals against it. The spelling eval
<a href="alms_task.html">described here</a> can call the proofreader through a
function and run each of its five mistake-ridden paragraphs against it.
Because the evals have a ground truth key for the actual mistakes, it can
compare the detected mistakes and find where the hits and misses were.
Any ground truth mistake that was overlapped by at least one detection is
considered detected, a <em>true positive</em>. Similarly any ground truth mistake
that has no overlaps is considered missed, a <em>false negative</em>. And any
detected mistake that doesn’t overlap a ground truth mistake is considered
a <em>false positive</em>. These then get combined to calculate
precision and recall, the metrics of choice for the evals created.
</p>

<p>
It’s worth noting that neither precision nor recall is zero.
Precision is typically 8-12% and recall is typically 15-25%.
The random baseline sometimes gets it right.
</p>

<h2><a id="Random-baseline"></a><a href="#Random-baseline">Random baseline</a></h2>

<p>
For most models, a safe place to start a baseline is a random number generator.
Instead of making predictions that are intelligent in any way, just roll
the dice each time. This is where we started with our spellchecker.
For some tasks and performance metrics this will give near zero, but for
others it could be much higher. For example, a random baseline on a true/false
quiz would be about 50%. Knowing where the performance floor lies is
helpful when it's time to interpret the performance numbers of future models.
</p>

<h2><a id="Consistency"></a><a href="#Consistency">Consistency</a></h2>

<p>
Consistent style makes code easier to read and helps bugs to be more visible.
I aim to keep my code mostly consistent and mostly in line with
the recommendations of <a href="https://peps.python.org/pep-0008/">PEP 8</a>,
but if I have good reason to deviate I don't hesitate. For example,
in most places I pust test files in the same directory as the modules they
test, but in <code>src/proofread_eng_scifi</code> I opted to put them in their
own subdirectory. Because of how I foresee the project evolving, this
made the most sense to me.
</p>

<p>
I've seen some
wildly verbose and incomprehensible code that resulted from blindly
keeping to consistent convensions.
As Ralph Waldo Emerson notes in his seminal essays on coding standards,
"A foolish consistency is the hobgoblin of little minds". This
roughly translates to "use common sense". In particular, think about
who is going to running and reading your code, and try to make their
lives as easy as possible. Future you will thank you.
</p>

<h2><a id="Next-steps"></a><a href="#Next-steps">Next steps</a></h2>

<p>
Now that there is a skeleton end-to-end prototype in place, the next steps
will be to start improving it piece by piece and re-evaluating it
after each step. There is plenty of room for improvement.
The evals are incomplete, the language model is laughably
bad, the tokenizer has options to experiment with, and the UI needs
a whole lot of polish.
</p>

<p>
Things are just starting to get interesting.
</p>
    ]]></description>
  </item>

  <item>
    <title>
    Python packaging with uv
    </title>
    <link>
    https://brandonrohrer.org/python_packaging.html
    </link>
    <pubDate>
    Fri, 17 Apr 2026 06:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/python_packaging.html
    </guid>
    <description><![CDATA[


<p>
For a long time, every time I needed to create a new project or
Python package I copied an existing repository. It worked pretty well, but
never perfectly. It always felt like I was wearing someone else’s shoes.
And then when I went to make changes, I realized quickly how little I
understood about how projects, builds, and distributions work.
</p>

<p>
Here are some questions that I’ve had and some answers I've found.
They focus on the uv toolset of environment management and packaging.
This post doesn’t say anything about setuptools, even though setuptools,
setup.py, and setup.cfg
are still present in a lot of well-built projects, especially pre-2025 ones.
</p>

<p>
I expect this list of questions to grow and evolve over time as I learn.
My primary audience is a clueless future me, but I hope it helps you too.
</p>

<h2><a id="How-do-I-name-projects?-Packages?-Modules?-Repositories?"></a><a href="#How-do-I-name-projects?-Packages?-Modules?-Repositories?">How do I name projects? Packages? Modules? Repositories?</a></h2>

<p>
As the saying goes, naming things is one of the hardest problems in
computer science. This is particularly true when it comes to packaging.
There are a lot of different entities to be named, and it’s unclear sometimes
what names belong to which. There is a repository name, a top level directory
name, a project name, and a package name. All of these can be different.
For extra confusion, they can get confused with module names and function
names as well.
</p>

<p>
The Python interpreter and build tools have no problem knowing whether
a particular name is supposed to reference the project or the package.
They determine this from context. For human brains, especially the ones
new to packaging,
this can be a lot to keep track of. To save yourself unnecessary hassle,
a good trick is to use the same name for all these things. A brief,
memorable, all-lowercase name is ideal. The one way in which these names
will differ is in how they handle multiword names. For project, repository,
and top level directory names, separate words with a hyphen, as in
<code>my-amazing-tool</code>. For Python
packages and modules separate multiple words with an underscore, for example
<code>my_amazing_tool</code>. This keeps
things consistent with the conventions of the various tools and communities.
</p>

<p>
But don’t worry too much if you feel the need to deviate from this.
Plenty of smart people have differing opinions and it’s a matter
of convention only.
The <a href="https://peps.python.org/pep-0008/#package-and-module-names">PEP 8 recommendation</a>
is to give packages single-word names, without underscores, but this can
be challenging to do in a readable way.  Everyone ignores this.
</p>

<p>
If you plan to distribute it publically on <a href="PyPI">https://pypi.org</a>,
check it first to make sure the package name isn't already taken.
</p>

<h2><a id="What-are-wheels-and-sdists?"></a><a href="#What-are-wheels-and-sdists?">What are wheels and sdists?</a></h2>

<p>
An sdist is a <em>source distribution</em> and a wheel is a <em>binary distribution</em>.
I have no idea why it's called a wheel. Source is the code itself&mdash;<code>.py</code>
files and their supporting cast. It comes in a single <code>.tar.gz</code> archive
which has to be unzipped with <code>tar -xvf</code> before it can be properly read.
<a href="https://packaging.python.org/en/latest/discussions/package-formats/#what-is-a-source-distribution">Detail on sdists here.</a>
</p>

<p>
The wheel is the compiled version of the source, containing
only the files needed to actually run the code. Because compilation
is platform specific, a wheel is tied to a particular operating system,
processor architecture, and Python version. A single project can have many
wheels if it's meant to be run on many platforms and architectures.
The big caveat here is that Python files don't get pre-built into binaries.
The local Python interpreter does that at runtime. So if it's a purely Python
package, then one wheel is usually sufficient for all platforms and OS's.
<a href="https://packaging.python.org/en/latest/discussions/package-formats/#what-is-a-wheel">Detail on wheels here.</a>
</p>

<h2><a id="What-are-build-tools-and-why-do-they-matter?"></a><a href="#What-are-build-tools-and-why-do-they-matter?">What are build tools and why do they matter?</a></h2>

<p>
Build tools do the work of taking the original files and the information
from <code>pyproject.toml</code> and using them as ingredients for building
the sdist and the wheel.
</p>

<p>
There are two parts to this, a build frontend and a build backend. For the
purposes of this post, the frontend is
<a href="https://docs.astral.sh/uv/concepts/projects/build/#using-uv-build">uv build</a>
it does some gathering and interpretation of the files and prepares them
for the next step. pip and build are other popular build frontends.
</p>

<p>
There are a few common build backend tools, including hatchling, setuptools,
and uv's own uv_build.
<a href="https://pydevtools.com/handbook/explanation/what-is-a-build-backend/#choosing-a-backend">Here's a short guide</a>
for choosing between them, but when in doubt hatchling is a good choice.
The backend needs to be called out in pyproject.toml
<a href="https://pydevtools.com/handbook/explanation/what-is-a-build-backend/#how-the-frontend-finds-the-backend">like this</a>
</p>

<p>
<pre>
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
</pre>
</p>

<p>
<a href="https://packaging.python.org/en/latest/tutorials/packaging-projects/#choosing-a-build-backend">Here are some examples</a>
for the other backends as well.
</p>

<h2>Why have a <code>src</code> directory?</h2>

<p>
There are two common patterns for organizing projects
<a href="https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/#src-layout-vs-flat-layout">flat and src layouts</a>.
</p>

<p>
A flat layout is intuitive. The package code sits at the top level of the
project and is more straightforward to access. But because of how imports work
it's easy to lose track of whether your other code, like tests, are
referencing the working copy of your code in the project or a previously
installed version of the package. It can result in maddening bugs.
</p>

<p>
A src layout alleviates this. Because the package sits one level lower,
it's not so easily reachable for direct import. Any import would need
an installed version of the package. Using an editable install makes
sure that your most recent changes to the code are what is run.
</p>

<p>
The src layout gives the benefit of protecting us from ourselves
a bit more. And it comes at the cost of a slightly more complex file
structure and an extra step to ensure an editable install.
</p>

<h2><a id="How-do-I-make-a-package-visible-across-the-project?"></a><a href="#How-do-I-make-a-package-visible-across-the-project?">How do I make a package visible across the project?</a></h2>

<p>
To make a package a visible from other locations in a project
that are outside the package directories, for instance in <code>tests/</code>,
the most reliable way is to do an editable install.
From within the top level directory of the project run
</p>

<p>
<code>uv pip install -e .</code>
</p>

<p>
Modules outside your package shouldn't need <code>__init__.py</code> files
in each directory. But now they will be able to <code>import mypackage</code>
and go to town.
</p>

<p>
Note that it's also totally valid to include <code>tests\</code> within the
package. In that case they will very much need their <code>__init__.py</code>
files. More detail in
<a href="https://docs.pytest.org/en/latest/explanation/goodpractices.html">pytest best practices</a>.
</p>

<h2><a id="How-do-I-add-other-file-types-to-the-package?-And-how-do-I-access-them-from-the-code?"></a><a href="#How-do-I-add-other-file-types-to-the-package?-And-how-do-I-access-them-from-the-code?">How do I add other file types to the package? And how do I access them from the code?</a></h2>

<p>
The easiest way is to include them under the package directory tree.
By default, hatchling includes in the sdist any non-Python files under
the <code>mypackage</code> directory that are not in
<code>.gitignore</code>. This behavior can be aribitrarily modified for both
types of build targets, sdists and wheels.
<a href="https://hatch.pypa.io/1.16/config/build/#file-selection">Examples here.</a>
They can be instructed to include files from outside the project as well.
</p>

<p>
From within the code, these files can be accessed by their absolute path
The <code>__file__</code> attribute gives the absolute path of a given module.
It can the be modified to point to the data instead. For example for this
structure
</p>

<p>
<pre>
myproject
├── pyproject.toml
└── src
    └── mypackage
        ├── __init__.py
        ├── mymodule.py
        └── data
            └── mydata.json
</pre>
</p>

<p>
within <code>mymodule</code>
</p>

<p>
<pre>import os<br>
mymodule_abspath = __file__
mydata_abspath = os.path.join(mymodule_abspath, 'data')
mydata_absfilename = os.path.join(mydata_abspath, 'mydata.json')
with open(mydata_absfilename, 'rt') as f:
    mydata = f.read()
</pre>
</p>

<h2>What goes into <code>pyproject.toml</code>?</h2>

<p>
While <code>pyproject.toml</code> files are powerful and flexible and can be quite long,
<a href="https://www.pyopensci.org/python-package-guide/package-structure-code/pyproject-toml-python-package-metadata.html">a minimal pyproject.toml</a>
contains just some basic project and build system information, like this.
</p>

<p>
<pre>
[build-system]
requires = ['hatchling']
build-backend = "hatchling.build"<br>
[project]
name = 'myproject'
version = '0.1.0'
</pre>
</p>

<p>
There are some other
<a href="https://www.pyopensci.org/python-package-guide/package-structure-code/pyproject-toml-python-package-metadata.html#optional-fields-to-include-in-the-project-table">commonly used fields</a>
for the <code>project</code> table, including description, license, authors, and
keywords. project-dependencies will list other packages that yours depends on.
The classifiers field gives a set of standard tags that can help humans
and software tools alike make better use of your package.
<a href="https://pypi.org/classifiers/">The full list of classifiers</a> is lengthy,
but some especially helpful ones are
</p>

<ul>
<li> Development Status</li>
<li> Intended Audience</li>
<li> Topic</li>
<li> Programming language</li>
</ul>

<h2><a id="Resources"></a><a href="#Resources">Resources</a></h2>

<p>
These are the references that I find most useful when answering these
questions.
</p>

<ul>
<li> <a href="https://packaging.python.org/en/latest/">python.org packaging</a></li>
<li> <a href="https://docs.astral.sh/uv/concepts/projects/">uv project configuration</a></li>
<li> <a href="https://pydevtools.com/handbook/explanation/what-is-a-build-frontend/">build frontends</a></li>
<li> <a href="https://pydevtools.com/handbook/explanation/what-is-a-build-backend/">build backends</a></li>
<li> <a href="https://hatch.pypa.io/1.16/config/build/">hatch configuration</a></li>
<li> <a href="https://www.pyopensci.org/python-package-guide/package-structure-code/pyproject-toml-python-package-metadata.html">pyproject.toml configuration</a></li>
</ul>
    ]]></description>
  </item>

  <item>
    <title>
    Artisanal Language Models: Define a task and write evals
    </title>
    <link>
    https://brandonrohrer.org/alms_task.html
    </link>
    <pubDate>
    Tue, 14 Apr 2026 08:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/alms_task.html
    </guid>
    <description><![CDATA[
<p>
A defining feature of an ALM is that it is purpose built for a well-defined
task. So far the task has only been described in general terms: proofreading
English prose of a novel draft.
</p>

<p>
It’s time to get more specific about what this ALM will do, what
its inputs and outputs will be, so that I can actually start building it.
It’s also time to define some tests and performance measures.
I’ll need them when I have something running and I make a change;
I’ll need some way to measure whether it got better.
</p>

<h2><a id="Inputs"></a><a href="#Inputs">Inputs</a></h2>

<p>
After the ALM has been trained, I’ll want to feed it text to proofread.
To start with I’ll plan to do this through the simplest and clunkiest way
I can think of: passing it the path to a text file containing the text
to be proofread. In an actual professional product built for users,
this is probably not ideal, but it’s a nice generalizable front door
that slicker user interfaces can be built around in the future.
</p>

<h2><a id="Outputs"></a><a href="#Outputs">Outputs</a></h2>

<p>
After the proofreader has done its job and identified segments that might
need correction, those segments will be reported as positions in
the input text. Specifically, when the text file is read in as a string,
the index of starting character position and ending character will be used
to tag segments for inspection and correction.
Position of the suspected error will be reported as a pair of indices.
This collection of start/finish pairs will be the output of the proofreader.
</p>

<p>
There are a lot of other things that could be enhanced about this to provide
a good user experience, and I leave the door open to add those later.
For instance, the text file could be modified to include special characters
marking the suspect segments. Or a fancier graphical UI could simply
highlight the potential errors or underline them, as is common in
word processors. An even fancier extension could propose corrections
and offer the user a single keystroke way to select from a number
of suggestions. But all of these could be built on top of a pair of
start and end markers for each proofreading note.
</p>

<h2><a id="Evaluation"></a><a href="#Evaluation">Evaluation</a></h2>

<p>
I’ll need a way to evaluate the progress. If I try to enhance my
proofreading model, how will I know if it worked? The variety
of all possible text to be proofread is so large, it’s impossible to test
every variation exhaustively. The best I can hope to do is come up with is
a representative sample.
</p>

<p>
This falls somewhere in between the traditional software engineering practice
of testing, where the notion of right and wrong answers and what a function
must do is fairly clean cut, and bench marking, which is a consensus
driven measure for comparing solutions in a broadly recognized way.
It’s what has come to be known in LLM development as evaluations or,
more affectionately, <strong>evals</strong>.
Evals are a reasonable sample of the space in which a language model
is expected to work. They lack the recognition and respect has
a full-blown benchmark, and also lack the rigor and confidence of carefully
designed tests. But despite having the worst of both worlds,
evals operate in a space that we cannot ignore, and for which there is
no better solution that I know of.
</p>

<p>
In practice, evals are organic. They grow to cover new use cases
and new failure modes during the development process. But it’s helpful
think through a reasonable initial set. For proofreading there
are several classes of errors that are important to cover.
</p>

<ul>
<li> <strong>Spelling</strong>. Febuary. Febrewary. Februry. Fabuwary.</li>
<li> <strong>Word choice</strong>. Catching when it should be "imply" and when it should be "infer".
To/two/too. There/their/they're.</li>
<li> <strong>Punctuation</strong>. Appropriate sentence termination. Comma usage.
Quotation mark usage.</li>
<li> <strong>White space</strong>. Extra spaces. Spaces around punctuation. Weird indents.</li>
<li> <strong>Capitalization</strong>.</li>
<li> <strong>Grammar</strong>. Verb tense. Pronoun-antecedent agreement.
Preposition choice.</li>
</ul>

<p>
These aren’t exhaustive, but there’s no need for evals to be exhaustive
in order to be useful. I will almost certainly add more later as I
discover new categories that aren’t getting picked up well.
But they are a good place to start.
</p>

<h2><a id="Creating-evals-for-each-category"></a><a href="#Creating-evals-for-each-category">Creating evals for each category</a></h2>

<p>
In practice, to test how well a given language model performs in each of
these areas, I’ll need to create an evaluation data set. For each of the
areas above, I’ll pull five paragraphs arbitrarily from an evaluation text
(Frankenstein by Mary Wollstonecraft (Godwin) Shelley) and throw 10 errors into
the text of a given type. Having five paragraphs full of spelling errors
gives a total of 50 spelling errors to detect. Each paragraph will come
with its own answer key, the beginning and end of each word or phrase
containing the error. After the proofreading model processes the paragraph,
the errors it detects will be compared against the ground truth.
</p>

<ul>
<li> A ground truth error that is overlapped by at least one model detected-error
is considered detected (<strong>true positive</strong>). This is not quite the same thing as</li>
<li> A model-detected error that overlaps at least
one ground truth error. This is considered an accurate detection, but there
may be several of these per ground truth error. I can't use this as the
true positive count because it could result in inflated counts.</li>
<li> A model-detected
error that doesn’t overlap a ground truth error will be considered
a <strong>false positive</strong>.</li>
<li> A ground truth error that is not overlapped by at least
one model detected error will be considered a <strong>false negative</strong>.</li>
</ul>

<p>
<img alt="Examples of true positives, false positives, and false negatives.
" src="https://raw.githubusercontent.com/brohrer/blog_images/refs/heads/main/alms_task/errors_pos_neg.png">
</p>

<p>
<strong>Recall</strong> will be defined as the total number of model-detected ground truth
errors (true positives) over the total number of ground truth errors
(true positives plus false negatives).
</p>

<p>
<strong>Accuracy</strong> will be a total number of model-detected ground truth errors
(true positives) divided by the total number of true positives
plus false positives. When there are no true positives or false positives,
accuracy will be undefined.
</p>

<h2><a id="Creating-the-evaluation-dataset"></a><a href="#Creating-the-evaluation-dataset">Creating the evaluation dataset</a></h2>

<p>
Putting this into computer-readable form required creating a Python script
with some error-ridden example text and the locations of the errors.
I created
<a href="https://codeberg.org/brohrer/alms/src/commit/57294820c5035a0be140d54e7f07b6a470c31c7e/data/eval/spelling.py">the initial set of evals for spelling errors</a>,
but held off on creating evals for the other error types (punctiation,
capitalization, etc.) for now.
By the time you read this, there is a good chance it will already have evolved.
If that's the case, you can find
<a href="https://codeberg.org/brohrer/alms/src/branch/main/data/eval/spelling.py">the latest version here</a>.
The evaluation dataset is a list of dicts, each of which contain a paragraph
of text taken from a different chapter of Frankenstein which I modified to
contain ten spelling mistakes. It also contains a list of ten
dicts, each containing
</p>

<ul>
<li> the mis-spelled word</li>
<li> the index of its first and last character</li>
<li> the corrected version of the word</li>
</ul>

<p>
Here's a snippet of the result
</p>

<p>
<pre>
evaluation_dataset = [
    {
        'source': 'Frankenstein',
        'chapter': 'L1',
        'paragraph': """
I am already far north of London, and as I walk in the streets of
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight. Do you understand this
feeling? This breeze, which has travelled from the regions towards
which I am advancing, gives me a foretaste of those icey climes.
Inspirited by this wind of promise, my daydreams become more fervent
and vivid. I try in vain to be persuaded that the pole is the seat of
frost and desolation; it ever presents itself to my imagniation as the
region of beauty and delight. There, Margaret, the sun is for ever
visible, its broad disk just skirting the horizon and diffusing a
perpettual splendour. There—for with your leave, my sister, I will put
...
requisite; or by ascertaining the secret of the magnet, which, if at
all possible, can only be effected by an undertaking such as mine.
                """,
        'mistakes': [
            {
                'first_char': 322,
                'last_char': 325,
                'wrong_text': 'icey',
                'correct_text': 'icy',
            },
            {
                'first_char': 526,
                'last_char': 536,
                'wrong_text': 'imagniation',
                'correct_text': 'imagination',
            },
            {
                'first_char': 678,
                'last_char': 687,
                'wrong_text': 'perpettual',
                'correct_text': 'perpetual',
            },
            ...
        ]
    },
]
</pre>
</p>

<h2><a id="End-to-end-prototyping"></a><a href="#End-to-end-prototyping">End-to-end prototyping</a></h2>

<p>
It’s fair to ask why I don’t go ahead and complete the evaluation datasets
for the other types of errors. It seems logical to completely finish this
step before moving onto the next. We can imagine this as a breadth-first
solution to the problem, thoroughly working through one stage of development,
putting some polish on it before moving to the next. When this work is
spread across teams, this is called waterfall style development. One team
completes a whole stage of the project like design or backend support
before passing it onto the next.
</p>

<p>
The alternative to this is a depth-first development strategy.
Building a bare bones end-to-end solution and then adding breadth and
sophistication to it in subsequent passes. Starting with an end-to-end
prototype means leaving a lot of things incomplete in the first pass.
It means creating something that you would be embarrassed to show to
your friends. If you’re working across multiple teams, it means a whole lot
more communication up front.
</p>

<p>
In theory, both of these approaches are valid and will produce a good result
in similar timeframes. But in practice, they don’t. The waterfall approach
assumes that all of the work done at each stage gets to remain in its
final form. In fact, every additional stage teaches us things we didn’t
know about what needed to come before. This requires a lot of rework on
stages that we had previously thought were complete. In an end-to-end
prototyping approach this rework happens quickly. The whole project has
a lot less momentum and can pivot more gracefully. It is more agile.
</p>

<p>
This lesson can take a long time to learn, and in some cases, it is in
managers' interest to ignore it, depending on the incentives of
the organization. But since I am all of the engineering teams and all
of the levels of management for this project I get to decide:
We’re going to start with a lightweight end-to-end prototype.
</p>

<p>
So now that the spelling evals are done, onto the next stage&mdash;building a
model to detect misspelled words.
</p>

    ]]></description>
  </item>

  <item>
    <title>
    Blog Highlights
    </title>
    <link>
    https://brandonrohrer.org/blog.html#highlights
    </link>
    <pubDate>
    Sun, 12 Apr 2026 06:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/blog.html#highlights
    </guid>
    <description><![CDATA[
        <p>
          I revamped my blog to include a highlights reel.
          It was a big unfriendly wall of links.
          Now it starts with a small unfriendly wall of links.
        </p>
        <p>
          Tutorials, projects, code, and thoughts
          collected into topic groups I've generously called Book Projects.
        </p>

        <h3>Highlights</h3>
        <p>
          <strong>New Releases</strong>
        </p>
        <ul>
          <li>
            <a href="ds_roles.html">
              Being a Staff+ Data Scientist in 2026
            </a>
          </li>
          <li>
            <a href="alms_tokenizer.html">
              Build a custom tokenizer
            </a>
          </li>
          <li>
            <a href="alms.html">
              Artisanal Language Models
            </a>
          </li>
        </ul>

        <p>
          <strong>Most Visited</strong>
        </p>
        <ul>
          <li>
            <a href="transformers.html">
              Transformers from scratch
            </a>
          </li>
          <li>
            <a href="ssh_at_home.html">
              Setting up an ssh server</a>
          </li>
          <li>
            <a href="convert_rgb_to_grayscale.html">
              How to convert RGB color images to grayscale
            </a>
          </li>
          <li>
            <a href="convolution_one_d.html">
              Convolution in one dimension
            </a>
          </li>
        </ul>

        <p>
          <strong>Most Loved</strong>
        </p>
        <ul>
          <li>
            <a href="professional_path.html">
              Choose your professional path
            </a>
          </li>
          <li>
            <a href="org_response.html">
              What to do when a leader does something wrong
            </a>
          </li>
          <li>
            <a href="microsuffering.html">
              On microsuffering
            </a>
          </li>
        </ul>

        <p>
          <strong>I'm most proud of</strong>
        </p>
        <ul>
          <li>
            <a href="pendulum.html">
               Solving an easy reinforcement learning problem on hard mode:
               Inverting a pendulum
            </a>
          </li>
          <li>
            <a href="cartographer">
              Naive Cartographer: A Markov Decision Process Learner
            </a>
          </li>
          <li>
            <a href="ziptie">
              Ziptie: Learning Useful Features
            </a>
          </li>
        </ul>
    ]]></description>
  </item>

</channel>
</rss>
