<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://proofsandintuitions.net/feed.xml" rel="self" type="application/atom+xml" /><link href="https://proofsandintuitions.net/" rel="alternate" type="text/html" /><updated>2026-05-20T19:52:41+00:00</updated><id>https://proofsandintuitions.net/feed.xml</id><title type="html">Proofs and Intuitions</title><subtitle>A blog about mathematics, computing, formal verification, and the ideas behind them</subtitle><author><name>Ilya Sergey</name></author><entry><title type="html">On the Unreasonable Effectiveness of Property-Based Testing for Validating Formal Specifications</title><link href="https://proofsandintuitions.net/2026/05/18/property-based-testing-specifications/" rel="alternate" type="text/html" title="On the Unreasonable Effectiveness of Property-Based Testing for Validating Formal Specifications" /><published>2026-05-18T00:00:00+00:00</published><updated>2026-05-18T00:00:00+00:00</updated><id>https://proofsandintuitions.net/2026/05/18/property-based-testing-specifications</id><content type="html" xml:base="https://proofsandintuitions.net/2026/05/18/property-based-testing-specifications/"><![CDATA[<p>In this post, we show that property-based testing (PBT) is surprisingly effective for validating LLM-synthesised specifications of Lean programs: it is a cheap alternative to symbolic proofs, which helped to detect underspecification in 10% of the specs in state-of-the-art benchmarks for verified code generation.</p>

<!--more-->

<h2 id="getting-program-specifications-that-are-just-right">Getting Program Specifications that are “Just Right”</h2>

<p>Formal program verification and program synthesis are only as reliable as the specifications used to validate the programs in question. A specification is a mathematical contract between a programmer and a machine: it captures <em>what</em> a program is supposed to do, and the verifier (or the synthesiser) ensures the contract is respected when producing the implementation and the proof. If the contract is flawed, the result is meaningless—a program verified against a wrong spec does what the specification states, but not necessarily what the user wants.</p>

<p>It is easy to write a contract that is useless on purpose. In <a href="https://en.wikipedia.org/wiki/Hoare_logic">Hoare logic</a>, a specification takes the form</p>

\[\{P\}\ \texttt{program}\ \{Q\},\]

<p>read as: <em>if the precondition $P$ holds of the inputs, then after <code class="language-plaintext highlighter-rouge">program</code> runs, the postcondition $Q$ holds of the outputs.</em> This is, for instance, the style of specification used by <a href="https://github.com/verse-lab/velvet">Velvet</a>—a program verifier embedded in Lean that we discussed in <a href="/2026/01/21/multi-modal-verification-velvet/">one of previous posts</a>. Setting $P \equiv \bot$ (i.e., $\mathit{false}$) makes the triple hold vacuously for <em>any</em> program: there are no inputs satisfying the precondition, so there is nothing to check, and even completely degenerate code is certified. Symmetrically, setting $Q \equiv \top$ ($\mathit{true}$) means that every possible output satisfies the postcondition, so once again any program will pass. The first specification’s precondition is too <strong>strong</strong> (it rules out every input), and the second specification’s postcondition is too <strong>weak</strong> (it rules out no output, making verification and synthesis useless, since such a program can return <em>anything</em>). Neither tells us anything about <em>what</em> the program should do.</p>

<p>The interesting middle ground—a specification that is <em>just right</em>—is an open research problem. A good specification must be <strong>precise enough</strong> to pin down the programmer’s intent, but <strong>not so precise</strong> that it inadvertently re-states the program itself. Consider a textbook task: <em>sort a list of integers in ascending order</em>. A specification that says “the output is the list produced by merge sort applied to the input” is, technically, a specification, but it has fused the <em>what</em> with the <em>how</em>: it commits to an algorithm and inherits all of its incidental details, defeating the purpose of having a specification in the first place. A good specification of sorting, by contrast, demands only two things of the result: it is in ascending order, and it is a permutation of the input. Whether the implementation runs merge sort, quicksort, <a href="https://en.wikipedia.org/wiki/Gnome_sort">gnome sort</a>, <a href="https://en.wikipedia.org/wiki/Bogosort">bogosort</a>, or <a href="https://www.reddit.com/r/programminghorror/comments/lgsd18/i_present_sleepsort/">sleepsort</a> is now irrelevant—the specification abstracts over all of them. Producing this kind of clean, intent-capturing formal statement from an informal English description is the part that humans are, even today, still better at than machines.</p>

<p>Hence the conundrum. The most reliable way to write a “just right” specification is to put a human expert in the loop: someone who reads the formal statement, compares it to the informal intent, and corrects it. But reading formal specifications fluently requires considerable training—precisely the kind of cognitive overhead that has kept formal methods out of the mainstream. If we want certified programs to become widely accessible, we need to dramatically reduce the human effort required to write <em>and</em> validate specifications, without simply passing control to a large language model and hoping for the best.</p>

<p>Our goal, thus, is to produce formal specifications with <strong>minimal user involvement</strong>, and to identify principles for generating specifications that are close enough to a human’s intent to drive the synthesis of <em>certified programs</em>—i.e., programs that come bundled with their formal specifications and machine-checkable correctness proofs. This problem has only become more urgent: in a world where more and more of the code we run is produced by LLMs, we cannot afford to also delegate the synthesis of specifications to LLMs without a reliable, automated way to validate that the resulting specs are <em>just right</em>. The rest of this post discusses two such validation techniques and the trade-offs between them.</p>

<h2 id="symbolic-specification-testing">Symbolic Specification Testing</h2>

<p>One natural approach is to validate an LLM-synthesised specification by <em>proving</em> it on a handful of representative inputs—using a verifier and an SMT solver in the loop. This idea has been actively explored over the past year by Shuvendu Lahiri, who frames the problem as <a href="https://risemsr.github.io/blog/2026-03-05-shuvendu-intent-formalization/"><em>intent formalisation</em></a>: closing the gap between informal natural-language intent and a machine-checkable specification.</p>

<p>Take the sorting task from earlier and imagine that an LLM proposes the following candidate postcondition: <em>“<code class="language-plaintext highlighter-rouge">result</code> is in ascending order, and every element of <code class="language-plaintext highlighter-rouge">arr</code> also occurs in <code class="language-plaintext highlighter-rouge">result</code>“</em>. It looks plausible, but it is silently too <strong>weak</strong>: it lets the implementation add elements that were never in <code class="language-plaintext highlighter-rouge">arr</code>. The <a href="https://risemsr.github.io/blog/2026-04-16-spotting-specs/"><em>Small Proof-Oriented Tests</em></a> (SPOTs) methodology of Nik Swamy and Shuvendu Lahiri catches defects like this by writing tiny <em>verified</em> test cases against the spec. A SPOT for our sorting task looks like:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>let arr    := #[3, 1, 2]
let result := sort(arr)
assert result == #[1, 2, 3]
</code></pre></div></div>

<p>and is handed to the verifier with a single demand: prove that the assertion holds <em>using only the specification of <code class="language-plaintext highlighter-rouge">sort</code></em>, without running the code. If the spec is precise enough to force <code class="language-plaintext highlighter-rouge">sort(#[3, 1, 2])</code> to equal <code class="language-plaintext highlighter-rouge">#[1, 2, 3]</code>, the proof goes through, and the SPOT becomes a small, machine-checked theorem about the spec. If the spec is too weak, the proof fails: our buggy postcondition admits not only $\texttt{#[1,2,3]}$ but also $\texttt{#[1,2,3,10]}$, $\texttt{#[-7,1,2,3]}$, and infinitely many other ascending lists that contain $1$, $2$, and $3$, so the verifier has no way to derive the asserted equality. The failure pinpoints exactly the weakness in the postcondition.</p>

<p>This approach works remarkably well in <em>auto-active</em> verifiers such as <a href="https://dafny.org/">Dafny</a>, <a href="https://github.com/verus-lang/verus">Verus</a>, and <a href="https://fstar-lang.org/">F*</a>, where industrial-grade SMT solvers discharge the proof obligations a SPOT generates—typically over arithmetic, bitvectors, fixed-size arrays, and list or string equality—in milliseconds.</p>

<h2 id="reframing-validation-soundness-and-uniqueness">Reframing Validation: Soundness and Uniqueness</h2>

<p>We tried to use SPOTs in Lean too, and the result was disappointing. The reason is: Lean is a <em>foundational</em> proof assistant, with a logic much richer
than what auto-active verifiers expose, and a correspondingly smaller fraction
of its proof obligations fall inside what SMT solvers can discharge
automatically. Many of the assertions a SPOT generates—equalities between
recursively defined functions, arithmetic that needs induction, list
manipulations—sit just outside
the solver’s automation. To close such a proof in Lean one usually has to fall
back on an interactive proof script, or, increasingly, on an LLM-driven proof
search agent, such as <a href="https://aristotle.harmonic.fun/">Aristotle</a>. Both are orders of magnitude slower per obligation than
discharging the same goal in Dafny, and each attempt also burns through a
non-trivial number of LLM tokens. A budget that comfortably validates dozens of
SPOTs in Dafny barely covers three or four in Lean.</p>

<p>To make the symbolic style work in Lean, then, we need to ask less of the verifier per test case. Rather than treating a SPOT as a single, all-or-nothing proof obligation, we can decompose it into three strictly weaker checks—one that captures whether the input is one the spec claims to handle, one that captures whether the spec <em>accepts</em> the intended output, and one that captures whether it <em>forbids</em> the unintended ones.</p>

<p>Concretely, given a test case $(i, o)$—an input $i$ paired with its intended output $o$—we ask three separate questions about a candidate specification, given by a precondition $\mathit{pre}$ and a postcondition $\mathit{post}$:</p>

<ul>
  <li><strong>Admissibility</strong>: $\mathit{pre}(i)$ holds. Intuitively, the chosen test input is one the spec actually claims to handle. A failure here means the test case lies outside the precondition—we would be checking the spec on inputs it has explicitly opted out of, and any verdict would be uninformative.</li>
  <li><strong>Soundness</strong>: $\mathit{post}(i, o)$ holds. Intuitively, the spec accepts the intended output. A failure here means the spec is too strong—it rejects something the informal intent says is correct.</li>
  <li><strong>Uniqueness</strong>: $\forall o’ \neq o,\ \neg\mathit{post}(i, o’)$ holds. Intuitively, the spec rejects every alternative output. A failure here means the spec is too weak—it admits outputs the informal intent does not.</li>
</ul>

<p>All three properties can be discharged by a symbolic verifier—this is essentially what a SPOT does, just packaged as a single obligation rather than three. More importantly for us, however, each of them can also be effectively <em>invalidated</em> by testing, without attempting a proof at all: a single counterexample is enough to refute any one of these properties and flag the spec. The next sections explain how to do so.</p>

<h2 id="property-based-testing">Property-Based Testing</h2>

<p>To turn the invalidation strategy from the previous section into actual code, we need a tool that can draw candidate inputs (and candidate alternative outputs $o’$) automatically and check a Boolean property on each. That tool is <em>property-based testing</em> (PBT): a technique for checking that an object satisfies a property by drawing random inputs from a generator, evaluating the property on each, and reporting any concrete counterexample it finds. The idea was introduced in 2000 in the <a href="https://dl.acm.org/doi/10.1145/351240.351266">seminal paper on QuickCheck for Haskell</a>, and has since been re-implemented in essentially every modern programming language, including Lean.</p>

<p>In the twenty-five years since, PBT has proven dramatically effective across a wide range of domains: it has uncovered bugs in <a href="https://dl.acm.org/doi/10.1145/3110259">production compilers</a>, driven testing of <a href="https://dl.acm.org/doi/10.1145/3597503.3639581">financial software</a>, surfaced subtle defects in <a href="https://dl.acm.org/doi/10.1145/2951913.2951927">computational geometry algorithms</a> and in <a href="https://dl.acm.org/doi/10.1145/3547653">smart contract runtimes</a>. What makes PBT so effective is that a precise formal property is, all by itself, an automatic refutation engine: any concrete input that violates it counts as a bug, and finding one requires only that the generator stumbles onto a witness.</p>

<p>Lean 4 ships with <a href="https://github.com/leanprover-community/plausible">Plausible</a>, a property-based testing library in the QuickCheck tradition. Given a theorem statement, Plausible generates random inputs from typeclass-derived generators and tries to refute the goal by exhibiting a counterexample.<sup id="fnref:decidable" role="doc-noteref"><a href="#fn:decidable" class="footnote" rel="footnote">1</a></sup></p>

<p>To see Plausible in action, consider an insertion-sort implementation written in <a href="/2026/01/21/multi-modal-verification-velvet/">Velvet</a>. The Velvet method comes with a postcondition that we <em>believe</em> expresses what sorting means:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">method</span> <span class="n">insertionSort</span> (<span class="n">mut</span> <span class="n">arr</span>: <span class="n">Array</span> <span class="n">Int</span>) <span class="n">return</span> (<span class="n">u</span>: <span class="n">Unit</span>)
  <span class="n">require</span> <span class="mi">1</span> <span class="o">≤</span> <span class="n">arr</span><span class="o">.</span><span class="n">size</span>
  <span class="n">ensures</span> <span class="o">∀</span> <span class="n">i</span> <span class="n">j</span>, <span class="mi">0</span> <span class="o">≤</span> <span class="n">i</span> <span class="o">∧</span> <span class="n">i</span> <span class="o">≤</span> <span class="n">j</span> <span class="o">∧</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">arr</span><span class="o">.</span><span class="n">size</span> <span class="o">→</span> <span class="n">arr</span>[<span class="n">i</span>]<span class="o">!</span> <span class="o">≤</span> <span class="n">arr</span>[<span class="n">j</span>]<span class="o">!</span>
  <span class="n">ensures</span> <span class="n">arr</span><span class="o">.</span><span class="n">toMultiset</span> <span class="o">=</span> <span class="n">arrOld</span><span class="o">.</span><span class="n">toMultiset</span>
  <span class="n">do</span><span class="cd">
    -- implementation elided</span>
</code></pre></div></div>

<p>The two <code class="language-plaintext highlighter-rouge">ensures</code> clauses say, respectively, that the array is in ascending order after the method runs, and that its multiset of elements is unchanged—i.e., the result is a permutation of the input. Without ever proving this method correct, we can already use Plausible to <em>test</em> the implementation against the postcondition:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">let</span> <span class="n">g</span> : <span class="n">Plausible</span><span class="o">.</span><span class="n">Gen</span> (<span class="n">_</span> <span class="o">×</span> <span class="n">Bool</span>) := <span class="n">do</span>
  <span class="n">let</span> <span class="n">arr</span> <span class="err">←</span> <span class="n">Plausible</span><span class="o">.</span><span class="n">SampleableExt</span><span class="o">.</span><span class="n">interpSample</span> (<span class="n">Array</span> <span class="n">Int</span>)
  <span class="n">let</span> <span class="n">res</span> := <span class="n">insertionSortTester</span> <span class="n">arr</span>
  <span class="n">pure</span> (<span class="n">arr</span>, <span class="n">res</span>)

<span class="n">for</span> <span class="n">_</span> <span class="n">in</span> [<span class="mi">1</span> : <span class="mi">500</span>] <span class="n">do</span>
  <span class="n">let</span> <span class="n">res</span> <span class="err">←</span> <span class="n">Plausible</span><span class="o">.</span><span class="n">Gen</span><span class="o">.</span><span class="n">run</span> <span class="n">g</span> <span class="mi">10</span>
  <span class="n">unless</span> <span class="n">res</span><span class="o">.2</span> <span class="n">do</span>
    <span class="n">IO</span><span class="o">.</span><span class="n">println</span> <span class="n">s</span><span class="o">!</span><span class="s">"postcondition violated for input {res.1}"</span>
    <span class="n">break</span>
</code></pre></div></div>

<p>The helper <code class="language-plaintext highlighter-rouge">insertionSortTester</code> is auto-derived from the method’s signature:<sup id="fnref:tester" role="doc-noteref"><a href="#fn:tester" class="footnote" rel="footnote">2</a></sup> it runs <code class="language-plaintext highlighter-rouge">insertionSort</code> on the sampled array and evaluates the postcondition on the resulting state. The loop is <em>trying to invalidate the postcondition</em> by drawing 500 fresh inputs and looking for one that breaks it. If none is found, we have a strong evidence—though not a <em>proof</em>—that the implementation respects the spec on the distribution of inputs the generator explores; if one is found, the offending array is a concrete witness of a bug. This is the canonical PBT trade-off against a symbolic correctness proof: dramatically cheaper to run, at the cost of giving up the universal guarantee that a proof would provide.</p>

<p>Notice, though, what we are testing here: we are checking the <em>implementation</em> of <code class="language-plaintext highlighter-rouge">insertionSort</code> against a specification we already trust. The next step—the one this whole post has been building towards—is to flip the script, and use the very same machinery to validate the specification itself.</p>

<h2 id="catching-a-bad-spec-with-pbt">Catching a Bad Spec with PBT</h2>

<p>Let’s put PBT and the soundness/uniqueness pair to work on a concrete LLM-synthesised specification. We stay with the sorting task: <em>given a list of integers, produce one in ascending order</em>. Asked to write a Lean specification for this problem, an LLM might propose:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">precondition</span> (<span class="n">arr</span> : <span class="n">List</span> <span class="n">Int</span>) : <span class="kt">Prop</span> :=
  <span class="n">True</span>

<span class="k">def</span> <span class="n">postcondition</span> (<span class="n">arr</span> : <span class="n">List</span> <span class="n">Int</span>) (<span class="n">result</span> : <span class="n">List</span> <span class="n">Int</span>) : <span class="kt">Prop</span> :=
  (<span class="o">∀</span> <span class="n">i</span> <span class="n">j</span>, <span class="mi">0</span> <span class="o">≤</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">result</span><span class="o">.</span><span class="n">size</span> <span class="o">→</span> <span class="n">result</span>[<span class="n">i</span>]<span class="o">!</span> <span class="o">≤</span> <span class="n">result</span>[<span class="n">j</span>]<span class="o">!</span>) <span class="o">∧</span>
  (<span class="o">∀</span> <span class="n">i</span>, <span class="mi">0</span> <span class="o">&lt;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">arr</span><span class="o">.</span><span class="n">size</span> <span class="o">→</span> <span class="n">arr</span><span class="o">.</span><span class="n">count</span> <span class="n">arr</span>[<span class="n">i</span>]<span class="o">!</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="n">count</span> <span class="n">arr</span>[<span class="n">i</span>])
</code></pre></div></div>

<p>In plain English, the postcondition says that <code class="language-plaintext highlighter-rouge">result</code> is in ascending order and that every element of <code class="language-plaintext highlighter-rouge">arr</code> also appears in <code class="language-plaintext highlighter-rouge">result</code>. This is silently too weak: it never forbids <code class="language-plaintext highlighter-rouge">result</code> from containing extra elements that were never in <code class="language-plaintext highlighter-rouge">arr</code>.</p>

<p>To validate this spec, we run the three PBT checks from the previous section. <em>Admissibility</em> is trivial here: the precondition is <code class="language-plaintext highlighter-rouge">True</code>, so every input we pick passes. For <em>soundness</em>, we choose a few small test cases $(i, o)$ where $o$ is the intended sort of $i$<sup id="fnref:testcases" role="doc-noteref"><a href="#fn:testcases" class="footnote" rel="footnote">3</a></sup>—say $(\texttt{#[]}, \texttt{#[]})$ and $(\texttt{#[3,1,2]}, \texttt{#[1,2,3]})$—and check that <code class="language-plaintext highlighter-rouge">postcondition i o</code> holds. It does, on every case we try: the spec accepts the intended outputs, so soundness is not the problem here.</p>

<p>For <em>uniqueness</em>, Plausible randomly samples alternative outputs $o’ \neq o$ for each test case and checks whether <code class="language-plaintext highlighter-rouge">postcondition i o'</code> is still accepted. On the very first test case $i = \texttt{#[]}$, the generator quickly stumbles onto $o’ = \texttt{#[0]}$: the array is trivially in ascending order and trivially contains every element of <code class="language-plaintext highlighter-rouge">arr</code> (because <code class="language-plaintext highlighter-rouge">arr</code> is empty), so the postcondition is satisfied. Plausible reports the counterexample, and the spec is flagged as too weak—exactly as the reframing predicted.</p>

<h2 id="whats-hard-to-test-and-how-we-cope">What’s Hard to Test (and How We Cope)</h2>

<p>PBT thrives when the property under test is a universally-quantified Boolean predicate one can evaluate directly on each sampled input. It struggles when the property contains an unbounded <em>existential</em> quantifier: refuting $\exists x,\, P(x)$ means establishing $\forall x,\, \neg P(x)$, which testing alone cannot do. We found two simple patterns that let us tackle the awkward cases without falling back to a full symbolic proof.</p>

<p>First, in program specifications, existentials are almost always implicit and
<em>bounded</em> by the surrounding context—an index into <code class="language-plaintext highlighter-rouge">arr</code> lives in $[0,\,
\texttt{arr.size})$, not in all of $\mathbb{N}$. We infer such bounds
heuristically from the property’s structure, prove that they indeed hold for the
respective program variables (such as array indices) with Lean tactics like
<code class="language-plaintext highlighter-rouge">grind</code> and <code class="language-plaintext highlighter-rouge">omega</code>, and then enumerate the existential variable over the
resulting finite range.</p>

<p>Second, sampling alternative outputs $o’ \neq o$ uniformly at random is usually quite ineffective: the odds of landing on a meaningful counterexample in this case are astronomically small. Instead, we take a page from the <a href="https://en.wikipedia.org/wiki/Fuzzing">fuzz testing</a> book and use <em>mutation-based sampling</em>: we perturb the intended output by flipping a Boolean, adding $\pm 1$ to an integer, or inserting/deleting an element from a list. This is because bad outputs tend to look much like good ones, and this empirical fact makes such “small-scale” mutation much more effective than blind random search in a large space of possible outputs.</p>

<p>With these two adaptations in place, one might wonder whether the testing-based approach to validating formal specifications actually pays off in the wild. The rest of the post discusses the experience of using a PBT-based spec-validation pipeline on two state-of-the-art benchmarks of specifications for Lean programs.</p>

<h2 id="catching-specification-bugs-in-verified-coding-benchmarks">Catching Specification Bugs in Verified Coding Benchmarks</h2>

<p>Our hypothesis going in was as follows: if PBT-based spec validation works as advertised, it should be able to find underspecified specifications even in published, human-written Lean benchmarks that have already been vetted by their authors as reference examples.</p>

<p>To put this to the test, we ran our pipeline on two state-of-the-art benchmarks for Lean specification synthesis: <a href="https://arxiv.org/abs/2505.23135">VERINA</a> and <a href="https://arxiv.org/abs/2505.13938">CLEVER</a>. Both benchmark suites provide natural-language problem descriptions, formal specifications, and a handful of test cases; some problems also include a reference implementation. When one was available, we used it to compute the intended outputs from precondition-satisfying inputs. CLEVER’s specifications do not separate preconditions from postconditions, so we wrote a small script to do that conversion; due to formatting issues, only 104 of CLEVER’s specifications converted cleanly, and those are the ones we tested.</p>

<p>Across 188 problems from VERINA and 104 from CLEVER, PBT flagged 13 underspecified specifications in the former and 18 in the latter—about 10% of everything we tested.</p>

<p>We reported these findings to the benchmark authors. The VERINA team has acknowledged the bugs we surfaced and <a href="https://github.com/sunblaze-ucb/verina/commit/75c46698739ae00e81c46e973971e1bc61eaf461">patched</a> nearly all of them (one is still under review); the CLEVER team has acknowledged all 18 issues, though fixes have not yet shipped at the time of writing.</p>

<p>Let us now look at three of the bugs we found.</p>

<h3 id="example-1-forgotten-range-constraints">Example 1: Forgotten Range Constraints</h3>

<p>The first comes from VERINA’s Basic 46. The task is to find the last position of a given element in a sorted array of integers, returning <code class="language-plaintext highlighter-rouge">-1</code> if the element is absent. VERINA’s specification is:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">lastPosition_precond</span> (<span class="n">arr</span> : <span class="n">Array</span> <span class="n">Int</span>) (<span class="n">elem</span> : <span class="n">Int</span>) : <span class="kt">Prop</span> :=
  <span class="n">List</span><span class="o">.</span><span class="n">Pairwise</span> (<span class="err">·</span> <span class="o">≤</span> <span class="err">·</span>) <span class="n">arr</span><span class="o">.</span><span class="n">toList</span>

<span class="k">def</span> <span class="n">lastPosition_postcond</span>
  (<span class="n">arr</span> : <span class="n">Array</span> <span class="n">Int</span>) (<span class="n">elem</span> : <span class="n">Int</span>) (<span class="n">result</span> : <span class="n">Int</span>)
  (<span class="n">h_precond</span> : <span class="n">lastPosition_precond</span> <span class="n">arr</span> <span class="n">elem</span>) :=
    (<span class="n">result</span> <span class="o">≥</span> <span class="mi">0</span> <span class="o">→</span>
      <span class="n">arr</span>[<span class="n">result</span><span class="o">.</span><span class="n">toNat</span>]<span class="o">!</span> <span class="o">=</span> <span class="n">elem</span> <span class="o">∧</span>
      (<span class="n">arr</span><span class="o">.</span><span class="n">toList</span><span class="o">.</span><span class="n">drop</span> (<span class="n">result</span><span class="o">.</span><span class="n">toNat</span> <span class="o">+</span> <span class="mi">1</span>))<span class="o">.</span><span class="n">all</span> (<span class="err">·</span> <span class="o">≠</span> <span class="n">elem</span>)) <span class="o">∧</span>
    (<span class="n">result</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="o">→</span> <span class="n">arr</span><span class="o">.</span><span class="n">toList</span><span class="o">.</span><span class="n">all</span> (<span class="err">·</span> <span class="o">≠</span> <span class="n">elem</span>))
</code></pre></div></div>

<p>At a glance this looks fine: when <code class="language-plaintext highlighter-rouge">result ≥ 0</code>, the <code class="language-plaintext highlighter-rouge">result</code>-th element equals <code class="language-plaintext highlighter-rouge">elem</code> and no element after it does; when <code class="language-plaintext highlighter-rouge">result = -1</code>, no element of the array equals <code class="language-plaintext highlighter-rouge">elem</code>.</p>

<p>PBT quickly finds the problem. With <code class="language-plaintext highlighter-rouge">arr = #[]</code> and <code class="language-plaintext highlighter-rouge">elem = 0</code>, the postcondition accepts both <code class="language-plaintext highlighter-rouge">-1</code> and <code class="language-plaintext highlighter-rouge">0</code>, even though only <code class="language-plaintext highlighter-rouge">-1</code> is correct. The reason: when <code class="language-plaintext highlighter-rouge">result = 0</code>, the lookup <code class="language-plaintext highlighter-rouge">arr[result.toNat]!</code> is out of bounds, and Lean’s accessor silently returns the default <code class="language-plaintext highlighter-rouge">Int</code>, which is <code class="language-plaintext highlighter-rouge">0</code>, and happens to equal <code class="language-plaintext highlighter-rouge">elem</code>. The postcondition is satisfied by accident.
Worse, the postcondition also accepts <code class="language-plaintext highlighter-rouge">-2</code>: nothing in the spec restricts <code class="language-plaintext highlighter-rouge">result</code> to be in <code class="language-plaintext highlighter-rouge">[-1, arr.size)</code>, so anything below <code class="language-plaintext highlighter-rouge">-1</code> is unconstrained.</p>

<p>What looks like an aesthetic problem is actually an exploit waiting to happen. A motivated attacker—or just a lazy LLM optimising for the cheapest implementation the verifier will accept—can ship code that returns <code class="language-plaintext highlighter-rouge">0</code> whenever the input array is empty, or uses any negative integer below <code class="language-plaintext highlighter-rouge">-1</code> as a “not found” sentinel. The verifier, reading only the postcondition, will stamp this program as correct; the user, trusting that stamp, will never notice that the “verified” program disagrees with the natural-language task it was supposed to solve.</p>

<p>Adding the explicit constraint <code class="language-plaintext highlighter-rouge">-1 ≤ result &lt; arr.size</code> fixes the spec. The general lesson: when writing a specification, always constrain the domain of every output variable.</p>

<h3 id="example-2-silently-truncated-subtraction">Example 2: Silently Truncated Subtraction</h3>

<p>The second example comes from CLEVER’s problem 79. The task is to convert a number in decimal to binary and wrap the result with <code class="language-plaintext highlighter-rouge">"db"</code> at both ends—so the desired output for <code class="language-plaintext highlighter-rouge">32</code> is <code class="language-plaintext highlighter-rouge">"db100000db"</code>. CLEVER’s specification reads:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">problem_spec</span>
  (<span class="n">implementation</span> : <span class="n">Nat</span> <span class="o">→</span> <span class="n">String</span>)
  (<span class="n">decimal</span> : <span class="n">Nat</span>) :=
  <span class="n">let</span> <span class="n">spec</span> (<span class="n">result</span> : <span class="n">String</span>) :=
    <span class="mi">4</span> <span class="o">&lt;</span> <span class="n">result</span><span class="o">.</span><span class="n">length</span> <span class="o">∧</span>
    <span class="n">result</span><span class="o">.</span><span class="n">drop</span> (<span class="n">result</span><span class="o">.</span><span class="n">length</span> <span class="o">-</span> <span class="mi">2</span>) <span class="o">=</span> <span class="s">"db"</span> <span class="o">∧</span>
    <span class="n">result</span><span class="o">.</span><span class="n">take</span> <span class="mi">2</span> <span class="o">=</span> <span class="s">"db"</span> <span class="o">∧</span>
    <span class="n">let</span> <span class="n">resultTrimmed</span> :=
      (<span class="n">result</span><span class="o">.</span><span class="n">toList</span><span class="o">.</span><span class="n">drop</span> <span class="mi">2</span>)<span class="o">.</span><span class="n">dropLast</span><span class="o">.</span><span class="n">dropLast</span><span class="o">.</span><span class="n">map</span>
        (<span class="k">fun</span> <span class="n">c</span> <span class="o">=&gt;</span> <span class="n">c</span><span class="o">.</span><span class="n">toNat</span> <span class="o">-</span> <span class="err">'</span><span class="mi">0</span><span class="err">'</span><span class="o">.</span><span class="n">toNat</span>)
    <span class="n">decimal</span> <span class="o">=</span> <span class="n">Nat</span><span class="o">.</span><span class="n">ofDigits</span> <span class="mi">2</span> <span class="n">resultTrimmed</span><span class="o">.</span><span class="n">reverse</span>
  <span class="o">∃</span> <span class="n">result</span>, <span class="n">implementation</span> <span class="n">decimal</span> <span class="o">=</span> <span class="n">result</span> <span class="o">∧</span>
  <span class="n">spec</span> <span class="n">result</span>
</code></pre></div></div>

<p>The first three lines say <code class="language-plaintext highlighter-rouge">result</code> starts and ends with <code class="language-plaintext highlighter-rouge">"db"</code>. The next line strips the wrappers and turns each remaining character into a digit by subtracting <code class="language-plaintext highlighter-rouge">'0'</code>’s Unicode value. The last line says the resulting digits, read as base 2, equal the input.</p>

<p>The bug hides in that subtraction. Lean’s <code class="language-plaintext highlighter-rouge">Nat</code> subtraction is truncated: any negative result clamps to <code class="language-plaintext highlighter-rouge">0</code>. So <em>any</em> character whose Unicode value is $\le$ that of <code class="language-plaintext highlighter-rouge">'0'</code>—including <code class="language-plaintext highlighter-rouge">'/'</code>—silently maps to <code class="language-plaintext highlighter-rouge">0</code>. When the input is <code class="language-plaintext highlighter-rouge">0</code>, the postcondition therefore accepts both <code class="language-plaintext highlighter-rouge">"db0db"</code> and <code class="language-plaintext highlighter-rouge">"db/db"</code>.</p>

<p>A random generator would essentially never produce a string starting and ending with <code class="language-plaintext highlighter-rouge">"db"</code> by chance—the structural constraint is too tight. Mutation-based sampling, on the other hand, perturbs the expected output one character at a time and immediately exposes the bug. Mutation also surfaced a different underspecification in VERINA’s Basic 97 (an in-place update of an array element to <code class="language-plaintext highlighter-rouge">60</code>): the postcondition does not require the output to have the same length as the input, and PBT caught this by appending an extra element to the expected result.</p>

<h3 id="example-3-catching-implementation-bugs">Example 3: Catching Implementation Bugs</h3>

<p>The third example, from CLEVER’s problem 9, is a bonus: we were hunting for spec bugs, but this one is an <em>implementation</em> bug. The task is to compute the running maximum of a list of integers, and the reference implementation is:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">implementation</span> (<span class="n">numbers</span> : <span class="n">List</span> <span class="n">Int</span>) : <span class="n">List</span> <span class="n">Int</span> :=
  <span class="n">let</span> <span class="n">rec</span> <span class="n">rolling_max</span>
          (<span class="n">numbers</span> : <span class="n">List</span> <span class="n">Int</span>)
          (<span class="n">results</span> : <span class="n">List</span> <span class="n">Int</span>)
          (<span class="n">acc</span> : <span class="n">Int</span>) : <span class="n">List</span> <span class="n">Int</span> :=
            <span class="k">match</span> <span class="n">numbers</span> <span class="k">with</span>
            <span class="o">|</span> [] <span class="o">=&gt;</span> <span class="n">results</span>
            <span class="o">|</span> <span class="n">n</span> :: <span class="n">ns</span> <span class="o">=&gt;</span>
              <span class="n">let</span> <span class="n">new_acc</span> := <span class="n">max</span> <span class="n">acc</span> <span class="n">n</span>
              <span class="n">let</span> <span class="n">new_results</span> := <span class="n">results</span> <span class="o">++</span> [<span class="n">new_acc</span>]
              <span class="n">rolling_max</span> <span class="n">ns</span> <span class="n">new_results</span> <span class="n">new_acc</span>
  <span class="n">rolling_max</span> <span class="n">numbers</span> [] <span class="mi">0</span>
</code></pre></div></div>

<p>The accumulator <code class="language-plaintext highlighter-rouge">acc</code> is initialised to <code class="language-plaintext highlighter-rouge">0</code>, which gives the wrong answer whenever the first element of the input is negative. PBT generated such an input within a handful of samples, the implementation’s output failed the spec, and the bug came out for free.</p>

<h2 id="limitations-and-conclusions">Limitations and Conclusions</h2>

<p>In the traditional formal verification setting—where we want to check that a <em>program</em> meets a specification we trust—testing is undoubtedly a much weaker tool than a formal proof: a test suite can never quite rule out an unseen bug, while a verifier gives a guarantee that no amount of testing can match. When the artefact under scrutiny is the <em>specification</em> itself, however, the picture becomes much subtler. The underlying problem—<em>does this formal statement faithfully capture what the human actually meant?</em>—is inherently non-formal, and even SPOT-style symbolic validation does not give a 100% guarantee: it is only as good as the test cases one chooses. Once absolute certainty is no longer on the table for any method, PBT-based validation becomes a genuinely worthy alternative in the design space, with the added benefit of being much cheaper than anything that involves proofs.</p>

<p>That said, our method has several opportunities for future improvements, which we discuss next.</p>

<p>First, the uniqueness property simply does not hold for every task. A clean example is quicksort-style partitioning, where the relative order within each partition is irrelevant to correctness and the spec legitimately admits many distinct outputs for the same input. Our approach can, however, be extended by replacing uniqueness with other, more permissive universally-quantified meta-properties of specifications—for instance, “any two accepted outputs are related by some user-supplied equivalence”. Identifying useful such properties and integrating them into a single PBT-driven validation pipeline is an exciting direction for future work.</p>

<p>Second, a good specification should capture the relation between inputs and outputs, rather than encode a particular algorithm in disguise. PBT can flag specs that are too weak or too strong, but it has nothing to say about specs that are simply <em>too operational</em>. We believe this dimension can be addressed by restricting the specification language itself—e.g., to a fragment restricted to particular sets of computational primitives—and we view this, too, as a promising future direction.</p>

<p>More broadly, the most interesting story here is probably not “testing <em>versus</em> proof” but “testing <em>and</em> proofs”: combining lightweight randomised validation with heavyweight deductive methods opens up an intriguing design space for cutting the cost of formal verification and verified code generation, using each technique where it is the most appropriate.<sup id="fnref:testdefs" role="doc-noteref"><a href="#fn:testdefs" class="footnote" rel="footnote">4</a></sup> Readers curious about how specification testing fits into the larger picture of verifiable code generation may enjoy <a href="https://arxiv.org/abs/2604.16584">our recent paper on LeetProof</a>, where PBT-based specification validation works alongside SMT and agentic proof search in a single end-to-end pipeline.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:decidable" role="doc-endnote">
      <p>There is a subtlety here: Plausible can only refute propositions
that are <em>decidable</em>—i.e., come with a procedure that returns a Boolean
for every concrete input. In practice, Lean’s type class resolution
synthesises the required <code class="language-plaintext highlighter-rouge">Decidable</code> instance automatically for most
propositions one encounters in program specifications: equalities over
built-in types, comparisons, finitely-quantified statements, and Boolean
combinations of all of the above. <a href="#fnref:decidable" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:tester" role="doc-endnote">
      <p>The details of how Velvet turns a method signature into an
executable tester are described in <a href="https://verse-lab.org/papers/velvet-cav26.pdf">our CAV’26
paper</a>. <a href="#fnref:tester" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:testcases" role="doc-endnote">
      <p>In practice, these input–output pairs can be drafted by an LLM
and then sanity-checked by a human—the same assumption SPOTs make about
their concrete witnesses. <a href="#fnref:testcases" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:testdefs" role="doc-endnote">
      <p>The same applies beyond specifications for synthesised programs:
testing the definitions that themselves appear inside our theorems—e.g.,
a programming language semantics in a soundness statement—is
just as valuable, since a flawed definition can make even a true theorem
useless. See <a href="/2026/03/18/move-borrow-checker-lean/">our earlier post on mechanising the Move borrow checker in
Lean</a> for a workflow
of this kind. <a href="#fnref:testdefs" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Yueyang Feng</name></author><category term="lean" /><category term="specification" /><category term="testing" /><category term="smt" /><category term="ai" /><summary type="html"><![CDATA[In this post, we show that property-based testing (PBT) is surprisingly effective for validating LLM-synthesised specifications of Lean programs: it is a cheap alternative to symbolic proofs, which helped to detect underspecification in 10% of the specs in state-of-the-art benchmarks for verified code generation.]]></summary></entry><entry><title type="html">Verifying Move Borrow Checker in Lean: an Experiment in AI-Assisted PL Metatheory</title><link href="https://proofsandintuitions.net/2026/03/18/move-borrow-checker-lean/" rel="alternate" type="text/html" title="Verifying Move Borrow Checker in Lean: an Experiment in AI-Assisted PL Metatheory" /><published>2026-03-18T00:00:00+00:00</published><updated>2026-03-18T00:00:00+00:00</updated><id>https://proofsandintuitions.net/2026/03/18/move-borrow-checker-lean</id><content type="html" xml:base="https://proofsandintuitions.net/2026/03/18/move-borrow-checker-lean/"><![CDATA[<p>I formalised and proved the correctness of <a href="https://www.sui.io/move">Move</a>’s new
borrow checker in Lean: 39,000 lines of mechanised metatheory, produced in under
a month with the help of an AI coding assistant. This post tells the story of
how it went and what it means for the future of PL research.</p>

<!--more-->

<blockquote>
  <p><strong>Reading guide.</strong> This is a long post. Depending on your background, you may want to skip ahead:</p>
  <ul>
    <li>If you are new to programming language research and curious about it, keep reading.</li>
    <li>If you already know what PL metatheory is, jump to <a href="#move-programming-language-and-its-borrow-checker">Move and its borrow checker</a>.</li>
    <li>For the step-by-step mechanisation story with AI, start at <a href="#encoding-move-typing-rules-in-lean">Encoding typing rules in Lean</a>.</li>
    <li>For the anecdotes on using AI “in anger”, see <a href="#soundness-proof-the-labours-of-claude">Soundness Proof: the Labours of Claude</a>.</li>
    <li>For numbers, see <a href="#some-statistics">Some statistics</a>.</li>
    <li>For the big picture, skip to <a href="#what-this-all-means-for-the-future-of-pl-research">What this all means for PL research</a>.</li>
  </ul>
</blockquote>

<p>The Programming Languages (PL) research community was one of the earliest
adopters of interactive theorem provers, such as Rocq, Agda, and
Isabelle/HOL, as a key technology for gaining trust in the produced
formal models and proofs of their properties. The famous <a href="https://www.seas.upenn.edu/~plclub/poplmark/">POPLmark
Challenge</a>, a set of benchmarks
aimed at identifying the main hurdles when stating and proving theorems about
programming languages, has turned 20 years old last year.</p>

<p>Proving properties of a PL artefact, such as an <a href="https://compcert.org/">optimising
compiler</a> or a <a href="https://dl.acm.org/doi/10.1145/3485516">type system
proposal</a>, has traditionally been
considered a significant challenge, and, while not insurmountable by trained
human provers, it has become almost a necessary requirement to publish a
research paper at one of the top-tier PL conferences. Some perceive this trend
as a rite of passage. I believe, a healthier way is to think of machine-assisted
proofs about programming languages as a way to sharpen one’s definitions and
statements: it is widely recognised that ugly definitions usually result in
significantly more laborious proofs.</p>

<p>The sheer amount of human time spent by PL researchers over the past two decades
formalising their results in provers, such as Rocq, is so staggering that
studying the proof patterns to discover more concise ways to engineer
machine-checked proofs became a research direction of its own with quite a few
high-profile publications (<a href="https://dl.acm.org/doi/abs/10.1145/2544174.2500587">Example
1</a>, <a href="https://dl.acm.org/doi/full/10.1145/3674646">Example
2</a>, <a href="https://dl.acm.org/doi/abs/10.1145/3372885.3373817">Example
3</a>) and even entire
<a href="https://popl26.sigplan.org/home/CPP-2026">academic conferences</a> dedicated to
this topic alone. With the modern trend of applying frontier AI models to
facilitate construction of machine-checked proofs of mathematical theorems
(predominantly, in Lean) with the help of systems such as Harmonic’s
<a href="https://aristotle.harmonic.fun/">Aristotle</a> and <a href="https://axiommath.ai/">Axiom
Prover</a>, it is only a matter of time to see these
advances facilitating proofs of theorems about programming languages. In this
blogpost, I describe one such experiment.</p>

<h2 id="what-is-pl-metatheory-and-why-it-needs-mechanised-proofs">What is PL Metatheory and why it needs mechanised proofs?</h2>

<p>Let us set up some terminology first. When talking about formal verification of
programs, it is important to remember that this is only meaningful when we have
a particular <em>specification</em> in mind, which describes the properties of the
program of interest that always hold (we have talked about this at length in
<a href="/2026/01/21/multi-modal-verification-velvet/">one of the previous posts</a>).</p>

<p>But what if, instead of properties of a program written in a certain language,
we are interested in <em>properties of a programming language</em> itself? If you are
wondering what these properties might be, think of your favourite programming
language and things that you believe <em>cannot</em> happen when running programs in
it. For instance, any implementation of Python guarantees <em>memory safety</em>: it
would be really surprising for one to observe a dangling pointer or a buffer
overflow when running a Python program—Python’s garbage collector ensures that
this does not happen at run time.</p>

<p>Some languages go even further and provide similar guarantees without a garbage
collector whatsoever, reasoning purely out of the syntax of the programs using
mechanisms known as <em>type systems</em>. A particularly prominent example of an
interesting yet practical type system is that of Rust: through the mechanisms
of borrows and lifetimes it <em>guarantees</em> that, if a program without <code class="language-plaintext highlighter-rouge">unsafe</code>
blocks is accepted by the Rust compiler, it is free from use-after-free errors,
dangling pointers, and data races—all being artefacts of so-called <em>unsafe
pointer aliasing</em>. But why should we believe that these guarantees do indeed
hold in reality when we run our compiled Rust programs? Well, this is where the
math behind PL theory steps in: ideally, a programming language designer must
<em>prove</em> a <em>Type Soundness Theorem</em> that connects the fact that a program is
accepted by a type checker with the absence of certain runtime behaviours—the
famous <em>Well-Typed Programs Don’t Go Wrong</em> mantra coined by <a href="https://en.wikipedia.org/wiki/Robin_Milner">Robin
Milner</a> in his <a href="https://www.sciencedirect.com/science/article/pii/0022000078900144">1978
paper</a>.<sup id="fnref:types" role="doc-noteref"><a href="#fn:types" class="footnote" rel="footnote">1</a></sup></p>

<p>For any interesting programming language, the type soundness theorem is
surprisingly non-trivial to state. First, it requires a precise description of
the type system itself, defining how exactly it “analyses” a program to
determine whether it should be accepted or rejected. Second, one needs to define
a <em>semantics</em> of the language’s runtime behaviour, and the errors that are
considered “preventable” by the type system (for instance, almost no practical
type systems promise to catch errors such as divisions by zero or out-of-memory
errors). Finally, the statement of the theorem should say something about the
validity of the memory state in which we are going to run a program that has
been accepted by a type checker.</p>

<p>From the components of a Type Soundness Theorem statement, only the type system
itself is <em>not</em> trusted. In contrast, the definition of the runtime semantics of
the language is almost always taken for granted (akin to the “laws of nature”),
while the properties of the initial state are something that is left for the
loader to take care of, so feasibility of such a requirement is not questioned
by the theorem itself. What the soundness theorem does deliver is the proof that
the type system does <em>not</em> accept programs that would result in a preventable
error at run time.</p>

<p>The study of definitions of semantics, type systems, program optimisations, and
their interactions with erroneous or harmful runtime behaviours is what is
typically called <em>PL metatheory</em>. Type soundness theorems are among the most
common statements studied and proven in PL metatheory, and this is, in a
nutshell, what we, PL researchers, do for a living.</p>

<p>Why, then, does PL metatheory call for mechanised proofs? For any remotely
interesting language, a type soundness proof amounts to a massive case analysis
over all possible language constructs, coupled with different cases for how the
runtime semantics can treat its state or individual language commands. It is not
uncommon for such proofs to span <a href="https://arxiv.org/pdf/1903.00982">dozens of
pages</a> of English prose and, unlike classical
mathematical proofs, they are very rarely intellectually rewarding: each case
follows a similar pattern, yet every single one must be checked. It is easy to
make a mistake, and even a trivial error can render the entire point of a type
system proposal false. This is why the PL community adopted proof assistants
since early 2000s. The first reason was to gain trust in the results. The second
was to tame the tedium using clever proof engineering techniques. That said,
when approached by a human prover, the amount of mechanised metatheory required
for a top-tier conference paper typically still measures at about 6–10
person-months of work. As I will try to argue below, this is about to change
with AI.</p>

<h2 id="move-programming-language-and-its-borrow-checker">Move programming language and its borrow checker</h2>

<p>Over the past few months, I have been collaborating with the developers from
<a href="https://www.mystenlabs.com/">Mysten Labs</a> on designing a new type system for
the <a href="https://www.sui.io/move">Move</a> language for smart contracts. Move is
deployed on the <a href="https://sui.io/">Sui</a> and <a href="https://aptosfoundation.org/">Aptos</a>
blockchains. Like Rust, Move enforces an <em>ownership discipline</em>: every value has
a unique owner, and references (called <em>borrows</em>) must not outlive the values
they borrow. The borrow checker (a special type-based static analysis run by the
compiler) rejects programs that could create dangling references, aliased
mutations, or use-after-move errors. Unlike Rust, however, Move does not allow
references inside data structures (so, no <a href="https://rust-unofficial.github.io/too-many-lists/">linked
lists</a>): every reference is
an <em>access path</em> rooted in a local variable and descending through a sequence of
field names. This restriction eliminates the need for lifetime annotations
entirely: the type system can track the reference provenance using just their
paths.</p>

<p>The key idea behind the new borrow checker design is to track reachability
between references using <em>regular expressions</em> over field paths. For example,
when a reference <code class="language-plaintext highlighter-rouge">w</code> borrows a deep path <code class="language-plaintext highlighter-rouge">p.x.f</code>, the type system registers the
regex <code class="language-plaintext highlighter-rouge">x · f</code> as the path from the parent reference to <code class="language-plaintext highlighter-rouge">w</code>. When another
reference <code class="language-plaintext highlighter-rouge">u</code> later re-borrows <code class="language-plaintext highlighter-rouge">p.x</code>, the type checker computes the <a href="https://en.wikipedia.org/wiki/Brzozowski_derivative"><em>Brzozowski
derivative</em></a> (i.e.,
“stripping” the shared prefix <code class="language-plaintext highlighter-rouge">x</code>) to discover whether <code class="language-plaintext highlighter-rouge">u</code> can reach <code class="language-plaintext highlighter-rouge">w</code>. If the
resulting regex is non-empty, the two references overlap in memory, so writing
through <code class="language-plaintext highlighter-rouge">u</code> would invalidate <code class="language-plaintext highlighter-rouge">w</code>.</p>

<p>Consider the following program, written in MoveIR, an LLVM-like intermediate
language of Move with minimalistic syntax and explicit borrow/move/copy
annotations, which makes it a better target for mechanisation than plain Move:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct S { f: u64 }
struct P { x: Self.S }
t() {
    let p: Self.P;
    let w: &amp;u64;
    let u: &amp;mut Self.S;
  label b0:
    p = P { x: S { f: 0 } };
    w = &amp;copy(&amp;p).P::x.S::f;   // deep borrow of p.x.f
    u = &amp;mut p.P::x;           // re-borrow of p.x
    *move(u) = S { f: 1 };     // typing error here!
    return;
}
</code></pre></div></div>

<p>The diagram below shows the <em>reachability graph</em> that the type checker maintains
for this program. Each node \(\rho\) is an <em>abstract reference</em>—a symbolic
representation for the respective run-time memory locations. The distinguished
\(\rho_0\) represents the stack frame root; \(\rho_p\), \(\rho_w\), and
\(\rho_u\) correspond to the variables <code class="language-plaintext highlighter-rouge">p</code>, <code class="language-plaintext highlighter-rouge">w</code>, and <code class="language-plaintext highlighter-rouge">u</code>. An edge labelled with
a regex records which field paths can be traversed from one location to reach
another.</p>

<p><img src="/assets/images/move-nested-aliasing.svg" alt="Reachability graph for the nested aliasing
example" /></p>

<p>Here, <code class="language-plaintext highlighter-rouge">w</code> borrows the deep path <code class="language-plaintext highlighter-rouge">p.x.f</code>, while <code class="language-plaintext highlighter-rouge">u</code> re-borrows <code class="language-plaintext highlighter-rouge">p.x</code>. The
derivative \(\delta(\mathtt{x} \cdot \mathtt{f},\, \mathtt{x}) = \mathtt{f}
\neq \emptyset\) reveals that \(\rho_u\) can reach \(\rho_w\) via field <code class="language-plaintext highlighter-rouge">f</code>
(shown as the blue edge). The write on
the last line triggers the safety check, which fails: overwriting the entire
<code class="language-plaintext highlighter-rouge">S</code> behind <code class="language-plaintext highlighter-rouge">u</code> would invalidate the deep borrow <code class="language-plaintext highlighter-rouge">w</code>, creating a dangling
reference. This is precisely the kind of subtle aliasing that the regex-based
approach catches through a clean, decidable mechanism.</p>

<p>This design has been fully implemented in a branch of the Sui blockchain client,
where it superseded the <a href="https://arxiv.org/abs/2205.05181">original</a> (much more
complicated and non-formalised) borrow analyser while maintaining full backwards
compatibility. But for a blockchain language, implementation alone is not
enough: the designers want <em>ironclad guarantees</em> that the borrow checker is
correct, in the sense of the Type Soundness Theorem explained above. A bug in
the borrow checker’s logic could allow an attacker to exploit a dangling
reference, potentially compromising on-chain funds.</p>

<p>This is where our mechanisation effort comes in. We wanted to formalise the new
type system in Lean and prove it sound with respect to reasonable semantics of
MoveIR. Our ambition was to cover as much of Move as possible, not just a “toy
subset” (as customary in proof-of-concept academic prototypes), and to get
confidence that the formalisation faithfully represents the actual deployed
implementation.</p>

<h2 id="encoding-move-typing-rules-in-lean">Encoding Move typing rules in Lean</h2>

<p>When PL researchers present a type system, they typically write it down as a
collection of <em>inference rules</em> in a logical notation. For instance, the typing
rule for writes through a mutable reference looks like this:</p>

\[\text{T-WriteRef} \quad \frac{\displaystyle \Sigma(a) = \mathsf{ref}(\tau, \rho, \mathsf{M}) \quad \Sigma(b) = \mathsf{basic}(\tau) \quad \mathsf{check\_outbound}(\Pi, \rho) \atop \displaystyle \Lambda;\; \mathcal{E}[\Sigma := \Sigma \setminus \{a, b\},\; \Pi := \Pi] \vdash \mathit{cont} : \overline{T}}{\Lambda;\; \mathcal{E} \vdash {*}a := b;\; \mathit{cont} : \overline{T}}\]

<p>Reading bottom-to-top: to type-check a write statement \(*a := b\), the rule
requires that \(a\) holds a mutable reference of type \(\tau\), that \(b\) holds
a basic value of the <em>same</em> type \(\tau\), and, crucially, that
\(\mathsf{check\_outbound}\) passes, meaning no existing borrow extends beyond
\(\rho\) (otherwise the write would create a dangling reference). After the
write, both <em>sites</em> are removed from \(\Sigma\). Here, a <em>site</em> is a named
temporary slot in the stack frame that holds the value produced by an expression
before it is consumed by the next operation (think of SSA registers in LLVM),
and \(\Sigma\) is the <em>site environment</em>—a map from site names to their types
that the type checker maintains as it walks through the program. This is not
what the compiler implements directly: the inference rule casts type checking as
<em>proof construction</em> in a domain-specific logic. A program that type-checks is
one for which a proof tree can be assembled from such rules.</p>

<p>The first step in our mechanisation was to encode MoveIR’s syntax and all its
typing rules in Lean. I wrote these definitions by hand. Here is a simplified
version of the $\text{T-WriteRef}$ rule in Lean:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">inductive</span> <span class="n">typecheck_stmt</span> : <span class="n">LabelEnv</span> <span class="o">→</span> <span class="n">TypeEnv</span> <span class="o">→</span> <span class="n">Stmt</span> <span class="o">→</span> <span class="n">List</span> <span class="n">MoveType</span> <span class="o">→</span> <span class="kt">Prop</span> <span class="n">where</span>
  <span class="o">...</span>
  <span class="o">|</span> <span class="n">write_ref</span> : <span class="o">∀</span> <span class="err">Λ</span> <span class="err">𝓔</span> <span class="n">a</span> <span class="n">b</span> <span class="err">τ</span> <span class="err">ρ</span> <span class="n">cont</span> <span class="n">T</span>,
      <span class="err">𝓔</span><span class="o">.</span><span class="err">Σ</span>(<span class="n">a</span>) <span class="o">=</span> <span class="o">.</span><span class="n">ref</span> <span class="err">τ</span> <span class="err">ρ</span> <span class="o">.</span><span class="n">mut</span> <span class="o">→</span>
      <span class="err">𝓔</span><span class="o">.</span><span class="err">Σ</span>(<span class="n">b</span>) <span class="o">=</span> <span class="o">.</span><span class="n">basic</span> <span class="err">τ</span> <span class="o">→</span>
      <span class="n">check_outbound</span> <span class="err">𝓔</span><span class="o">.Π</span> <span class="err">ρ</span> <span class="o">→</span>
      <span class="n">typecheck_stmt</span> <span class="err">Λ</span> (<span class="err">𝓔</span>[<span class="err">Σ</span> := <span class="err">𝓔</span><span class="o">.</span><span class="err">Σ</span> <span class="err">\</span> <span class="err">{</span><span class="n">a</span>, <span class="n">b</span><span class="err">}</span>]) <span class="n">cont</span> <span class="n">Ts</span> <span class="o">→</span>
      <span class="n">typecheck_stmt</span> <span class="err">Λ</span> <span class="err">𝓔</span> (<span class="o">*</span><span class="n">a</span> := <span class="n">b</span><span class="o">;</span> <span class="n">cont</span>) <span class="n">Ts</span>
  <span class="o">...</span>
</code></pre></div></div>

<p>Each premise of the inference rule becomes a hypothesis in the Lean
constructor: the lookups check site types, <code class="language-plaintext highlighter-rouge">check_outbound</code> enforces the
borrow safety condition, and the recursive <code class="language-plaintext highlighter-rouge">typecheck_stmt</code> call types the
continuation under the updated environment.</p>

<p>With this encoding of the MoveIR type system in place, we can already <em>validate</em>
how faithful our Lean model is: take a concrete program from Move’s test suite,
translate it to MoveIR, and try to <em>prove</em> that the typing judgement holds—I
will call such proofs <em>conformance proofs</em>. This was the point, at which I
started using Claude Code (Opus 4.5) to construct these proofs. To my surprise,
the AI could handle tiny MoveIR programs (4–5 lines of code) successfully,
assembling the proof tree step by step. But for anything larger, the approach
quickly became impractical: each proof step required instantiating the right
rule constructor, providing witnesses for existential variables, and discharging
side conditions about the path environment, all of which grew rapidly with
program size. This is where I started to take the agenda of conformance proofs
seriously, but in a very different form: more on that next.</p>

<h2 id="from-conformance-proofs-to-tests-via-algorithmic-type-checking">From conformance proofs to tests via algorithmic type checking</h2>

<p>To solve the efficiency bottleneck with conformance proofs for the type system
(AI was slow and unreliable), I resorted to a more traditional approach:
implementing an actual <em>executable</em> type checker in Lean and running it on the
same tests as the production Move implementation, making sure that it accepts
and rejects the same programs. Instead of constructing proof trees, the
algorithmic checker is a plain recursive Lean function that returns an updated
type environment on success or fails with <code class="language-plaintext highlighter-rouge">none</code>:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">check_stmt</span> (<span class="n">lenv</span> : <span class="n">LabelEnv</span>) (<span class="n">env</span> : <span class="n">TypeEnv</span>) (<span class="n">s</span> : <span class="n">Stmt</span>)
    (<span class="n">retTypes</span> : <span class="n">List</span> <span class="n">ParamType</span>) : <span class="n">Option</span> <span class="n">TypeEnv</span>
</code></pre></div></div>

<p>Luckily, there was no shortage of tests in the Move compiler’s test suite, so
it did not take long to vibe-code a parser from MoveIR text into our Lean
representation and start running the tests.<sup id="fnref:parser" role="doc-noteref"><a href="#fn:parser" class="footnote" rel="footnote">2</a></sup></p>

<p>But what is the relationship between the <em>relational</em> type checker from before
(the inductive <code class="language-plaintext highlighter-rouge">typecheck_stmt</code> with inference rules) and the new <em>algorithmic</em>
one (the executable <code class="language-plaintext highlighter-rouge">check_stmt</code>)? After all, we could have introduced a bug in
the algorithmic version that makes it accept programs the relational rules would
reject, or vice versa. To close this gap, we proved (entirely by AI, of course)
the first important theorem of our development: the <em>soundness of the
algorithmic type checker</em>:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">theorem</span> <span class="n">check_stmt_sound</span> (<span class="n">lenv</span> : <span class="n">LabelEnv</span>) (<span class="n">env</span> : <span class="n">TypeEnv</span>)
    (<span class="n">s</span> : <span class="n">Stmt</span>) (<span class="n">retTypes</span> : <span class="n">List</span> <span class="n">ParamType</span>) :
    <span class="o">∃</span> <span class="n">env</span><span class="err">'</span>, <span class="n">check_stmt</span> <span class="n">lenv</span> <span class="n">env</span> <span class="n">s</span> <span class="n">retTypes</span> <span class="o">=</span> <span class="n">some</span> <span class="n">env</span><span class="err">'</span> <span class="o">→</span>
    <span class="n">typecheck_stmt</span> <span class="n">lenv</span> <span class="n">env</span> <span class="n">s</span> <span class="n">retTypes</span>
</code></pre></div></div>

<p>This theorem says: whenever the algorithmic checker accepts a program (returns
<code class="language-plaintext highlighter-rouge">some env'</code>), the relational typing judgement holds. In other words, the
executable checker is a <em>sound decision procedure</em> for the type system: it never
accepts a program that the inference rules would reject. Every successful run of
<code class="language-plaintext highlighter-rouge">check_stmt</code> effectively produces a <em>certificate</em> that the relational typing
derivation exists.<sup id="fnref:completeness" role="doc-noteref"><a href="#fn:completeness" class="footnote" rel="footnote">3</a></sup></p>

<p>Once we had the parser, the algorithmic checker, and the soundness proof in
place, we ran 156 conformance tests on programs drawn from the Move own type
checker test suite. This turned out to be one of the most valuable components of
the entire development. In the later stages, we frequently needed to revise our
encoding of the typing rules—for instance, when extending the type system with
additional features—and it was the conformance tests that gave us trust that
we were still formalising the right thing.<sup id="fnref:cedar" role="doc-noteref"><a href="#fn:cedar" class="footnote" rel="footnote">4</a></sup></p>

<p>All this, of course, does not yet mean that a Move program accepted by the type
checker is actually free of dangling pointers. The algorithmic soundness theorem
only connects the executable checker to the inference rules; it says nothing
about runtime behaviour. Next, we had to provide the remaining components for
the much desired Type Soundness Theorem and prove it.</p>

<h2 id="runtime-semantics-and-type-soundness-theorem">Runtime semantics and Type Soundness Theorem</h2>

<p>To state a Type Soundness Theorem, we need a precise definition of what it means
to <em>run</em> a program. In our Lean development, the runtime semantics takes the
form of a <a href="https://dl.acm.org/doi/10.1145/800194.805852"><em>definitional
interpreter</em></a>: a recursive
function <code class="language-plaintext highlighter-rouge">run</code> that executes a program for at most <code class="language-plaintext highlighter-rouge">fuel</code> steps, returning
either a final result or an error:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">run</span> (<span class="n">fuel</span> : <span class="n">Nat</span>) (<span class="n">state</span> : <span class="n">ExecState</span>) : <span class="n">ExecState</span> :=
  <span class="k">match</span> <span class="n">fuel</span> <span class="k">with</span>
  <span class="o">|</span> <span class="mi">0</span> <span class="o">=&gt;</span> <span class="o">.</span><span class="n">error</span> <span class="o">.</span><span class="n">outOfFuel</span>
  <span class="o">|</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">=&gt;</span>
    <span class="k">match</span> <span class="n">state</span> <span class="k">with</span>
    <span class="o">|</span> <span class="o">.</span><span class="n">running</span> <span class="n">_</span> <span class="o">=&gt;</span> <span class="n">run</span> <span class="n">n</span> (<span class="n">step</span> <span class="n">state</span>)
    <span class="o">|</span> <span class="n">other</span> <span class="o">=&gt;</span> <span class="n">other</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">fuel</code> parameter is a standard technique for defining potentially
non-terminating computations in proof assistants: instead of proving
termination, we give the interpreter a budget of steps. When the budget runs
out, it returns <code class="language-plaintext highlighter-rouge">outOfFuel</code> rather than looping forever. The soundness theorem
will quantify over <em>all</em> fuel values, so this is not a limitation: if a bug
exists, it would manifest at some finite step count.</p>

<p>The <code class="language-plaintext highlighter-rouge">step</code> function performs a single transition from one “virtual machine”
state (variables, sites, stack, heap) to another. When something goes wrong at
runtime, it produces an error state. Not all errors are created equal, though.
Some are <em>preventable</em>: reading through a dangling reference, accessing a moved
value, or encountering a type mismatch at a write. These are exactly the bugs
the borrow checker exists to catch. Others are <em>acceptable</em>: division by zero,
running out of fuel, or an explicit abort. No reasonable type system promises to
prevent those. The type soundness theorem is exactly what guarantees that a
program accepted by the type checker never reaches any preventable error.</p>

<p>The runtime semantics is one of the <em>trusted</em> components of the formalisation:
if we get it wrong, the theorem might be true but vacuous. That said, this
particular semantics, despite being AI-generated, was relatively
straightforward—a standard small-step interpreter over a heap—so I mostly
validated it by reading the code. As additional, albeit lightweight assurance, I
vibe-coded a number of simple “litmus” tests, checking that programs I believed
should crash due to a dangling pointer or a use-after-move error did indeed
crash under our semantics.<sup id="fnref:runtime-conformance" role="doc-noteref"><a href="#fn:runtime-conformance" class="footnote" rel="footnote">5</a></sup></p>

<p>With the semantics in hand, we can finally state the Type Soundness Theorem.
Here is a slightly simplified version from our Lean development:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">theorem</span> <span class="n">type_soundness</span> (<span class="n">f</span> : <span class="n">FunDef</span>) (<span class="n">lenv</span> : <span class="n">LabelEnv</span>)
    (<span class="n">enumEnv</span> : <span class="n">EnumEnv</span>) (<span class="n">funEnv</span> : <span class="n">AssocMap</span> <span class="n">Id</span> <span class="n">FunDef</span>)
    (<span class="n">args</span> : <span class="n">List</span> <span class="n">Value</span>) (<span class="n">heap</span> : <span class="n">Heap</span>)
    (<span class="n">htyped</span> : <span class="n">typecheck_fun</span> <span class="n">f</span> <span class="n">lenv</span> <span class="n">enumEnv</span>)
    (<span class="n">hfunEnv</span> : <span class="o">∀</span> <span class="n">fname</span> <span class="n">fdef</span>, <span class="n">lookup</span> <span class="n">funEnv</span> <span class="n">fname</span> <span class="o">=</span> <span class="n">some</span> <span class="n">fdef</span> <span class="o">→</span>
               <span class="n">FunTypeSafe</span> <span class="n">fdef</span> <span class="n">funEnv</span> <span class="n">enumEnv</span>)
    (<span class="n">ha</span> : <span class="n">SoundnessAssumptions</span> <span class="n">f</span> <span class="n">lenv</span> <span class="n">enumEnv</span> <span class="n">funEnv</span> <span class="n">heap</span> <span class="n">args</span>)
    (<span class="n">e</span> : <span class="n">RuntimeError</span>) (<span class="n">hna</span> : <span class="o">¬</span><span class="n">e</span><span class="o">.</span><span class="n">isAcceptable</span>) :
    <span class="o">∀</span> <span class="n">n</span>, <span class="n">run</span> <span class="n">n</span> (<span class="n">initState</span> <span class="n">f</span> <span class="n">funEnv</span> <span class="n">args</span> <span class="n">heap</span>) <span class="o">≠</span> <span class="o">.</span><span class="n">error</span> <span class="n">e</span>
</code></pre></div></div>

<p>Reading this bottom-to-top: for <em>any</em> non-acceptable error <code class="language-plaintext highlighter-rouge">e</code> and <em>any</em> fuel
budget <code class="language-plaintext highlighter-rouge">n</code>, running the function <code class="language-plaintext highlighter-rouge">f</code> never produces that error. The premises
require three things. First, <code class="language-plaintext highlighter-rouge">htyped</code>: the function type-checks under our
relational type system. Second, <code class="language-plaintext highlighter-rouge">hfunEnv</code>: every function that <code class="language-plaintext highlighter-rouge">f</code> might call is
itself type-safe (this gives us modular reasoning—we check one function at a
time). Third, <code class="language-plaintext highlighter-rouge">ha</code>: a record called <code class="language-plaintext highlighter-rouge">SoundnessAssumptions</code> that bundles 23
well-formedness preconditions on the initial state—things like “argument types
match parameter declarations”, “the heap contains no dangling locations”, “label
environments are complete”, and structural invariants on enum definitions. These
assumptions on the initial state are yet another point where the formalisation
could become vacuous: if the 23 preconditions are mutually contradictory, the
theorem would be true but useless: there would simply be no valid inputs to run
the function on! We will discuss how to address this concern shortly.</p>

<h2 id="soundness-proof-the-labours-of-claude">Soundness Proof: the Labours of Claude</h2>

<p>Everything described so far—encoding the type system, building the
algorithmic checker, writing the parser, running conformance tests—was, in
retrospect, the easy part. The real struggle was proving the Type Soundness
Theorem itself.</p>

<p>The standard approach, introduced by <a href="https://www.sciencedirect.com/science/article/pii/S0890540184710935">Wright and
Felleisen</a>
in 1994, decomposes type soundness into two lemmas: <em>progress</em> and
<em>preservation</em>.<sup id="fnref:stronger" role="doc-noteref"><a href="#fn:stronger" class="footnote" rel="footnote">6</a></sup> Progress says: if the current state is well-typed,
the machine can take a step not resulting in a preventable error. Preservation
says: if the current state is well-typed and the machine takes a step, the
resulting state is <em>also</em> well-typed. Together, they give an inductive argument:
the initial state is well-typed (by the soundness assumptions), each step
preserves well-typedness (preservation), and no well-typed state can crash
(progress), so the machine never reaches a preventable error, no matter how many
steps it takes. Progress was straightforward: for each typing rule, the relevant
premises guarantee that the corresponding runtime operation succeeds. The real
beast was preservation.</p>

<h3 id="preservation-proof-an-exercise-in-invariant-inference">Preservation proof: an exercise in invariant inference</h3>

<p>The crux of preservation is defining what a “well-typed state” means for a
running machine. Remember, the type system deals with <em>syntactic</em> entities such
as types, type environments, abstract references, and regexes, while the runtime
operates on heap locations, actual values, and reference chains. The two worlds
need to be connected. A <em>well-typed state invariant</em> is precisely this bridge: a
predicate that relates the concrete machine state to the abstract type
environment, asserting that every promise made by the type checker is backed by
reality in the heap. In our case, it is a Lean record with 35 fields. For
instance, some clauses say that every abstract reference \(\rho\) tracked by the
type checker maps to a live heap location, that the regex paths between
references faithfully reflect actual pointer chains in the heap (so the
reachability graph from the diagrams above is adequate), and that the
<code class="language-plaintext highlighter-rouge">check_outbound</code> condition used by $\text{T-WriteRef}$ is always satisfied for
mutable references.</p>

<p>It is nearly impossible to predict all 35 (or more) invariant fields from the
start. The way it works in practice is: you attempt the proof for a particular
statement kind, get stuck because the invariant is too weak, strengthen it with
a new clause, and then propagate the change to all other cases. This is very
similar to inferring <a href="/2026/01/21/multi-modal-verification-velvet/">loop invariants in imperative programs</a> or inductive invariants for
<a href="/2026/02/09/distributed-verification-veil/">distributed systems</a>:
you just iterate until the invariant is strong enough to make proof constructing
and checking by Lean possible.</p>

<h3 id="where-ai-did-great-and-where-it-didnt">Where AI did great, and where it didn’t</h3>

<p>Despite MoveIR being a relatively small language, it has over 20 distinct
statement forms (borrow, move, copy, field borrow, write, freeze, call, return,
jump, branch, pack, unpack, and their vector/enum variants), each with its own
semantic step rule. The preservation proof contains 153 lemmas (mostly
conjectured by AI), collectively showing that each of these steps preserves the
35-field invariant. Every time I strengthened the invariant with a new clause,
dozens of lemmas across 30+ files needed updating. This is where Claude (Opus
4.5, and later 4.6) was invaluable. The AI excelled at <em>proof repair</em>:
propagating a change through a large proof landscape, applying the same pattern
to case after case. It also handled routine preservation cases—where the
environment update is a straightforward pass-through—with high reliability.</p>

<p>But it was not all a walk in the park. At one point, Claude got stuck trying to
prove preservation for jumps to LLVM-style labelled blocks, going in circles for
hours. I had to step in and recognise that we needed a separate fact: a
<em>weakening lemma</em>. Weakening says that if a statement type-checks under a “more
restrictive” type environment (one that tracks more paths between references),
it also type-checks under a “less restrictive” one. This is what makes
control-flow joins sound: the checker types the continuation under a target
environment and verifies that each branch’s post-environment subsumes it. The
weakening lemma proof turned out to be substantial on its own: about 7,200 lines
of Lean.</p>

<p>At some point I considered switching to Harmonic’s
<a href="https://aristotle.harmonic.fun/">Aristotle</a> prover for the more difficult
lemmas, but ultimately decided against it. Claude Code made it very easy to keep
the tight control over the entire development: I could refactor proof structures
on the fly, ask “why are you proving this?” mid-proof, and steer the
decomposition into lemmas interactively. Aristotle is better suited for
one-shotting complex standalone mathematical statements (or refuting them),
which is quite different from the iterative, gradual workflow I needed when
prototyping Move metatheory.</p>

<h3 id="the-dragon-function-calls">The dragon: function calls</h3>

<p>The preservation proof totals about 10,300 lines of Lean. Roughly half of that
complexity lives in just two cases: the call and return command. The typing rule
for calls is the most involved in the entire system: it moves from reasoning
about variables allocated in a single stack frame to reasoning about the entire
call stack, which requires a bunch of additional invariants on saved frames.
Additional complexity comes from the fact that our call rule supports relatively
accurate inter-procedural tracking of borrows across call boundaries. The
details are beyond the scope of this post, but the upshot is: this was the
hardest case to dispatch. Proving that all 35 invariant fields are preserved
this operation was, by far, the most difficult part of the entire development.
While Claude wrote the actual proof code, it took quite a few high-level hints
from me: choosing the right case splits and cutting off unproductive proof
attempts early. The call rule alone took nearly two days of Claude running
almost non-stop (which forced me to upgrade to the most expensive plan). When it
finally went through, it felt like defeating the dragon—the rest was mopping
up.</p>

<h2 id="last-wrinkles-soundness-assumptions-and-type-system-extensions">Last wrinkles: soundness assumptions and type system extensions</h2>

<p>Let us go back to the Type Soundness Theorem and the 23 <code class="language-plaintext highlighter-rouge">SoundnessAssumptions</code>
that worried us earlier. Recall the concern: if these preconditions are mutually
contradictory, the theorem is vacuously true, as there would be no valid inputs
to run the function on, and we would have proven nothing useful.</p>

<p>The fix follows the same pattern that served us well throughout this
development: make it executable and test it! Specifically, I vibe-coded a
decidable checker that collapses all 23 preconditions <em>plus</em> the type-checking
judgement into a single total Boolean function:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">SoundnessAssumptions</span><span class="o">.</span><span class="n">checkDecidable</span>
    (<span class="n">f</span> : <span class="n">FunDef</span>) (<span class="n">lenvDec</span> : <span class="n">LabelEnvDec</span>) (<span class="n">enumEnv</span> : <span class="n">EnumEnv</span>)
    (<span class="n">funEnv</span> : <span class="n">AssocMap</span> <span class="n">Id</span> <span class="n">FunDef</span>) (<span class="n">fte</span> : <span class="n">FunTypingEnv</span>)
    (<span class="n">heap</span> : <span class="n">Heap</span>) (<span class="n">args</span> : <span class="n">List</span> <span class="n">Value</span>) : <span class="n">Bool</span>
</code></pre></div></div>

<p>I then made Claude prove that whenever this Boolean check returns <code class="language-plaintext highlighter-rouge">true</code>, the
full <code class="language-plaintext highlighter-rouge">SoundnessAssumptions</code> record holds—and therefore the type soundness
theorem applies. For each of our ~30 runtime test programs, I instructed Claude
to instantiate the decidable checker with concrete inputs and let Lean’s
evaluator discharge the precondition, producing a <em>per-execution safety certificate</em>: a machine-checked proof that
this specific run cannot produce any preventable error.</p>

<h3 id="stretch-goals-vectors-and-enums">Stretch goals: vectors and enums</h3>

<p>The core type system covers MoveIR’s structs, references, and control flow. Two
important features, vectors and multi-variant enums, were initially omitted but
added later as “stretch goals.”</p>

<p>Vectors were relatively straightforward to support, as they are conceptually
very similar to records that our initial formalisation already supported.  The
vector extension took about one day to vibe-formalise and touched 30+ files, but
the modifications were uniform enough that Claude could apply the same pattern
repeatedly without errors. As before, we have tested the extended algorithmic
type checker extensively.</p>

<p>Enums were harder. The challenge was choosing the right runtime representation.
The preservation proof requires that two values of the same type always have
identical field structure (so that overwriting one with the other preserves all
existing pointer paths). For structs this is trivially true, but enum variants
carry different fields. Claude proposed several encodings that broke this
property; the fix required me to provide some insight: represent every enum
value as a flat record carrying fields for <em>all</em> variants at once, with inactive
ones filled by type-specific default values. This made the structural property
hold unconditionally, and the enum extension went through in about five days.</p>

<h2 id="some-statistics">Some statistics</h2>

<p>Everyone loves numbers, so here they are.</p>

<p>The entire Lean development comprises approximately <strong>39,000</strong> non-blank,
non-comment lines of code, across 267 commits, with zero axioms or <code class="language-plaintext highlighter-rouge">sorry</code>
declarations. To the best of my knowledge, this is the second largest PL
metatehory mechanisation in Lean to date (after the <a href="https://github.com/cedar-policy/cedar-spec">Cedar
specification</a>, which is around
70,000 LOC) and, very likely, the largest one accomplished predominantly using
AI. The soundness proofs dominate the codebase at ~23,000 LOC (59%), with the
preservation proof alone weighing in at ~10,300 lines and the weakening lemma at
~7,200. The typing rules (both relational and algorithmic, with the soundness
theorem) account for ~5,800 LOC, the parser and translator for ~1,300, and the
test suite for ~6,100. The vector extension added ~3,000 LOC in one day; enums
required ~5,000 LOC over five.</p>

<p>As a relatively large Lean development with diverse proof patterns, this
codebase might serve as an interesting benchmark for AI-assisted theorem proving
in Lean. I plan to open-source the full codebase in the coming weeks; feel free
to get in touch if you would like early access for research purposes.</p>

<p>In terms of calendar time, the active development phase spanned 27 working days
(some of them were on weekends, duh!). The chart below shows the daily commit
intensity:</p>

<p><img src="/assets/images/move-commit-chart.svg" alt="Daily commit intensity during the active development
phase" /></p>

<p>The algorithmic type checker and its soundness proof took two days and
kicked-off my AI-powered metatheory journey. Parser, macros, and testing
harnesses took five. The soundness proofs consumed 13 days, with peaks at the
weakening proof (22 commits in a single day) and the call rule (17 commits).</p>

<p>While this post covers the “interesting” PL-theory parts of the formalisation,
in my estimate, the vast majority of the generated proofs are horribly boring:
wrangling lists, threading facts through semantic rules, and proving
excruciatingly mind-numbing lemmas about runtime state representation. Having
done quite a few mechanisations in the past (by hand, in Rocq), I would estimate
this effort at about 5–6 months of my full-time engagement without AI (the
described here mechanisation has been done by me concurrently with all my other
duties as a faculty). It is, of course, if I wouldn’t have gone insane from
dealing with the tedious parts first. I am very happy that I didn’t have to do
those proofs “old style” (and, probably, never will have to again).</p>

<h2 id="what-this-all-means-for-the-future-of-pl-research">What this all means for the future of PL research?</h2>

<p>Congratulations on making it to the end of this post. Let me reward your
patience with some of my thoughts on what all this might mean for PL research.</p>

<p>I have read somewhere recently that one reason skilled programmers love
generative AI is that it lets them focus on the creative parts of computing
while the machine handles the tedium.<sup id="fnref:opposite" role="doc-noteref"><a href="#fn:opposite" class="footnote" rel="footnote">7</a></sup> I think the same dynamic
applies to PL researchers. The creative work—designing logics and type
systems, choosing the right semantics, figuring out non-standard proof
strategies—still belongs to humans. The tedium—threading a new invariant clause
through 153 lemmas, or proving that removing a key from an key-value map
preserves some property of the remaining entries—is exactly what AI handles
well. This is a good trade.</p>

<p>I personally know some established PL academics who actively resist mechanising
their metatheory proofs, because the overhead slows down their research. I
believe, that objection is no longer that convincing. While I probably shouldn’t
generalise from a single data point of this experience, I would estimate that
for a solid type system idea and using a well-known proof technique (progress
and preservation, in my case, which every PL researcher learns in graduate
program), a single researcher can go from inception to a mechanised result at
the level of a top-tier conference submission in about one month. To put it
differently (and I apologise if this offends anyone) the effort of writing a
POPL/PLDI paper will soon be <a href="https://iclrpoints.com/">comparable</a> to the effort
of writing an ICML/ICLR paper.</p>

<p>This does not mean that PL researchers will be out of a job any time soon. In
another experiment (which I will save for a different post), I tried to tackle a
far less conventional published result that relies on a non-standard proof
technique, which was only briefly sketched in the respective manuscript. After
about a week of fairly involved “vibe-mathing” with Claude Opus 4.6, my
AI-assisted attempts produced no usable result (there are still dozens of
<code class="language-plaintext highlighter-rouge">sorry</code>s in that project and I have no good idea how provable they are). In the
experiment described in this post, the model clearly benefited from the fact
that the progress-and-preservation technique is 32 years old and has numerous
mechanised instances in the easily accessible training data. Novel proof
strategies remain beyond AI’s reach (for now).</p>

<p>But let us forget about academic publications for a moment and think bigger.
Over the past two decades, the PL community has done a tremendous job distilling
reusable “proof harnesses” for challenges like <a href="https://dl.acm.org/doi/10.1145/3018610.3018620">type
soundness</a>, <a href="https://dl.acm.org/doi/10.1145/3341689">compiler
correctness</a>, and <a href="https://dl.acm.org/doi/10.1145/3371072">program
logics</a>. With AI-assisted theorem
proving, we can hope to see many more real-world languages formalised
end-to-end—not just <a href="https://dl.acm.org/doi/10.1145/3704858">Wasm</a> and
<a href="https://dl.acm.org/doi/10.1145/3649835">Cedar</a>. Programming language design has
a reputation for moving slowly, in part because of its insistence on rigour.
AI-assisted mechanisation lets us keep the rigour and lose the slowness. We can
finally <em>move fast</em> and <em>break nothing</em>.</p>

<h3 id="acknowledgements">Acknowledgements</h3>

<p>I am grateful to Todd Nowacki and Sam Blackshear for many discussions on the
design and implementation of Move’s borrow checker, and for their patience in
answering my endless questions about corner cases. Thanks to the members of the
<a href="https://verse-lab.org/">VERSE lab</a> and participants of the <a href="https://ifip-wg28.github.io/">IFIP WG
2.8</a> meeting in March 2026 for their comments on
the presentation that preceded this post.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:types" role="doc-endnote">
      <p>Does it mean that in practice PL designers prove such a theorem for
every new programming language proposal? Well, of course not: real
programming languages are large beasts with complicated syntax and unclear
semantics, so errors in the type system design (not even type checker
implementation) often go <a href="https://dl.acm.org/doi/10.1145/2983990.2984004">unnoticed for
decades</a>, making academic
researchers very happy when they discover them. <a href="#fnref:types" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:parser" role="doc-endnote">
      <p>The parser itself was validated by 37 alpha-equivalence tests
comparing the output of the translation against hand-written reference
ASTs. <a href="#fnref:parser" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:completeness" role="doc-endnote">
      <p>The converse property—<em>completeness</em>—would say that the
algorithmic checker accepts every program the relational rules accept. We
have not proved completeness, but this is less important for our goals:
soundness guarantees that no unsafe program slips through, while a
completeness gap would only cause the checker to reject some safe programs,
which would be caught by the conformance tests. <a href="#fnref:completeness" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:cedar" role="doc-endnote">
      <p>This approach is not entirely novel. The formalisation of the AWS
<a href="https://dl.acm.org/doi/10.1145/3649835">Cedar authorisation language</a>
follows a similar pattern: a Lean implementation of Cedar’s evaluator is
tested against its production version in Rust, ensuring that the two agree
on a shared test suite. <a href="#fnref:cedar" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:runtime-conformance" role="doc-endnote">
      <p>A more rigorous approach would be to instrument the Move
runtime to emit intermediate machine states, parse them into our Lean
formalisation, and check that our definitional interpreter passes through
corresponding states at each step. In the interest of time, I have not
carried out this exercise so far. <a href="#fnref:runtime-conformance" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:stronger" role="doc-endnote">
      <p>More powerful techniques have been developed since 1994, notably,
<a href="https://dl.acm.org/doi/10.1145/3676954">logical relations</a>, that can handle
features like higher-order state and recursive types. We did not need them
here: Move’s first-order setting is well within the reach of classical
progress-and-preservation. <a href="#fnref:stronger" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:opposite" role="doc-endnote">
      <p>The opposite, apparently, is true for creative professions:
writers and artists. <a href="#fnref:opposite" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Ilya Sergey</name></author><category term="lean" /><category term="move" /><category term="types" /><category term="verification" /><category term="ai" /><summary type="html"><![CDATA[I formalised and proved the correctness of Move’s new borrow checker in Lean: 39,000 lines of mechanised metatheory, produced in under a month with the help of an AI coding assistant. This post tells the story of how it went and what it means for the future of PL research.]]></summary></entry><entry><title type="html">Verifying Distributed Protocols in Veil</title><link href="https://proofsandintuitions.net/2026/02/09/distributed-verification-veil/" rel="alternate" type="text/html" title="Verifying Distributed Protocols in Veil" /><published>2026-02-09T00:00:00+00:00</published><updated>2026-02-09T00:00:00+00:00</updated><id>https://proofsandintuitions.net/2026/02/09/distributed-verification-veil</id><content type="html" xml:base="https://proofsandintuitions.net/2026/02/09/distributed-verification-veil/"><![CDATA[<p>In this post, we discuss how to formalise, test, and prove the correctness of a
classic distributed protocol by combining model checking, automated deductive
verification, and AI-powered invariant inference in <a href="https://veil.dev/">Veil</a>,
a new auto-active <a href="https://lean-lang.org/">Lean</a>-based verifier for distributed
protocols.</p>

<!--more-->

<h2 id="introduction">Introduction</h2>

<p>A famous quote by <a href="https://en.wikipedia.org/wiki/Leslie_Lamport">Leslie Lamport</a> states:</p>

<blockquote>
  <p>“A distributed system is one in which the failure of a computer you didn’t
even know existed can render your own computer unusable.”</p>
</blockquote>

<p>While <em>implementations</em> of distributed systems are typically very large programs
of tens of thousands of lines of code, at the heart of each such system is a
<em>distributed protocol</em>—an algorithm responsible for enabling all parts of the
system to communicate with each other so that the system delivers correct
results to its users. That said, even though descriptions of distributed
protocols are usually one to two orders of magnitude smaller than their
respective implementation, getting those protocols right is still a rather
challenging endeavour, due to their inherent concurrent nature and a large
number of “moving parts”, especially in the presence of possible faults.</p>

<p>The complicated nature of distributed protocols has made them a popular subject
for formal modelling and validation using computer-assisted tools, such as
<a href="https://github.com/tlaplus">TLA+</a>, <a href="https://github.com/p-org/P">P</a>, and
<a href="https://quint-lang.org/">Quint</a>. TLA+, designed by Lamport himself, is perhaps
the most popular such framework today. It is used widely by
<a href="https://brooker.co.za/blog/2022/07/29/getting-into-tla.html">designers</a> and
<a href="https://www.amazon.science/publications/how-amazon-web-services-uses-formal-methods">implementers</a>
of distributed protocols in industry to quickly prototype protocol designs and
validate their properties by exhaustively testing them on a fixed set of
parameters—an approach known as <em>model checking</em>.</p>

<p>A significant shortcoming of all these tools when it comes to gaining trust in
distributed protocol design is their rather rudimentary support for
machine-checked <em>formal proofs</em> of distributed protocol correctness. That is,
these tools are excellent at finding bugs in protocol designs, but make it
challenging to prove that a distributed algorithm <em>never</em> violates its
correctness specification.<sup id="fnref:tlaps" role="doc-noteref"><a href="#fn:tlaps" class="footnote" rel="footnote">1</a></sup></p>

<p>To address these shortcomings, we developed <a href="https://veil.dev/">Veil</a>—a
“one-stop shop” framework for prototyping, model checking, and verifying
distributed protocols, with ultimate correctness guarantees. Similar to our
other verifier <a href="/2026/01/21/multi-modal-verification-velvet/">Velvet</a>, Veil is built on top of the <a href="https://lean-lang.org/">Lean</a> proof assistant,
as a library, so it naturally benefits from Lean’s expressive specifications,
rich collection of data types and mathematical theorems, and the toolset for
proof scripting and automation. Furthermore, we strove to replicate the best
parts of TLA+ in Veil, namely, its ability to state protocol specifications at a
high level of abstraction (i.e., without spelling out each insignificant
implementation detail) combined with the ability to model-check such
specifications, quickly discovering “shallow” bugs, before focusing on formal
correctness proofs.</p>

<p>In the rest of this post, we will discuss how to encode, model check, and
semi-automatically verify a classic consensus protocol in Veil. In future
posts in this series, we will focus on more advanced Veil features, such as
combining automated and interactive Lean proofs about distributed protocol
correctness.</p>

<h2 id="a-classic-example-dolev-strong-floodset-protocol">A Classic Example: Dolev-Strong Floodset Protocol</h2>

<p>To showcase Veil, we will use it to model the Dolev-Strong Floodset protocol, a
classic distributed algorithm allowing $N$ nodes to reach agreement, i.e.,
select the same value uniformly, by exchanging messages with each other.
Notably, this is a fault-tolerant protocol: it allows for $t &lt; N$ nodes to fail
during the protocol’s execution—meaning that they will stop sending messages.</p>

<p>The Floodset protocol ensures agreement between non-faulty nodes under an
assumption of <em>synchrony</em>, meaning that the nodes communicate in discrete,
numbered rounds, and all messages sent in a given round are guaranteed to be
delivered before the next round starts. This is a strong assumption, as it
allows a node $n$ in the protocol to infer that, if it has not heard from a node
$m$ in a certain round, that is because the node $m$ is faulty, and not because
the network is slow.<sup id="fnref:async" role="doc-noteref"><a href="#fn:async" class="footnote" rel="footnote">2</a></sup></p>

<p>The protocol’s logic is rather simple. It starts with each node choosing a value
from a certain finite set of totally ordered values (the need for ordering will
become clear soon). Next, they communicate in $t + 1$ rounds. In each round,
each non-faulty node broadcasts (i.e., sends to everyone in the network) all
values it is aware of. Initially, this is just the value it has chosen at the
start, but this set will grow as it “hears” from other nodes. After $t + 1$
rounds, each node selects the <em>smallest</em> value from those it is aware of,
according to the ordering, and concludes that this is the value that all nodes
should agree on.</p>

<p>Let’s try to convince ourselves that this protocol works, in the sense that
after $t + 1$ rounds of communication, despite the possible failures, all nodes
that survived choose <em>the same</em> value.</p>

<p>First, let us notice that if there are no failures, just one round of
communication between all nodes is sufficient. Indeed, if everyone hears from
everyone, every node will be aware of every value any node (including itself)
has chosen by the end of this round. Given that all of them exercise the same
rule of choice—just choosing the smallest value in the set—they will all
choose the same result.</p>

<p>Things get trickier if there are failures. The problem is: a faulty node can
fail in the middle of its broadcast, so some of the nodes will receive its
messages, and some will not. So, by the end of the round, some of the alive nodes
might be aware of different values. To make things worse, multiple nodes might
fail during the same round, increasing the amount of chaos in the system even
further.</p>

<p>So why does the protocol work at the end? Notice that to reach agreement, we
only need one fault-free round during the entire execution, as by the end of
this round every node will “synchronise” with each other on all the data in the
system. Since we assumed up to $t$ faults and $t + 1$ rounds, even if one node
fails in each round, by the <a href="https://en.wikipedia.org/wiki/Pigeonhole_principle">pigeonhole
principle</a> there will be at
least one fault-free round, in which everyone will get everyone’s data. And even if
some nodes fail after that, it will not change this knowledge (although it’s
possible that alive nodes will agree on a value proposed by some node that has
failed at some point).</p>

<p>Let us now go ahead and encode the protocol in Veil, proving that this intuition
is, in fact, true.</p>

<h2 id="encoding-floodset-protocol-in-veil">Encoding Floodset Protocol in Veil</h2>

<blockquote>
  <p><strong>Disclaimer:</strong> <a href="https://veil.dev/">Veil</a> is a research prototype and is
currently under active development by <a href="http://verse-lab.org/">VERSE lab</a>. Its
performance as an automated verifier might vary on different case studies, and
some of its UI aspects (e.g., syntactic error highlighting) will be improved
in the future. Please, get in touch with us if you are planning to use it, and
feel free to submit bug reports to the <a href="https://github.com/verse-lab/veil/">GitHub
repository</a>.</p>
</blockquote>

<p>The protocol model accompanying the rest of this post can be found in <a href="https://github.com/verse-lab/veil/blob/veil-2.0-preview/Examples/Synchronous/FloodSet.lean">this
file</a>.
If you don’t want to compile Veil, you can try its <a href="https://try.veil.dev/">web
version</a> (although be prepared that it’s not very fast).</p>

<h3 id="representing-state-space">Representing State Space</h3>

<p>To encode a distributed protocol in Veil, we start by creating a new Lean
file (let’s call it <code class="language-plaintext highlighter-rouge">FloodSet.lean</code>) with the command that imports Veil as a
Lean library and makes a new Veil module, where we will put all our definitions
and properties:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="n">Veil</span>

<span class="n">veil</span> <span class="n">module</span> <span class="n">FloodSet</span><span class="cd">

-- The protocol definition and its properties will be here</span>

<span class="k">end</span> <span class="n">FloodSet</span>
</code></pre></div></div>

<p>Next, we will define the state of our protocol. Unlike actual implementations of
distributed systems, where each node runs its own code, formal descriptions of
distributed protocols typically model the system <em>holistically</em>: the state
captures the information held by all nodes at once, and transitions describe how
a single node’s action updates this global state. This style greatly simplifies
formal encoding of the protocol’s logic while retaining all its characteristic
intricacies. Readers with experience in TLA+ or Quint will find Veil’s encoding
style very familiar.</p>

<p>We first declare the types of the node identities and the values they exchange:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cd">-- Abstract types for nodes and values</span>
<span class="n">type</span> <span class="n">node</span>
<span class="n">type</span> <span class="n">value</span><span class="cd">

-- Values must be totally ordered (for picking the minimum)</span>
<span class="n">instantiate</span> <span class="n">val_ord</span> : <span class="n">TotalOrder</span> <span class="n">value</span>
<span class="k">open</span> <span class="n">TotalOrder</span>
</code></pre></div></div>

<p>At this point, it is not important what the elements of <code class="language-plaintext highlighter-rouge">node</code> and <code class="language-plaintext highlighter-rouge">value</code> are
exactly: they can be integer numbers, strings, etc., but knowing their structure
is not necessary for modelling the protocol’s logic (and also, surprisingly,
makes it easier to verify it), so we deliberately omit these details. The only
thing that we need to postulate now is that elements of <code class="language-plaintext highlighter-rouge">value</code> are totally
ordered, which is what is done by the line <code class="language-plaintext highlighter-rouge">instantiate val_ord : TotalOrder
value</code>.<sup id="fnref:class" role="doc-noteref"><a href="#fn:class" class="footnote" rel="footnote">3</a></sup></p>

<p>Next, we proceed to the components of the state space of the protocol,
which one can think of as a record with several fields:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">immutable</span> <span class="n">individual</span> <span class="n">t</span> : <span class="n">Nat</span>

<span class="n">individual</span> <span class="n">round</span> : <span class="n">Nat</span>
<span class="n">individual</span> <span class="n">crashCount</span> : <span class="n">Nat</span>

<span class="n">function</span> <span class="n">initialValue</span> : <span class="n">node</span> <span class="o">→</span> <span class="n">value</span>
<span class="n">relation</span> <span class="n">seen</span> : <span class="n">node</span> <span class="o">→</span> <span class="n">value</span> <span class="o">→</span> <span class="n">Bool</span>
<span class="n">relation</span> <span class="n">decision</span> : <span class="n">node</span> <span class="o">→</span> <span class="n">value</span> <span class="o">→</span> <span class="n">Bool</span>
<span class="n">relation</span> <span class="n">alive</span> : <span class="n">node</span> <span class="o">→</span> <span class="n">Bool</span>
<span class="n">relation</span> <span class="n">crashedInThisRound</span> : <span class="n">node</span> <span class="o">→</span> <span class="n">Bool</span>

<span class="n">#gen_state</span>
</code></pre></div></div>

<p>Scalar state components (i.e., non-functions) in Veil are declared using the
<code class="language-plaintext highlighter-rouge">individual</code> keyword.<sup id="fnref:ivy1" role="doc-noteref"><a href="#fn:ivy1" class="footnote" rel="footnote">4</a></sup> Components that remain constant throughout the
entire protocol’s execution, such as the maximal allowed number of faulty nodes
<code class="language-plaintext highlighter-rouge">t</code>, can be marked as <code class="language-plaintext highlighter-rouge">immutable</code>. This is not strictly necessary, but it
improves the performance of model checking and helps with simplifying the proofs
by signalling Veil that it does not need to keep track of changes in those
values. Since both the round number (<code class="language-plaintext highlighter-rouge">round</code>) and the actual number of crashed
nodes (<code class="language-plaintext highlighter-rouge">crashCount</code>) will increase during the system’s execution, those are not
marked as immutable.</p>

<p>Non-scalar components in Veil model <em>many-to-one</em> and <em>many-to-many</em> relations
between values in the system. For instance, the <code class="language-plaintext highlighter-rouge">initialValue</code> function will
represent the values chosen by nodes at the beginning of the protocol. The
binary relation <code class="language-plaintext highlighter-rouge">seen</code> captures the values each node is aware of at any point.
That is, <code class="language-plaintext highlighter-rouge">seen n v = true</code> means that the specific node <code class="language-plaintext highlighter-rouge">n</code> is aware of the
value <code class="language-plaintext highlighter-rouge">v</code>. In a similar vein, <code class="language-plaintext highlighter-rouge">decision</code> encodes which nodes have decided on
which values. Even though each node will eventually select at most one value, it
is much more convenient to “over-provision” the definition for a possibility of
a node to choose multiple values—don’t worry, we will later verify that this
never happens. Finally, the two unary relations <code class="language-plaintext highlighter-rouge">alive</code> and <code class="language-plaintext highlighter-rouge">crashedInThisRound</code>
capture the nodes alive at any moment of the protocol’s execution and the nodes
that have crashed since the start of the current round, respectively.</p>

<p>Given these declarations, Veil command <code class="language-plaintext highlighter-rouge">#gen_state</code> produces an internal
representation of the protocol’s state, so we can use it to describe how
exactly the algorithm works.</p>

<h3 id="initial-states">Initial States</h3>

<p>Now comes the most interesting part: modelling the protocol’s logic, i.e.,
describing how exactly it starts and runs. Let’s start by describing the set of
possible initial states of the protocol:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cd">-- Initial state: each node has exactly one proposal value</span>
<span class="n">after_init</span> <span class="err">{</span>
  <span class="n">initialValue</span> := <span class="o">*</span>
  <span class="n">seen</span> <span class="n">N</span> <span class="n">V</span> := <span class="n">initialValue</span> <span class="n">N</span> <span class="o">==</span> <span class="n">V</span>

  <span class="n">alive</span> <span class="n">N</span> := <span class="n">true</span>
  <span class="n">decision</span> <span class="n">N</span> <span class="n">V</span> := <span class="n">false</span>
  <span class="n">round</span> := <span class="mi">0</span>
  <span class="n">crashCount</span> := <span class="mi">0</span>
  <span class="n">crashedInThisRound</span> <span class="n">N</span> := <span class="n">false</span>
<span class="err">}</span>
</code></pre></div></div>

<p>There is quite a bit to unfold in this definition.</p>

<p>Let us start with the line <code class="language-plaintext highlighter-rouge">initialValue := *</code>. Remember that <code class="language-plaintext highlighter-rouge">initialValue</code> is
a function of type <code class="language-plaintext highlighter-rouge">node → value</code> defining which value each node chooses at the
beginning. It is not important what <em>exact</em> value each node chooses, and, in
fact, we are interested in checking all possible combinations. This Veil syntax
allows us to achieve exactly this: picking a definition of <code class="language-plaintext highlighter-rouge">initialValue</code>
<em>arbitrarily</em>. The power of this feature will become clearer once we start
model-checking (i.e., exhaustively testing) and verifying the correctness of the
protocol. In the former case, it will ensure that we run the check for <em>every
possible definition</em> of <code class="language-plaintext highlighter-rouge">initialValue</code>. In the latter, it will allow us to
prove that the protocol’s correctness holds <em>regardless</em> of which definition of
<code class="language-plaintext highlighter-rouge">initialValue</code> it starts with.</p>

<p>The second line <code class="language-plaintext highlighter-rouge">seen N V := initialValue N == V</code> shows another important
feature of Veil, inspired by the <a href="https://github.com/kenmcmil/ivy">Ivy</a>
verifier—so-called <em>iterated assignments</em>. Whenever we use a capitalised
variable, e.g., <code class="language-plaintext highlighter-rouge">N</code> in the code of Veil actions (and later, protocol
properties), it should be read as “for any <code class="language-plaintext highlighter-rouge">N</code> in the respective set”. More
specifically, here we define the Boolean value of <code class="language-plaintext highlighter-rouge">seen N V</code> for any pair <code class="language-plaintext highlighter-rouge">(N,
V)</code> as the value of the expression <code class="language-plaintext highlighter-rouge">initialValue N == V</code>. In other words, if,
for a given <code class="language-plaintext highlighter-rouge">N</code> and <code class="language-plaintext highlighter-rouge">V</code>, <code class="language-plaintext highlighter-rouge">initialValue N</code> returns <code class="language-plaintext highlighter-rouge">V</code>, then <code class="language-plaintext highlighter-rouge">seen N V</code> is set to
<code class="language-plaintext highlighter-rouge">true</code>, otherwise it’s set to <code class="language-plaintext highlighter-rouge">false</code>. This syntax might be a bit weird when you
see it for the first time, but soon it becomes very natural, and you might appreciate
its conciseness and elegance.</p>

<p>With the syntax explained, the rest of the definition should be clear: every
node <code class="language-plaintext highlighter-rouge">N</code> is <code class="language-plaintext highlighter-rouge">alive</code> at the start of the execution, and has not decided on any
value (<code class="language-plaintext highlighter-rouge">decision N V := false</code>). We start from the round <code class="language-plaintext highlighter-rouge">0</code>, with <code class="language-plaintext highlighter-rouge">0</code> crashed nodes
and no node crashed in the initial round.</p>

<h3 id="protocol-actions">Protocol Actions</h3>

<p>To model an execution of distributed protocols, we must embrace their
non-deterministic nature: in essence, at any point, one of many things can
happen in the system, and we do not always have control over the exact order of
these events. That said, we should also be able to express which event might or
might not happen given the state of the system so far. Let us see how to
accommodate these requirements and express distributed events using the
mechanism of Veil actions.</p>

<p>For example, this is an action that describes an event of a node failing:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cd">-- Crash one alive node (can happen multiple times per round, up to t total)</span>
<span class="n">action</span> <span class="n">crash</span> (<span class="n">n</span> : <span class="n">node</span>) <span class="err">{</span>
  <span class="n">require</span> <span class="n">round</span> <span class="o">&lt;</span> <span class="n">t</span> <span class="o">+</span> <span class="mi">1</span>
  <span class="n">require</span> <span class="n">crashCount</span> <span class="o">&lt;</span> <span class="n">t</span>
  <span class="n">require</span> <span class="n">alive</span> <span class="n">n</span>

  <span class="n">alive</span> <span class="n">n</span> := <span class="n">false</span>
  <span class="n">crashedInThisRound</span> <span class="n">n</span> := <span class="n">true</span>
  <span class="n">crashCount</span> := <span class="n">crashCount</span> <span class="o">+</span> <span class="mi">1</span>
<span class="err">}</span>
</code></pre></div></div>

<p>This definition states that an action <code class="language-plaintext highlighter-rouge">crash</code> can be executed at any point for
<em>any</em> node <code class="language-plaintext highlighter-rouge">n</code> given that the following side conditions are satisfied:</p>

<ul>
  <li>The current value of <code class="language-plaintext highlighter-rouge">round</code> is less than <code class="language-plaintext highlighter-rouge">t + 1</code></li>
  <li>The value of <code class="language-plaintext highlighter-rouge">crashCount</code> is less than <code class="language-plaintext highlighter-rouge">t</code></li>
  <li>The node <code class="language-plaintext highlighter-rouge">n</code> is <code class="language-plaintext highlighter-rouge">alive</code></li>
</ul>

<p>The combination of these requirements will prevent us from crashing more than
<code class="language-plaintext highlighter-rouge">t</code> nodes, doing so after the end of the protocol’s main “run”, and crashing the
nodes that have already failed.</p>

<p>What follows is the “operational” logic of the action. It first marks the node
<code class="language-plaintext highlighter-rouge">n</code> as failed (<code class="language-plaintext highlighter-rouge">alive n := false</code>). Then, it records the fact that <code class="language-plaintext highlighter-rouge">n</code> has
crashed in the currently ongoing round (<code class="language-plaintext highlighter-rouge">crashedInThisRound n := true</code>). Finally,
it increases the total counter of crashed nodes.</p>

<p>It is important to note that, despite the concurrent nature of distributed
protocol executions in Veil, there are no “data races” between actions: multiple
actions are always assumed to be executed <em>atomically, one after another</em>. What
remains non-deterministic is the <em>order</em> in which they might be scheduled for
execution. Indeed, Veil’s machinery for model checking and verification will
account for that, to make sure that every such execution is accounted for at the
end.</p>

<p>Next, let us describe the most important part of the Floodset algorithm:
synchronously advancing rounds, making nodes exchange data with each other. This
is done by the <code class="language-plaintext highlighter-rouge">advanceRound</code> action:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">action</span> <span class="n">advanceRound</span> <span class="err">{</span>
  <span class="n">require</span> <span class="n">round</span> <span class="o">&lt;</span> <span class="n">t</span> <span class="o">+</span> <span class="mi">1</span>

  <span class="n">let</span> <span class="n">deadToAliveDelivery</span> : <span class="n">node</span> <span class="o">→</span> <span class="n">node</span> <span class="o">→</span> <span class="n">Bool</span> :<span class="o">|</span> <span class="n">true</span>

  <span class="n">seen</span> <span class="n">N</span> <span class="n">V</span> := <span class="n">seen</span> <span class="n">N</span> <span class="n">V</span> <span class="o">||</span>
    <span class="n">alive</span> <span class="n">N</span> <span class="o">&amp;&amp;</span>
      <span class="n">decide</span> ((<span class="o">∃</span> <span class="n">m</span>, <span class="n">alive</span> <span class="n">m</span> <span class="o">∧</span> <span class="n">seen</span> <span class="n">m</span> <span class="n">V</span>) <span class="o">∨</span>
              (<span class="o">∃</span> <span class="n">d</span>, <span class="n">crashedInThisRound</span> <span class="n">d</span> <span class="o">∧</span> <span class="n">deadToAliveDelivery</span> <span class="n">d</span> <span class="n">N</span> <span class="o">∧</span> <span class="n">seen</span> <span class="n">d</span> <span class="n">V</span>))

  <span class="n">crashedInThisRound</span> <span class="n">N</span> := <span class="n">false</span>
  <span class="n">round</span> := <span class="n">round</span> <span class="o">+</span> <span class="mi">1</span>
<span class="err">}</span>
</code></pre></div></div>

<p>The most interesting part of the <code class="language-plaintext highlighter-rouge">advanceRound</code> action is how it propagates the
information about seen values between nodes by updating the <code class="language-plaintext highlighter-rouge">seen</code> relation.
Notice that, due to our synchrony assumption, there is no need to model the
message-passing explicitly, since the delays between sending and receiving
messages are not observable in our system: effectively, every message reaches
its destination within a single round, or is lost forever. That’s why we can
update <code class="language-plaintext highlighter-rouge">seen N V</code> for any node <code class="language-plaintext highlighter-rouge">N</code> and value <code class="language-plaintext highlighter-rouge">V</code> in a single iterated
assignment. The assignment constructs the new relation by taking the union of
the old one (via Boolean disjunction <code class="language-plaintext highlighter-rouge">||</code>) with the newly received data. This
data is only relevant for nodes that are still alive (hence the conjunction with
<code class="language-plaintext highlighter-rouge">alive N</code>). A new value <code class="language-plaintext highlighter-rouge">V</code> may come either from the <code class="language-plaintext highlighter-rouge">seen</code>-set of some
presently alive node <code class="language-plaintext highlighter-rouge">m</code> (<code class="language-plaintext highlighter-rouge">∃ m, alive m ∧ seen m V</code>) or from the <code class="language-plaintext highlighter-rouge">seen</code>-set of
some recently crashed node <code class="language-plaintext highlighter-rouge">d</code>.</p>

<p>The latter aspect of the model deserves a small discussion. What we want to
express here is that <em>some</em> of the alive nodes might hear from <em>some</em> of the
failed nodes. We don’t want to tell for which pairs of failed/alive nodes this
is the case, hence we model this by non-deterministically choosing this
many-to-many relation as a function <code class="language-plaintext highlighter-rouge">deadToAliveDelivery</code>:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">let</span> <span class="n">deadToAliveDelivery</span> : <span class="n">node</span> <span class="o">→</span> <span class="n">node</span> <span class="o">→</span> <span class="n">Bool</span> :<span class="o">|</span> <span class="n">true</span>
</code></pre></div></div>

<p>This syntax effectively tells Veil to assign to <code class="language-plaintext highlighter-rouge">deadToAliveDelivery</code> some random
function of type <code class="language-plaintext highlighter-rouge">node → node → Bool</code>, such that it satisfies predicate <code class="language-plaintext highlighter-rouge">true</code>
(in other words, the choice of the function is unrestricted).<sup id="fnref:hilbert" role="doc-noteref"><a href="#fn:hilbert" class="footnote" rel="footnote">5</a></sup> Getting
back to the last line of the iterated assignment to <code class="language-plaintext highlighter-rouge">seen</code> in <code class="language-plaintext highlighter-rouge">advanceRound</code>, we
can see that, for a fixed choice of <code class="language-plaintext highlighter-rouge">deadToAliveDelivery</code>, only the nodes <code class="language-plaintext highlighter-rouge">N</code>
such that <code class="language-plaintext highlighter-rouge">deadToAliveDelivery d N == true</code> for some node <code class="language-plaintext highlighter-rouge">d</code> that failed in this
round will guarantee the delivery of a value <code class="language-plaintext highlighter-rouge">V</code> from <code class="language-plaintext highlighter-rouge">d</code>’s <code class="language-plaintext highlighter-rouge">seen</code>-set to <code class="language-plaintext highlighter-rouge">N</code>.
The call to Lean’s <code class="language-plaintext highlighter-rouge">decide</code> function is a slightly annoying necessity, needed to
convert from Lean native propositions (of type <code class="language-plaintext highlighter-rouge">Prop</code>) to Booleans, imposed by
the presence of the existentially-quantified statements such as <code class="language-plaintext highlighter-rouge">(∃ m, alive m ∧
seen m V)</code>.</p>

<p>The remainder of the <code class="language-plaintext highlighter-rouge">advanceRound</code> action “resets” <code class="language-plaintext highlighter-rouge">crashedInThisRound</code> to
discard the nodes crashed in this round from affecting the outcome of the next
round (any node that will be crashed via the <code class="language-plaintext highlighter-rouge">crash</code> action from this point on
might only affect the outcome of the <em>next</em> round). Finally, it advances the
round number.</p>

<p>The last action of the protocol is <code class="language-plaintext highlighter-rouge">nodeDecide</code>, which allows any alive node to
select the value of the consensus in the round <code class="language-plaintext highlighter-rouge">t + 1</code>:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">action</span> <span class="n">nodeDecide</span> (<span class="n">n</span> : <span class="n">node</span>) <span class="err">{</span>
  <span class="n">require</span> <span class="n">round</span> <span class="o">=</span> <span class="n">t</span> <span class="o">+</span> <span class="mi">1</span>
  <span class="n">require</span> <span class="n">alive</span> <span class="n">n</span>

  <span class="n">let</span> <span class="n">v</span> :<span class="o">|</span> <span class="n">seen</span> <span class="n">n</span> <span class="n">v</span>
  <span class="k">assume</span> <span class="o">∀</span> <span class="n">w</span>, <span class="n">seen</span> <span class="n">n</span> <span class="n">w</span> <span class="o">→</span> <span class="n">le</span> <span class="n">v</span> <span class="n">w</span>

  <span class="n">decision</span> <span class="n">n</span> <span class="n">V</span> := <span class="n">V</span> <span class="o">==</span> <span class="n">v</span>
<span class="err">}</span>
</code></pre></div></div>

<p>Once again, we use the constrained non-deterministic choice operator to pick the
value <code class="language-plaintext highlighter-rouge">v</code> such that it’s in the <code class="language-plaintext highlighter-rouge">seen</code>-set for the node <code class="language-plaintext highlighter-rouge">n</code>. We additionally
constrain it, via Veil’s <code class="language-plaintext highlighter-rouge">assume</code> statement, to be the smallest possible value
amongst those seen by <code class="language-plaintext highlighter-rouge">n</code>. At the end, we set the decision of <code class="language-plaintext highlighter-rouge">n</code> to only
contain the chosen value <code class="language-plaintext highlighter-rouge">v</code> that (a) it has seen and (b) is the smallest amongst
all values seen by <code class="language-plaintext highlighter-rouge">n</code>.</p>

<h2 id="specifying-protocol-safety-properties">Specifying Protocol Safety Properties</h2>

<p>Now, with the definition of the protocol at hand, let us try to ensure that it
indeed does what it’s supposed to do: makes all alive nodes uniformly decide on
exactly one of their initial values. In Veil, we can encode this specification
using the following two statements:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">safety</span> [<span class="n">agreement</span>]
  <span class="o">∀</span> <span class="n">n1</span> <span class="n">n2</span> <span class="n">v1</span> <span class="n">v2</span>, <span class="n">decision</span> <span class="n">n1</span> <span class="n">v1</span> <span class="o">∧</span> <span class="n">decision</span> <span class="n">n2</span> <span class="n">v2</span> <span class="o">→</span> <span class="n">v1</span> <span class="o">=</span> <span class="n">v2</span>

<span class="n">safety</span> [<span class="n">validity</span>]
  <span class="o">∀</span> <span class="n">n</span> <span class="n">v</span>, <span class="n">decision</span> <span class="n">n</span> <span class="n">v</span> <span class="o">→</span> (<span class="o">∃</span> <span class="n">m</span>, <span class="n">initialValue</span> <span class="n">m</span> <span class="o">=</span> <span class="n">v</span>)
</code></pre></div></div>

<p>The keyword <code class="language-plaintext highlighter-rouge">safety</code> indicates a property of a protocol that must not be
violated by any state that is (a) either amongst its initial states (defined via
<code class="language-plaintext highlighter-rouge">after_init</code>) or (b) <em>reachable</em> from an initial state by executing a
sequence of one or more actions: <code class="language-plaintext highlighter-rouge">crash</code>, <code class="language-plaintext highlighter-rouge">advanceRound</code>, or <code class="language-plaintext highlighter-rouge">nodeDecide</code>. The
names of properties are optional and can be omitted.</p>

<p>Here, <code class="language-plaintext highlighter-rouge">agreement</code> states that any two values <code class="language-plaintext highlighter-rouge">v1</code> and <code class="language-plaintext highlighter-rouge">v2</code> that some nodes <code class="language-plaintext highlighter-rouge">n1</code>
and <code class="language-plaintext highlighter-rouge">n2</code> decide upon are, in fact, the same. Notice that this property also
ensures that each node only decides on the same value (in this case <code class="language-plaintext highlighter-rouge">n1 = n2</code>).
Here, we are not concerned with whether the node is alive or crashed (in fact, a
crashed node will never get to make a decision, as per the premise of the
<code class="language-plaintext highlighter-rouge">nodeDecide</code> action). The second property, <code class="language-plaintext highlighter-rouge">validity</code>, states that any decided value
originates from some node’s initial value choice.</p>

<h2 id="catching-bugs-in-specification-with-model-checking">Catching Bugs in Specification with Model Checking</h2>

<p>To check that both <code class="language-plaintext highlighter-rouge">agreement</code> and <code class="language-plaintext highlighter-rouge">validity</code> do indeed hold true for our
encoding of Floodset, we are going to use <strong>Lace</strong>, a model checker of Veil.
Lace works similarly to TLC, the model checker for TLA+. Given concrete finite
sets representing core data types of the protocol, it simply <em>runs</em> the model,
starting by enumerating all of its initial states, and then by applying to the
already reached states any enabled actions (i.e., actions whose
<code class="language-plaintext highlighter-rouge">require</code>-statements are satisfied by those states), until this process exhausts
the entire state space of the protocol, or is interrupted.</p>

<p>We can run Lace directly from a Lean file with the following command:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#model_check { node := Fin 3, value := Fin 2 } { t := 1 }
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">Fin n</code> is Lean’s native data type that represents the finite collection
containing <code class="language-plaintext highlighter-rouge">{0, ..., n - 1}</code>. We use it to specify that the set of nodes is the
finite set <code class="language-plaintext highlighter-rouge">{0, 1, 2}</code>, and the set of values to choose from is <code class="language-plaintext highlighter-rouge">{0, 1}</code>. We
also allow for at most one crash by setting the immutable protocol parameter <code class="language-plaintext highlighter-rouge">t</code>
to <code class="language-plaintext highlighter-rouge">1</code>. After a few seconds, the command produces the following output shown in
the Lean InfoView buffer of VSCode:</p>

<p><img src="/assets/images/veil-model-checking-01.png" alt="Lean InfoView showing the results of successful model
checking" /> <em>VSCode with Lean 4: testing
properties of the Veil implementation of the Floodset algorithm. The InfoView
panel on the right shows statistics for the execution.</em></p>

<p>In particular, Lace has explored 54904 states of the protocol, of which 362 were
distinct, and the execution <em>diameter</em>, i.e., the maximal length of an execution
trace, was 6 actions (you can think of such a scenario as an exercise). The
large number of states is explained by the combinatorial explosion of
possible executions of the protocol: since our encoding relied on
non-deterministic choice (e.g., by setting <code class="language-plaintext highlighter-rouge">initialValue := *</code>), the model
checking algorithm had to enumerate <em>all</em> possible outcomes of such choices.</p>

<p>Model checking is tremendously useful for debugging the logic of the protocol
and its properties, as it can quickly discover bugs that are relatively
“shallow”—i.e., that manifest at small diameters. For instance, if we introduce a
bug by commenting out the line <code class="language-plaintext highlighter-rouge">assume ∀ w, seen n w → le v w</code> in the action
<code class="language-plaintext highlighter-rouge">nodeDecide</code> and re-run the model checker, Lace will show the following report:</p>

<p><img src="/assets/images/veil-model-checking-02.png" alt="A result of a failed run of the Lace model checker for a buggy Veil model" /></p>

<p>In the state that violates the property, the relation <code class="language-plaintext highlighter-rouge">decision</code> stores two
pairs of values, namely, <code class="language-plaintext highlighter-rouge">(0, 0)</code> and <code class="language-plaintext highlighter-rouge">(1, 1)</code>, meaning that the node <code class="language-plaintext highlighter-rouge">0</code>
has chosen the value <code class="language-plaintext highlighter-rouge">0</code>, and the node <code class="language-plaintext highlighter-rouge">1</code> has chosen <code class="language-plaintext highlighter-rouge">1</code>, which clearly
violates the agreement property. Notice that the model checker not only reports
the violation of the <code class="language-plaintext highlighter-rouge">agreement</code> property—it also constructs a concrete
execution trace, demonstrating how a property-violating state can be reached
from an initial one.</p>

<p>In addition to discovering violations of safety properties (such as
<code class="language-plaintext highlighter-rouge">agreement</code>), a model checker can also be “tricked” into discovering
reachability bugs—scenarios in which a protocol does not do anything useful.
To see how to do that, let us uncomment the line <code class="language-plaintext highlighter-rouge">assume ∀ w, seen n w → le v w</code>
of <code class="language-plaintext highlighter-rouge">nodeDecide</code> and comment out the line <code class="language-plaintext highlighter-rouge">decision n V := V == v</code>. Clearly, now,
no node ever makes a decision, which means that both <code class="language-plaintext highlighter-rouge">agreement</code> and <code class="language-plaintext highlighter-rouge">validity</code>
always hold true, as the premises of their implications are always false. This
can be confirmed by re-running Lace, which reports no errors.</p>

<p>One can check whether the protocol does, in fact, reach “interesting” states by
stating a property that we <em>want</em> to be violated, such as the following one:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">safety</span> [<span class="n">no_decision</span>] <span class="o">∀</span> <span class="n">n</span> <span class="n">v</span>, <span class="o">¬</span><span class="n">decision</span> <span class="n">n</span> <span class="n">v</span>
</code></pre></div></div>

<p>The property <code class="language-plaintext highlighter-rouge">no_decision</code> simply states that no node ever decides on any value.
This is clearly not what we want, so an execution of a well-functioning protocol
should have a state in which <code class="language-plaintext highlighter-rouge">no_decision</code> becomes false. However, if we rerun
the model checker on Floodset in its current form, no violations will be
reported. This means we’ve just discovered a reachability bug. By uncommenting
<code class="language-plaintext highlighter-rouge">decision n V := V == v</code> and reverting the protocol to its initial state, we
make it so that <code class="language-plaintext highlighter-rouge">no_decision</code> is now violated, which results in the following
report, demonstrating a trace that does lead to a node making a decision:</p>

<p><img src="/assets/images/veil-model-checking-03.png" alt="Using the model checker to prove state reachability" /></p>

<h2 id="formal-safety-proof-and-inductive-invariants">Formal Safety Proof and Inductive Invariants</h2>

<p>So far, we have managed to check that no execution of our model of the Floodset
protocol violates the desired safety properties for <em>some</em> fixed values
of its parameters: <code class="language-plaintext highlighter-rouge">node</code>, <code class="language-plaintext highlighter-rouge">value</code>, and <code class="language-plaintext highlighter-rouge">t</code>. This gives some certainty that we
have got the protocol right, but it does not serve as a proof of this fact. What
we want to ensure is that no execution violates the safety properties for <em>any</em>
values of <code class="language-plaintext highlighter-rouge">node</code>, <code class="language-plaintext highlighter-rouge">value</code>, and <code class="language-plaintext highlighter-rouge">t</code>.</p>

<p>This statement can be phrased as a formal theorem, which can then be proven
by <em>induction</em> on the length of the execution. Specifically, we can derive that
the desired safety properties (e.g., <code class="language-plaintext highlighter-rouge">agreement</code> and <code class="language-plaintext highlighter-rouge">validity</code>) hold for any
state of the system if we prove the following two facts:</p>

<ol>
  <li>These properties are true for any initial state, and</li>
  <li>If they hold true for a state <code class="language-plaintext highlighter-rouge">s</code>, and a state <code class="language-plaintext highlighter-rouge">s'</code> can be produced from <code class="language-plaintext highlighter-rouge">s</code>
by applying one of the protocol’s actions, then these properties also hold
true for <code class="language-plaintext highlighter-rouge">s'</code>.</li>
</ol>

<p>You can recognise part (1) as a <em>base</em> of a proof by induction, while (2) is the
<em>induction step</em>. Veil provides a convenient way to assemble Lean theorems
corresponding to the statements (1) and (2) for a specific protocol, out of
the protocol’s description and its stated safety properties. We can then attempt
to prove those properties automatically by typing the following command:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">#check_invariants</span>
</code></pre></div></div>

<p>As a result, for this example, Veil generates 12 Lean theorems, out of which 10
are proven automatically, while two are disproven. Let’s take a closer look at
the generated report shown in the Lean InfoView:</p>

<p><img src="/assets/images/veil-verification-01.png" alt="A counterexample to induction generated by Veil" /></p>

<p>What it states is that for the action <code class="language-plaintext highlighter-rouge">nodeDecide</code>, it has discovered a pair of
states, a pre-state and a post-state, such that all the safety properties (also
sometimes called <em>invariants</em>) hold for the pre-state, but the post-state
violates <code class="language-plaintext highlighter-rouge">agreement</code>. This example refutes the statement of the induction step,
but does not necessarily mean that the properties don’t hold for any reachable
state of the protocol—after all, we had strong evidence for them being true
given by the model checker. So why does the proof not go through?</p>

<p>It turns out that the pre-state of the generated counterexample is, in fact, <em>not</em>
reachable in any of the concrete runs of the system. It is generated simply
because this is a state that satisfies both our properties, as required by the
premise of the induction step, which does not talk about actual state
reachability. Such spurious counterexamples, known as counterexamples to
induction (CTI), can be eliminated by stating more properties that we
believe hold true over the system. For instance, by examining the provided
counterexample, we can notice that, according to it, the node <code class="language-plaintext highlighter-rouge">1</code> has decided on
the value <code class="language-plaintext highlighter-rouge">1</code> (<code class="language-plaintext highlighter-rouge">decision ↦ (1, 1)</code>), even though it hasn’t “seen” it (the <code class="language-plaintext highlighter-rouge">seen</code>
relation only contains the pair <code class="language-plaintext highlighter-rouge">(0, 0)</code>). This is clearly not something that
could happen during a concrete execution, so we can add the corresponding
property to be included in our induction hypothesis:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">invariant</span> [<span class="n">seen_decision</span>]
  <span class="o">∀</span> <span class="n">n</span> <span class="n">v</span>, <span class="n">decision</span> <span class="n">n</span> <span class="n">v</span> <span class="o">→</span> <span class="n">seen</span> <span class="n">n</span> <span class="n">v</span>
</code></pre></div></div>

<p>The property <code class="language-plaintext highlighter-rouge">seen_decision</code> states that any value that becomes some node’s decision must have been seen by this node. Sadly, running <code class="language-plaintext highlighter-rouge">#check_invariants</code> fails again, but now with a different CTI:</p>

<p><img src="/assets/images/veil-verification-02.png" alt="Another counterexample to induction generated by Veil" /></p>

<p>After looking at the counterexample, we can notice that the value of
<code class="language-plaintext highlighter-rouge">crashCount</code> in the pre-state is <code class="language-plaintext highlighter-rouge">0</code>, while both nodes <code class="language-plaintext highlighter-rouge">0</code> and <code class="language-plaintext highlighter-rouge">1</code> are marked as
crashed in this round, and yet alive at the same time. Again, this is not
something that would happen after a valid run in the system. We can try to
eliminate this outcome by adding the following property to the set of invariants
we use in our proof by induction:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">invariant</span> [<span class="n">crashed_not_alive</span>]
  <span class="o">∀</span> <span class="n">n</span>, <span class="n">crashedInThisRound</span> <span class="n">n</span> <span class="o">→</span> <span class="o">¬</span> <span class="n">alive</span> <span class="n">n</span>
</code></pre></div></div>

<p>And again, this generates another counterexample, different this time. What we
are doing here is called <em>inductive invariant inference</em>, and, as we’ve
discussed in the <a href="/2026/01/21/multi-modal-verification-velvet/">previous post</a>, there is no algorithm guaranteed
to always find a set of inductive invariants for a given initial set of
properties to hold—even if they <em>do</em> hold.</p>

<p>Luckily, with a bit of understanding of how the protocol works and by analysing
the counterexamples to induction produced by Veil, we can eventually provide a
sufficient set of invariants for the system so that all generated theorems are proven,
thus delivering the correctness proof for the protocol with regard to our
initial properties. These invariants can be found in the <a href="https://github.com/verse-lab/veil/blob/veil-2.0-preview/Examples/Synchronous/FloodSet.lean">Floodset
implementation</a>
available on GitHub.</p>

<h2 id="ai-powered-inference-of-inductive-invariants">AI-Powered Inference of Inductive Invariants</h2>

<p>I know you’ve been waiting for this part!</p>

<p>Indeed, the process of inferring inductive invariants manually is quite tedious,
so having some automation here would go a long way. In the past, several
academic efforts in the systems research community attempted to automate
invariant inference by applying traditional symbolic enumerative techniques
(examples of such frameworks are
<a href="https://dl.acm.org/doi/10.1145/3341301.3359651">I4</a>,
<a href="https://www.usenix.org/system/files/nsdi21-hance.pdf">SWISS</a>, and
<a href="https://www.usenix.org/system/files/osdi22-yao.pdf">DuoAI</a>).</p>

<p>Nowadays, we have Claude Code, so we can just use it with the feedback in the
form of counterexamples provided by Veil. In fact, most of the auxiliary
invariants required to verify the agreement and validity of the accompanying
Floodset protocol
<a href="https://github.com/verse-lab/veil/blob/veil-2.0-preview/Examples/Synchronous/FloodSet.lean#L145">model</a>
have been obtained by Claude Code automatically within just a couple of minutes
from a prompt “Infer the invariants necessary for <code class="language-plaintext highlighter-rouge">#check_invariants</code> to
succeed. Use counterexamples provided by the verifier.”</p>

<h2 id="concluding-remarks">Concluding Remarks</h2>

<p>There are quite a few aspects of Veil we haven’t discussed yet, but I hope to
elaborate on them in future posts.</p>

<p>For example, even though Veil is just a Lean library, allowing any of its
theorems to be proven in Lean directly, we didn’t get to use Lean proof mode and
tactics in this tutorial. This is because we got quite lucky with our encoding
of Floodset, which allowed for its inductive proof to be obtained fully
automatically with the help of SMT solvers (Veil uses <code class="language-plaintext highlighter-rouge">cvc5</code> as the default one),
invoked by <code class="language-plaintext highlighter-rouge">#check_invariants</code> under the hood. In one of the next posts in this
series, we will discuss protocol formalisations in Veil whose verification
requires a combination of interactive and automated proofs in Lean.</p>

<p>If you are interested in learning more about Veil’s capabilities as a verifier,
you should check out this <a href="https://verse-lab.org/papers/veil-cav25.pdf">CAV’25
paper</a>. You can also find some
takeaways from implementing Veil on top of Lean in <a href="https://verse-lab.org/papers/veil-dafny26.pdf">this
paper</a> presented at the recent
Dafny workshop.</p>

<p>Even though Claude Code was incredibly effective at inferring inductive invariants
for proving correctness of a protocol, it was not as reliable as a mechanism for
protocol auto-formalisation. This is not due to it getting Veil’s syntax wrong:
contrary to my expectations, it managed to guess the meaning of all its unusual
syntactic constructions correctly, or quickly corrected all its mistakes with
the help of the compiler errors. The main problem is that the protocol definitions it
produced in Veil were missing crucial parts, making them vacuously correct but
also useless. For example, one of the first versions it produced modelled
all crashes in the same action as the broadcast without accounting for partial
message delivery, so the resulting protocol would reach agreement
immediately after the first round. Issues like these were discovered with
multiple runs of Lace to test reachability properties—as discussed above.</p>

<p>Furthermore, Claude Code was prone to overcomplicate things, for instance, by
introducing state components that were not necessary for modelling the essence
of the protocol, such as a relation representing in-flight messages.</p>

<p>While these shortcomings might be eliminated in future models, my experience so
far still demonstrates that the presence of the human expert is crucial for
getting a faithful and concise formal specification of a system at the right
level of abstraction.</p>

<h3 id="acknowledgements">Acknowledgements</h3>

<p>I’m grateful to Seth Gilbert who suggested taking a look at the FloodSet
consensus protocol as a case study for Veil, and to George Pîrlea
and Qiyuan Zhao for their comments on this post.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:tlaps" role="doc-endnote">
      <p>TLA+ comes with
<a href="https://proofs.tlapl.us/doc/web/content/Home.html">TLAPS</a> (TLA+ Proof
System) for writing deductive proofs in a bespoke tactic language, but it’s
a separate tool. To the best of our knowledge, it’s not commonly used
(although, if you are using it, please get in touch with us—we are curious to
learn about your proofs!). It is also not as expressive as modern proof
assistants, such as Rocq or Lean. Quint allows for a <a href="https://quint-lang.org/docs/checking-properties#inductive-invariants">form of automated
correctness
proof</a>
that is not guaranteed to always succeed, even for correct protocols, due
to the limitations of its underlying solvers. <a href="#fnref:tlaps" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:async" role="doc-endnote">
      <p>Other famous consensus protocols, such as Paxos or Raft, do not assume
synchrony and work under a weaker assumption of asynchronous communication,
where one cannot tell the difference between a slow and a “dead” participant,
as the messages can take arbitrarily long to deliver. That is, they
guarantee correctness without any timing assumptions, but, in practice, they
also rely on time-outs to make progress. <a href="#fnref:async" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:class" role="doc-endnote">
      <p>For those who think in terms of Lean (or Haskell), the <code class="language-plaintext highlighter-rouge">instantiate</code>
command simply imposes the <code class="language-plaintext highlighter-rouge">TotalOrder</code> type class constraint on the
elements of the abstract type <code class="language-plaintext highlighter-rouge">value</code>. <a href="#fnref:class" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:ivy1" role="doc-endnote">
      <p>This notation is inspired by the syntax of the <a href="https://github.com/kenmcmil/ivy">Ivy</a> verifier for distributed protocols. <a href="#fnref:ivy1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:hilbert" role="doc-endnote">
      <p>Veil syntax <code class="language-plaintext highlighter-rouge">let x :| p x</code> allows one to constrain the values of a
randomly picked <code class="language-plaintext highlighter-rouge">x</code> to only those that satisfy the Boolean predicate <code class="language-plaintext highlighter-rouge">p</code>.
This operator, known as <a href="https://en.wikipedia.org/wiki/Epsilon_calculus">Hilbert’s epsilon
operator</a>, is very
convenient for modelling constrained non-deterministic choice. <a href="#fnref:hilbert" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Ilya Sergey</name></author><category term="verification" /><category term="distributed systems" /><category term="veil" /><category term="model checking" /><category term="lean" /><summary type="html"><![CDATA[In this post, we discuss how to formalise, test, and prove the correctness of a classic distributed protocol by combining model checking, automated deductive verification, and AI-powered invariant inference in Veil, a new auto-active Lean-based verifier for distributed protocols.]]></summary></entry><entry><title type="html">Multi-Modal Program Verification in Velvet</title><link href="https://proofsandintuitions.net/2026/01/21/multi-modal-verification-velvet/" rel="alternate" type="text/html" title="Multi-Modal Program Verification in Velvet" /><published>2026-01-21T00:00:00+00:00</published><updated>2026-01-21T00:00:00+00:00</updated><id>https://proofsandintuitions.net/2026/01/21/multi-modal-verification-velvet</id><content type="html" xml:base="https://proofsandintuitions.net/2026/01/21/multi-modal-verification-velvet/"><![CDATA[<p>In this post, we will show how to specify and verify imperative programs in Lean
4 using
<a href="https://github.com/verse-lab/velvet">Velvet</a>—an
embedded verifier, which relies on a combination of automated symbolic and
AI-assisted theorem proving techniques.</p>

<!--more-->

<blockquote>
  <p><strong>Disclaimer:</strong> Velvet is currently under active development by <a href="https://verse-lab.org/">VERSE
Lab</a>, as we are working to improve its expressivity
and performance. It’s likely that its codebase will soon change substantially.
Nevertheless, the code linked in this post will remain accessible.</p>
</blockquote>

<p>Formal program verification is about telling <em>what</em> a program should do without
telling <em>how</em> it should do it and then mathematically proving that the program
indeed does exactly that. The <em>what</em> is given by a <em>program specification</em>—a
logical statement that describes the assumptions on the program’s input and the
properties that should hold true about its outcomes.<sup id="fnref:specs" role="doc-noteref"><a href="#fn:specs" class="footnote" rel="footnote">1</a></sup></p>

<h2 id="getting-started-specifying-and-verifying-functional-programs">Getting Started: Specifying and Verifying Functional Programs</h2>

<p>The <a href="https://lean-lang.org/">Lean 4</a> theorem prover allows one to write a
program and also to state its specification in the form of a mathematical theorem.
For instance, the following code fragment shows a function <code class="language-plaintext highlighter-rouge">append</code> that
concatenates two lists of integers and a theorem <code class="language-plaintext highlighter-rouge">append_assoc</code> that states and
proves the function’s associativity (meaning that the result of concatenating
three lists does not depend on the order in which we perform concatenation—only
on the position of the arguments).</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">append</span> (<span class="n">xs</span> <span class="n">ys</span> : <span class="n">List</span> <span class="n">Int</span>) : <span class="n">List</span> <span class="n">Int</span> :=
  <span class="k">match</span> <span class="n">xs</span> <span class="k">with</span>
  <span class="o">|</span> []      <span class="o">=&gt;</span> <span class="n">ys</span>
  <span class="o">|</span> <span class="n">x</span> :: <span class="n">xs</span> <span class="o">=&gt;</span> <span class="n">x</span> :: <span class="n">append</span> <span class="n">xs</span> <span class="n">ys</span>

<span class="k">theorem</span> <span class="n">append_assoc</span> (<span class="n">xs</span> <span class="n">ys</span> <span class="n">zs</span> : <span class="n">List</span> <span class="n">Int</span>) : 
    <span class="n">append</span> (<span class="n">append</span> <span class="n">xs</span> <span class="n">ys</span>) <span class="n">zs</span> <span class="o">=</span> <span class="n">append</span> <span class="n">xs</span> (<span class="n">append</span> <span class="n">ys</span> <span class="n">zs</span>) := <span class="k">by</span>
  <span class="n">induction</span> <span class="n">xs</span> <span class="k">with</span>
  <span class="o">|</span> <span class="n">nil</span> <span class="o">=&gt;</span> <span class="n">rfl</span>
  <span class="o">|</span> <span class="n">cons</span> <span class="n">x</span> <span class="n">xs</span> <span class="n">ih</span> <span class="o">=&gt;</span> <span class="n">simp</span> [<span class="n">append</span>, <span class="n">ih</span>]  
</code></pre></div></div>

<p>The proof of <code class="language-plaintext highlighter-rouge">append_assoc</code> is done by <em>induction</em> on the shape of the first
argument of the <code class="language-plaintext highlighter-rouge">append</code> function (i.e., the list “on the left side” of the
concatenation), and its details are not that important for now: it suffices to say
that it closely mimics the paper-and-pencil argument that argues for the
validity of the theorem’s statement by considering two different ways lists can
be constructed.</p>

<p>The program <code class="language-plaintext highlighter-rouge">append</code> is written in a <em>functional</em> style, in which the result of
a program is determined solely by its parameters, and all functions always
terminate. Thanks to these restrictions, functional programs are known to be
particularly well-suited for formal verification, with simple theorems featuring
relatively natural proofs—just as we’ve seen above. Unfortunately, such
mathematically “pure” functional programming makes it non-trivial to express
so-called <em>imperative</em> features of common programming languages (and even
pseudocode languages used in common textbooks for presenting algorithms):
potentially non-terminating loops, exceptions, mutable variables, and
randomness.</p>

<p>Velvet addresses this gap by providing support for imperative programming within
Lean’s verification ecosystem. It is not the only existing verifier for
imperative programs—many great tools exist to do exactly that, including
<a href="https://dafny.org/">Dafny</a>, <a href="https://github.com/verus-lang/verus">Verus</a>, and
<a href="https://github.com/viperproject/prusti-dev">Prusti</a>, to mention just a few.
None of those tools, however, allow one to use Lean as a way to orchestrate
their verification, which is a unique feature of Velvet—by virtue of it being
<em>embedded</em> in Lean. To put it differently, Velvet is a Lean library rather than
a standalone tool.</p>

<h2 id="imperative-programming-in-velvet">Imperative Programming in Velvet</h2>

<blockquote>
  <p>The code accompanying the rest of the post can be found in <a href="https://github.com/verse-lab/velvet/blob/master/Velvet/Examples/IsNonPrime.lean">this
file</a>.</p>
</blockquote>

<p>Let’s start our tour of Velvet by implementing in it a simple program that
checks whether a given natural number is not prime:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">method</span> <span class="n">IsNonPrime</span> (<span class="n">n</span>: <span class="n">Nat</span>) <span class="n">return</span> (<span class="n">result</span>: <span class="n">Bool</span>)
  <span class="n">do</span>
    <span class="n">if</span> <span class="n">n</span> <span class="o">≤</span> <span class="mi">1</span> <span class="n">then</span>
      <span class="n">return</span> <span class="n">true</span>
    <span class="n">let</span> <span class="n">mut</span> <span class="n">i</span>: <span class="n">Nat</span> := <span class="mi">2</span>
    <span class="n">let</span> <span class="n">mut</span> <span class="n">ret</span>: <span class="n">Bool</span> := <span class="n">false</span>
    <span class="n">while</span> <span class="n">i</span> <span class="o">*</span> <span class="n">i</span> <span class="o">≤</span> <span class="n">n</span>
    <span class="n">invariant</span> <span class="n">true</span>
    <span class="n">do</span>
      <span class="n">if</span> <span class="n">n</span> <span class="err">%</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">then</span>
        <span class="n">ret</span> := <span class="n">true</span>
      <span class="n">i</span> := <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="n">return</span> <span class="n">ret</span>
</code></pre></div></div>

<p>Most of the code of <code class="language-plaintext highlighter-rouge">IsNonPrime</code> should be self-explanatory, so let us
highlight only unusual parts. First, its result is explicitly named in its
signature (as <code class="language-plaintext highlighter-rouge">result</code>, but one can use any arbitrary name different from
parameters) for the reasons that will soon become clear. Similarly to Python,
the scoping of the code is determined by its offset, and the <code class="language-plaintext highlighter-rouge">do</code> keyword starts
a new code block. All local variables (introduced using <code class="language-plaintext highlighter-rouge">let</code>) and function
parameters are immutable by default unless explicitly marked as <code class="language-plaintext highlighter-rouge">mut</code>. In the
<code class="language-plaintext highlighter-rouge">while</code>-loop, right after the condition we can see the <code class="language-plaintext highlighter-rouge">invariant true</code>
annotation. For now, it serves no particular purpose except for making the
parser happy, so let’s not worry about it and just think of it as a piece of
unavoidable boilerplate.</p>

<p>We can immediately test our function by running it and checking its result in VSCode’s Lean InfoView. For instance, executing</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">#eval</span> (<span class="n">IsNonPrime</span> <span class="mi">42</span>)<span class="o">.</span><span class="n">extract</span>
</code></pre></div></div>

<p>results in <code class="language-plaintext highlighter-rouge">true</code> (42 is indeed non-prime), while running</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">#eval</span> (<span class="n">IsNonPrime</span> <span class="mi">239</span>)<span class="o">.</span><span class="n">extract</span>
</code></pre></div></div>

<p>produces the result <code class="language-plaintext highlighter-rouge">false</code>, just as expected.</p>

<h2 id="specifying-a-program-in-velvet">Specifying a Program in Velvet</h2>

<p>As the next step, we move from tests, which can only show that a program behaves
as expected on specific inputs, to <em>formal specifications</em> stating that a program
<em>always</em> does what it’s supposed to do. Despite its tiny size and simplicity,
<code class="language-plaintext highlighter-rouge">IsNonPrime</code> is surprisingly tricky to specify formally, as it relies on the
definition of primality of natural numbers. Thanks to the rich vocabulary of
Lean specifications and programming mechanisms, one can define primality in
multiple different ways. We are going to do it by first writing a mathematical
function that returns the number of divisors of a natural number <code class="language-plaintext highlighter-rouge">n</code>:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">countDivisors</span> (<span class="n">n</span>: <span class="n">Nat</span>) : <span class="n">Nat</span> :=
  (<span class="n">List</span><span class="o">.</span><span class="n">range</span> (<span class="n">n</span> <span class="o">+</span> <span class="mi">1</span>))<span class="o">.</span><span class="n">filter</span> (<span class="k">fun</span> <span class="n">d</span> <span class="o">=&gt;</span> <span class="n">d</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="o">∧</span> <span class="n">n</span> <span class="err">%</span> <span class="n">d</span> <span class="o">=</span> <span class="mi">0</span>) <span class="o">|&gt;.</span><span class="n">length</span>
</code></pre></div></div>

<p>In plain words, <code class="language-plaintext highlighter-rouge">countDivisors</code> first constructs a list of numbers from <code class="language-plaintext highlighter-rouge">0</code> to
<code class="language-plaintext highlighter-rouge">n</code>, then keeps only those that are strictly positive and are divisors of <code class="language-plaintext highlighter-rouge">n</code>;
finally, it returns the size of the resulting list. Indeed, for any prime number
<code class="language-plaintext highlighter-rouge">countDivisors</code> should return 2: counting 1 and the number itself, i.e., only
the <em>trivial</em> divisors. We are going to adopt this as a definition of a prime
number, defined in Lean as the following predicate:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="n">isPrime</span> (<span class="n">n</span>: <span class="n">Nat</span>) : <span class="kt">Prop</span> :=
  <span class="n">n</span> <span class="o">&gt;</span> <span class="mi">1</span> <span class="o">∧</span> <span class="n">countDivisors</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">2</span>
</code></pre></div></div>

<p>We are now ready to ascribe a formal specification to <code class="language-plaintext highlighter-rouge">IsNonPrime</code>, which we can
do by adding a logical statement that starts with <code class="language-plaintext highlighter-rouge">ensures</code> right after the
function’s signature:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">method</span> <span class="n">IsNonPrime</span> (<span class="n">n</span>: <span class="n">Nat</span>) <span class="n">return</span> (<span class="n">result</span>: <span class="n">Bool</span>)
  <span class="n">ensures</span> <span class="n">result</span> <span class="o">↔</span> <span class="o">¬</span><span class="n">isPrime</span> <span class="n">n</span>
  <span class="n">do</span>
    <span class="n">if</span> <span class="n">n</span> <span class="o">≤</span> <span class="mi">1</span> <span class="n">then</span>
      <span class="n">return</span> <span class="n">true</span>
    <span class="n">let</span> <span class="n">mut</span> <span class="n">i</span>: <span class="n">Nat</span> := <span class="mi">2</span>
    <span class="n">let</span> <span class="n">mut</span> <span class="n">ret</span>: <span class="n">Bool</span> := <span class="n">false</span>
    <span class="n">while</span> <span class="n">i</span> <span class="o">*</span> <span class="n">i</span> <span class="o">≤</span> <span class="n">n</span>
    <span class="n">invariant</span> <span class="n">true</span>
    <span class="n">do</span>
      <span class="n">if</span> <span class="n">n</span> <span class="err">%</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">then</span>
        <span class="n">ret</span> := <span class="n">true</span>
      <span class="n">i</span> := <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="n">return</span> <span class="n">ret</span>
</code></pre></div></div>

<p>The added statement is commonly called a <em>postcondition</em>: it postulates what
should hold true about the program’s result at the end of its execution (and
that’s why we had to give an explicit name to the result!).<sup id="fnref:total" role="doc-noteref"><a href="#fn:total" class="footnote" rel="footnote">2</a></sup> In our
example, the postcondition asserts that the function should return <code class="language-plaintext highlighter-rouge">true</code> <em>if
and only if</em> (denoted as <code class="language-plaintext highlighter-rouge">↔</code> in Lean) its result is <em>not</em> prime (<code class="language-plaintext highlighter-rouge">¬isPrime</code>).</p>

<p>Let us now go ahead and try to verify that the desired property does indeed hold
for any input <code class="language-plaintext highlighter-rouge">n</code> passed to <code class="language-plaintext highlighter-rouge">IsNonPrime</code>. This can be done by adding the
following command to the file after the function definition:<sup id="fnref:loom" role="doc-noteref"><a href="#fn:loom" class="footnote" rel="footnote">3</a></sup></p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prove_correct</span> <span class="n">IsNonPrime</span> <span class="k">by</span>
  <span class="n">loom_solve</span><span class="o">!</span>
</code></pre></div></div>

<p>Sadly, this does not immediately verify our program. Running <code class="language-plaintext highlighter-rouge">loom_solve!</code> proof
tactic, however, provides us with a very helpful piece of information: a Lean
proof context shown in VSCode InfoView, which states what exactly could not be
proven about the program:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>n i : ℕ
result : Bool
if_neg : 1 &lt; n
done_1 : n &lt; i * i
⊢ result = true ↔ ¬isPrime n
</code></pre></div></div>

<p>In short, this is what remains to be proven: when the program’s input <code class="language-plaintext highlighter-rouge">n</code> is
strictly larger than 1 (indicated by the hypothesis <code class="language-plaintext highlighter-rouge">if_neg</code>), the returned
result matches our specification.</p>

<p>What about the case when <code class="language-plaintext highlighter-rouge">n</code> is 0 or 1? In fact, this case, corresponding to
taking the <code class="language-plaintext highlighter-rouge">then</code>-branch in the program’s body, was indeed proven by
<code class="language-plaintext highlighter-rouge">loom_solve!</code>, and this happened automatically! To see this, let us step back
and ask Velvet for all facts that need to be proven to establish the desired
postcondition of <code class="language-plaintext highlighter-rouge">IsNonPrime</code>. This can be done by adding the following line right above <code class="language-plaintext highlighter-rouge">prove_correct ...</code>:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">set_option</span> <span class="n">loom</span><span class="o">.</span><span class="n">solver</span> <span class="s">"custom"</span>
</code></pre></div></div>

<p>Doing so makes <code class="language-plaintext highlighter-rouge">prove_correct</code> produce the following collection of Lean
statements:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-- Goal 1
n : ℕ
if_pos : n ≤ 1
⊢ ¬isPrime n

-- Goal 2
n : ℕ
if_neg : ¬n ≤ 1
i : ℕ
ret : Bool
if_neg_1 : ¬i * i ≤ n
⊢ ¬i * i ≤ n

-- Goal 3
n i : ℕ
ret : Bool
if_neg : 1 &lt; n
done_1 : n &lt; i * i
⊢ ret = true ↔ ¬isPrime n
</code></pre></div></div>

<p>We have already seen the third statement, so where did the first two come from?
Looking closely, one can notice that the first one corresponds to the fact that
needs to be proven to justify <code class="language-plaintext highlighter-rouge">IsNonPrime</code> returning <code class="language-plaintext highlighter-rouge">true</code> for <code class="language-plaintext highlighter-rouge">n ≤ 1</code>. The
correctness of this statement follows immediately from the definition of
<code class="language-plaintext highlighter-rouge">isPrime</code>, which we have defined above. The second goal comes from the
requirement, imposed by Velvet, to show that the loop condition (in this case,
<code class="language-plaintext highlighter-rouge">i * i ≤ n</code>) is indeed negated at the end of the loop execution, and it holds
trivially. Both goals are proven by <code class="language-plaintext highlighter-rouge">loom_solve!</code> automatically using a
combination of SMT solvers (such as Z3 and cvc5) and Lean’s own proof automation
(most prominently, the <code class="language-plaintext highlighter-rouge">grind</code> and <code class="language-plaintext highlighter-rouge">aesop</code> tactics).</p>

<h2 id="multi-modal-verification">Multi-Modal Verification</h2>

<p>The demonstrated ability to automatically prove some of the facts required to
verify a program, while leaving some others open for an interactive proof is one
of the key advantages of Velvet compared to state-of-the-art program verifiers,
which typically provide only an automated or only an interactive verification
mode. The former ones suffer from the lack of debugging information when a proof
fails, while the latter make most of the proofs (even of trivial facts)
extremely laborious. Velvet combines the strengths of both modes, thus,
providing a <em>multi-modal</em> verification experience.</p>

<h2 id="dealing-with-unbounded-executions-loop-invariants">Dealing with Unbounded Executions: Loop Invariants</h2>

<p>To prove the only interesting statement about <code class="language-plaintext highlighter-rouge">IsNonPrime</code>, we will have to do a
bit more work and provide so-called <em>loop invariants</em>—assertions such that (1)
they hold right before the start of the loop’s execution, (2) if they hold true
at the start of a loop iteration, they also hold true at the end of it, and (3)
when combined, they allow deriving the program’s postcondition. From the
mathematician’s perspective, loop invariants are very similar to an induction
hypothesis, which also needs to hold in the base case and must be
re-established for the induction step. In this case, (1) is the equivalent of
proving the base case, (2) is the induction step, and (3) is using the statement
proven by induction (i.e., the invariant) to prove the desired fact. So, if you
are familiar with proofs by induction, you can simply think of a combination of
inductive loop invariants as an induction hypothesis necessary to prove that
certain facts are true about our program’s state no matter how many iterations
the loop makes.</p>

<p>Sadly, there is no algorithm to reliably infer loop invariants for any program
and its provable postcondition—a fact that directly follows from <a href="https://en.wikipedia.org/wiki/Rice%27s_theorem"><em>Rice’s
Theorem</em></a>, which states that any
non-trivial semantic property of programs (e.g., the existence of their loop
invariants for a given postcondition) is undecidable. However, invariants can
frequently be conjectured by using our understanding of what does the loop do
(yes, you can also try to do it using your favourite AI system) and then
verified using formal proof techniques to satisfy the requirements (1)-(3)
listed above. This is what we are going to do for now. Let us change <code class="language-plaintext highlighter-rouge">invariant
true</code> in the program, so it will look as follows:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">method</span> <span class="n">IsNonPrime</span> (<span class="n">n</span>: <span class="n">Nat</span>) <span class="n">return</span> (<span class="n">result</span>: <span class="n">Bool</span>)
  <span class="n">ensures</span> <span class="n">result</span> <span class="o">↔</span> <span class="o">¬</span><span class="n">isPrime</span> <span class="n">n</span>
  <span class="n">do</span>
    <span class="n">if</span> <span class="n">n</span> <span class="o">≤</span> <span class="mi">1</span> <span class="n">then</span>
      <span class="n">return</span> <span class="n">true</span>
    <span class="n">let</span> <span class="n">mut</span> <span class="n">i</span>: <span class="n">Nat</span> := <span class="mi">2</span>
    <span class="n">let</span> <span class="n">mut</span> <span class="n">ret</span>: <span class="n">Bool</span> := <span class="n">false</span>
    <span class="n">while</span> <span class="n">i</span> <span class="o">*</span> <span class="n">i</span> <span class="o">≤</span> <span class="n">n</span>
    <span class="n">invariant</span> <span class="mi">2</span> <span class="o">≤</span> <span class="n">i</span>
    <span class="n">invariant</span> (<span class="n">ret</span> <span class="o">=</span> <span class="n">false</span> <span class="o">↔</span> <span class="o">∀</span> <span class="n">d</span>, <span class="mi">2</span> <span class="o">≤</span> <span class="n">d</span> <span class="o">∧</span> <span class="n">d</span> <span class="o">&lt;</span> <span class="n">i</span> <span class="o">→</span> <span class="n">n</span> <span class="err">%</span> <span class="n">d</span> <span class="o">≠</span> <span class="mi">0</span>)
    <span class="n">invariant</span> (<span class="n">i</span> <span class="o">-</span> <span class="mi">1</span>) <span class="o">*</span> (<span class="n">i</span> <span class="o">-</span> <span class="mi">1</span>) <span class="o">≤</span> <span class="n">n</span>
    <span class="n">do</span>
      <span class="n">if</span> <span class="n">n</span> <span class="err">%</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">then</span>
        <span class="n">ret</span> := <span class="n">true</span>
      <span class="n">i</span> := <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="n">return</span> <span class="n">ret</span>
</code></pre></div></div>

<p>The first invariant states that at the start and at the end of each loop
iteration, the variable <code class="language-plaintext highlighter-rouge">i</code> is at least <code class="language-plaintext highlighter-rouge">2</code>. The second one is the most
interesting: it states that the variable <code class="language-plaintext highlighter-rouge">ret</code> is <code class="language-plaintext highlighter-rouge">false</code> if and only if there
are no divisors of <code class="language-plaintext highlighter-rouge">n</code> between <code class="language-plaintext highlighter-rouge">2</code> and <code class="language-plaintext highlighter-rouge">i</code>. Finally, the third invariant simply
states that the square of <code class="language-plaintext highlighter-rouge">i - 1</code> is not larger than <code class="language-plaintext highlighter-rouge">n</code>. Because of the
requirement (2), adding these invariants made the job of our verifier quite a
bit harder: now we need to prove that each of these invariants is preserved by a
loop iteration. As a result, if we once again add</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">set_option</span> <span class="n">loom</span><span class="o">.</span><span class="n">solver</span> <span class="s">"custom"</span>
</code></pre></div></div>

<p>in front of our proof script (starting with <code class="language-plaintext highlighter-rouge">prove_correct IsNonPrime by</code>), we
will see that <code class="language-plaintext highlighter-rouge">loom_solve!</code> leaves a whopping 15 facts to prove! The good news
is that most of them are no match for Lean’s proof automation, and can be solved
without any involvement required from our side. So let us comment out the option
<code class="language-plaintext highlighter-rouge">set_option loom.solver "custom"</code> and run this script instead:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prove_correct</span> <span class="n">IsNonPrime</span> <span class="k">by</span>
  <span class="n">loom_solve</span><span class="o">;</span> <span class="n">try</span> <span class="n">simp_all</span>
</code></pre></div></div>

<p>Now it leaves us with just a single fact to prove, which looks strikingly
similar to the one we have struggled with before:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span> <span class="n">i</span> : <span class="o">ℕ</span>
<span class="n">ret</span> : <span class="n">Bool</span>
<span class="n">i_1</span> : <span class="o">ℕ</span>
<span class="n">ret_1</span> : <span class="n">Bool</span>
<span class="n">if_neg</span> : <span class="mi">1</span> <span class="o">&lt;</span> <span class="n">n</span>
<span class="n">invariant_1</span> : <span class="mi">2</span> <span class="o">≤</span> <span class="n">i_1</span>
<span class="n">invariant_2</span> : <span class="n">ret_1</span> <span class="o">=</span> <span class="n">false</span> <span class="o">↔</span> <span class="o">∀</span> (<span class="n">d</span> : <span class="o">ℕ</span>), <span class="mi">2</span> <span class="o">≤</span> <span class="n">d</span> <span class="o">→</span> <span class="n">d</span> <span class="o">&lt;</span> <span class="n">i_1</span> <span class="o">→</span> <span class="o">¬</span><span class="n">n</span> <span class="err">%</span> <span class="n">d</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">invariant_3</span> : (<span class="n">i_1</span> <span class="o">-</span> <span class="mi">1</span>) <span class="o">*</span> (<span class="n">i_1</span> <span class="o">-</span> <span class="mi">1</span>) <span class="o">≤</span> <span class="n">n</span>
<span class="n">done_1</span> : <span class="n">n</span> <span class="o">&lt;</span> <span class="n">i_1</span> <span class="o">*</span> <span class="n">i_1</span>
<span class="n">i_2</span> : <span class="n">i</span> <span class="o">=</span> <span class="n">i_1</span> <span class="o">∧</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">ret_1</span>
<span class="err">⊢</span> <span class="n">ret_1</span> <span class="o">=</span> <span class="n">true</span> <span class="o">↔</span> <span class="o">¬</span><span class="n">isPrime</span> <span class="n">n</span>
</code></pre></div></div>

<p>This time, however, we are in a much better situation, as, thanks to our (or
rather Velvet’s and Lean’s) hard work, we have the three invariants (e.g.,
<code class="language-plaintext highlighter-rouge">invariant_1</code>, etc) available to us as assumptions, meaning we can use them to
verify the much desired fact about the function’s result.</p>

<h2 id="unleashing-ai-powered-proof-automation">Unleashing AI-Powered Proof Automation</h2>

<p>For now, the only thing that stands between us and proving correctness of
<code class="language-plaintext highlighter-rouge">IsNonPrime</code> with regard to its specification is the statement above that,
roughly, states that a number <code class="language-plaintext highlighter-rouge">n</code> is prime if and only if the number of its
divisors between 2 and its discrete square root (<code class="language-plaintext highlighter-rouge">i_1</code>) is exactly zero. While
somewhat obvious from our understanding of mathematics of division, this fact is
by far non-trivial—mostly because we talk about enumerating all potential
divisors not between 2 and $n$ but between 2 and $\sqrt{n}$. Since this requires
number-theoretic reasoning beyond what SMT solvers or Lean’s <code class="language-plaintext highlighter-rouge">grind</code> tactic
handle well, it’s time to bring in AI-powered proof automation.</p>

<p>First, let us “hoist” the statement we are willing to prove as a separate
theorem called <code class="language-plaintext highlighter-rouge">remaining_goal</code> (any name will do), whose proof is omitted for
now via Lean’s <code class="language-plaintext highlighter-rouge">sorry</code> keyword:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">theorem</span> <span class="n">remaining_goal</span>
(<span class="n">n</span> : <span class="o">ℕ</span>)
(<span class="n">i</span> : <span class="o">ℕ</span>)
(<span class="n">ret</span> : <span class="n">Bool</span>)
(<span class="n">i_1</span> : <span class="o">ℕ</span>)
(<span class="n">ret_1</span> : <span class="n">Bool</span>)
(<span class="n">if_neg</span> : <span class="mi">1</span> <span class="o">&lt;</span> <span class="n">n</span>)
(<span class="n">invariant_1</span> : <span class="mi">2</span> <span class="o">≤</span> <span class="n">i_1</span>)
(<span class="n">invariant_2</span> : <span class="n">ret_1</span> <span class="o">=</span> <span class="n">false</span> <span class="o">↔</span> <span class="o">∀</span> (<span class="n">d</span> : <span class="o">ℕ</span>), <span class="mi">2</span> <span class="o">≤</span> <span class="n">d</span> <span class="o">→</span> <span class="n">d</span> <span class="o">&lt;</span> <span class="n">i_1</span> <span class="o">→</span> <span class="o">¬</span><span class="n">n</span> <span class="err">%</span> <span class="n">d</span> <span class="o">=</span> <span class="mi">0</span>)
(<span class="n">invariant_3</span> : (<span class="n">i_1</span> <span class="o">-</span> <span class="mi">1</span>) <span class="o">*</span> (<span class="n">i_1</span> <span class="o">-</span> <span class="mi">1</span>) <span class="o">≤</span> <span class="n">n</span>)
(<span class="n">done_1</span> : <span class="n">n</span> <span class="o">&lt;</span> <span class="n">i_1</span> <span class="o">*</span> <span class="n">i_1</span>)
(<span class="n">i_2</span> : <span class="n">i</span> <span class="o">=</span> <span class="n">i_1</span> <span class="o">∧</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">ret_1</span>)
: <span class="n">ret_1</span> <span class="o">=</span> <span class="n">true</span> <span class="o">↔</span> <span class="o">¬</span><span class="n">isPrime</span> <span class="n">n</span> :=
  <span class="k">by</span> <span class="n">sorry</span>
</code></pre></div></div>

<p>With this theorem, the verification of <code class="language-plaintext highlighter-rouge">IsNonPrime</code> can be accomplished simply
via the following script, which makes use of <code class="language-plaintext highlighter-rouge">remaining_goal</code>:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prove_correct</span> <span class="n">IsNonPrime</span> <span class="k">by</span>
  <span class="n">loom_solve</span><span class="o">;</span> <span class="n">try</span> <span class="n">simp_all</span>
  <span class="n">apply</span> (<span class="n">remaining_goal</span> <span class="n">n</span> <span class="n">i</span> <span class="n">ret</span> <span class="n">i_1</span> <span class="n">ret_1</span> <span class="n">if_neg</span> <span class="n">invariant_1</span> <span class="n">invariant_2</span> <span class="n">invariant_3</span> <span class="n">done_1</span> <span class="n">i_2</span>)
</code></pre></div></div>

<p>Thanks to the multi-modal nature of Velvet, we could prove <code class="language-plaintext highlighter-rouge">remaining_goal</code>
manually by writing a Lean proof script, as its statement is, in fact, true.
Here, however, we will take advantage of existing AI-based proof automation
systems.</p>

<p>In this experiment, which took place in early January 2026, I’ve tried to use
<a href="https://claude.com/product/claude-code">Claude Code</a> and
<a href="https://harmonic.fun/">Harmonic</a>’s Aristotle. In case of Claude Code, I simply
asked to complete all the proofs in the file, making sure there are no <code class="language-plaintext highlighter-rouge">sorry</code>s
and that the file compiles. For Aristotle, I constructed a file that only
contained the theorem and the definitions required by it, doing a quick clean-up
from Velvet’s specific libraries, as those are not necessary for the proof at
this point and might prevent the system from accepting the file. Each of the two
systems was able to accomplish the proof in approximately 20 minutes.</p>

<p>The resulting proofs of this theorem are highly non-idiomatic and I won’t be
posting them here to spare the reader’s eyes. In case you’re still curious, the
Lean development accompanying this post, including the proof produced by
Aristotle, can be found by <a href="https://github.com/verse-lab/velvet/blob/master/Velvet/Examples/IsNonPrime.lean">this
link</a>.
The repository also contains a number of other curious examples in Velvet,
including a highly non-trivial correctness proof of a <a href="https://github.com/verse-lab/velvet/blob/master/Velvet/Examples/MemAlloc.lean">memory
allocator</a>,
which was offered as a problem in the recent <a href="https://chinasoft.ccf.org.cn/#competition/theorem-proving">Theorem Proving
Competition</a> held in
Wuhan in November 2025.</p>

<h2 id="concluding-remarks">Concluding Remarks</h2>

<p>In this post, we’ve had a brief introduction to the principles of specification
and verification of imperative programs in Velvet—a multi-modal program
verifier implemented as a library on top of Lean. Velvet combines automated and
interactive reasoning modes, attempting to discharge the majority of proof
obligations needed for program correctness using existing symbolic automation
techniques (such as SMT solvers and Lean’s <code class="language-plaintext highlighter-rouge">grind</code> tactic), leaving “complex”
facts to be proven by other means, such as writing a proof script manually. In
the latter case, present-day AI systems, such as Claude Code and Aristotle,
prove to be quite capable of completing the proofs involving mathematical
statements with close to no human intervention. This combination of symbolic
automation, interactive theorem proving, and AI assistance hints at a future
where rigorous program verification becomes accessible to a much wider audience
of non-experts.</p>

<h2 id="further-reading">Further Reading</h2>

<p>To learn more about Velvet’s capabilities, check out <a href="https://verse-lab.org/papers/velvet-dafny26.pdf">this short
paper</a>, which provides a
tutorial-style introduction to its features. If you are curious about semantic
foundations for engineering multi-modal verifiers on top of Lean and are
comfortable with programming language meta-theory, you might be interested in
checking out our recently published <a href="https://verse-lab.org/papers/loom-popl26.pdf">POPL’26 paper</a>.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:specs" role="doc-endnote">
      <p>There are, indeed, multiple possible styles to describe a desired
program’s behaviour. For instance, a specification can be a standalone
mathematical theorem that describes certain properties of a program, for
instance, associativity of the <code class="language-plaintext highlighter-rouge">String.concat</code> operation. To keep things simple,
in this post, we will be following one of the most widely accepted ones: by
means of <a href="https://en.wikipedia.org/wiki/Hoare_logic">Floyd-Hoare Logic</a>,
specifying programs using pre- and postconditions. <a href="#fnref:specs" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:total" role="doc-endnote">
      <p>The postcondition on its own does not require a program to always
terminate. In fact, in a commonly-accepted way to reason about programs with
loops, known as <em>partial correctness</em>, a never-terminating program would
satisfy <em>any</em> postcondition, which might be a bit counter-intuitive. A
stronger notion of <em>total correctness</em> guarantees that a program always
terminates on a suitably constrained class of input, at the expense of
proving more facts about the program. In this post, we will stick with
partial correctness, leaving the total one for future discussions. <a href="#fnref:total" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:loom" role="doc-endnote">
      <p>Many of the tactics used for writing proofs in Velvet have the <code class="language-plaintext highlighter-rouge">loom_</code>
prefix. This is because Velvet is, in fact, built on top of a more general
Lean library called <a href="https://github.com/verse-lab/loom">Loom</a>. You can read
more about Loom in <a href="https://verse-lab.org/papers/loom-popl26.pdf">this
paper</a>. <a href="#fnref:loom" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Ilya Sergey</name></author><category term="programming" /><category term="verification" /><category term="lean" /><category term="velvet" /><category term="smt" /><category term="ai" /><summary type="html"><![CDATA[In this post, we will show how to specify and verify imperative programs in Lean 4 using Velvet—an embedded verifier, which relies on a combination of automated symbolic and AI-assisted theorem proving techniques.]]></summary></entry><entry><title type="html">Hello, World!</title><link href="https://proofsandintuitions.net/2026/01/10/hello-world/" rel="alternate" type="text/html" title="Hello, World!" /><published>2026-01-10T00:00:00+00:00</published><updated>2026-01-10T00:00:00+00:00</updated><id>https://proofsandintuitions.net/2026/01/10/hello-world</id><content type="html" xml:base="https://proofsandintuitions.net/2026/01/10/hello-world/"><![CDATA[<p>Welcome to <em>Proofs and Intuitions</em>! This is a blog about mathematics, formal verification, and the ideas that connect them.</p>

<!--more-->

<h2 id="a-bit-of-mathematics">A Bit of Mathematics</h2>

<p>Let’s start with something beautiful. The most famous equation in physics is
probably Einstein’s mass-energy equivalence: $E = mc^2$. But in pure
mathematics, Euler’s identity takes the crown:</p>

\[e^{i\pi} + 1 = 0\]

<p>This single equation connects five fundamental constants: $e$, $i$, $\pi$, $1$, and $0$. Truly remarkable!</p>

<p>Here’s another classic—the quadratic formula. For any equation of the form $ax^2 + bx + c = 0$, the solutions are:</p>

\[x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}\]

<h2 id="lean-4-theorem-proving">Lean 4: Theorem Proving</h2>

<p>One of the exciting developments in modern mathematics is the use of <em>proof assistants</em> like Lean. These tools allow us to write mathematical proofs that can be mechanically verified by a computer.</p>

<p><img src="/assets/images/lean-gcd-vscode.png" alt="Proving the GCD algorithm correct in Lean with VSCode" />
<em>VSCode with Lean 4: proving correctness of the Euclidean GCD algorithm. The InfoView panel on the right shows the current proof state with three goals remaining.</em></p>

<p>Here’s a simple example. In Lean 4, we can define natural number addition and prove basic properties. For instance, we can express that <code class="language-plaintext highlighter-rouge">0 + n = n</code> using the <code class="language-plaintext highlighter-rouge">Nat.zero_add</code> theorem.</p>

<p>A simple inline reference: the term <code class="language-plaintext highlighter-rouge">Nat.succ n</code> represents the successor of <code class="language-plaintext highlighter-rouge">n</code>, i.e., <code class="language-plaintext highlighter-rouge">n + 1</code>.</p>

<p>Here’s a small Lean 4 proof that addition is commutative:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">theorem</span> <span class="n">add_comm</span> (<span class="n">n</span> <span class="n">m</span> : <span class="n">Nat</span>) : <span class="n">n</span> <span class="o">+</span> <span class="n">m</span> <span class="o">=</span> <span class="n">m</span> <span class="o">+</span> <span class="n">n</span> := <span class="k">by</span>
  <span class="n">induction</span> <span class="n">n</span> <span class="k">with</span>
  <span class="o">|</span> <span class="n">zero</span> <span class="o">=&gt;</span> <span class="n">simp</span> [<span class="n">Nat</span><span class="o">.</span><span class="n">zero_add</span>, <span class="n">Nat</span><span class="o">.</span><span class="n">add_zero</span>]
  <span class="o">|</span> <span class="n">succ</span> <span class="n">n</span> <span class="n">ih</span> <span class="o">=&gt;</span> <span class="n">simp</span> [<span class="n">Nat</span><span class="o">.</span><span class="n">succ_add</span>, <span class="n">Nat</span><span class="o">.</span><span class="n">add_succ</span>, <span class="n">ih</span>]
</code></pre></div></div>

<p>And here’s a proof that demonstrates the associativity of addition:</p>

<div class="language-lean highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">theorem</span> <span class="n">add_assoc</span> (<span class="n">a</span> <span class="n">b</span> <span class="n">c</span> : <span class="n">Nat</span>) : (<span class="n">a</span> <span class="o">+</span> <span class="n">b</span>) <span class="o">+</span> <span class="n">c</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> (<span class="n">b</span> <span class="o">+</span> <span class="n">c</span>) := <span class="k">by</span>
  <span class="n">induction</span> <span class="n">a</span> <span class="k">with</span>
  <span class="o">|</span> <span class="n">zero</span> <span class="o">=&gt;</span> <span class="n">rfl</span>
  <span class="o">|</span> <span class="n">succ</span> <span class="n">a</span> <span class="n">ih</span> <span class="o">=&gt;</span> <span class="n">simp</span> [<span class="n">Nat</span><span class="o">.</span><span class="n">succ_add</span>, <span class="n">ih</span>]
</code></pre></div></div>

<h2 id="the-joy-of-discovery">The Joy of Discovery</h2>

<p>There’s a special feeling when a proof finally clicks—when the pieces fall into place and you see <em>why</em> something must be true, not just <em>that</em> it is true.</p>

<p><img src="https://media.giphy.com/media/BmmfETghGOPrW/giphy.gif" alt="Math is beautiful" /></p>

<p>That moment of clarity is what this blog is about. We’ll explore proofs, develop intuitions, and hopefully have some fun along the way.</p>

<p>Stay tuned for more!</p>]]></content><author><name>Ilya Sergey</name></author><category term="hello-world" /><category term="math" /><category term="lean" /><summary type="html"><![CDATA[Welcome to Proofs and Intuitions! This is a blog about mathematics, formal verification, and the ideas that connect them.]]></summary></entry></feed>