Proofs and Intuitions

On the Unreasonable Effectiveness of Property-Based Testing for Validating Formal Specifications

2026-05-18T00:00:00+00:00

In this post, we show that property-based testing (PBT) is surprisingly effective for validating LLM-synthesised specifications of Lean programs: it is a cheap alternative to symbolic proofs, which helped to detect underspecification in 10% of the specs in state-of-the-art benchmarks for verified code generation.

Getting Program Specifications that are “Just Right”

Formal program verification and program synthesis are only as reliable as the specifications used to validate the programs in question. A specification is a mathematical contract between a programmer and a machine: it captures what a program is supposed to do, and the verifier (or the synthesiser) ensures the contract is respected when producing the implementation and the proof. If the contract is flawed, the result is meaningless—a program verified against a wrong spec does what the specification states, but not necessarily what the user wants.

It is easy to write a contract that is useless on purpose. In Hoare logic, a specification takes the form

\[\{P\}\ \texttt{program}\ \{Q\},\]

read as: if the precondition $P$ holds of the inputs, then after program runs, the postcondition $Q$ holds of the outputs. This is, for instance, the style of specification used by Velvet—a program verifier embedded in Lean that we discussed in one of previous posts. Setting $P \equiv \bot$ (i.e., $\mathit{false}$) makes the triple hold vacuously for any program: there are no inputs satisfying the precondition, so there is nothing to check, and even completely degenerate code is certified. Symmetrically, setting $Q \equiv \top$ ($\mathit{true}$) means that every possible output satisfies the postcondition, so once again any program will pass. The first specification’s precondition is too strong (it rules out every input), and the second specification’s postcondition is too weak (it rules out no output, making verification and synthesis useless, since such a program can return anything). Neither tells us anything about what the program should do.

The interesting middle ground—a specification that is just right—is an open research problem. A good specification must be precise enough to pin down the programmer’s intent, but not so precise that it inadvertently re-states the program itself. Consider a textbook task: sort a list of integers in ascending order. A specification that says “the output is the list produced by merge sort applied to the input” is, technically, a specification, but it has fused the what with the how: it commits to an algorithm and inherits all of its incidental details, defeating the purpose of having a specification in the first place. A good specification of sorting, by contrast, demands only two things of the result: it is in ascending order, and it is a permutation of the input. Whether the implementation runs merge sort, quicksort, gnome sort, bogosort, or sleepsort is now irrelevant—the specification abstracts over all of them. Producing this kind of clean, intent-capturing formal statement from an informal English description is the part that humans are, even today, still better at than machines.

Hence the conundrum. The most reliable way to write a “just right” specification is to put a human expert in the loop: someone who reads the formal statement, compares it to the informal intent, and corrects it. But reading formal specifications fluently requires considerable training—precisely the kind of cognitive overhead that has kept formal methods out of the mainstream. If we want certified programs to become widely accessible, we need to dramatically reduce the human effort required to write and validate specifications, without simply passing control to a large language model and hoping for the best.

Our goal, thus, is to produce formal specifications with minimal user involvement, and to identify principles for generating specifications that are close enough to a human’s intent to drive the synthesis of certified programs—i.e., programs that come bundled with their formal specifications and machine-checkable correctness proofs. This problem has only become more urgent: in a world where more and more of the code we run is produced by LLMs, we cannot afford to also delegate the synthesis of specifications to LLMs without a reliable, automated way to validate that the resulting specs are just right. The rest of this post discusses two such validation techniques and the trade-offs between them.

Symbolic Specification Testing

One natural approach is to validate an LLM-synthesised specification by proving it on a handful of representative inputs—using a verifier and an SMT solver in the loop. This idea has been actively explored over the past year by Shuvendu Lahiri, who frames the problem as intent formalisation: closing the gap between informal natural-language intent and a machine-checkable specification.

Take the sorting task from earlier and imagine that an LLM proposes the following candidate postcondition: “result is in ascending order, and every element of arr also occurs in result“. It looks plausible, but it is silently too weak: it lets the implementation add elements that were never in arr. The Small Proof-Oriented Tests (SPOTs) methodology of Nik Swamy and Shuvendu Lahiri catches defects like this by writing tiny verified test cases against the spec. A SPOT for our sorting task looks like:

let arr    := #[3, 1, 2]
let result := sort(arr)
assert result == #[1, 2, 3]

and is handed to the verifier with a single demand: prove that the assertion holds using only the specification of sort, without running the code. If the spec is precise enough to force sort(#[3, 1, 2]) to equal #[1, 2, 3], the proof goes through, and the SPOT becomes a small, machine-checked theorem about the spec. If the spec is too weak, the proof fails: our buggy postcondition admits not only $\texttt{#[1,2,3]}$ but also $\texttt{#[1,2,3,10]}$, $\texttt{#[-7,1,2,3]}$, and infinitely many other ascending lists that contain $1$, $2$, and $3$, so the verifier has no way to derive the asserted equality. The failure pinpoints exactly the weakness in the postcondition.

This approach works remarkably well in auto-active verifiers such as Dafny, Verus, and F*, where industrial-grade SMT solvers discharge the proof obligations a SPOT generates—typically over arithmetic, bitvectors, fixed-size arrays, and list or string equality—in milliseconds.

Reframing Validation: Soundness and Uniqueness

We tried to use SPOTs in Lean too, and the result was disappointing. The reason is: Lean is a foundational proof assistant, with a logic much richer than what auto-active verifiers expose, and a correspondingly smaller fraction of its proof obligations fall inside what SMT solvers can discharge automatically. Many of the assertions a SPOT generates—equalities between recursively defined functions, arithmetic that needs induction, list manipulations—sit just outside the solver’s automation. To close such a proof in Lean one usually has to fall back on an interactive proof script, or, increasingly, on an LLM-driven proof search agent, such as Aristotle. Both are orders of magnitude slower per obligation than discharging the same goal in Dafny, and each attempt also burns through a non-trivial number of LLM tokens. A budget that comfortably validates dozens of SPOTs in Dafny barely covers three or four in Lean.

To make the symbolic style work in Lean, then, we need to ask less of the verifier per test case. Rather than treating a SPOT as a single, all-or-nothing proof obligation, we can decompose it into three strictly weaker checks—one that captures whether the input is one the spec claims to handle, one that captures whether the spec accepts the intended output, and one that captures whether it forbids the unintended ones.

Concretely, given a test case $(i, o)$—an input $i$ paired with its intended output $o$—we ask three separate questions about a candidate specification, given by a precondition $\mathit{pre}$ and a postcondition $\mathit{post}$:

Admissibility: $\mathit{pre}(i)$ holds. Intuitively, the chosen test input is one the spec actually claims to handle. A failure here means the test case lies outside the precondition—we would be checking the spec on inputs it has explicitly opted out of, and any verdict would be uninformative.
Soundness: $\mathit{post}(i, o)$ holds. Intuitively, the spec accepts the intended output. A failure here means the spec is too strong—it rejects something the informal intent says is correct.
Uniqueness: $\forall o’ \neq o,\ \neg\mathit{post}(i, o’)$ holds. Intuitively, the spec rejects every alternative output. A failure here means the spec is too weak—it admits outputs the informal intent does not.

All three properties can be discharged by a symbolic verifier—this is essentially what a SPOT does, just packaged as a single obligation rather than three. More importantly for us, however, each of them can also be effectively invalidated by testing, without attempting a proof at all: a single counterexample is enough to refute any one of these properties and flag the spec. The next sections explain how to do so.

Property-Based Testing

To turn the invalidation strategy from the previous section into actual code, we need a tool that can draw candidate inputs (and candidate alternative outputs $o’$) automatically and check a Boolean property on each. That tool is property-based testing (PBT): a technique for checking that an object satisfies a property by drawing random inputs from a generator, evaluating the property on each, and reporting any concrete counterexample it finds. The idea was introduced in 2000 in the seminal paper on QuickCheck for Haskell, and has since been re-implemented in essentially every modern programming language, including Lean.

In the twenty-five years since, PBT has proven dramatically effective across a wide range of domains: it has uncovered bugs in production compilers, driven testing of financial software, surfaced subtle defects in computational geometry algorithms and in smart contract runtimes. What makes PBT so effective is that a precise formal property is, all by itself, an automatic refutation engine: any concrete input that violates it counts as a bug, and finding one requires only that the generator stumbles onto a witness.

Lean 4 ships with Plausible, a property-based testing library in the QuickCheck tradition. Given a theorem statement, Plausible generates random inputs from typeclass-derived generators and tries to refute the goal by exhibiting a counterexample.¹

To see Plausible in action, consider an insertion-sort implementation written in Velvet. The Velvet method comes with a postcondition that we believe expresses what sorting means:

method insertionSort (mut arr: Array Int) return (u: Unit)
  require 1 ≤ arr.size
  ensures ∀ i j, 0 ≤ i ∧ i ≤ j ∧ j < arr.size → arr[i]! ≤ arr[j]!
  ensures arr.toMultiset = arrOld.toMultiset
  do
    -- implementation elided

The two ensures clauses say, respectively, that the array is in ascending order after the method runs, and that its multiset of elements is unchanged—i.e., the result is a permutation of the input. Without ever proving this method correct, we can already use Plausible to test the implementation against the postcondition:

let g : Plausible.Gen (_ × Bool) := do
  let arr ← Plausible.SampleableExt.interpSample (Array Int)
  let res := insertionSortTester arr
  pure (arr, res)

for _ in [1 : 500] do
  let res ← Plausible.Gen.run g 10
  unless res.2 do
    IO.println s!"postcondition violated for input {res.1}"
    break

The helper insertionSortTester is auto-derived from the method’s signature:² it runs insertionSort on the sampled array and evaluates the postcondition on the resulting state. The loop is trying to invalidate the postcondition by drawing 500 fresh inputs and looking for one that breaks it. If none is found, we have a strong evidence—though not a proof—that the implementation respects the spec on the distribution of inputs the generator explores; if one is found, the offending array is a concrete witness of a bug. This is the canonical PBT trade-off against a symbolic correctness proof: dramatically cheaper to run, at the cost of giving up the universal guarantee that a proof would provide.

Notice, though, what we are testing here: we are checking the implementation of insertionSort against a specification we already trust. The next step—the one this whole post has been building towards—is to flip the script, and use the very same machinery to validate the specification itself.

Catching a Bad Spec with PBT

Let’s put PBT and the soundness/uniqueness pair to work on a concrete LLM-synthesised specification. We stay with the sorting task: given a list of integers, produce one in ascending order. Asked to write a Lean specification for this problem, an LLM might propose:

def precondition (arr : List Int) : Prop :=
  True

def postcondition (arr : List Int) (result : List Int) : Prop :=
  (∀ i j, 0 ≤ i < j < result.size → result[i]! ≤ result[j]!) ∧
  (∀ i, 0 < i < arr.size → arr.count arr[i]! = result.count arr[i])

In plain English, the postcondition says that result is in ascending order and that every element of arr also appears in result. This is silently too weak: it never forbids result from containing extra elements that were never in arr.

To validate this spec, we run the three PBT checks from the previous section. Admissibility is trivial here: the precondition is True, so every input we pick passes. For soundness, we choose a few small test cases $(i, o)$ where $o$ is the intended sort of $i$³—say $(\texttt{#[]}, \texttt{#[]})$ and $(\texttt{#[3,1,2]}, \texttt{#[1,2,3]})$—and check that postcondition i o holds. It does, on every case we try: the spec accepts the intended outputs, so soundness is not the problem here.

For uniqueness, Plausible randomly samples alternative outputs $o’ \neq o$ for each test case and checks whether postcondition i o' is still accepted. On the very first test case $i = \texttt{#[]}$, the generator quickly stumbles onto $o’ = \texttt{#[0]}$: the array is trivially in ascending order and trivially contains every element of arr (because arr is empty), so the postcondition is satisfied. Plausible reports the counterexample, and the spec is flagged as too weak—exactly as the reframing predicted.

What’s Hard to Test (and How We Cope)

PBT thrives when the property under test is a universally-quantified Boolean predicate one can evaluate directly on each sampled input. It struggles when the property contains an unbounded existential quantifier: refuting $\exists x,\, P(x)$ means establishing $\forall x,\, \neg P(x)$, which testing alone cannot do. We found two simple patterns that let us tackle the awkward cases without falling back to a full symbolic proof.

First, in program specifications, existentials are almost always implicit and bounded by the surrounding context—an index into arr lives in $[0,\, \texttt{arr.size})$, not in all of $\mathbb{N}$. We infer such bounds heuristically from the property’s structure, prove that they indeed hold for the respective program variables (such as array indices) with Lean tactics like grind and omega, and then enumerate the existential variable over the resulting finite range.

Second, sampling alternative outputs $o’ \neq o$ uniformly at random is usually quite ineffective: the odds of landing on a meaningful counterexample in this case are astronomically small. Instead, we take a page from the fuzz testing book and use mutation-based sampling: we perturb the intended output by flipping a Boolean, adding $\pm 1$ to an integer, or inserting/deleting an element from a list. This is because bad outputs tend to look much like good ones, and this empirical fact makes such “small-scale” mutation much more effective than blind random search in a large space of possible outputs.

With these two adaptations in place, one might wonder whether the testing-based approach to validating formal specifications actually pays off in the wild. The rest of the post discusses the experience of using a PBT-based spec-validation pipeline on two state-of-the-art benchmarks of specifications for Lean programs.

Catching Specification Bugs in Verified Coding Benchmarks

Our hypothesis going in was as follows: if PBT-based spec validation works as advertised, it should be able to find underspecified specifications even in published, human-written Lean benchmarks that have already been vetted by their authors as reference examples.

To put this to the test, we ran our pipeline on two state-of-the-art benchmarks for Lean specification synthesis: VERINA and CLEVER. Both benchmark suites provide natural-language problem descriptions, formal specifications, and a handful of test cases; some problems also include a reference implementation. When one was available, we used it to compute the intended outputs from precondition-satisfying inputs. CLEVER’s specifications do not separate preconditions from postconditions, so we wrote a small script to do that conversion; due to formatting issues, only 104 of CLEVER’s specifications converted cleanly, and those are the ones we tested.

Across 188 problems from VERINA and 104 from CLEVER, PBT flagged 13 underspecified specifications in the former and 18 in the latter—about 10% of everything we tested.

We reported these findings to the benchmark authors. The VERINA team has acknowledged the bugs we surfaced and patched nearly all of them (one is still under review); the CLEVER team has acknowledged all 18 issues, though fixes have not yet shipped at the time of writing.

Let us now look at three of the bugs we found.

Example 1: Forgotten Range Constraints

The first comes from VERINA’s Basic 46. The task is to find the last position of a given element in a sorted array of integers, returning -1 if the element is absent. VERINA’s specification is:

def lastPosition_precond (arr : Array Int) (elem : Int) : Prop :=
  List.Pairwise (· ≤ ·) arr.toList

def lastPosition_postcond
  (arr : Array Int) (elem : Int) (result : Int)
  (h_precond : lastPosition_precond arr elem) :=
    (result ≥ 0 →
      arr[result.toNat]! = elem ∧
      (arr.toList.drop (result.toNat + 1)).all (· ≠ elem)) ∧
    (result = -1 → arr.toList.all (· ≠ elem))

At a glance this looks fine: when result ≥ 0, the result-th element equals elem and no element after it does; when result = -1, no element of the array equals elem.

PBT quickly finds the problem. With arr = #[] and elem = 0, the postcondition accepts both -1 and 0, even though only -1 is correct. The reason: when result = 0, the lookup arr[result.toNat]! is out of bounds, and Lean’s accessor silently returns the default Int, which is 0, and happens to equal elem. The postcondition is satisfied by accident. Worse, the postcondition also accepts -2: nothing in the spec restricts result to be in [-1, arr.size), so anything below -1 is unconstrained.

What looks like an aesthetic problem is actually an exploit waiting to happen. A motivated attacker—or just a lazy LLM optimising for the cheapest implementation the verifier will accept—can ship code that returns 0 whenever the input array is empty, or uses any negative integer below -1 as a “not found” sentinel. The verifier, reading only the postcondition, will stamp this program as correct; the user, trusting that stamp, will never notice that the “verified” program disagrees with the natural-language task it was supposed to solve.

Adding the explicit constraint -1 ≤ result < arr.size fixes the spec. The general lesson: when writing a specification, always constrain the domain of every output variable.

Example 2: Silently Truncated Subtraction

The second example comes from CLEVER’s problem 79. The task is to convert a number in decimal to binary and wrap the result with "db" at both ends—so the desired output for 32 is "db100000db". CLEVER’s specification reads:

def problem_spec
  (implementation : Nat → String)
  (decimal : Nat) :=
  let spec (result : String) :=
    4 < result.length ∧
    result.drop (result.length - 2) = "db" ∧
    result.take 2 = "db" ∧
    let resultTrimmed :=
      (result.toList.drop 2).dropLast.dropLast.map
        (fun c => c.toNat - '0'.toNat)
    decimal = Nat.ofDigits 2 resultTrimmed.reverse
  ∃ result, implementation decimal = result ∧
  spec result

The first three lines say result starts and ends with "db". The next line strips the wrappers and turns each remaining character into a digit by subtracting '0'’s Unicode value. The last line says the resulting digits, read as base 2, equal the input.

The bug hides in that subtraction. Lean’s Nat subtraction is truncated: any negative result clamps to 0. So any character whose Unicode value is $\le$ that of '0'—including '/'—silently maps to 0. When the input is 0, the postcondition therefore accepts both "db0db" and "db/db".

A random generator would essentially never produce a string starting and ending with "db" by chance—the structural constraint is too tight. Mutation-based sampling, on the other hand, perturbs the expected output one character at a time and immediately exposes the bug. Mutation also surfaced a different underspecification in VERINA’s Basic 97 (an in-place update of an array element to 60): the postcondition does not require the output to have the same length as the input, and PBT caught this by appending an extra element to the expected result.

Example 3: Catching Implementation Bugs

The third example, from CLEVER’s problem 9, is a bonus: we were hunting for spec bugs, but this one is an implementation bug. The task is to compute the running maximum of a list of integers, and the reference implementation is:

def implementation (numbers : List Int) : List Int :=
  let rec rolling_max
          (numbers : List Int)
          (results : List Int)
          (acc : Int) : List Int :=
            match numbers with
            | [] => results
            | n :: ns =>
              let new_acc := max acc n
              let new_results := results ++ [new_acc]
              rolling_max ns new_results new_acc
  rolling_max numbers [] 0

The accumulator acc is initialised to 0, which gives the wrong answer whenever the first element of the input is negative. PBT generated such an input within a handful of samples, the implementation’s output failed the spec, and the bug came out for free.

Limitations and Conclusions

In the traditional formal verification setting—where we want to check that a program meets a specification we trust—testing is undoubtedly a much weaker tool than a formal proof: a test suite can never quite rule out an unseen bug, while a verifier gives a guarantee that no amount of testing can match. When the artefact under scrutiny is the specification itself, however, the picture becomes much subtler. The underlying problem—does this formal statement faithfully capture what the human actually meant?—is inherently non-formal, and even SPOT-style symbolic validation does not give a 100% guarantee: it is only as good as the test cases one chooses. Once absolute certainty is no longer on the table for any method, PBT-based validation becomes a genuinely worthy alternative in the design space, with the added benefit of being much cheaper than anything that involves proofs.

That said, our method has several opportunities for future improvements, which we discuss next.

First, the uniqueness property simply does not hold for every task. A clean example is quicksort-style partitioning, where the relative order within each partition is irrelevant to correctness and the spec legitimately admits many distinct outputs for the same input. Our approach can, however, be extended by replacing uniqueness with other, more permissive universally-quantified meta-properties of specifications—for instance, “any two accepted outputs are related by some user-supplied equivalence”. Identifying useful such properties and integrating them into a single PBT-driven validation pipeline is an exciting direction for future work.

Second, a good specification should capture the relation between inputs and outputs, rather than encode a particular algorithm in disguise. PBT can flag specs that are too weak or too strong, but it has nothing to say about specs that are simply too operational. We believe this dimension can be addressed by restricting the specification language itself—e.g., to a fragment restricted to particular sets of computational primitives—and we view this, too, as a promising future direction.

More broadly, the most interesting story here is probably not “testing versus proof” but “testing and proofs”: combining lightweight randomised validation with heavyweight deductive methods opens up an intriguing design space for cutting the cost of formal verification and verified code generation, using each technique where it is the most appropriate.⁴ Readers curious about how specification testing fits into the larger picture of verifiable code generation may enjoy our recent paper on LeetProof, where PBT-based specification validation works alongside SMT and agentic proof search in a single end-to-end pipeline.

There is a subtlety here: Plausible can only refute propositions that are decidable—i.e., come with a procedure that returns a Boolean for every concrete input. In practice, Lean’s type class resolution synthesises the required Decidable instance automatically for most propositions one encounters in program specifications: equalities over built-in types, comparisons, finitely-quantified statements, and Boolean combinations of all of the above. ↩
The details of how Velvet turns a method signature into an executable tester are described in our CAV’26 paper. ↩
In practice, these input–output pairs can be drafted by an LLM and then sanity-checked by a human—the same assumption SPOTs make about their concrete witnesses. ↩
The same applies beyond specifications for synthesised programs: testing the definitions that themselves appear inside our theorems—e.g., a programming language semantics in a soundness statement—is just as valuable, since a flawed definition can make even a true theorem useless. See our earlier post on mechanising the Move borrow checker in Lean for a workflow of this kind. ↩

Verifying Move Borrow Checker in Lean: an Experiment in AI-Assisted PL Metatheory

2026-03-18T00:00:00+00:00

I formalised and proved the correctness of Move’s new borrow checker in Lean: 39,000 lines of mechanised metatheory, produced in under a month with the help of an AI coding assistant. This post tells the story of how it went and what it means for the future of PL research.

Reading guide. This is a long post. Depending on your background, you may want to skip ahead:

If you are new to programming language research and curious about it, keep reading.

If you already know what PL metatheory is, jump to Move and its borrow checker.

For the step-by-step mechanisation story with AI, start at Encoding typing rules in Lean.

For the anecdotes on using AI “in anger”, see Soundness Proof: the Labours of Claude.

For numbers, see Some statistics.

For the big picture, skip to What this all means for PL research.

The Programming Languages (PL) research community was one of the earliest adopters of interactive theorem provers, such as Rocq, Agda, and Isabelle/HOL, as a key technology for gaining trust in the produced formal models and proofs of their properties. The famous POPLmark Challenge, a set of benchmarks aimed at identifying the main hurdles when stating and proving theorems about programming languages, has turned 20 years old last year.

Proving properties of a PL artefact, such as an optimising compiler or a type system proposal, has traditionally been considered a significant challenge, and, while not insurmountable by trained human provers, it has become almost a necessary requirement to publish a research paper at one of the top-tier PL conferences. Some perceive this trend as a rite of passage. I believe, a healthier way is to think of machine-assisted proofs about programming languages as a way to sharpen one’s definitions and statements: it is widely recognised that ugly definitions usually result in significantly more laborious proofs.

The sheer amount of human time spent by PL researchers over the past two decades formalising their results in provers, such as Rocq, is so staggering that studying the proof patterns to discover more concise ways to engineer machine-checked proofs became a research direction of its own with quite a few high-profile publications (Example 1, Example 2, Example 3) and even entire academic conferences dedicated to this topic alone. With the modern trend of applying frontier AI models to facilitate construction of machine-checked proofs of mathematical theorems (predominantly, in Lean) with the help of systems such as Harmonic’s Aristotle and Axiom Prover, it is only a matter of time to see these advances facilitating proofs of theorems about programming languages. In this blogpost, I describe one such experiment.

What is PL Metatheory and why it needs mechanised proofs?

Let us set up some terminology first. When talking about formal verification of programs, it is important to remember that this is only meaningful when we have a particular specification in mind, which describes the properties of the program of interest that always hold (we have talked about this at length in one of the previous posts).

But what if, instead of properties of a program written in a certain language, we are interested in properties of a programming language itself? If you are wondering what these properties might be, think of your favourite programming language and things that you believe cannot happen when running programs in it. For instance, any implementation of Python guarantees memory safety: it would be really surprising for one to observe a dangling pointer or a buffer overflow when running a Python program—Python’s garbage collector ensures that this does not happen at run time.

Some languages go even further and provide similar guarantees without a garbage collector whatsoever, reasoning purely out of the syntax of the programs using mechanisms known as type systems. A particularly prominent example of an interesting yet practical type system is that of Rust: through the mechanisms of borrows and lifetimes it guarantees that, if a program without unsafe blocks is accepted by the Rust compiler, it is free from use-after-free errors, dangling pointers, and data races—all being artefacts of so-called unsafe pointer aliasing. But why should we believe that these guarantees do indeed hold in reality when we run our compiled Rust programs? Well, this is where the math behind PL theory steps in: ideally, a programming language designer must prove a Type Soundness Theorem that connects the fact that a program is accepted by a type checker with the absence of certain runtime behaviours—the famous Well-Typed Programs Don’t Go Wrong mantra coined by Robin Milner in his 1978 paper.¹

For any interesting programming language, the type soundness theorem is surprisingly non-trivial to state. First, it requires a precise description of the type system itself, defining how exactly it “analyses” a program to determine whether it should be accepted or rejected. Second, one needs to define a semantics of the language’s runtime behaviour, and the errors that are considered “preventable” by the type system (for instance, almost no practical type systems promise to catch errors such as divisions by zero or out-of-memory errors). Finally, the statement of the theorem should say something about the validity of the memory state in which we are going to run a program that has been accepted by a type checker.

From the components of a Type Soundness Theorem statement, only the type system itself is not trusted. In contrast, the definition of the runtime semantics of the language is almost always taken for granted (akin to the “laws of nature”), while the properties of the initial state are something that is left for the loader to take care of, so feasibility of such a requirement is not questioned by the theorem itself. What the soundness theorem does deliver is the proof that the type system does not accept programs that would result in a preventable error at run time.

The study of definitions of semantics, type systems, program optimisations, and their interactions with erroneous or harmful runtime behaviours is what is typically called PL metatheory. Type soundness theorems are among the most common statements studied and proven in PL metatheory, and this is, in a nutshell, what we, PL researchers, do for a living.

Why, then, does PL metatheory call for mechanised proofs? For any remotely interesting language, a type soundness proof amounts to a massive case analysis over all possible language constructs, coupled with different cases for how the runtime semantics can treat its state or individual language commands. It is not uncommon for such proofs to span dozens of pages of English prose and, unlike classical mathematical proofs, they are very rarely intellectually rewarding: each case follows a similar pattern, yet every single one must be checked. It is easy to make a mistake, and even a trivial error can render the entire point of a type system proposal false. This is why the PL community adopted proof assistants since early 2000s. The first reason was to gain trust in the results. The second was to tame the tedium using clever proof engineering techniques. That said, when approached by a human prover, the amount of mechanised metatheory required for a top-tier conference paper typically still measures at about 6–10 person-months of work. As I will try to argue below, this is about to change with AI.

Move programming language and its borrow checker

Over the past few months, I have been collaborating with the developers from Mysten Labs on designing a new type system for the Move language for smart contracts. Move is deployed on the Sui and Aptos blockchains. Like Rust, Move enforces an ownership discipline: every value has a unique owner, and references (called borrows) must not outlive the values they borrow. The borrow checker (a special type-based static analysis run by the compiler) rejects programs that could create dangling references, aliased mutations, or use-after-move errors. Unlike Rust, however, Move does not allow references inside data structures (so, no linked lists): every reference is an access path rooted in a local variable and descending through a sequence of field names. This restriction eliminates the need for lifetime annotations entirely: the type system can track the reference provenance using just their paths.

The key idea behind the new borrow checker design is to track reachability between references using regular expressions over field paths. For example, when a reference w borrows a deep path p.x.f, the type system registers the regex x · f as the path from the parent reference to w. When another reference u later re-borrows p.x, the type checker computes the Brzozowski derivative (i.e., “stripping” the shared prefix x) to discover whether u can reach w. If the resulting regex is non-empty, the two references overlap in memory, so writing through u would invalidate w.

Consider the following program, written in MoveIR, an LLVM-like intermediate language of Move with minimalistic syntax and explicit borrow/move/copy annotations, which makes it a better target for mechanisation than plain Move:

struct S { f: u64 }
struct P { x: Self.S }
t() {
    let p: Self.P;
    let w: &u64;
    let u: &mut Self.S;
  label b0:
    p = P { x: S { f: 0 } };
    w = ©(&p).P::x.S::f;   // deep borrow of p.x.f
    u = &mut p.P::x;           // re-borrow of p.x
    *move(u) = S { f: 1 };     // typing error here!
    return;
}

The diagram below shows the reachability graph that the type checker maintains for this program. Each node $\rho$ is an abstract reference—a symbolic representation for the respective run-time memory locations. The distinguished $\rho_0$ represents the stack frame root; $\rho_p$, $\rho_w$, and $\rho_u$ correspond to the variables p, w, and u. An edge labelled with a regex records which field paths can be traversed from one location to reach another.

Here, w borrows the deep path p.x.f, while u re-borrows p.x. The derivative $\delta(\mathtt{x} \cdot \mathtt{f},\, \mathtt{x}) = \mathtt{f} \neq \emptyset$ reveals that $\rho_u$ can reach $\rho_w$ via field f (shown as the blue edge). The write on the last line triggers the safety check, which fails: overwriting the entire S behind u would invalidate the deep borrow w, creating a dangling reference. This is precisely the kind of subtle aliasing that the regex-based approach catches through a clean, decidable mechanism.

This design has been fully implemented in a branch of the Sui blockchain client, where it superseded the original (much more complicated and non-formalised) borrow analyser while maintaining full backwards compatibility. But for a blockchain language, implementation alone is not enough: the designers want ironclad guarantees that the borrow checker is correct, in the sense of the Type Soundness Theorem explained above. A bug in the borrow checker’s logic could allow an attacker to exploit a dangling reference, potentially compromising on-chain funds.

This is where our mechanisation effort comes in. We wanted to formalise the new type system in Lean and prove it sound with respect to reasonable semantics of MoveIR. Our ambition was to cover as much of Move as possible, not just a “toy subset” (as customary in proof-of-concept academic prototypes), and to get confidence that the formalisation faithfully represents the actual deployed implementation.

Encoding Move typing rules in Lean

When PL researchers present a type system, they typically write it down as a collection of inference rules in a logical notation. For instance, the typing rule for writes through a mutable reference looks like this:

\[\text{T-WriteRef} \quad \frac{\displaystyle \Sigma(a) = \mathsf{ref}(\tau, \rho, \mathsf{M}) \quad \Sigma(b) = \mathsf{basic}(\tau) \quad \mathsf{check\_outbound}(\Pi, \rho) \atop \displaystyle \Lambda;\; \mathcal{E}[\Sigma := \Sigma \setminus \{a, b\},\; \Pi := \Pi] \vdash \mathit{cont} : \overline{T}}{\Lambda;\; \mathcal{E} \vdash {*}a := b;\; \mathit{cont} : \overline{T}}\]

Reading bottom-to-top: to type-check a write statement $*a := b$, the rule requires that $a$ holds a mutable reference of type $\tau$, that $b$ holds a basic value of the same type $\tau$, and, crucially, that $\mathsf{check\_outbound}$ passes, meaning no existing borrow extends beyond $\rho$ (otherwise the write would create a dangling reference). After the write, both sites are removed from $\Sigma$. Here, a site is a named temporary slot in the stack frame that holds the value produced by an expression before it is consumed by the next operation (think of SSA registers in LLVM), and $\Sigma$ is the site environment—a map from site names to their types that the type checker maintains as it walks through the program. This is not what the compiler implements directly: the inference rule casts type checking as proof construction in a domain-specific logic. A program that type-checks is one for which a proof tree can be assembled from such rules.

The first step in our mechanisation was to encode MoveIR’s syntax and all its typing rules in Lean. I wrote these definitions by hand. Here is a simplified version of the $\text{T-WriteRef}$ rule in Lean:

inductive typecheck_stmt : LabelEnv → TypeEnv → Stmt → List MoveType → Prop where
  ...
  | write_ref : ∀ Λ 𝓔 a b τ ρ cont T,
      𝓔.Σ(a) = .ref τ ρ .mut →
      𝓔.Σ(b) = .basic τ →
      check_outbound 𝓔.Π ρ →
      typecheck_stmt Λ (𝓔[Σ := 𝓔.Σ \ {a, b}]) cont Ts →
      typecheck_stmt Λ 𝓔 (*a := b; cont) Ts
  ...

Each premise of the inference rule becomes a hypothesis in the Lean constructor: the lookups check site types, check_outbound enforces the borrow safety condition, and the recursive typecheck_stmt call types the continuation under the updated environment.

With this encoding of the MoveIR type system in place, we can already validate how faithful our Lean model is: take a concrete program from Move’s test suite, translate it to MoveIR, and try to prove that the typing judgement holds—I will call such proofs conformance proofs. This was the point, at which I started using Claude Code (Opus 4.5) to construct these proofs. To my surprise, the AI could handle tiny MoveIR programs (4–5 lines of code) successfully, assembling the proof tree step by step. But for anything larger, the approach quickly became impractical: each proof step required instantiating the right rule constructor, providing witnesses for existential variables, and discharging side conditions about the path environment, all of which grew rapidly with program size. This is where I started to take the agenda of conformance proofs seriously, but in a very different form: more on that next.

From conformance proofs to tests via algorithmic type checking

To solve the efficiency bottleneck with conformance proofs for the type system (AI was slow and unreliable), I resorted to a more traditional approach: implementing an actual executable type checker in Lean and running it on the same tests as the production Move implementation, making sure that it accepts and rejects the same programs. Instead of constructing proof trees, the algorithmic checker is a plain recursive Lean function that returns an updated type environment on success or fails with none:

def check_stmt (lenv : LabelEnv) (env : TypeEnv) (s : Stmt)
    (retTypes : List ParamType) : Option TypeEnv

Luckily, there was no shortage of tests in the Move compiler’s test suite, so it did not take long to vibe-code a parser from MoveIR text into our Lean representation and start running the tests.²

But what is the relationship between the relational type checker from before (the inductive typecheck_stmt with inference rules) and the new algorithmic one (the executable check_stmt)? After all, we could have introduced a bug in the algorithmic version that makes it accept programs the relational rules would reject, or vice versa. To close this gap, we proved (entirely by AI, of course) the first important theorem of our development: the soundness of the algorithmic type checker:

theorem check_stmt_sound (lenv : LabelEnv) (env : TypeEnv)
    (s : Stmt) (retTypes : List ParamType) :
    ∃ env', check_stmt lenv env s retTypes = some env' →
    typecheck_stmt lenv env s retTypes

This theorem says: whenever the algorithmic checker accepts a program (returns some env'), the relational typing judgement holds. In other words, the executable checker is a sound decision procedure for the type system: it never accepts a program that the inference rules would reject. Every successful run of check_stmt effectively produces a certificate that the relational typing derivation exists.³

Once we had the parser, the algorithmic checker, and the soundness proof in place, we ran 156 conformance tests on programs drawn from the Move own type checker test suite. This turned out to be one of the most valuable components of the entire development. In the later stages, we frequently needed to revise our encoding of the typing rules—for instance, when extending the type system with additional features—and it was the conformance tests that gave us trust that we were still formalising the right thing.⁴

All this, of course, does not yet mean that a Move program accepted by the type checker is actually free of dangling pointers. The algorithmic soundness theorem only connects the executable checker to the inference rules; it says nothing about runtime behaviour. Next, we had to provide the remaining components for the much desired Type Soundness Theorem and prove it.

Runtime semantics and Type Soundness Theorem

To state a Type Soundness Theorem, we need a precise definition of what it means to run a program. In our Lean development, the runtime semantics takes the form of a definitional interpreter: a recursive function run that executes a program for at most fuel steps, returning either a final result or an error:

def run (fuel : Nat) (state : ExecState) : ExecState :=
  match fuel with
  | 0 => .error .outOfFuel
  | n + 1 =>
    match state with
    | .running _ => run n (step state)
    | other => other

The fuel parameter is a standard technique for defining potentially non-terminating computations in proof assistants: instead of proving termination, we give the interpreter a budget of steps. When the budget runs out, it returns outOfFuel rather than looping forever. The soundness theorem will quantify over all fuel values, so this is not a limitation: if a bug exists, it would manifest at some finite step count.

The step function performs a single transition from one “virtual machine” state (variables, sites, stack, heap) to another. When something goes wrong at runtime, it produces an error state. Not all errors are created equal, though. Some are preventable: reading through a dangling reference, accessing a moved value, or encountering a type mismatch at a write. These are exactly the bugs the borrow checker exists to catch. Others are acceptable: division by zero, running out of fuel, or an explicit abort. No reasonable type system promises to prevent those. The type soundness theorem is exactly what guarantees that a program accepted by the type checker never reaches any preventable error.

The runtime semantics is one of the trusted components of the formalisation: if we get it wrong, the theorem might be true but vacuous. That said, this particular semantics, despite being AI-generated, was relatively straightforward—a standard small-step interpreter over a heap—so I mostly validated it by reading the code. As additional, albeit lightweight assurance, I vibe-coded a number of simple “litmus” tests, checking that programs I believed should crash due to a dangling pointer or a use-after-move error did indeed crash under our semantics.⁵

With the semantics in hand, we can finally state the Type Soundness Theorem. Here is a slightly simplified version from our Lean development:

theorem type_soundness (f : FunDef) (lenv : LabelEnv)
    (enumEnv : EnumEnv) (funEnv : AssocMap Id FunDef)
    (args : List Value) (heap : Heap)
    (htyped : typecheck_fun f lenv enumEnv)
    (hfunEnv : ∀ fname fdef, lookup funEnv fname = some fdef →
               FunTypeSafe fdef funEnv enumEnv)
    (ha : SoundnessAssumptions f lenv enumEnv funEnv heap args)
    (e : RuntimeError) (hna : ¬e.isAcceptable) :
    ∀ n, run n (initState f funEnv args heap) ≠ .error e

Reading this bottom-to-top: for any non-acceptable error e and any fuel budget n, running the function f never produces that error. The premises require three things. First, htyped: the function type-checks under our relational type system. Second, hfunEnv: every function that f might call is itself type-safe (this gives us modular reasoning—we check one function at a time). Third, ha: a record called SoundnessAssumptions that bundles 23 well-formedness preconditions on the initial state—things like “argument types match parameter declarations”, “the heap contains no dangling locations”, “label environments are complete”, and structural invariants on enum definitions. These assumptions on the initial state are yet another point where the formalisation could become vacuous: if the 23 preconditions are mutually contradictory, the theorem would be true but useless: there would simply be no valid inputs to run the function on! We will discuss how to address this concern shortly.

Soundness Proof: the Labours of Claude

Everything described so far—encoding the type system, building the algorithmic checker, writing the parser, running conformance tests—was, in retrospect, the easy part. The real struggle was proving the Type Soundness Theorem itself.

The standard approach, introduced by Wright and Felleisen in 1994, decomposes type soundness into two lemmas: progress and preservation.⁶ Progress says: if the current state is well-typed, the machine can take a step not resulting in a preventable error. Preservation says: if the current state is well-typed and the machine takes a step, the resulting state is also well-typed. Together, they give an inductive argument: the initial state is well-typed (by the soundness assumptions), each step preserves well-typedness (preservation), and no well-typed state can crash (progress), so the machine never reaches a preventable error, no matter how many steps it takes. Progress was straightforward: for each typing rule, the relevant premises guarantee that the corresponding runtime operation succeeds. The real beast was preservation.

Preservation proof: an exercise in invariant inference

The crux of preservation is defining what a “well-typed state” means for a running machine. Remember, the type system deals with syntactic entities such as types, type environments, abstract references, and regexes, while the runtime operates on heap locations, actual values, and reference chains. The two worlds need to be connected. A well-typed state invariant is precisely this bridge: a predicate that relates the concrete machine state to the abstract type environment, asserting that every promise made by the type checker is backed by reality in the heap. In our case, it is a Lean record with 35 fields. For instance, some clauses say that every abstract reference $\rho$ tracked by the type checker maps to a live heap location, that the regex paths between references faithfully reflect actual pointer chains in the heap (so the reachability graph from the diagrams above is adequate), and that the check_outbound condition used by $\text{T-WriteRef}$ is always satisfied for mutable references.

It is nearly impossible to predict all 35 (or more) invariant fields from the start. The way it works in practice is: you attempt the proof for a particular statement kind, get stuck because the invariant is too weak, strengthen it with a new clause, and then propagate the change to all other cases. This is very similar to inferring loop invariants in imperative programs or inductive invariants for distributed systems: you just iterate until the invariant is strong enough to make proof constructing and checking by Lean possible.

Where AI did great, and where it didn’t

Despite MoveIR being a relatively small language, it has over 20 distinct statement forms (borrow, move, copy, field borrow, write, freeze, call, return, jump, branch, pack, unpack, and their vector/enum variants), each with its own semantic step rule. The preservation proof contains 153 lemmas (mostly conjectured by AI), collectively showing that each of these steps preserves the 35-field invariant. Every time I strengthened the invariant with a new clause, dozens of lemmas across 30+ files needed updating. This is where Claude (Opus 4.5, and later 4.6) was invaluable. The AI excelled at proof repair: propagating a change through a large proof landscape, applying the same pattern to case after case. It also handled routine preservation cases—where the environment update is a straightforward pass-through—with high reliability.

But it was not all a walk in the park. At one point, Claude got stuck trying to prove preservation for jumps to LLVM-style labelled blocks, going in circles for hours. I had to step in and recognise that we needed a separate fact: a weakening lemma. Weakening says that if a statement type-checks under a “more restrictive” type environment (one that tracks more paths between references), it also type-checks under a “less restrictive” one. This is what makes control-flow joins sound: the checker types the continuation under a target environment and verifies that each branch’s post-environment subsumes it. The weakening lemma proof turned out to be substantial on its own: about 7,200 lines of Lean.

At some point I considered switching to Harmonic’s Aristotle prover for the more difficult lemmas, but ultimately decided against it. Claude Code made it very easy to keep the tight control over the entire development: I could refactor proof structures on the fly, ask “why are you proving this?” mid-proof, and steer the decomposition into lemmas interactively. Aristotle is better suited for one-shotting complex standalone mathematical statements (or refuting them), which is quite different from the iterative, gradual workflow I needed when prototyping Move metatheory.

The dragon: function calls

The preservation proof totals about 10,300 lines of Lean. Roughly half of that complexity lives in just two cases: the call and return command. The typing rule for calls is the most involved in the entire system: it moves from reasoning about variables allocated in a single stack frame to reasoning about the entire call stack, which requires a bunch of additional invariants on saved frames. Additional complexity comes from the fact that our call rule supports relatively accurate inter-procedural tracking of borrows across call boundaries. The details are beyond the scope of this post, but the upshot is: this was the hardest case to dispatch. Proving that all 35 invariant fields are preserved this operation was, by far, the most difficult part of the entire development. While Claude wrote the actual proof code, it took quite a few high-level hints from me: choosing the right case splits and cutting off unproductive proof attempts early. The call rule alone took nearly two days of Claude running almost non-stop (which forced me to upgrade to the most expensive plan). When it finally went through, it felt like defeating the dragon—the rest was mopping up.

Last wrinkles: soundness assumptions and type system extensions

Let us go back to the Type Soundness Theorem and the 23 SoundnessAssumptions that worried us earlier. Recall the concern: if these preconditions are mutually contradictory, the theorem is vacuously true, as there would be no valid inputs to run the function on, and we would have proven nothing useful.

The fix follows the same pattern that served us well throughout this development: make it executable and test it! Specifically, I vibe-coded a decidable checker that collapses all 23 preconditions plus the type-checking judgement into a single total Boolean function:

def SoundnessAssumptions.checkDecidable
    (f : FunDef) (lenvDec : LabelEnvDec) (enumEnv : EnumEnv)
    (funEnv : AssocMap Id FunDef) (fte : FunTypingEnv)
    (heap : Heap) (args : List Value) : Bool

I then made Claude prove that whenever this Boolean check returns true, the full SoundnessAssumptions record holds—and therefore the type soundness theorem applies. For each of our ~30 runtime test programs, I instructed Claude to instantiate the decidable checker with concrete inputs and let Lean’s evaluator discharge the precondition, producing a per-execution safety certificate: a machine-checked proof that this specific run cannot produce any preventable error.

Stretch goals: vectors and enums

The core type system covers MoveIR’s structs, references, and control flow. Two important features, vectors and multi-variant enums, were initially omitted but added later as “stretch goals.”

Vectors were relatively straightforward to support, as they are conceptually very similar to records that our initial formalisation already supported. The vector extension took about one day to vibe-formalise and touched 30+ files, but the modifications were uniform enough that Claude could apply the same pattern repeatedly without errors. As before, we have tested the extended algorithmic type checker extensively.

Enums were harder. The challenge was choosing the right runtime representation. The preservation proof requires that two values of the same type always have identical field structure (so that overwriting one with the other preserves all existing pointer paths). For structs this is trivially true, but enum variants carry different fields. Claude proposed several encodings that broke this property; the fix required me to provide some insight: represent every enum value as a flat record carrying fields for all variants at once, with inactive ones filled by type-specific default values. This made the structural property hold unconditionally, and the enum extension went through in about five days.

Some statistics

Everyone loves numbers, so here they are.

The entire Lean development comprises approximately 39,000 non-blank, non-comment lines of code, across 267 commits, with zero axioms or sorry declarations. To the best of my knowledge, this is the second largest PL metatehory mechanisation in Lean to date (after the Cedar specification, which is around 70,000 LOC) and, very likely, the largest one accomplished predominantly using AI. The soundness proofs dominate the codebase at ~23,000 LOC (59%), with the preservation proof alone weighing in at ~10,300 lines and the weakening lemma at ~7,200. The typing rules (both relational and algorithmic, with the soundness theorem) account for ~5,800 LOC, the parser and translator for ~1,300, and the test suite for ~6,100. The vector extension added ~3,000 LOC in one day; enums required ~5,000 LOC over five.

As a relatively large Lean development with diverse proof patterns, this codebase might serve as an interesting benchmark for AI-assisted theorem proving in Lean. I plan to open-source the full codebase in the coming weeks; feel free to get in touch if you would like early access for research purposes.

In terms of calendar time, the active development phase spanned 27 working days (some of them were on weekends, duh!). The chart below shows the daily commit intensity:

The algorithmic type checker and its soundness proof took two days and kicked-off my AI-powered metatheory journey. Parser, macros, and testing harnesses took five. The soundness proofs consumed 13 days, with peaks at the weakening proof (22 commits in a single day) and the call rule (17 commits).

While this post covers the “interesting” PL-theory parts of the formalisation, in my estimate, the vast majority of the generated proofs are horribly boring: wrangling lists, threading facts through semantic rules, and proving excruciatingly mind-numbing lemmas about runtime state representation. Having done quite a few mechanisations in the past (by hand, in Rocq), I would estimate this effort at about 5–6 months of my full-time engagement without AI (the described here mechanisation has been done by me concurrently with all my other duties as a faculty). It is, of course, if I wouldn’t have gone insane from dealing with the tedious parts first. I am very happy that I didn’t have to do those proofs “old style” (and, probably, never will have to again).

What this all means for the future of PL research?

Congratulations on making it to the end of this post. Let me reward your patience with some of my thoughts on what all this might mean for PL research.

I have read somewhere recently that one reason skilled programmers love generative AI is that it lets them focus on the creative parts of computing while the machine handles the tedium.⁷ I think the same dynamic applies to PL researchers. The creative work—designing logics and type systems, choosing the right semantics, figuring out non-standard proof strategies—still belongs to humans. The tedium—threading a new invariant clause through 153 lemmas, or proving that removing a key from an key-value map preserves some property of the remaining entries—is exactly what AI handles well. This is a good trade.

I personally know some established PL academics who actively resist mechanising their metatheory proofs, because the overhead slows down their research. I believe, that objection is no longer that convincing. While I probably shouldn’t generalise from a single data point of this experience, I would estimate that for a solid type system idea and using a well-known proof technique (progress and preservation, in my case, which every PL researcher learns in graduate program), a single researcher can go from inception to a mechanised result at the level of a top-tier conference submission in about one month. To put it differently (and I apologise if this offends anyone) the effort of writing a POPL/PLDI paper will soon be comparable to the effort of writing an ICML/ICLR paper.

This does not mean that PL researchers will be out of a job any time soon. In another experiment (which I will save for a different post), I tried to tackle a far less conventional published result that relies on a non-standard proof technique, which was only briefly sketched in the respective manuscript. After about a week of fairly involved “vibe-mathing” with Claude Opus 4.6, my AI-assisted attempts produced no usable result (there are still dozens of sorrys in that project and I have no good idea how provable they are). In the experiment described in this post, the model clearly benefited from the fact that the progress-and-preservation technique is 32 years old and has numerous mechanised instances in the easily accessible training data. Novel proof strategies remain beyond AI’s reach (for now).

But let us forget about academic publications for a moment and think bigger. Over the past two decades, the PL community has done a tremendous job distilling reusable “proof harnesses” for challenges like type soundness, compiler correctness, and program logics. With AI-assisted theorem proving, we can hope to see many more real-world languages formalised end-to-end—not just Wasm and Cedar. Programming language design has a reputation for moving slowly, in part because of its insistence on rigour. AI-assisted mechanisation lets us keep the rigour and lose the slowness. We can finally move fast and break nothing.

Acknowledgements

I am grateful to Todd Nowacki and Sam Blackshear for many discussions on the design and implementation of Move’s borrow checker, and for their patience in answering my endless questions about corner cases. Thanks to the members of the VERSE lab and participants of the IFIP WG 2.8 meeting in March 2026 for their comments on the presentation that preceded this post.

Does it mean that in practice PL designers prove such a theorem for every new programming language proposal? Well, of course not: real programming languages are large beasts with complicated syntax and unclear semantics, so errors in the type system design (not even type checker implementation) often go unnoticed for decades, making academic researchers very happy when they discover them. ↩
The parser itself was validated by 37 alpha-equivalence tests comparing the output of the translation against hand-written reference ASTs. ↩
The converse property—completeness—would say that the algorithmic checker accepts every program the relational rules accept. We have not proved completeness, but this is less important for our goals: soundness guarantees that no unsafe program slips through, while a completeness gap would only cause the checker to reject some safe programs, which would be caught by the conformance tests. ↩
This approach is not entirely novel. The formalisation of the AWS Cedar authorisation language follows a similar pattern: a Lean implementation of Cedar’s evaluator is tested against its production version in Rust, ensuring that the two agree on a shared test suite. ↩
A more rigorous approach would be to instrument the Move runtime to emit intermediate machine states, parse them into our Lean formalisation, and check that our definitional interpreter passes through corresponding states at each step. In the interest of time, I have not carried out this exercise so far. ↩
More powerful techniques have been developed since 1994, notably, logical relations, that can handle features like higher-order state and recursive types. We did not need them here: Move’s first-order setting is well within the reach of classical progress-and-preservation. ↩
The opposite, apparently, is true for creative professions: writers and artists. ↩

Verifying Distributed Protocols in Veil

2026-02-09T00:00:00+00:00

In this post, we discuss how to formalise, test, and prove the correctness of a classic distributed protocol by combining model checking, automated deductive verification, and AI-powered invariant inference in Veil, a new auto-active Lean-based verifier for distributed protocols.

Introduction

A famous quote by Leslie Lamport states:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.”

While implementations of distributed systems are typically very large programs of tens of thousands of lines of code, at the heart of each such system is a distributed protocol—an algorithm responsible for enabling all parts of the system to communicate with each other so that the system delivers correct results to its users. That said, even though descriptions of distributed protocols are usually one to two orders of magnitude smaller than their respective implementation, getting those protocols right is still a rather challenging endeavour, due to their inherent concurrent nature and a large number of “moving parts”, especially in the presence of possible faults.

The complicated nature of distributed protocols has made them a popular subject for formal modelling and validation using computer-assisted tools, such as TLA+, P, and Quint. TLA+, designed by Lamport himself, is perhaps the most popular such framework today. It is used widely by designers and implementers of distributed protocols in industry to quickly prototype protocol designs and validate their properties by exhaustively testing them on a fixed set of parameters—an approach known as model checking.

A significant shortcoming of all these tools when it comes to gaining trust in distributed protocol design is their rather rudimentary support for machine-checked formal proofs of distributed protocol correctness. That is, these tools are excellent at finding bugs in protocol designs, but make it challenging to prove that a distributed algorithm never violates its correctness specification.¹

To address these shortcomings, we developed Veil—a “one-stop shop” framework for prototyping, model checking, and verifying distributed protocols, with ultimate correctness guarantees. Similar to our other verifier Velvet, Veil is built on top of the Lean proof assistant, as a library, so it naturally benefits from Lean’s expressive specifications, rich collection of data types and mathematical theorems, and the toolset for proof scripting and automation. Furthermore, we strove to replicate the best parts of TLA+ in Veil, namely, its ability to state protocol specifications at a high level of abstraction (i.e., without spelling out each insignificant implementation detail) combined with the ability to model-check such specifications, quickly discovering “shallow” bugs, before focusing on formal correctness proofs.

In the rest of this post, we will discuss how to encode, model check, and semi-automatically verify a classic consensus protocol in Veil. In future posts in this series, we will focus on more advanced Veil features, such as combining automated and interactive Lean proofs about distributed protocol correctness.

A Classic Example: Dolev-Strong Floodset Protocol

To showcase Veil, we will use it to model the Dolev-Strong Floodset protocol, a classic distributed algorithm allowing $N$ nodes to reach agreement, i.e., select the same value uniformly, by exchanging messages with each other. Notably, this is a fault-tolerant protocol: it allows for $t < N$ nodes to fail during the protocol’s execution—meaning that they will stop sending messages.

The Floodset protocol ensures agreement between non-faulty nodes under an assumption of synchrony, meaning that the nodes communicate in discrete, numbered rounds, and all messages sent in a given round are guaranteed to be delivered before the next round starts. This is a strong assumption, as it allows a node $n$ in the protocol to infer that, if it has not heard from a node $m$ in a certain round, that is because the node $m$ is faulty, and not because the network is slow.²

The protocol’s logic is rather simple. It starts with each node choosing a value from a certain finite set of totally ordered values (the need for ordering will become clear soon). Next, they communicate in $t + 1$ rounds. In each round, each non-faulty node broadcasts (i.e., sends to everyone in the network) all values it is aware of. Initially, this is just the value it has chosen at the start, but this set will grow as it “hears” from other nodes. After $t + 1$ rounds, each node selects the smallest value from those it is aware of, according to the ordering, and concludes that this is the value that all nodes should agree on.

Let’s try to convince ourselves that this protocol works, in the sense that after $t + 1$ rounds of communication, despite the possible failures, all nodes that survived choose the same value.

First, let us notice that if there are no failures, just one round of communication between all nodes is sufficient. Indeed, if everyone hears from everyone, every node will be aware of every value any node (including itself) has chosen by the end of this round. Given that all of them exercise the same rule of choice—just choosing the smallest value in the set—they will all choose the same result.

Things get trickier if there are failures. The problem is: a faulty node can fail in the middle of its broadcast, so some of the nodes will receive its messages, and some will not. So, by the end of the round, some of the alive nodes might be aware of different values. To make things worse, multiple nodes might fail during the same round, increasing the amount of chaos in the system even further.

So why does the protocol work at the end? Notice that to reach agreement, we only need one fault-free round during the entire execution, as by the end of this round every node will “synchronise” with each other on all the data in the system. Since we assumed up to $t$ faults and $t + 1$ rounds, even if one node fails in each round, by the pigeonhole principle there will be at least one fault-free round, in which everyone will get everyone’s data. And even if some nodes fail after that, it will not change this knowledge (although it’s possible that alive nodes will agree on a value proposed by some node that has failed at some point).

Let us now go ahead and encode the protocol in Veil, proving that this intuition is, in fact, true.

Encoding Floodset Protocol in Veil

Disclaimer: Veil is a research prototype and is currently under active development by VERSE lab. Its performance as an automated verifier might vary on different case studies, and some of its UI aspects (e.g., syntactic error highlighting) will be improved in the future. Please, get in touch with us if you are planning to use it, and feel free to submit bug reports to the GitHub repository.

The protocol model accompanying the rest of this post can be found in this file. If you don’t want to compile Veil, you can try its web version (although be prepared that it’s not very fast).

Representing State Space

To encode a distributed protocol in Veil, we start by creating a new Lean file (let’s call it FloodSet.lean) with the command that imports Veil as a Lean library and makes a new Veil module, where we will put all our definitions and properties:

import Veil

veil module FloodSet

-- The protocol definition and its properties will be here

end FloodSet

Next, we will define the state of our protocol. Unlike actual implementations of distributed systems, where each node runs its own code, formal descriptions of distributed protocols typically model the system holistically: the state captures the information held by all nodes at once, and transitions describe how a single node’s action updates this global state. This style greatly simplifies formal encoding of the protocol’s logic while retaining all its characteristic intricacies. Readers with experience in TLA+ or Quint will find Veil’s encoding style very familiar.

We first declare the types of the node identities and the values they exchange:

-- Abstract types for nodes and values
type node
type value

-- Values must be totally ordered (for picking the minimum)
instantiate val_ord : TotalOrder value
open TotalOrder

At this point, it is not important what the elements of node and value are exactly: they can be integer numbers, strings, etc., but knowing their structure is not necessary for modelling the protocol’s logic (and also, surprisingly, makes it easier to verify it), so we deliberately omit these details. The only thing that we need to postulate now is that elements of value are totally ordered, which is what is done by the line instantiate val_ord : TotalOrder value.³

Next, we proceed to the components of the state space of the protocol, which one can think of as a record with several fields:

immutable individual t : Nat

individual round : Nat
individual crashCount : Nat

function initialValue : node → value
relation seen : node → value → Bool
relation decision : node → value → Bool
relation alive : node → Bool
relation crashedInThisRound : node → Bool

#gen_state

Scalar state components (i.e., non-functions) in Veil are declared using the individual keyword.⁴ Components that remain constant throughout the entire protocol’s execution, such as the maximal allowed number of faulty nodes t, can be marked as immutable. This is not strictly necessary, but it improves the performance of model checking and helps with simplifying the proofs by signalling Veil that it does not need to keep track of changes in those values. Since both the round number (round) and the actual number of crashed nodes (crashCount) will increase during the system’s execution, those are not marked as immutable.

Non-scalar components in Veil model many-to-one and many-to-many relations between values in the system. For instance, the initialValue function will represent the values chosen by nodes at the beginning of the protocol. The binary relation seen captures the values each node is aware of at any point. That is, seen n v = true means that the specific node n is aware of the value v. In a similar vein, decision encodes which nodes have decided on which values. Even though each node will eventually select at most one value, it is much more convenient to “over-provision” the definition for a possibility of a node to choose multiple values—don’t worry, we will later verify that this never happens. Finally, the two unary relations alive and crashedInThisRound capture the nodes alive at any moment of the protocol’s execution and the nodes that have crashed since the start of the current round, respectively.

Given these declarations, Veil command #gen_state produces an internal representation of the protocol’s state, so we can use it to describe how exactly the algorithm works.

Initial States

Now comes the most interesting part: modelling the protocol’s logic, i.e., describing how exactly it starts and runs. Let’s start by describing the set of possible initial states of the protocol:

-- Initial state: each node has exactly one proposal value
after_init {
  initialValue := *
  seen N V := initialValue N == V

  alive N := true
  decision N V := false
  round := 0
  crashCount := 0
  crashedInThisRound N := false
}

There is quite a bit to unfold in this definition.

Let us start with the line initialValue := *. Remember that initialValue is a function of type node → value defining which value each node chooses at the beginning. It is not important what exact value each node chooses, and, in fact, we are interested in checking all possible combinations. This Veil syntax allows us to achieve exactly this: picking a definition of initialValue arbitrarily. The power of this feature will become clearer once we start model-checking (i.e., exhaustively testing) and verifying the correctness of the protocol. In the former case, it will ensure that we run the check for every possible definition of initialValue. In the latter, it will allow us to prove that the protocol’s correctness holds regardless of which definition of initialValue it starts with.

The second line seen N V := initialValue N == V shows another important feature of Veil, inspired by the Ivy verifier—so-called iterated assignments. Whenever we use a capitalised variable, e.g., N in the code of Veil actions (and later, protocol properties), it should be read as “for any N in the respective set”. More specifically, here we define the Boolean value of seen N V for any pair (N, V) as the value of the expression initialValue N == V. In other words, if, for a given N and V, initialValue N returns V, then seen N V is set to true, otherwise it’s set to false. This syntax might be a bit weird when you see it for the first time, but soon it becomes very natural, and you might appreciate its conciseness and elegance.

With the syntax explained, the rest of the definition should be clear: every node N is alive at the start of the execution, and has not decided on any value (decision N V := false). We start from the round 0, with 0 crashed nodes and no node crashed in the initial round.

Protocol Actions

To model an execution of distributed protocols, we must embrace their non-deterministic nature: in essence, at any point, one of many things can happen in the system, and we do not always have control over the exact order of these events. That said, we should also be able to express which event might or might not happen given the state of the system so far. Let us see how to accommodate these requirements and express distributed events using the mechanism of Veil actions.

For example, this is an action that describes an event of a node failing:

-- Crash one alive node (can happen multiple times per round, up to t total)
action crash (n : node) {
  require round < t + 1
  require crashCount < t
  require alive n

  alive n := false
  crashedInThisRound n := true
  crashCount := crashCount + 1
}

This definition states that an action crash can be executed at any point for any node n given that the following side conditions are satisfied:

The current value of round is less than t + 1
The value of crashCount is less than t
The node n is alive

The combination of these requirements will prevent us from crashing more than t nodes, doing so after the end of the protocol’s main “run”, and crashing the nodes that have already failed.

What follows is the “operational” logic of the action. It first marks the node n as failed (alive n := false). Then, it records the fact that n has crashed in the currently ongoing round (crashedInThisRound n := true). Finally, it increases the total counter of crashed nodes.

It is important to note that, despite the concurrent nature of distributed protocol executions in Veil, there are no “data races” between actions: multiple actions are always assumed to be executed atomically, one after another. What remains non-deterministic is the order in which they might be scheduled for execution. Indeed, Veil’s machinery for model checking and verification will account for that, to make sure that every such execution is accounted for at the end.

Next, let us describe the most important part of the Floodset algorithm: synchronously advancing rounds, making nodes exchange data with each other. This is done by the advanceRound action:

action advanceRound {
  require round < t + 1

  let deadToAliveDelivery : node → node → Bool :| true

  seen N V := seen N V ||
    alive N &&
      decide ((∃ m, alive m ∧ seen m V) ∨
              (∃ d, crashedInThisRound d ∧ deadToAliveDelivery d N ∧ seen d V))

  crashedInThisRound N := false
  round := round + 1
}

The most interesting part of the advanceRound action is how it propagates the information about seen values between nodes by updating the seen relation. Notice that, due to our synchrony assumption, there is no need to model the message-passing explicitly, since the delays between sending and receiving messages are not observable in our system: effectively, every message reaches its destination within a single round, or is lost forever. That’s why we can update seen N V for any node N and value V in a single iterated assignment. The assignment constructs the new relation by taking the union of the old one (via Boolean disjunction ||) with the newly received data. This data is only relevant for nodes that are still alive (hence the conjunction with alive N). A new value V may come either from the seen-set of some presently alive node m (∃ m, alive m ∧ seen m V) or from the seen-set of some recently crashed node d.

The latter aspect of the model deserves a small discussion. What we want to express here is that some of the alive nodes might hear from some of the failed nodes. We don’t want to tell for which pairs of failed/alive nodes this is the case, hence we model this by non-deterministically choosing this many-to-many relation as a function deadToAliveDelivery:

let deadToAliveDelivery : node → node → Bool :| true

This syntax effectively tells Veil to assign to deadToAliveDelivery some random function of type node → node → Bool, such that it satisfies predicate true (in other words, the choice of the function is unrestricted).⁵ Getting back to the last line of the iterated assignment to seen in advanceRound, we can see that, for a fixed choice of deadToAliveDelivery, only the nodes N such that deadToAliveDelivery d N == true for some node d that failed in this round will guarantee the delivery of a value V from d’s seen-set to N. The call to Lean’s decide function is a slightly annoying necessity, needed to convert from Lean native propositions (of type Prop) to Booleans, imposed by the presence of the existentially-quantified statements such as (∃ m, alive m ∧ seen m V).

The remainder of the advanceRound action “resets” crashedInThisRound to discard the nodes crashed in this round from affecting the outcome of the next round (any node that will be crashed via the crash action from this point on might only affect the outcome of the next round). Finally, it advances the round number.

The last action of the protocol is nodeDecide, which allows any alive node to select the value of the consensus in the round t + 1:

action nodeDecide (n : node) {
  require round = t + 1
  require alive n

  let v :| seen n v
  assume ∀ w, seen n w → le v w

  decision n V := V == v
}

Once again, we use the constrained non-deterministic choice operator to pick the value v such that it’s in the seen-set for the node n. We additionally constrain it, via Veil’s assume statement, to be the smallest possible value amongst those seen by n. At the end, we set the decision of n to only contain the chosen value v that (a) it has seen and (b) is the smallest amongst all values seen by n.

Specifying Protocol Safety Properties

Now, with the definition of the protocol at hand, let us try to ensure that it indeed does what it’s supposed to do: makes all alive nodes uniformly decide on exactly one of their initial values. In Veil, we can encode this specification using the following two statements:

safety [agreement]
  ∀ n1 n2 v1 v2, decision n1 v1 ∧ decision n2 v2 → v1 = v2

safety [validity]
  ∀ n v, decision n v → (∃ m, initialValue m = v)

The keyword safety indicates a property of a protocol that must not be violated by any state that is (a) either amongst its initial states (defined via after_init) or (b) reachable from an initial state by executing a sequence of one or more actions: crash, advanceRound, or nodeDecide. The names of properties are optional and can be omitted.

Here, agreement states that any two values v1 and v2 that some nodes n1 and n2 decide upon are, in fact, the same. Notice that this property also ensures that each node only decides on the same value (in this case n1 = n2). Here, we are not concerned with whether the node is alive or crashed (in fact, a crashed node will never get to make a decision, as per the premise of the nodeDecide action). The second property, validity, states that any decided value originates from some node’s initial value choice.

Catching Bugs in Specification with Model Checking

To check that both agreement and validity do indeed hold true for our encoding of Floodset, we are going to use Lace, a model checker of Veil. Lace works similarly to TLC, the model checker for TLA+. Given concrete finite sets representing core data types of the protocol, it simply runs the model, starting by enumerating all of its initial states, and then by applying to the already reached states any enabled actions (i.e., actions whose require-statements are satisfied by those states), until this process exhausts the entire state space of the protocol, or is interrupted.

We can run Lace directly from a Lean file with the following command:

#model_check { node := Fin 3, value := Fin 2 } { t := 1 }

Fin n is Lean’s native data type that represents the finite collection containing {0, ..., n - 1}. We use it to specify that the set of nodes is the finite set {0, 1, 2}, and the set of values to choose from is {0, 1}. We also allow for at most one crash by setting the immutable protocol parameter t to 1. After a few seconds, the command produces the following output shown in the Lean InfoView buffer of VSCode:

VSCode with Lean 4: testing properties of the Veil implementation of the Floodset algorithm. The InfoView panel on the right shows statistics for the execution.

In particular, Lace has explored 54904 states of the protocol, of which 362 were distinct, and the execution diameter, i.e., the maximal length of an execution trace, was 6 actions (you can think of such a scenario as an exercise). The large number of states is explained by the combinatorial explosion of possible executions of the protocol: since our encoding relied on non-deterministic choice (e.g., by setting initialValue := *), the model checking algorithm had to enumerate all possible outcomes of such choices.

Model checking is tremendously useful for debugging the logic of the protocol and its properties, as it can quickly discover bugs that are relatively “shallow”—i.e., that manifest at small diameters. For instance, if we introduce a bug by commenting out the line assume ∀ w, seen n w → le v w in the action nodeDecide and re-run the model checker, Lace will show the following report:

In the state that violates the property, the relation decision stores two pairs of values, namely, (0, 0) and (1, 1), meaning that the node 0 has chosen the value 0, and the node 1 has chosen 1, which clearly violates the agreement property. Notice that the model checker not only reports the violation of the agreement property—it also constructs a concrete execution trace, demonstrating how a property-violating state can be reached from an initial one.

In addition to discovering violations of safety properties (such as agreement), a model checker can also be “tricked” into discovering reachability bugs—scenarios in which a protocol does not do anything useful. To see how to do that, let us uncomment the line assume ∀ w, seen n w → le v w of nodeDecide and comment out the line decision n V := V == v. Clearly, now, no node ever makes a decision, which means that both agreement and validity always hold true, as the premises of their implications are always false. This can be confirmed by re-running Lace, which reports no errors.

One can check whether the protocol does, in fact, reach “interesting” states by stating a property that we want to be violated, such as the following one:

safety [no_decision] ∀ n v, ¬decision n v

The property no_decision simply states that no node ever decides on any value. This is clearly not what we want, so an execution of a well-functioning protocol should have a state in which no_decision becomes false. However, if we rerun the model checker on Floodset in its current form, no violations will be reported. This means we’ve just discovered a reachability bug. By uncommenting decision n V := V == v and reverting the protocol to its initial state, we make it so that no_decision is now violated, which results in the following report, demonstrating a trace that does lead to a node making a decision:

Formal Safety Proof and Inductive Invariants

So far, we have managed to check that no execution of our model of the Floodset protocol violates the desired safety properties for some fixed values of its parameters: node, value, and t. This gives some certainty that we have got the protocol right, but it does not serve as a proof of this fact. What we want to ensure is that no execution violates the safety properties for any values of node, value, and t.

This statement can be phrased as a formal theorem, which can then be proven by induction on the length of the execution. Specifically, we can derive that the desired safety properties (e.g., agreement and validity) hold for any state of the system if we prove the following two facts:

These properties are true for any initial state, and
If they hold true for a state s, and a state s' can be produced from s by applying one of the protocol’s actions, then these properties also hold true for s'.

You can recognise part (1) as a base of a proof by induction, while (2) is the induction step. Veil provides a convenient way to assemble Lean theorems corresponding to the statements (1) and (2) for a specific protocol, out of the protocol’s description and its stated safety properties. We can then attempt to prove those properties automatically by typing the following command:

#check_invariants

As a result, for this example, Veil generates 12 Lean theorems, out of which 10 are proven automatically, while two are disproven. Let’s take a closer look at the generated report shown in the Lean InfoView:

What it states is that for the action nodeDecide, it has discovered a pair of states, a pre-state and a post-state, such that all the safety properties (also sometimes called invariants) hold for the pre-state, but the post-state violates agreement. This example refutes the statement of the induction step, but does not necessarily mean that the properties don’t hold for any reachable state of the protocol—after all, we had strong evidence for them being true given by the model checker. So why does the proof not go through?

It turns out that the pre-state of the generated counterexample is, in fact, not reachable in any of the concrete runs of the system. It is generated simply because this is a state that satisfies both our properties, as required by the premise of the induction step, which does not talk about actual state reachability. Such spurious counterexamples, known as counterexamples to induction (CTI), can be eliminated by stating more properties that we believe hold true over the system. For instance, by examining the provided counterexample, we can notice that, according to it, the node 1 has decided on the value 1 (decision ↦ (1, 1)), even though it hasn’t “seen” it (the seen relation only contains the pair (0, 0)). This is clearly not something that could happen during a concrete execution, so we can add the corresponding property to be included in our induction hypothesis:

invariant [seen_decision]
  ∀ n v, decision n v → seen n v

The property seen_decision states that any value that becomes some node’s decision must have been seen by this node. Sadly, running #check_invariants fails again, but now with a different CTI:

After looking at the counterexample, we can notice that the value of crashCount in the pre-state is 0, while both nodes 0 and 1 are marked as crashed in this round, and yet alive at the same time. Again, this is not something that would happen after a valid run in the system. We can try to eliminate this outcome by adding the following property to the set of invariants we use in our proof by induction:

invariant [crashed_not_alive]
  ∀ n, crashedInThisRound n → ¬ alive n

And again, this generates another counterexample, different this time. What we are doing here is called inductive invariant inference, and, as we’ve discussed in the previous post, there is no algorithm guaranteed to always find a set of inductive invariants for a given initial set of properties to hold—even if they do hold.

Luckily, with a bit of understanding of how the protocol works and by analysing the counterexamples to induction produced by Veil, we can eventually provide a sufficient set of invariants for the system so that all generated theorems are proven, thus delivering the correctness proof for the protocol with regard to our initial properties. These invariants can be found in the Floodset implementation available on GitHub.

AI-Powered Inference of Inductive Invariants

I know you’ve been waiting for this part!

Indeed, the process of inferring inductive invariants manually is quite tedious, so having some automation here would go a long way. In the past, several academic efforts in the systems research community attempted to automate invariant inference by applying traditional symbolic enumerative techniques (examples of such frameworks are I4, SWISS, and DuoAI).

Nowadays, we have Claude Code, so we can just use it with the feedback in the form of counterexamples provided by Veil. In fact, most of the auxiliary invariants required to verify the agreement and validity of the accompanying Floodset protocol model have been obtained by Claude Code automatically within just a couple of minutes from a prompt “Infer the invariants necessary for #check_invariants to succeed. Use counterexamples provided by the verifier.”

Concluding Remarks

There are quite a few aspects of Veil we haven’t discussed yet, but I hope to elaborate on them in future posts.

For example, even though Veil is just a Lean library, allowing any of its theorems to be proven in Lean directly, we didn’t get to use Lean proof mode and tactics in this tutorial. This is because we got quite lucky with our encoding of Floodset, which allowed for its inductive proof to be obtained fully automatically with the help of SMT solvers (Veil uses cvc5 as the default one), invoked by #check_invariants under the hood. In one of the next posts in this series, we will discuss protocol formalisations in Veil whose verification requires a combination of interactive and automated proofs in Lean.

If you are interested in learning more about Veil’s capabilities as a verifier, you should check out this CAV’25 paper. You can also find some takeaways from implementing Veil on top of Lean in this paper presented at the recent Dafny workshop.

Even though Claude Code was incredibly effective at inferring inductive invariants for proving correctness of a protocol, it was not as reliable as a mechanism for protocol auto-formalisation. This is not due to it getting Veil’s syntax wrong: contrary to my expectations, it managed to guess the meaning of all its unusual syntactic constructions correctly, or quickly corrected all its mistakes with the help of the compiler errors. The main problem is that the protocol definitions it produced in Veil were missing crucial parts, making them vacuously correct but also useless. For example, one of the first versions it produced modelled all crashes in the same action as the broadcast without accounting for partial message delivery, so the resulting protocol would reach agreement immediately after the first round. Issues like these were discovered with multiple runs of Lace to test reachability properties—as discussed above.

Furthermore, Claude Code was prone to overcomplicate things, for instance, by introducing state components that were not necessary for modelling the essence of the protocol, such as a relation representing in-flight messages.

While these shortcomings might be eliminated in future models, my experience so far still demonstrates that the presence of the human expert is crucial for getting a faithful and concise formal specification of a system at the right level of abstraction.

Acknowledgements

I’m grateful to Seth Gilbert who suggested taking a look at the FloodSet consensus protocol as a case study for Veil, and to George Pîrlea and Qiyuan Zhao for their comments on this post.

TLA+ comes with TLAPS (TLA+ Proof System) for writing deductive proofs in a bespoke tactic language, but it’s a separate tool. To the best of our knowledge, it’s not commonly used (although, if you are using it, please get in touch with us—we are curious to learn about your proofs!). It is also not as expressive as modern proof assistants, such as Rocq or Lean. Quint allows for a form of automated correctness proof that is not guaranteed to always succeed, even for correct protocols, due to the limitations of its underlying solvers. ↩
Other famous consensus protocols, such as Paxos or Raft, do not assume synchrony and work under a weaker assumption of asynchronous communication, where one cannot tell the difference between a slow and a “dead” participant, as the messages can take arbitrarily long to deliver. That is, they guarantee correctness without any timing assumptions, but, in practice, they also rely on time-outs to make progress. ↩
For those who think in terms of Lean (or Haskell), the instantiate command simply imposes the TotalOrder type class constraint on the elements of the abstract type value. ↩
This notation is inspired by the syntax of the Ivy verifier for distributed protocols. ↩
Veil syntax let x :| p x allows one to constrain the values of a randomly picked x to only those that satisfy the Boolean predicate p. This operator, known as Hilbert’s epsilon operator, is very convenient for modelling constrained non-deterministic choice. ↩

Multi-Modal Program Verification in Velvet

2026-01-21T00:00:00+00:00

In this post, we will show how to specify and verify imperative programs in Lean 4 using Velvet—an embedded verifier, which relies on a combination of automated symbolic and AI-assisted theorem proving techniques.

Disclaimer: Velvet is currently under active development by VERSE Lab, as we are working to improve its expressivity and performance. It’s likely that its codebase will soon change substantially. Nevertheless, the code linked in this post will remain accessible.

Formal program verification is about telling what a program should do without telling how it should do it and then mathematically proving that the program indeed does exactly that. The what is given by a program specification—a logical statement that describes the assumptions on the program’s input and the properties that should hold true about its outcomes.¹

Getting Started: Specifying and Verifying Functional Programs

The Lean 4 theorem prover allows one to write a program and also to state its specification in the form of a mathematical theorem. For instance, the following code fragment shows a function append that concatenates two lists of integers and a theorem append_assoc that states and proves the function’s associativity (meaning that the result of concatenating three lists does not depend on the order in which we perform concatenation—only on the position of the arguments).

def append (xs ys : List Int) : List Int :=
  match xs with
  | []      => ys
  | x :: xs => x :: append xs ys

theorem append_assoc (xs ys zs : List Int) : 
    append (append xs ys) zs = append xs (append ys zs) := by
  induction xs with
  | nil => rfl
  | cons x xs ih => simp [append, ih]  

The proof of append_assoc is done by induction on the shape of the first argument of the append function (i.e., the list “on the left side” of the concatenation), and its details are not that important for now: it suffices to say that it closely mimics the paper-and-pencil argument that argues for the validity of the theorem’s statement by considering two different ways lists can be constructed.

The program append is written in a functional style, in which the result of a program is determined solely by its parameters, and all functions always terminate. Thanks to these restrictions, functional programs are known to be particularly well-suited for formal verification, with simple theorems featuring relatively natural proofs—just as we’ve seen above. Unfortunately, such mathematically “pure” functional programming makes it non-trivial to express so-called imperative features of common programming languages (and even pseudocode languages used in common textbooks for presenting algorithms): potentially non-terminating loops, exceptions, mutable variables, and randomness.

Velvet addresses this gap by providing support for imperative programming within Lean’s verification ecosystem. It is not the only existing verifier for imperative programs—many great tools exist to do exactly that, including Dafny, Verus, and Prusti, to mention just a few. None of those tools, however, allow one to use Lean as a way to orchestrate their verification, which is a unique feature of Velvet—by virtue of it being embedded in Lean. To put it differently, Velvet is a Lean library rather than a standalone tool.

Imperative Programming in Velvet

The code accompanying the rest of the post can be found in this file.

Let’s start our tour of Velvet by implementing in it a simple program that checks whether a given natural number is not prime:

method IsNonPrime (n: Nat) return (result: Bool)
  do
    if n ≤ 1 then
      return true
    let mut i: Nat := 2
    let mut ret: Bool := false
    while i * i ≤ n
    invariant true
    do
      if n % i = 0 then
        ret := true
      i := i + 1
    return ret

Most of the code of IsNonPrime should be self-explanatory, so let us highlight only unusual parts. First, its result is explicitly named in its signature (as result, but one can use any arbitrary name different from parameters) for the reasons that will soon become clear. Similarly to Python, the scoping of the code is determined by its offset, and the do keyword starts a new code block. All local variables (introduced using let) and function parameters are immutable by default unless explicitly marked as mut. In the while-loop, right after the condition we can see the invariant true annotation. For now, it serves no particular purpose except for making the parser happy, so let’s not worry about it and just think of it as a piece of unavoidable boilerplate.

We can immediately test our function by running it and checking its result in VSCode’s Lean InfoView. For instance, executing

#eval (IsNonPrime 42).extract

results in true (42 is indeed non-prime), while running

#eval (IsNonPrime 239).extract

produces the result false, just as expected.

Specifying a Program in Velvet

As the next step, we move from tests, which can only show that a program behaves as expected on specific inputs, to formal specifications stating that a program always does what it’s supposed to do. Despite its tiny size and simplicity, IsNonPrime is surprisingly tricky to specify formally, as it relies on the definition of primality of natural numbers. Thanks to the rich vocabulary of Lean specifications and programming mechanisms, one can define primality in multiple different ways. We are going to do it by first writing a mathematical function that returns the number of divisors of a natural number n:

def countDivisors (n: Nat) : Nat :=
  (List.range (n + 1)).filter (fun d => d > 0 ∧ n % d = 0) |>.length

In plain words, countDivisors first constructs a list of numbers from 0 to n, then keeps only those that are strictly positive and are divisors of n; finally, it returns the size of the resulting list. Indeed, for any prime number countDivisors should return 2: counting 1 and the number itself, i.e., only the trivial divisors. We are going to adopt this as a definition of a prime number, defined in Lean as the following predicate:

def isPrime (n: Nat) : Prop :=
  n > 1 ∧ countDivisors n = 2

We are now ready to ascribe a formal specification to IsNonPrime, which we can do by adding a logical statement that starts with ensures right after the function’s signature:

method IsNonPrime (n: Nat) return (result: Bool)
  ensures result ↔ ¬isPrime n
  do
    if n ≤ 1 then
      return true
    let mut i: Nat := 2
    let mut ret: Bool := false
    while i * i ≤ n
    invariant true
    do
      if n % i = 0 then
        ret := true
      i := i + 1
    return ret

The added statement is commonly called a postcondition: it postulates what should hold true about the program’s result at the end of its execution (and that’s why we had to give an explicit name to the result!).² In our example, the postcondition asserts that the function should return true if and only if (denoted as ↔ in Lean) its result is not prime (¬isPrime).

Let us now go ahead and try to verify that the desired property does indeed hold for any input n passed to IsNonPrime. This can be done by adding the following command to the file after the function definition:³

prove_correct IsNonPrime by
  loom_solve!

Sadly, this does not immediately verify our program. Running loom_solve! proof tactic, however, provides us with a very helpful piece of information: a Lean proof context shown in VSCode InfoView, which states what exactly could not be proven about the program:

n i : ℕ
result : Bool
if_neg : 1 < n
done_1 : n < i * i
⊢ result = true ↔ ¬isPrime n

In short, this is what remains to be proven: when the program’s input n is strictly larger than 1 (indicated by the hypothesis if_neg), the returned result matches our specification.

What about the case when n is 0 or 1? In fact, this case, corresponding to taking the then-branch in the program’s body, was indeed proven by loom_solve!, and this happened automatically! To see this, let us step back and ask Velvet for all facts that need to be proven to establish the desired postcondition of IsNonPrime. This can be done by adding the following line right above prove_correct ...:

set_option loom.solver "custom"

Doing so makes prove_correct produce the following collection of Lean statements:

-- Goal 1
n : ℕ
if_pos : n ≤ 1
⊢ ¬isPrime n

-- Goal 2
n : ℕ
if_neg : ¬n ≤ 1
i : ℕ
ret : Bool
if_neg_1 : ¬i * i ≤ n
⊢ ¬i * i ≤ n

-- Goal 3
n i : ℕ
ret : Bool
if_neg : 1 < n
done_1 : n < i * i
⊢ ret = true ↔ ¬isPrime n

We have already seen the third statement, so where did the first two come from? Looking closely, one can notice that the first one corresponds to the fact that needs to be proven to justify IsNonPrime returning true for n ≤ 1. The correctness of this statement follows immediately from the definition of isPrime, which we have defined above. The second goal comes from the requirement, imposed by Velvet, to show that the loop condition (in this case, i * i ≤ n) is indeed negated at the end of the loop execution, and it holds trivially. Both goals are proven by loom_solve! automatically using a combination of SMT solvers (such as Z3 and cvc5) and Lean’s own proof automation (most prominently, the grind and aesop tactics).

The demonstrated ability to automatically prove some of the facts required to verify a program, while leaving some others open for an interactive proof is one of the key advantages of Velvet compared to state-of-the-art program verifiers, which typically provide only an automated or only an interactive verification mode. The former ones suffer from the lack of debugging information when a proof fails, while the latter make most of the proofs (even of trivial facts) extremely laborious. Velvet combines the strengths of both modes, thus, providing a multi-modal verification experience.

Dealing with Unbounded Executions: Loop Invariants

To prove the only interesting statement about IsNonPrime, we will have to do a bit more work and provide so-called loop invariants—assertions such that (1) they hold right before the start of the loop’s execution, (2) if they hold true at the start of a loop iteration, they also hold true at the end of it, and (3) when combined, they allow deriving the program’s postcondition. From the mathematician’s perspective, loop invariants are very similar to an induction hypothesis, which also needs to hold in the base case and must be re-established for the induction step. In this case, (1) is the equivalent of proving the base case, (2) is the induction step, and (3) is using the statement proven by induction (i.e., the invariant) to prove the desired fact. So, if you are familiar with proofs by induction, you can simply think of a combination of inductive loop invariants as an induction hypothesis necessary to prove that certain facts are true about our program’s state no matter how many iterations the loop makes.

Sadly, there is no algorithm to reliably infer loop invariants for any program and its provable postcondition—a fact that directly follows from Rice’s Theorem, which states that any non-trivial semantic property of programs (e.g., the existence of their loop invariants for a given postcondition) is undecidable. However, invariants can frequently be conjectured by using our understanding of what does the loop do (yes, you can also try to do it using your favourite AI system) and then verified using formal proof techniques to satisfy the requirements (1)-(3) listed above. This is what we are going to do for now. Let us change invariant true in the program, so it will look as follows:

method IsNonPrime (n: Nat) return (result: Bool)
  ensures result ↔ ¬isPrime n
  do
    if n ≤ 1 then
      return true
    let mut i: Nat := 2
    let mut ret: Bool := false
    while i * i ≤ n
    invariant 2 ≤ i
    invariant (ret = false ↔ ∀ d, 2 ≤ d ∧ d < i → n % d ≠ 0)
    invariant (i - 1) * (i - 1) ≤ n
    do
      if n % i = 0 then
        ret := true
      i := i + 1
    return ret

The first invariant states that at the start and at the end of each loop iteration, the variable i is at least 2. The second one is the most interesting: it states that the variable ret is false if and only if there are no divisors of n between 2 and i. Finally, the third invariant simply states that the square of i - 1 is not larger than n. Because of the requirement (2), adding these invariants made the job of our verifier quite a bit harder: now we need to prove that each of these invariants is preserved by a loop iteration. As a result, if we once again add

set_option loom.solver "custom"

in front of our proof script (starting with prove_correct IsNonPrime by), we will see that loom_solve! leaves a whopping 15 facts to prove! The good news is that most of them are no match for Lean’s proof automation, and can be solved without any involvement required from our side. So let us comment out the option set_option loom.solver "custom" and run this script instead:

prove_correct IsNonPrime by
  loom_solve; try simp_all

Now it leaves us with just a single fact to prove, which looks strikingly similar to the one we have struggled with before:

n i : ℕ
ret : Bool
i_1 : ℕ
ret_1 : Bool
if_neg : 1 < n
invariant_1 : 2 ≤ i_1
invariant_2 : ret_1 = false ↔ ∀ (d : ℕ), 2 ≤ d → d < i_1 → ¬n % d = 0
invariant_3 : (i_1 - 1) * (i_1 - 1) ≤ n
done_1 : n < i_1 * i_1
i_2 : i = i_1 ∧ ret = ret_1
⊢ ret_1 = true ↔ ¬isPrime n

This time, however, we are in a much better situation, as, thanks to our (or rather Velvet’s and Lean’s) hard work, we have the three invariants (e.g., invariant_1, etc) available to us as assumptions, meaning we can use them to verify the much desired fact about the function’s result.

Unleashing AI-Powered Proof Automation

For now, the only thing that stands between us and proving correctness of IsNonPrime with regard to its specification is the statement above that, roughly, states that a number n is prime if and only if the number of its divisors between 2 and its discrete square root (i_1) is exactly zero. While somewhat obvious from our understanding of mathematics of division, this fact is by far non-trivial—mostly because we talk about enumerating all potential divisors not between 2 and $n$ but between 2 and $\sqrt{n}$. Since this requires number-theoretic reasoning beyond what SMT solvers or Lean’s grind tactic handle well, it’s time to bring in AI-powered proof automation.

First, let us “hoist” the statement we are willing to prove as a separate theorem called remaining_goal (any name will do), whose proof is omitted for now via Lean’s sorry keyword:

theorem remaining_goal
(n : ℕ)
(i : ℕ)
(ret : Bool)
(i_1 : ℕ)
(ret_1 : Bool)
(if_neg : 1 < n)
(invariant_1 : 2 ≤ i_1)
(invariant_2 : ret_1 = false ↔ ∀ (d : ℕ), 2 ≤ d → d < i_1 → ¬n % d = 0)
(invariant_3 : (i_1 - 1) * (i_1 - 1) ≤ n)
(done_1 : n < i_1 * i_1)
(i_2 : i = i_1 ∧ ret = ret_1)
: ret_1 = true ↔ ¬isPrime n :=
  by sorry

With this theorem, the verification of IsNonPrime can be accomplished simply via the following script, which makes use of remaining_goal:

prove_correct IsNonPrime by
  loom_solve; try simp_all
  apply (remaining_goal n i ret i_1 ret_1 if_neg invariant_1 invariant_2 invariant_3 done_1 i_2)

Thanks to the multi-modal nature of Velvet, we could prove remaining_goal manually by writing a Lean proof script, as its statement is, in fact, true. Here, however, we will take advantage of existing AI-based proof automation systems.

In this experiment, which took place in early January 2026, I’ve tried to use Claude Code and Harmonic’s Aristotle. In case of Claude Code, I simply asked to complete all the proofs in the file, making sure there are no sorrys and that the file compiles. For Aristotle, I constructed a file that only contained the theorem and the definitions required by it, doing a quick clean-up from Velvet’s specific libraries, as those are not necessary for the proof at this point and might prevent the system from accepting the file. Each of the two systems was able to accomplish the proof in approximately 20 minutes.

The resulting proofs of this theorem are highly non-idiomatic and I won’t be posting them here to spare the reader’s eyes. In case you’re still curious, the Lean development accompanying this post, including the proof produced by Aristotle, can be found by this link. The repository also contains a number of other curious examples in Velvet, including a highly non-trivial correctness proof of a memory allocator, which was offered as a problem in the recent Theorem Proving Competition held in Wuhan in November 2025.

Concluding Remarks

In this post, we’ve had a brief introduction to the principles of specification and verification of imperative programs in Velvet—a multi-modal program verifier implemented as a library on top of Lean. Velvet combines automated and interactive reasoning modes, attempting to discharge the majority of proof obligations needed for program correctness using existing symbolic automation techniques (such as SMT solvers and Lean’s grind tactic), leaving “complex” facts to be proven by other means, such as writing a proof script manually. In the latter case, present-day AI systems, such as Claude Code and Aristotle, prove to be quite capable of completing the proofs involving mathematical statements with close to no human intervention. This combination of symbolic automation, interactive theorem proving, and AI assistance hints at a future where rigorous program verification becomes accessible to a much wider audience of non-experts.

Hello, World!

2026-01-10T00:00:00+00:00

Welcome to Proofs and Intuitions! This is a blog about mathematics, formal verification, and the ideas that connect them.

A Bit of Mathematics

Let’s start with something beautiful. The most famous equation in physics is probably Einstein’s mass-energy equivalence: $E = mc^2$. But in pure mathematics, Euler’s identity takes the crown:

\[e^{i\pi} + 1 = 0\]

This single equation connects five fundamental constants: $e$, $i$, $\pi$, $1$, and $0$. Truly remarkable!

Here’s another classic—the quadratic formula. For any equation of the form $ax^2 + bx + c = 0$, the solutions are:

\[x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}\]

Lean 4: Theorem Proving

One of the exciting developments in modern mathematics is the use of proof assistants like Lean. These tools allow us to write mathematical proofs that can be mechanically verified by a computer.

VSCode with Lean 4: proving correctness of the Euclidean GCD algorithm. The InfoView panel on the right shows the current proof state with three goals remaining.

Here’s a simple example. In Lean 4, we can define natural number addition and prove basic properties. For instance, we can express that 0 + n = n using the Nat.zero_add theorem.

A simple inline reference: the term Nat.succ n represents the successor of n, i.e., n + 1.

Here’s a small Lean 4 proof that addition is commutative:

theorem add_comm (n m : Nat) : n + m = m + n := by
  induction n with
  | zero => simp [Nat.zero_add, Nat.add_zero]
  | succ n ih => simp [Nat.succ_add, Nat.add_succ, ih]

And here’s a proof that demonstrates the associativity of addition:

theorem add_assoc (a b c : Nat) : (a + b) + c = a + (b + c) := by
  induction a with
  | zero => rfl
  | succ a ih => simp [Nat.succ_add, ih]

The Joy of Discovery

There’s a special feeling when a proof finally clicks—when the pieces fall into place and you see why something must be true, not just that it is true.

That moment of clarity is what this blog is about. We’ll explore proofs, develop intuitions, and hopefully have some fun along the way.

Stay tuned for more!

Proofs and Intuitions

On the Unreasonable Effectiveness of Property-Based Testing for Validating Formal Specifications

Getting Program Specifications that are “Just Right”

Symbolic Specification Testing

Reframing Validation: Soundness and Uniqueness

Property-Based Testing

Catching a Bad Spec with PBT

What’s Hard to Test (and How We Cope)

Catching Specification Bugs in Verified Coding Benchmarks

Example 1: Forgotten Range Constraints

Example 2: Silently Truncated Subtraction

Example 3: Catching Implementation Bugs

Limitations and Conclusions

Verifying Move Borrow Checker in Lean: an Experiment in AI-Assisted PL Metatheory

What is PL Metatheory and why it needs mechanised proofs?

Move programming language and its borrow checker

Encoding Move typing rules in Lean

From conformance proofs to tests via algorithmic type checking

Runtime semantics and Type Soundness Theorem

Soundness Proof: the Labours of Claude

Preservation proof: an exercise in invariant inference

Where AI did great, and where it didn’t

The dragon: function calls

Last wrinkles: soundness assumptions and type system extensions

Stretch goals: vectors and enums

Some statistics

What this all means for the future of PL research?

Acknowledgements

Verifying Distributed Protocols in Veil

Introduction

A Classic Example: Dolev-Strong Floodset Protocol

Encoding Floodset Protocol in Veil

Representing State Space

Initial States

Protocol Actions

Specifying Protocol Safety Properties

Catching Bugs in Specification with Model Checking

Formal Safety Proof and Inductive Invariants

AI-Powered Inference of Inductive Invariants

Concluding Remarks

Acknowledgements

Multi-Modal Program Verification in Velvet

Getting Started: Specifying and Verifying Functional Programs

Imperative Programming in Velvet

Specifying a Program in Velvet

Multi-Modal Verification

Dealing with Unbounded Executions: Loop Invariants

Unleashing AI-Powered Proof Automation

Concluding Remarks

Further Reading

Hello, World!

A Bit of Mathematics

Lean 4: Theorem Proving

The Joy of Discovery