Increasingly, I see people who’ve been struggling with LLM-based coding assistants reaching the conclusion that what’s needed is “better” specifications.
If you were to ask me what might make a specification “better”, I’d probably say:
- Less ambiguous – less open to multiple valid interpretations
- More complete – fewer gaps where expected system behaviour and other properties are left undefined
- More consistent – fewer contradictions (e.g., Requirement #1: “Users can opt in to notifications”, Requirement #77: “By default, notifications must be on”)
Of these three factors, ambiguity is top of my list. It can mask contradictions and paper over gaps. When requirements are ambiguous, that takes us into physicist Wolfgang Pauli’s “not even wrong” territory.
It’s hard to know what the software’s supposed to do, and hard to know when it’s not doing it. This is why so many testers tell me that a large part of their job is figuring out what the requirements were in the first place. (Pro tip: bring them into those discussions.)
An ideal software specification therefore has no ambiguity. It’s not open to multiple interpretations. This enables us to spot gaps and inconsistencies more easily. But more importantly, it enables us to know with certainty when the software doesn’t conform to the specification.
We can never know, of course, that it always conforms to the specification. That would require infinite testing in most cases. But it only needs one test to refute it – and that requires the specification to be refutable.
So I guess when I talk about a “better” specification, I’m talking mostly about refutability.
“Precise”. You Keep Using That Word.
Refutability requires precision. And this is where our natural languages let us down. Try as we might to articulate rules in “precise English” or “precise French” or “precise Cantonese”, these languages haven’t evolved for precision.
Language entropy – the tendency of natural language statements to have multiple valid interpretations, and therefore uncertain meaning – is pretty inescapable.
For completely unambiguous statements, we need a formal language – a language with precisely-defined syntax – with formal semantics that precisely define how that syntax is to be interpreted. Statements made with these can have one – and only one – interpretation. It’s possible to know with certainty when an example contradicts it.
Computer programmers are very familiar with these formal systems. Programming languages are formal languages, and compilers and interpreters endow them with formal semantics – with precise meaning.
I half-joke, when product managers and software designers ask me where they can find good examples of complete software specifications to look on GitHub. It’s full of them.
It’s only half a joke because it’s literally true that program source code is a program specification, not an actual program. It expresses all of the rules of a program in a formal language that are then interpreted into lower-level formal languages like x86 assembly language or machine code. These in turn are interpreted into even lower-level representations, until eventually they’re interpreted by the machine itself – the ultimate arbiter of meaning.
It’s turtles all the way down, and given a specific stack of turtles, meaning – hardware failures notwithstanding – is completely predictable. The same source code, compiled by the same compiler, executed by the same CPU, will produce the same observable behaviour.
So we have a specification that’s refutable and predictable. The same rules will produce the same behaviour every time, and we can know with certainty when examples break the rules.
But, of course, a computer program does what it does. It will always conform to its program specification, expressed in Java or Python or – okay, maybe not JavaScript – or Go. That doesn’t mean it’s the right program.
So we need to take a step back from the program. Sure, it does what it does. But what is it supposed to do?
Remember those turtles? Well, it would be a mistake if we believed the program source code is at the top of the stack. In order to meaningfully test if we wrote the right program code, we need another formal specification (and I use those words most accurately) that describes the desired properties of the program without being part of the program itself.
Let’s think of a simple example. If I have a program that withdraws money from a bank account, and me and my customer agree that withdrawal amounts must be more than zero, and the account needs to have sufficient funds to cover it, we might specify that withdrawals should only happen when that’s true.
In informal language, a precondition of any withdrawal is that the amount must be greater than zero, and the balance must be greater than or equal to the amount being withdrawn. If the withdraw function is invoked when that condition isn’t met, the program is wrong.
To remove any ambiguity, I would wish to express that in a formal language. I could do it in a programming language. I could insert an assertion at the start of the withdraw function that checks the condition and e.g., throws an exception of it’s not satisfied, or halts execution during testing and reports an error.
e.g. in Python “defensive programming” (we can talk in another blog post about what terrible UX design this is – yes, UX design. In the code. Bazinga!)
def withdraw(self, amount): if amount <= 0: raise InvalidAmountError() if self.balance < amount: raise InsufficientFundsError() self.balance -= amount
e.g., using inline assertions that are checked during testing
def withdraw(self, amount): assert amount > 0 assert self.balance >= amount self.balance -= amount
These approaches are fine, but they’re not a great way to establish what those rules are with our customer in the first place. Are we going to sit down with them and start writing code to capture the requirements?
In the late 1980s, formal languages started to appear specifically with the aim of creating precise external specifications of correct behaviour that aren’t part of the code at all.
The first I used was Z. Z was a notation founded on predicate logic and set theory. Here’s an artist’s impression of a Z specification that ChatGPT hallucinated for me.

Not the most customer-friendly of notations. Other formal specification languages attempted to be more “business-friendly”, like the Object Constraint Language:
context BankAccount::withdraw(amount: Real)pre: amount > 0pre: balance >= amountpost: balance = balance@pre - amount
These OCL constraints were designed to extend UML models to make their meaning more precise. I remember being told that it was designed to be used by business people. I found that naivety endearing.
To cut a long story short, while formal specification certainly found a home in the niche of high-integrity and critical systems engineering, that same snow never settled on the plains of business and requirements analysis and everyday software development. We were expecting business stakeholders to become programmers. That rarely works out.
But for a time, I used formal specifications – luckily, my customers were electronics engineers and not marketing executives, so most already had programming experience.
Tests As Specifications
We’d firm up a specification using a combination of Z and the Object Modeling Technique (UML wasn’t a thing then) describing precisely what a feature or a function needed to do.
Then I’d analyse that specification and choose test examples.
BankAccount:: withdrawExample #1: invalid amountamount = 0Outcome:throws InvalidAmountErrorExample #2: valid amount and sufficient fundsamount = 50.0balance = 50.0Outcome:balance = 0.0Example #3: insufficient fundsamount = 50.01balance = 50.0Outcome:throws InsufficientFundsError
It turned out that business stakeholders can much more easily understand specific examples than general rules expressed in formal languages. So we flipped the script, and explored examples first, and then generalised them to a formal specification.
It was when I first started learning about “test-first design”, one of the practices of the earliest documented versions of Extreme Programming, that the lightbulb moment came.
If we’ve got tests, do we need the formal specifications at all? Maybe we could cut out the middle-man and go straight to the tests?
This often works well – exploring the precise meaning of requirements using test examples – with non-programming stakeholders.
And many people are discovering that including test examples in our prompts helps LLMs match more accurately by reducing the search space of code patterns. It turns out that models are trained on code samples that have been paired with usage examples (tests, basically), so including examples in the prompt gives them more to match on.
So, if you were to ask me what might make a specification for LLM code generation “better”, I’d definitely say “tests”. (And there was you thinking it was the LLM’s job to dream up tests.)
Visualising The Gaps
That helps reduce ambiguity and the risk of misinterpretation, but what of completeness and consistency?
This is where some kind of generalisation is really needed, but it doesn’t have take us down the Z or OCL road. What we really need is a way to visualise the state space of the problem.
One simple technique I’ve used to good effect is a decision table. This helps me to see how the rules of a function or an action map to different outcomes.

Here, I’ve laid out all the possible combinations of conditions and mapped them to specific outcomes. There’s one simplification we can make – if the amount isn’t greater than zero, we don’t care if the account has sufficient funds.

That maps exactly on to my three original test cases, so I’m confident they’re a complete description of this withdraw function.
Mapping it out like this and exploring test cases encourages us to clarify exactly what the customer expects to happen. When the amount is greater than the balance, exactly what should the software do? It forces us and our customers to consider details that probably wouldn’t have come up otherwise.
Other tools we can use to visualise system behaviour and rules include Venn diagrams (have we tested every part of the diagram?), state transition diagrams and state transition tables (have we tested every transition from every state?), logic flow diagrams (have we tested every branch and every path?), and good old-fashioned truth tables – the top half of a decision table.
Isn’t This Testing?
“But, Jason, this sounds awfully like what testers do!”
Yup 🙂
Tests are to specifications what experiments are to hypotheses.
If I say “It should throw an error when the account holder tries to withdraw more than their balance” before any code’s been written to do that, I’m specifying what should happen. Hypothesis.
If I try to withdraw £100 from an account with a balance of £99, then that’s a test of whether the software satisfies it’s specification. It’s a test of what does happen. Experiment.
This is why I strongly recommend teams bring testing experts into requirements discussions. You’re far more likely to get a complete specification when someone in the room is thinking “Ah, but what if A and B, but not C?”
You can, of course, learn to think more like a tester. I did, so it can’t be that hard.
But there’s really no substitute for someone with deep and wide testing experience in the room.
If a function or a feature is straightforward, we can probably figure out what test cases we’d need to cover in our heads. My initial guesses at tests for the withdraw function were pretty good, it turned out.
But when they’re not straightforward, or when the scenario’s high risk, I’ve found these techniques very valuable.
As a bottom line, I’ve found that tests of some kind are table stakes. They’re the least I’ll include in my specification.
Shared Language
Another thing I’ve found that helps to minimise misinterpretations is establishing a shared model of the concepts we’re talking about in our specifications.
In a training exercise I run often, pairs are asked to use Test-Driven Development to create a simple online retail program. They’re given a set of requirements expressed in plain English and the idea is that they agree tests with the customer (one of them plays that role) to pin down what they think the requirements mean.
e.g.
• Add item – add an item to an order. An order item has a product and a quantity. There must be sufficient stock of that product to fulfil the order
• Total including shipping – calculate the total amount payable for the order, including shipping to the address
• Confirm – when an order is confirmed, the stock levels of every product in the items are adjusted by the item quantity, and then the order is added to the sales history.
A couple of years back, I changed the exercise by giving them a “walking skeleton” – essentially a “Hello, world!” project for their tech stack with a dummy test and a CI build script set up and ready to go – to get them started.
And in that project I added a bare-bones domain model – just classes, fields and relationships – that modeled the concepts used in the requirements.
In UML, it looked something like this.

Before I added a domain model, pairs would come up with distinctly different interpretations of the requirements.
With the addition of a domain model, 90% of pairs would land on pretty much the same interpretation. Such is the power of a shared conceptual model of what it is we’re actually talking about.
It doesn’t need to be code or a UML diagram – but some expression in some form we hopefully can all understand of the concepts in our requirements and how they’re related evidently cuts out a lot of misunderstandings.
Precision In UX & UI Design
And, of course, if we’re trying to describe a user interface, pictures can really help there. Wireframes and mock-ups are great, but if we’re trying to describe dynamic behaviour – what happens when I click that button? – I highly recommend storyboards.
A storyboard is just a sequence of snapshots of the UI in specific test scenarios that illustrates what happens with each user interaction. Here’s a great example.

It’s another way of visualising a test case, just from the user’s perspective. In that sense, it can be a powerful tool in user experience design, helping stakeholders to come to a shared understanding of the user’s journey, and potentially revealing problems with the design early.
Precision != BDUF
Before anybody jumps in with accusations of Big Design Up-Front (BDUF), a quick reminder that I would never suggest trying to specify everything, then implement it, then test it, then merge and release it in one pass. I trust you know me better than that.
When clarity’s needed, I have a pretty well-stocked toolbox of techniques for providing it, as and when it’s needed in a highly iterative process delivering working software in thin slices – one feature at a time, one scenario at a time, one outcome at a time, and one example at a time. Solving one problem at a time in tight feedback loops.
Taking small steps with continuous feedback and opportunities to steer is highly compatible with working with LLM-based coding assistants. It’s actually kind of essential, really. Folks talking about specifying e.g., a whole feature “precisely” and then leaving the agent(s) to get on with it are… Well, you probably know what I think. I’ve seen those trains come off the rails so many times.
And with each step, I stay on-task. I’ll rarely, for example, model domain concepts that aren’t involved in the test cases I’m working on. I’m not one of these “First, I model ALL THE THINGS, then I think about the user’s goals” guys.
And using tests as specifications goes hand-in-glove with a test-driven approach to development, which you may have heard I’m quite partial to.
Believe it or not, agility and precision are completely compatible. How precise you’re being, and the size of the steps you’re taking that end in user feedback from working software, are orthogonal concerns. If you look in the original XP books, you’ll even find – gasp! – UML diagrams.
Hopefully you get some ideas about the kinds of things we can include in a specification to make it more precise, more complete and more consistent.
But at the very least, you might begin to rethink just how good your current specifications actually are.
Prompts Aren’t Code and LLMs Aren’t Compilers
One final thought. The formal systems of computer programming – programming languages, compilers, machine code and so on – and the “turtles” in an LLM-based stack are very different.
Prompts – even expressed in formal languages – aren’t code, and LLMs aren’t compilers. They will rarely produce the exact same output given the exact same input. It’s a category mistake to believe otherwise.
This means that no matter how precise our inputs are, they will not be processed precisely or predictably. Expect surprises.
But less ambiguity will – and I’ve tested this a lot – reduce the number of surprises. And refutability gives us a way to spot the brown M&Ms in the output more easily.
It’s easier to know when the model got it wrong.





