If speculative ideas can not be tested, they’re not science; they don’t even rise to the level of being wrong.
Wolfgang Pauli
When we interact with a language model, we’re communicating in natural language. And communicating in natural language is a lossy process.
There’s what I intended it to mean, and then there’s the meaning the model interprets, and they’re often not the same thing.
Many bad things have happened in the world because the receiver misinterpreted the intent of the sender. So it’s important to know with high confidence if we’ve grabbed the wrong end of the stick.
The inherent ambiguity of natural languages works against our desire to make our meaning clear.
In real-world communication, a simple technique to uncover misunderstandings is to test interpretations to see if they satisfy the original intent.
Including a test in an instruction given to an LLM serves two useful purposes:
- It restricts pattern-matching to those that also match the test and not just the natural language instruction. Coding models are actually trained by pairing code samples with tests of some kind, and more recently test execution has been used as a reward function in reinforcement learning. LLMs are sort of build for tests.
- It potentially gives us a direct way to check if the output doesn’t satisfy the intent. If our success criteria are turned into executable tests – e.g. unit tests – then we can run them against the output and get immediate feedback.
Imagine we want our LLM to generate code to add items to an online shopping basket. I regularly see prompts that look something like this.
Please generate a Python function for adding items to a shopping
basket. It should take product and quantity as parameters.
But the devil’s in the detail. What exactly are we expecting to happen when the function adds the item? How will we know if it doesn’t happen the way we intended?
I’ve been providing BDD-style tests in my contexts, along the lines of:
Given an empty basket,
And the customer has selected the product with ID 811 and stock of 3
When the customer adds the product to the basket with quantity 2
Then a new order item is added to the basket with product 811 and quantity 2
And 2 of product 811’s stock are put on hold, leaving available stock of 1
This gives the LLM much more to go on regarding the expected behaviour – the precise intent – of adding an item to the basket.
And it can be directly translated into unit tests:
class AddToBasket(unittest.TestCase): def test_order_item_is_added(self): basket = [] product = Product(id=811, stock=3) add_to_basket(basket, product, quantity=2) item = basket[0] self.assertEqual(item.product, product) self.assertEqual(item.quantity, 2) def test_stock_put_on_hold(self): basket = [] product = Product(id=811, stock=3) add_to_basket(basket, product, quantity=2) self.assertEqual(product.hold, 2) self.assertEqual(product.available_stock(), 1)
(NB: In my workflow, I’d tackle one test at a time – we’ll cover that in the final two letters in CRESS.)
Provided the executable tests the LLM generates match the intent – and it’s really important to check that they do – any implementation it generates will need to pass them.
If the implementation doesn’t pass the tests, or the tests don’t match the intent, I revert the changes, flush the context (see “C is for Current“) and try again – perhaps adding further clarification to the context, like additional tests, if needed.
Does this really make a difference? It certainly does. I conducted closed-loop experiments where I tasked Claude Code – using Opus 4.6 – to implement a set of features for a small, but non-trivial, system.
I’d written my own reference implementation with tests that used a simple API that didn’t reveal any internal design details. I preserved the API and moved the tests to where Claude couldn’t see them, leaving just my instructions and the API for it to work with.
When Claude had finished, I moved the tests back in to the project and ran them, scoring each pass by the % of tests passing.
I didn’t intervene until Claude said it was done. (In real life, I don’t use it this way, of course.)
In one version of the experiment, I provided BDD-style examples in the prompt. In another, I just gave Claude the basic feature descriptions. In both versions, Claude was instructed to generate its own tests from its interpretation of the requirements.
In a single pass, measured by % of tests passing, the difference was big.

Over multiple passes, feeding back test results after each, the difference got even bigger.

With test examples provided, the agent has explicit success criteria to converge on. Without them, it just goes around in circles, literally aimlessly. Poor little Ralph!
One final thought: not all interactions with an AI coding tool will be about adding or changing functionality. What if the task is a refactoring?
Well, hopefully your refactorings have goals – they’re done with intent to improve the design.
In my TDD workflow, at every green light – whenever the tests are passing again – I perform a mini code review on the changes. I might, for example, run a linter over the diff. Let’s say one of my code quality checks – just another kind of test – is for functions or methods that have a cyclomatic complexity > 5.
If the LLM changes a function and makes CC = 6, I now have a failing test. I could revert and feed that back in another pass (and giving an LLM two objectives in the same interaction reduces the odds of either being satisfied, so we could be here all day throwing the dice over and over again).
Or I could ask the LLM to refactor the function, and then run the check again to see if the restructured version is within limits.
However I choose to handle it, importantly I have a clear way to know when it hasn’t worked.

