Faster Feedback -> Better Outcomes

The impact of feedback loops like testing in software development can be as profound as it is widely misunderstood.

Movie-making had a similar problem up until the 1960s. Crew shoots a take during the day. Director has to wait until the film’s processed so they can watch “the dailies” to check for any mistakes nobody noticed at the time – like an extra using an iPhone in what’s supposed to be 1889 – and to see if the shot actually works dramatically, comedically etc.

If they wanted to fix it, back in the day, that could mean rebuilding the set, or transporting everyone – cast, crew, equipment, costumes, props etc – back to the location. Remounting shots is a big deal.

Image
advertisement

In 1960, comic actor and director Jerry Lewis started using “video-assist” while working on The Bellboy. Takes were captured simultaneously on film and on video, so the director can check each shot in “video village” immediately after the take. If a joke’s not working, they can see straight away and adjust for the next take. By the mid-60s, the technology had been refined using a beam splitter to ensure the video captured was showing exactly what the film camera was recording. WYSIWYG.

It made a big difference. When we move the feedback much closer to the action and the myriad decisions made in just a single shot, fixing problems gets much quicker and much, much cheaper. So – unsurprisingly – more problems get fixed.

Cinephiles like myself may have noticed a tangible leap in the quality of films being made during the 1960s and early 1970s, as this technology became mainstream.

In software development, we have our equivalents of “video-assist” – techniques we can use to bring the feedback much closer to the decision, making mistakes much quicker and cheaper to fix.

A good example is developer testing. Instead of making a whole bunch of changes to the code and then testing all of them, we make one change and immediately run to our equivalent of “video village” – a unit test suite, for example – to check for problems.

Teams that rely on downstream testing are doing the equivalent of waiting to see the dailies. When problems are caught, fixing them becomes a bigger deal. Likely as not, the developers have moved on. The set’s been struck, so to speak, and remounting those shots is a bigger deal.

What other examples can you think of where we move feedback closer to the decision in software development?

Feedbackmaxxing

You know the TV gameshow Play Your Cards Right? Contestants are shown a sequence – in two rows – of giant playing cards presented face-down. The host turns over the first card. The contestant then has to guess if the next card is higher or lower than that one.

They move across the board, guessing and then revealing one card at a time until either the contestant guesses wrong or they complete the sequence and win the game.

Now imagine a version of that where they don’t turn the cards over until the contestant has guessed higher or lower for the entire sequence.

“That’s just silly, Jason.”

You’re absolutely right. It is silly. Very silly. The odds of winning the game would be so remote that we’d probably never see it happen.

So why are you developing software that way?

Be honest now – you are.

You don’t turn the cards over one a time. You make a whole bunch of guesses about what the users or the business really needs. Then you make a whole bunch of design decisions that may or may not be the right decisions. Then you make a whole bunch of changes to the code that may or may not work. And only then do you turn the cards over to see if all those many guesses were good guesses.

Every decision, and every change to the code, carries uncertainty. And that uncertainty compounds with every subsequent decision or change. If we have a 90% chance of getting one right, we have an 81% chance of getting two right, a 35% chance of getting ten right, and 0.003% chance of getting 100 right. The more uncertainty accumulates, the longer we spend driving in the dark with the lights off.

These decisions and these changes don’t exist in isolation. One decision is often a consequence of an earlier decision – another junction along the way of the path we chose. One change to the code will constrain our choice of future changes.

Image

If we take a wrong turn with any decision or any change (which is just another decision, really), how long can we afford to waste heading down the wrong road? How long will it take and how much will it cost to get back on the right road?

The further we go before we get a meaningful answer, the bigger the wasted time and effort, and the more it will cost to correct.

And this is where sunk cost enters the chat. When the cost of correcting a mistake is too high, teams will tend to choose to live with the mistake. Waddayagonnado?

And that’s how you make software, that is.

A smarter way is to turn the cards over as they’re being played. Test your guesses against reality as soon as possible, so the next guess is less likely to be a stop on the wrong road.

If you guessed wrong, no problemo. Correcting your mistake is quick and cheap. You don’t have to undo 100 decisions that followed, then make 100 new ones.

So a critical metric in software development is how long it takes for us to test our decisions after they’ve been made. That feedback latency needs to be as low as possible.

I’m now calling this approach feedbackmaxxing, because that’s how we talk these days apparently.

Feedbackmaxxing is maximising feedback frequency while minimising feedback latency across the entire software development system

This is about two variables we can control in our development process:

  • Batch Size – how many decisions need feedback (e.g., from testing, from code review, from users) at a time?
  • Feedback Frequency– how often do we get that feedback?

The bigger the batches, the longer it takes to get feedback. The smaller the batches, the sooner we learn what works and what doesn’t.

The smart players work in small batches – they solve one problem at a time – and engineer their feedback loops to be very fast.

Software development cycles are loops within loops. We have that outer loop – will a reminder to reorder a prescription reduce missed doses? And we have the inner loop – did that change I just made to the code work? Did it break anything that was depending on it?

The smart players know something about how to optimise nested loops, too. They know that to speed up the outer loop – the real-world user feedback from working releases – you focus your attention on the innermost loop.

How long does it take to build and test the software? If the answer is an hour, you have a big problem. Your choices are not great – you can either test one change at a time, and spend most of your day waiting for feedback. Or- and this is the most popular choice – you make a lot of changes, and then test them, in the mistaken belief this will save you time. “I’m too busy building on top of broken code for testing!”

The other systemic effect that large batches has is – because they take longer to get feedback on (reviewing a 5-line diff vs. a 500-line diff, for example) – changes tend to end up sitting in queues waiting their turn.

Make the batches bigger, the queues get larger, and delays get longer. The more decisions we make before testing them, the slower we get overall.

The evidence at this point is overwhelming that AI code generation speeds developers up, but slows teams down. We’ve been maxxing the wrong thing.

Large Language Models can make a lot of decisions – e.g., a lot of changes to our code – very, very quickly. It comes as no surprise that data from studying work queues across thousands of teams shows diffs getting bigger and bigger, queues getting large and larger, and lead times for getting changes into production getting longer and longer.

In the most meaningful sense, feedback latency isn’t the time elapsed after a decision’s been made before we get feedback, but the number of subsequent decisions made that are a consequence of it – how many miles did we carry on down that road. Lightning fast code generation doesn’t help us here. If anything, it probably makes latency worse – we’re much further down potentially the wrong road driving a Maserati than if we’d walked.

“Ah, but Jason, we can just get the agent to regenerate the software again from the original specs.” U-huh? Tell me you’ve never tried that on anything non-trivial without telling me you’ve never tried that on anything non-trivial.

“Aha! But we can just get the agent to make the changes we need.” This is where the peak-end rule bites on the backside. Ask users, for example, for feedback on a single design choice, and you’ll get specific, meaningful, useful thoughts. Ask them for feedback on 50 choices, and they’ll talk about the one or two things that stood out, and the last thing they saw. (See also: code reviews – “Looks good to me”).

And then there’s the established fact that LLMs are good at generating code that they’re bad at modifying later. And the more complex the code base is, the worse they get. I wish you the best of luck with that!

You are drinking from a code-generating firehose, and it’s getting out of control.

The answer to your AI-generated woes is feedbackmaxxing. Ask one question at a time. Get an answer as soon as possible. Test continuously. Review continuously. Integrate continuously. Get real-world feedback continuously.

A lot of people struggle to picture what that looks like.

Once you’ve seen it, though, your journey to Feedbackmaxxville (twinned with Gas Town) can begin.

Talking of which…

Image

Extending The Horizon Of Agent Autonomy Is A Testing Problem

I’ve talked before about how improbable long-horizon autonomous agentic workflows are. Every step is a throw of a weighted dice, and with each additional step the probability of success goes down.

On top of this, decisions have dependencies, and this means that errors can compound downstream. Take a wrong turn at step N, and step N+1, N+2, N+3 could well build on that mistake.

The other side of the equation is verification. Mistakes aren’t a problem if they’re caught before they compound.

So we now have two components: the probability of an error, and the probable number of subsequent steps before the error’s detected.

More bluntly, if the agent f***ed up, how soon would we/it know?

The E in my CRESS principles for context engineering stands for “Empirical” – input contexts should be grounded in observed reality, not unverified model output. I visualise raw model output as being like untreated sewage. Yes, there’s water in it. But it’s not safe for the model to drink.

Image

To make it safe, it needs to be tested against reality and potentially debugged and refactored, or flushed down the drain if it’s too far gone.

I don’t know about you, but I’d think twice about drinking water from the tap if I knew that the only testing it had been through was someone holding it up to the light and pronouncing “Looks good to me”.

This takes us into the wonderful world of test assurance, and into territory that will be alien to the vast majority of software teams. The longer the autonomous horizon, the higher the assurance needs to be.

I’m seeing lots of folks (finally!) discovering the value of mutation testing – a technique for testing your tests by deliberately introducing errors and seeing if they fail – in agentic workflows. And there’s no doubting this helps close the gaps that errors can leak through from one step to the next.

But the kind of full autonomy Anthropic and others claim will soon be upon us requires degrees of assurance that go way beyond even that needed for safety-critical systems into uncharted territory.

Now, personally, I think testing and verification in software has left a lot to be desired for many decades. Almost none of you have ever gotten in the ballpark of what I consider to be good enough, even for line-of-business applications, let alone safety-critical ones.

But even if we could drag ourselves into that ballpark, I know from experience that high-integrity software engineering still requires acres of human judgement and learning that LLMs will likely never be capable of.

But it might extend the agentic horizon from, say, N steps to 1.1 N steps before we need to course correct. And that could be the key to squeezing out more net value from the technology – maybe our lead times shrink from L to 0.9 L?

The fun part is that I know for a fact – having tried for nearly 30 years to get teams interested in upping the integrity of their products – that 99% will not want to hear that the answer is MORE RIGOUR.

(And, of course, we’re just talking about one kind of testing here. When we add in other qualities of software, like maintainability, performance and security – you can probably see why I consider full autonomy a Fool’s Errand.)

CRESS Principles for Context Engineering – E is for Empirical

Most commercial LLMs – that is to say, the ones with expensive lawyers – display a disclaimer along the lines of “<LLM> can make mistakes. Check important info.”

They’re not kidding. Every token of text an LLM generates should be considered suspect, and when fidelity matters, we really should check the output thoroughly.

In programming, fidelity matters. I appreciate that’s the kind of heresy that can get you sacked in Silicon Valley these days, where YOLO – mostly driven by FOMO – dominates.

But in banks and retail chains and hospitals and payroll it really does still matter – which is why, on high-risk systems, applying LLM-generated code changes directly is effectively banned in many organisations.

And it’s a two-way street. If we want more trustworthy output from an LLM, we need it to have more trustworthy input – both in training and in inference.

As users, we don’t have any control over the quality of the data an LLM is trained on, but we do have control over the quality of the data we give it in day-to-day use. Here’s another mnemonic: GIGO – Garbage In, Garbage Out.

Whenever possible, we want the context that the model is pattern-matching on to be grounded in observed reality, rather than in the model’s own output.

  • The code as it really is right now
  • The real requirements we agreed with the customer or product owner
  • The real customer acceptance tests
  • Actual test run results
  • Actual linter reports
  • Actual mutation testing results
  • Actual user feedback

And so on.

The uncomfortable truth is that the moment Claude Opus or GPT-5 or Gemini starts acting on its own output – e.g., its own planning or reasoning or generated code – the context starts drifting from reality. And the further we let the generated context run, the more they compound on their errors – eating their own fiction and producing even wilder flights of fancy. They have no model of the real world to compare it to.

Ditto where context “compression” and LLM-generated summaries are concerned – they’re notoriously unreliable narrators. That architecture.md file that Claude generated for you? Be very skeptical that it’s an accurate picture of the real architecture. Research finds that LLM-generated context files can mislead models.

The practical upshot of this is an information flow where our inputs are wherever possible grounded in observed reality, and LLM output can only become part of that observed reality after it’s been thoroughly tested against it.

And, yes, I’m implying that we shouldn’t rely on LLMs to mark their own homework, because they don’t have access to any kind of real-world model until we give it to them. In short, when an LLM tells you it’s raining, go outside and look.

To use an analogy, LLM waste water needs to be made clean and safe to drink before feeding it back into the LLM in future interactions. This often requires expert intervention, and often requires that the output be rejected outright if it’s too far from acceptable (e.g., if it fails the unit tests).

Image

As with the C in CRESS – contexts should be Current – the implication is that contexts be short-lived, or they start to fill up with generated content that hasn’t been verified and – as the ground shifts beneath the model’s feet with each applied change to the code – it drifts further from the underlying reality.

The E in CRESS also works with the R – contexts should be refutable. In order for model outputs to be fed back into model inputs, they should pass through a quality gate that enables us to know with high confidence if they don’t satisfy our intent.

CRESS Principles for Context Engineering – R is for Refutable

If speculative ideas can not be tested, they’re not science; they don’t even rise to the level of being wrong.

Wolfgang Pauli

When we interact with a language model, we’re communicating in natural language. And communicating in natural language is a lossy process.

There’s what I intended it to mean, and then there’s the meaning the model interprets, and they’re often not the same thing.

Many bad things have happened in the world because the receiver misinterpreted the intent of the sender. So it’s important to know with high confidence if we’ve grabbed the wrong end of the stick.

The inherent ambiguity of natural languages works against our desire to make our meaning clear.

In real-world communication, a simple technique to uncover misunderstandings is to test interpretations to see if they satisfy the original intent.

Including a test in an instruction given to an LLM serves two useful purposes:

  1. It restricts pattern-matching to those that also match the test and not just the natural language instruction. Coding models are actually trained by pairing code samples with tests of some kind, and more recently test execution has been used as a reward function in reinforcement learning. LLMs are sort of build for tests.
  2. It potentially gives us a direct way to check if the output doesn’t satisfy the intent. If our success criteria are turned into executable tests – e.g. unit tests – then we can run them against the output and get immediate feedback.

Imagine we want our LLM to generate code to add items to an online shopping basket. I regularly see prompts that look something like this.

Please generate a Python function for adding items to a shopping
basket. It should take product and quantity as parameters.

But the devil’s in the detail. What exactly are we expecting to happen when the function adds the item? How will we know if it doesn’t happen the way we intended?

I’ve been providing BDD-style tests in my contexts, along the lines of:

Given an empty basket,
And the customer has selected the product with ID 811 and stock of 3
When the customer adds the product to the basket with quantity 2
Then a new order item is added to the basket with product 811 and quantity 2
And 2 of product 811’s stock are put on hold, leaving available stock of 1

This gives the LLM much more to go on regarding the expected behaviour – the precise intent – of adding an item to the basket.

And it can be directly translated into unit tests:

class AddToBasket(unittest.TestCase):
def test_order_item_is_added(self):
basket = []
product = Product(id=811, stock=3)
add_to_basket(basket, product, quantity=2)
item = basket[0]
self.assertEqual(item.product, product)
self.assertEqual(item.quantity, 2)
def test_stock_put_on_hold(self):
basket = []
product = Product(id=811, stock=3)
add_to_basket(basket, product, quantity=2)
self.assertEqual(product.hold, 2)
self.assertEqual(product.available_stock(), 1)

(NB: In my workflow, I’d tackle one test at a time – we’ll cover that in the final two letters in CRESS.)

Provided the executable tests the LLM generates match the intent – and it’s really important to check that they do – any implementation it generates will need to pass them.

If the implementation doesn’t pass the tests, or the tests don’t match the intent, I revert the changes, flush the context (see “C is for Current“) and try again – perhaps adding further clarification to the context, like additional tests, if needed.

Does this really make a difference? It certainly does. I conducted closed-loop experiments where I tasked Claude Code – using Opus 4.6 – to implement a set of features for a small, but non-trivial, system.

I’d written my own reference implementation with tests that used a simple API that didn’t reveal any internal design details. I preserved the API and moved the tests to where Claude couldn’t see them, leaving just my instructions and the API for it to work with.

When Claude had finished, I moved the tests back in to the project and ran them, scoring each pass by the % of tests passing.

I didn’t intervene until Claude said it was done. (In real life, I don’t use it this way, of course.)

In one version of the experiment, I provided BDD-style examples in the prompt. In another, I just gave Claude the basic feature descriptions. In both versions, Claude was instructed to generate its own tests from its interpretation of the requirements.

In a single pass, measured by % of tests passing, the difference was big.

Image

Over multiple passes, feeding back test results after each, the difference got even bigger.

Image

With test examples provided, the agent has explicit success criteria to converge on. Without them, it just goes around in circles, literally aimlessly. Poor little Ralph!

One final thought: not all interactions with an AI coding tool will be about adding or changing functionality. What if the task is a refactoring?

Well, hopefully your refactorings have goals – they’re done with intent to improve the design.

In my TDD workflow, at every green light – whenever the tests are passing again – I perform a mini code review on the changes. I might, for example, run a linter over the diff. Let’s say one of my code quality checks – just another kind of test – is for functions or methods that have a cyclomatic complexity > 5.

If the LLM changes a function and makes CC = 6, I now have a failing test. I could revert and feed that back in another pass (and giving an LLM two objectives in the same interaction reduces the odds of either being satisfied, so we could be here all day throwing the dice over and over again).

Or I could ask the LLM to refactor the function, and then run the check again to see if the restructured version is within limits.

However I choose to handle it, importantly I have a clear way to know when it hasn’t worked.

The AI-Ready Software Developer #22 – Test Your Tests

Many teams are learning that the key to speeding up the feedback loops in software development is greater automation.

The danger of automation that enables us to ship changes faster is that it can allow bad stuff to make it out into the real world faster, too.

To reduce that risk, we need effective quality gates that catch the boo-boos before they get deployed at the very latest. Ideally, catching the boo-boos as they’re being – er – boo-boo’d, so we don’t carry on far after a wrong turn.

There’s much talk about automating tests and code reviews in the world of “AI”-assisted software development these days. Which is good news.

But there’s less talk about just how good these automated quality gates are at catching problems. LLM-generated tests and LLM-performed code reviews in particular can often leave gaping holes through which problems can slip undetected, and at high speed.

The question we need to ask ourselves is: “If the agent broke this code, how soon would we know?”

Imagine your automated test suite is a police force, and you want to know how good your police force is at catching burglars. A simple way to test that might be to commit a burglary and see if you get caught.

We can do something similar with automated tests. Commit a crime in the code that leaves it broken, but still syntactically valid. Then run your tests to see if any of them fail.

This is a technique called mutation testing. It works by applying “mutations” to our code – for example, turning a + into a -, or a string into “” – such that our code no longer does what it did before, but we can still run it.

Then our tests are run against that “mutant” copy of the code. If any tests fail (or time-out), we say that the tests “killed the mutant”.

If no tests fail, and the “mutant” survives, then we may well need to look at whether that part of the code is really being tested, and potentially close the gap with new or better tests.

Mutation testing tools exist for most programming languages – like PIT for Java and Stryker for JS and .NET. They systematically go through solution code line by line, applying appropriate mutation operators, and running your tests.

This will produce a detailed report of what mutations were performed, and what the outcome of testing was, often with a summary of test suite “strength” or “mutation coverage”. This helps us gauge the likelihood that at least one test would fail if we broke the code.

This is much more meaningful than the usual code coverage stats that just tell us which lines of code were executed when we ran the tests.

Some of the best mutation testing tools can be used to incrementally do this on code that’s changed, so you don’t need to run it again and again against code that isn’t changing, making it practical to do within, say, a TDD micro-cycle.

So that answers the question about how we minimise the risk of bugs slipping through our safety net.

But what about other kinds of problems? What about code smells, for example?

The most common route teams take to checking for issues relating to maintainability or security etc is code reviews. Many teams are now learning that after-the-fact code reviews – e.g., for Pull Requests – are both too little, too late and a bottleneck that a code-generating firehose easily overwhelms.

They’re discovering that the review bottleneck is a fundamental speed limit for AI-assisted engineering, and there’s much talk online about how to remove or circumvent that limit.

Some people are proposing that we don’t review the code anymore. These are silly people, and you shouldn’t listen to them.

As we’ve known for decades now, when something hurts, you do it more often. The Cliffs Notes version of how to unblock a bottleneck in software development is to put the word “continuous” in front of it.

Here, we’re talking about continuous inspection.

I build code review directly into my Test-Driven micro-cycle. Whenever the tests are passing after a change to the code, I review it, and – if necessary – refactor it.

Continuous inspection has three benefits:

  1. It catches problems straight away, before I start pouring more cement on them
  2. It dramatically reduces the amount of code that needs to be reviewed, so brings far greater focus
  3. It eliminates the Pull Request/after-the-horse-has-bolted code review bottleneck (it’s PAYG code review)

But, as with our automated tests, the end result is only as good as our inspections.

Some manual code inspection is highly recommended. It lets us consider issues of high-level modular design, like where responsibilities really belong and what depends on what. And it’s really the only way to judge whether code is easy to understand.

But manual inspections tend to miss a lot, especially low-level details like unused imports and embedded URLs. There are actually many, many potential problems we need to look for.

This is where automation is our friend. Static analysis – the programmatic checking of code for conformance to rules – can analyse large amounts of code completely systematically for dozens or even hundreds of problems.

Static analysers – you may know them as “linters” – work by walking the abstract syntax tree of the parsed code, applying appropriate rules to each element in the tree, and reporting whenever an element breaks one of our rules. Perhaps a function has too many parameters. Perhaps a class has too many methods. Perhaps a method is too tightly coupled to the features of another class.

We can think of those code quality rules as being like fast-running unit tests for the structure of the code itself. And, like unit tests, they’re the key to dramatically accelerating code review feedback loops, making it practical to do comprehensive code reviews many times an hour.

The need for human understanding and judgement will never go away, but if 80%-90% of your coding standards and code quality goals can be covered by static analysis, then the time required reduces very significantly. (100% is a Fool’s Errand, of course.)

And, just like unit tests, we need to ask ourselves “If I made a mess in my code, would the automated code inspection catch it?”

Here, we can apply a similar technique; commit a crime in the code and see if the inspection detects it.

For example, I cloned a copy of the JUnit 5 framework source – which is a pretty high-quality code base, as you might expect – and “refuctored” it to introduce unused imports into random source files. Then I asked Claude to look for them. This, by the way, is when I learned not to trust code reviews undertaken by LLMs. They’re not linters, folks!

Continuous inspection is an advanced discipline. You have to invest a lot of time and thought into building and evolving effective quality gates. And a big part of that is testing those gates and closing gaps. Out of the box, most linters won’t get you anywhere near the level of confidence you’ll need for continuous inspection.

If we spot a problem that slipped through the net – and that’s why manual inspections aren’t going away (think of them as exploratory code quality testing) – we need to feed that back into further development of the gate.

(It also requires a good understanding of the code’s abstract syntax, and the ability to reason about code quality. Heck, though, it is our domain model, so it’ll probably make you a better programmer.)

Read the whole series

How Can We Progress Past “AI” Woo?

For decades, when I’ve interacted with folks working in machine learning, one of the big questions that comes up is “How do we test these systems?”

Traditional software’s behaviour is (usually) predictably repeatable, in the sense that the exact same inputs provided to the system in the exact same internal state will produce the exact same output.

If I ask to debit $51 from an account with $50 available, the system will say “No can do, amigo”.

With, say, a Large Language Model, the exact same input fed to the exact same model – their internal state doesn’t change – can produce surprisingly different outputs each time.

When I say “I tried this, and it didn’t work”, and then you say “Well, I tried it, and it did work”, that’s a bit like saying “Well, I threw a 7, so you must have been throwing the dice wrong”.

This is where probabilistic technology collides with 70+ years of experience with deterministic software. We’re still treating it like the banking system, expecting our results to be reproducible in the same way.

When we consider what’s real and what works when we’re using ML, we have to look beyond data points and start looking for statistically significant trends. We don’t throw the dice, get the 7 we wanted, and declare “These dice are better”. We throw the dice a bunch of times, and see what the distribution of outputs is.

I’ve been using small-scale, closed-loop experiments – once it’s started, I can’t intervene – to test claims about what improves model accuracy, with hard “either it did or it didn’t” fitness functions to gauge success. (Typically, a hidden acceptance test suite, scoring on how many tests passed).

But these are small-scale. I might run them 10x, because I simply can’t afford the tokens. If I published the results, that’s the first thing folks would say: “But this is just 10 data points, Jason”.

And they’d be quite right to, just as I’d be right to take a single data point (especially an anecdotal one) with a big pinch of salt.

And that’s why I’ve leaned more on related – the same phenomena but in different problem domains – large-scale peer-reviewed studies and on the science (e.g., statistical mechanics). So the stage I’m at is: “In theory, this should happen, and in my small experiment, that’s kind of what happened”.

In that respect, I appear to be ahead of the curve, with most folks still relying entirely on how it “feels”. Confirmation bias and projection still dominate the discourse.

It turns out there’s a significant correlation between confidence in “AI” output and belief in the paranormal. Our susceptibility to suggestion appears to be a factor here.

Meanwhile, I see more and more of the ideas presented in my AI-Ready Software Developer blog series being incorporated into workflows and even into the tools themselves. I can’t take the credit, of course. But when a bunch of different people independently come to similar conclusions, that’s interesting data in itself. We could all just be suffering the same delusions, though. Just because most teams use Pull Requests, that doesn’t make them a good idea.

In my defence, I’ve tried very hard to follow the best available evidence, and I’ve been taking nobody’s word for it. This has, at times, made me about as welcome as James Randi at a spoon-bending convention.

I see folks claiming that all of the frontier models and all of the “AI” coding assistants took a big step up in performance in the last 2-3 months. Claude users are saying it. Cursor users are saying it. Copilot users are saying it.

But did the technology really shift over Christmas, or was it actually a shift in expectations and incentives feeding a sort of mass delusion fueled by social proof?

When the CTO comes back to the office in January and declares “This is the future! Get with it, teams!”, does our perception of tool performance change in response? Is this how religions start?

But this is nothing new in our field.

If we’re to move forward in our understanding of what’s real and what works, and avoid surrendering to “AI Woo”, perhaps we need a way to test our hypotheses at a statistically significant scale, and in a more methodical way?

It’s my hope that – on this particularly important subject for our profession – we can just this once move beyond social proof, beyond online debate, beyond committees and manifestos drawn up by people who just happened to be in the room at the time, and beyond appeals to authority, and engage with the reality in a more objective way.

I, for one, would really like to know what that reality is.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise rework due to misunderstandings about requirements, they’ll need to describe requirements in a testable way as part of a close and ongoing collaboration with our customers.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of code reviews, they’ll need to build review into the coding workflow itself, and automate the majority of their code quality checks.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise merge conflicts and broken builds, and to minimise software delivery lead times, they’ll need to integrate their changes more often and automatically build and test the software each time to make it ready for automated deployment.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the “blast radius” of changes, they’ll need to cleanly separate concerns in their designs to reduce coupling and increase cohesion.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the cost and the risk of changing code, they’ll need to continuously refactor their code to keep its intent clear, and keep it simple, modular and low in duplication.

“We definitely don’t have time for that, Jason!”

“AI” coding assistants don’t solve any of these problems. They AMPLIFY THEM.

More code, with more problems, hitting these bottlenecks at accelerated speed turns the code-generating firehose into a load test for your development process.

For most teams, the outcome is less reliable software that costs more to change and is delivered later.

Those teams are being easily outperformed by teams who test, review, refactor and integrate continuously, and who build shared understanding of requirements using examples – with and without “AI”.

Will you make time for them in 2026? Drop me a line if you think it’s about time your team addressed these bottlenecks.

Or was productivity never the point?

Ready, Fire, Aim!

I teach Test-Driven Development. You may have heard.

And as a teacher of TDD for some quarter of a century now, you can probably imagine that I’ve heard every reason for not doing TDD under the Sun. (And some more reasons under the Moon.)

“It won’t work with our tech stack” is one of the most common, and one of the most easily addressed. I’ve done and seen done TDD on all of the tech stacks, at all levels of abstraction from 4GLs down through assembly language to the hardware design itself. If you can invoke it and get an output, you can automatically test it. And if you can automatically test it, you can write that test first.

(Typically, what they really mean is that the architecture of the framework(s) they’re using doesn’t make unit testing easy. That’s about separation of concerns, though, and usually work-aroundable.)

The second most common reason I hear is perhaps the more puzzling: “But how can I write tests first if I don’t know what the code’s supposed to do?”

The implication here is that developers are writing solution code without a clear idea of what they expect it to do – that they’re retrofitting intent to implementations.

I find that hard to imagine. When I write code, I “hear the tune” in my head, so to speak. The intended meaning is clear to me. When I run it, my understanding might turn out to be wrong. But there is an expectation of what the code will do: I think it’s going to do X.

My best guess is that we all kind of sort of have those inner expectations when we write code. The code has meaning to us, even if we turn out to have understood it wrong when we run it.

So I could perhaps rephrase “How can I write tests first if I don’t know what the code’s supposed to do?” to articulate what might actually be happening:

“How do I express what I want the code to do before I’ve seen that code?”

Take this example of code that calculates the total of items in a shopping basket:

class Basket:
def __init__(self, items):
self.items = items
def total(self):
sum = 0.0
for item in self.items:
sum += item.price * item.quantity
return sum

When I write this code, in my head – often subconsciously – I have expectations about what it’s going to do. I start by declaring a sum of zero, because an empty basket will have a total of zero.

Then, for every item in the basket, I add that item’s price multiplied by it’s quantity to the sum.

So, in my head, there’s an expectation that if the basket had one item with a quantity of one, the total would equal just the price of that item.

If that item had a quantity of two, then the total would be the price multiplied by two.

If there were two items, the total would be the sum of price times quantity of both items.

And so on.

You’ll notice that my thinking isn’t very abstract. I’m thinking more with examples than with symbols.

  • No items.
  • One item with quantity of one.
  • One item with quantity of two.
  • Two items.

If you asked me to write unit tests for the total function, these examples might form the basis of them.

A test-driven approach just flips the script. I start by listing examples of what I expect the function to do, and then – one example at a time – I write a failing test, write the simplest code to pass the test, and then refactor if I need to before moving on to the next example.

    def test_total_of_empty_basket(self):
        items = []
        basket = Basket(items)
        
        self.assertEqual(0.0, basket.total())
class Basket:
def __init__(self, items):
self.items = items
def total(self):
return 0.0

What I’m doing – and this is part of the art of Test-Driven Development – is externalising the subconscious expectations I would no doubt have as I write the total function’s implementation.

Importantly, I’m not doing it in the abstract – “the total of the basket is the sum of price times quantity for all of its items”.

I’m using concrete examples, like the total of an empty basket, or the total of a single item of quantity one.

“But, Jason, surely it’s six of one and half-a-dozen of the other whether we write the tests first or write the implementation first. Why does it matter?”

The psychology of it’s very interesting. You may have heard life coaches and business gurus tell their audience to visualise their goal – picture themselves in their perfect home, or sipping champagne on their yacht, or making that acceptance speech, or destabilising western democracy. It’s good to have goals.

When people set out with a clear goal, we’re much more likely to achieve it. It’s a self-fulfilling prophecy.

We make outcomes visible and concrete by adding key details – how many bedrooms does your perfect home have? How big is the yacht? Which Oscar did you win? How little regulation will be applied to your business dealings?

What should the total of a basket with no items be? What should the total of a basket with a single item with price 9.99 and quantity 1 be?

    def test_total_of_single_item(self):
        items = [
            Item(9.99, 1),
        ]
        basket = Basket(items)

        self.assertEqual(9.99, basket.total())

We precisely describe the “what” – the desired properties of the outcome – and work our way backwards directly to the “how”? What would be the simplest way of achieving that outcome?

class Basket:
def __init__(self, items):
self.items = items
def total(self):
if len(self.items) > 0:
return self.items[0].price
return 0.0

Then we move on to the next outcome – the next example:

    def test_total_of_item_with_quantity_of_2(self):
        items = [
            Item(9.99, 2)
        ]
        basket = Basket(items)

        self.assertEqual(19.98, basket.total())
class Basket:
def __init__(self, items):
self.items = items
def total(self):
if len(self.items) > 0:
item = self.items[0]
return item.price * item.quantity
return 0.0

And then our final example:

    def test_total_of_two_items(self):
        items = [
            Item(9.99, 1),
            Item(5.99, 1)
        ]
        basket = Basket(items)

        self.assertEqual(15.98, basket.total())
class Basket:
def __init__(self, items):
self.items = items
def total(self):
sum = 0.0
for item in self.items:
sum += item.price * item.quantity
return sum

If we enforce that items must have a price >= 0.0 and an integer quantity > 0, this code should cover any list of items, including an empty list, with any price and any quantity.

And our unit tests cover every outcome. If I were to break this code so that, say, an empty basket causes an error to be thrown, one of these tests would fail. I’d know straight away that I’d broken it.

This is another self-fulfilling prophecy of starting with the outcome and working directly backwards to the simplest way of achieving it – we end up with the code we need, and only the code we need, and we end up with tests that give us high assurance after every change that those outcomes are still being satisfied.

Which means that if I were to refactor the design of the total function:

    def total(self):
        return sum(
                map(lambda item: item.subtotal(), self.items))

I can do that with high confidence.

If I write the code and then write tests for it, several things tend to happen:

  • I may end up with code I didn’t actually need, and miss code I did need
  • I may well miss important cases, because unit tests? Such a chore when the work’s already done! I just wanna ship it!
  • It’s not safe to refactor the new code without those tests, so I have to leave that until the end, and – well, yeah. Refactoring? Such a chore! etc etc etc.
  • The tests I choose – the “what” – are now being driven by my design – the “how”. I’m asking “What test do I need to cover that branch?” and not “What branch do I need to pass that test?”

And finally, there’s the issue of design methodology. Any effective software design methodology is usually usage-driven. We don’t start by asking “What does this feature do?” We start by asking “How will this feature be used?”

What the feature does is a consequence of how it will be used. We don’t build stuff and then start looking for use cases for it. Well, I don’t, anyway.

In a test-driven approach, my tests are the first users of the total function. That’s what my tests are about – user outcomes. I’m thinking about the design from the user’s – the external – perspective and driving the design of my code from the outside in.

I’m not thinking “How am I going to test this total function?” I’m thinking “How will the user know the total cost of the basket?” and my tests reveal the need for a total function. I use it in the test, and that tells me I need it.

“Test-driven”. In case you were wondering what that meant.

When we design code from the user’s perspective, we’re far more likely to end up with useful code. And when we design code with tests playing the role of the user, we’re far more likely to end up with code that works.

One final question: if I find myself asking “What is this function supposed to do?”, is that a cue for me to start writing code in the hope that somebody will find a use for it?

Or is that my cue to go and speak to someone who understands the user’s needs?

The AI-Ready Software Developer #20 – It’s The Bottlenecks, Stupid!

For many years now, cycling has been consistently the fastest way to get around central London. Faster than taking the tube. Faster than taking the train. Faster than taking the bus. Faster than taking a cab. Faster than taking your car.

Image

All of these other modes of transport are, in theory, faster than a bike. But the bike will tend to get there first, not because it’s the fastest vehicle, but because it’s subject to the fewest constraints.

Cars, cabs, trains and buses move not at the top speed of the vehicle, but at the speed of the system.

And, of course, when we measure their journey speed at an average 9 mph, we don’t see them crawling along steadily at that pace.

“Travelling” in London is really mostly waiting. Waiting at junctions. Waiting at traffic lights. Waiting to turn. Waiting for the bus to pull out. Waiting on rail platforms. Waiting at tube stations. Waiting for the pedestrian to cross. Waiting for that van to unload.

Cyclists spend significantly less time waiting, and that makes them faster across town overall.

Similarly, development teams that can produce code much faster, but work in a system with real constraints – lots of waiting – will tend to be outperformed overall by teams who might produce code significantly slower, but who are less constrained – spend less time waiting.

What are developers waiting for? What are the traffic lights, junctions and pedestrian crossings in our work?

If I submit a Pull Request, I’m waiting for it to be reviewed. If I send my code for testing, I’m waiting for the results. If I don’t have SQL skills, and I need a new column in the database, I’m waiting for the DBA to add it for me. If I need someone on another team to make a change to their API, more waiting. If I pick up a feature request that needs clarifying, I’m waiting for the customer or the product owner to shed some light. If I need my manager to raise a request for a laptop, then that’s just yet more waiting.

Teams with handovers, sign-offs and other blocking activities in their development process will tend to be outperformed by teams who spend less time waiting, regardless of the raw coding power available to them.

Teams who treat activities like testing, code review, customer interaction and merging as “phases” in their process will tend to be outperformed by teams who do them continuously, regardless of how many LOC or tokens per minute they’re capable of generating.

This isn’t conjecture. The best available evidence is pretty clear. Teams who’ve addressed the bottlenecks in their system are getting there sooner – and in better shape – than teams who haven’t. With or without “AI”.

The teams who collaborate with customers every day – many times a day – outperform teams who have limited, infrequent access.

The teams who design, test, review, refactor and integrate continuously outperform teams who do them in phases.

The teams with wider skillsets outperform highly-specialised teams.

The teams working in cohesive and loosely-coupled enterprise architectures outperform teams working in distributed monoliths.

The teams with more autonomy outperform teams working in command-and-control hierarchies.

None of these things comes with your Claude Code plan. You can’t buy them. You can’t install them. But you can learn them.

And if you’re ticking none of those boxes, and you still think a code-generating supercar is going to make things better, I have a Bugatti Chiron Sport you might be interested in buying. Perfect for the school run!