One handy thing about living in London, if you enjoy stand-up comedy, is that so many comedians test new material here in small venues – often playing “works in progress” to audiences of just a few dozen.
Stewart Lee famously iterates his show over many, many performances at the Leicester Square Theatre before he takes it on tour to bigger venues and has it recorded for TV and DVD.
Here’s the thing: comedy requires feedback. Immediate feedback. Not an aggregate report at the end of the show, but in-the-moment feedback about how a joke’s landing. In big theatres, audiences can become that faceless aggregate, but in 100-seater venues, every data point has a face.
And that matters. It matters when you can see the faces and hear the responses from your audience. Because now each one of them matters, and that’s a very different kind of feedback to being told that “27% thought that the routine about Prince Andrew went on a bit too long” after they’ve all gone home.
I hear developers all the time complaining that there are just too many users to get that kind of feedback-with-a-face. I say that’s a choice – like skipping the warm-up gigs at Old Rope at the Comedy Store and taking your show straight to the O2 Arena.
It’s worth cultivating small audiences to test new material on. Sure, you don’t get to see the aggregate trends – only big audiences can give you that. But you can see their faces, and you know immediately if the joke’s aren’t landing. And if you’re going to die on stage, it’s preferable not to do it in front of 20,000 paying punters.
One final thought: I’ve observed our industry morph from one where the data points had faces and individual users’ experiences mattered to one where we only play the proverbial stadiums, and we only see the trends, not the faces.
This, I suspect – while not a direct cause – has been an enabler of “enshittification”. It’s much easier to do that to a faceless aggregate.
The impact of feedback loops like testing in software development can be as profound as it is widely misunderstood.
Movie-making had a similar problem up until the 1960s. Crew shoots a take during the day. Director has to wait until the film’s processed so they can watch “the dailies” to check for any mistakes nobody noticed at the time – like an extra using an iPhone in what’s supposed to be 1889 – and to see if the shot actually works dramatically, comedically etc.
If they wanted to fix it, back in the day, that could mean rebuilding the set, or transporting everyone – cast, crew, equipment, costumes, props etc – back to the location. Remounting shots is a big deal.
advertisement
In 1960, comic actor and director Jerry Lewis started using “video-assist” while working on The Bellboy. Takes were captured simultaneously on film and on video, so the director can check each shot in “video village” immediately after the take. If a joke’s not working, they can see straight away and adjust for the next take. By the mid-60s, the technology had been refined using a beam splitter to ensure the video captured was showing exactly what the film camera was recording. WYSIWYG.
It made a big difference. When we move the feedback much closer to the action and the myriad decisions made in just a single shot, fixing problems gets much quicker and much, much cheaper. So – unsurprisingly – more problems get fixed.
Cinephiles like myself may have noticed a tangible leap in the quality of films being made during the 1960s and early 1970s, as this technology became mainstream.
In software development, we have our equivalents of “video-assist” – techniques we can use to bring the feedback much closer to the decision, making mistakes much quicker and cheaper to fix.
A good example is developer testing. Instead of making a whole bunch of changes to the code and then testing all of them, we make one change and immediately run to our equivalent of “video village” – a unit test suite, for example – to check for problems.
Teams that rely on downstream testing are doing the equivalent of waiting to see the dailies. When problems are caught, fixing them becomes a bigger deal. Likely as not, the developers have moved on. The set’s been struck, so to speak, and remounting those shots is a bigger deal.
What other examples can you think of where we move feedback closer to the decision in software development?
You know the TV gameshow Play Your Cards Right? Contestants are shown a sequence – in two rows – of giant playing cards presented face-down. The host turns over the first card. The contestant then has to guess if the next card is higher or lower than that one.
They move across the board, guessing and then revealing one card at a time until either the contestant guesses wrong or they complete the sequence and win the game.
Now imagine a version of that where they don’t turn the cards over until the contestant has guessed higher or lower for the entire sequence.
“That’s just silly, Jason.”
You’re absolutely right. It is silly. Very silly. The odds of winning the game would be so remote that we’d probably never see it happen.
So why are you developing software that way?
Be honest now – you are.
You don’t turn the cards over one a time. You make a whole bunch of guesses about what the users or the business really needs. Then you make a whole bunch of design decisions that may or may not be the right decisions. Then you make a whole bunch of changes to the code that may or may not work. And only then do you turn the cards over to see if all those many guesses were good guesses.
Every decision, and every change to the code, carries uncertainty. And that uncertainty compounds with every subsequent decision or change. If we have a 90% chance of getting one right, we have an 81% chance of getting two right, a 35% chance of getting ten right, and 0.003% chance of getting 100 right. The more uncertainty accumulates, the longer we spend driving in the dark with the lights off.
These decisions and these changes don’t exist in isolation. One decision is often a consequence of an earlier decision – another junction along the way of the path we chose. One change to the code will constrain our choice of future changes.
If we take a wrong turn with any decision or any change (which is just another decision, really), how long can we afford to waste heading down the wrong road? How long will it take and how much will it cost to get back on the right road?
The further we go before we get a meaningful answer, the bigger the wasted time and effort, and the more it will cost to correct.
And this is where sunk cost enters the chat. When the cost of correcting a mistake is too high, teams will tend to choose to live with the mistake. Waddayagonnado?
And that’s how you make software, that is.
A smarter way is to turn the cards over as they’re being played. Test your guesses against reality as soon as possible, so the next guess is less likely to be a stop on the wrong road.
If you guessed wrong, no problemo. Correcting your mistake is quick and cheap. You don’t have to undo 100 decisions that followed, then make 100 new ones.
So a critical metric in software development is how long it takes for us to test our decisions after they’ve been made. That feedback latency needs to be as low as possible.
I’m now calling this approach feedbackmaxxing, because that’s how we talk these days apparently.
Feedbackmaxxing is maximising feedback frequency while minimising feedback latency across the entire software development system
This is about two variables we can control in our development process:
Batch Size – how many decisions need feedback (e.g., from testing, from code review, from users) at a time?
Feedback Frequency– how often do we get that feedback?
The bigger the batches, the longer it takes to get feedback. The smaller the batches, the sooner we learn what works and what doesn’t.
The smart players work in small batches – they solve one problem at a time – and engineer their feedback loops to be very fast.
Software development cycles are loops within loops. We have that outer loop – will a reminder to reorder a prescription reduce missed doses? And we have the inner loop – did that change I just made to the code work? Did it break anything that was depending on it?
The smart players know something about how to optimise nested loops, too. They know that to speed up the outer loop – the real-world user feedback from working releases – you focus your attention on the innermost loop.
How long does it take to build and test the software? If the answer is an hour, you have a big problem. Your choices are not great – you can either test one change at a time, and spend most of your day waiting for feedback. Or- and this is the most popular choice – you make a lot of changes, and then test them, in the mistaken belief this will save you time. “I’m too busy building on top of broken code for testing!”
The other systemic effect that large batches has is – because they take longer to get feedback on (reviewing a 5-line diff vs. a 500-line diff, for example) – changes tend to end up sitting in queues waiting their turn.
Make the batches bigger, the queues get larger, and delays get longer. The more decisions we make before testing them, the slower we get overall.
Large Language Models can make a lot of decisions – e.g., a lot of changes to our code – very, very quickly. It comes as no surprise that data from studying work queues across thousands of teams shows diffs getting bigger and bigger, queues getting large and larger, and lead times for getting changes into production getting longer and longer.
In the most meaningful sense, feedback latency isn’t the time elapsed after a decision’s been made before we get feedback, but the number of subsequent decisions made that are a consequence of it – how many miles did we carry on down that road. Lightning fast code generation doesn’t help us here. If anything, it probably makes latency worse – we’re much further down potentially the wrong road driving a Maserati than if we’d walked.
“Ah, but Jason, we can just get the agent to regenerate the software again from the original specs.” U-huh? Tell me you’ve never tried that on anything non-trivial without telling me you’ve never tried that on anything non-trivial.
“Aha! But we can just get the agent to make the changes we need.” This is where the peak-end rule bites on the backside. Ask users, for example, for feedback on a single design choice, and you’ll get specific, meaningful, useful thoughts. Ask them for feedback on 50 choices, and they’ll talk about the one or two things that stood out, and the last thing they saw. (See also: code reviews – “Looks good to me”).
You are drinking from a code-generating firehose, and it’s getting out of control.
The answer to your AI-generated woes is feedbackmaxxing. Ask one question at a time. Get an answer as soon as possible. Test continuously. Review continuously. Integrate continuously. Get real-world feedback continuously.
A lot of people struggle to picture what that looks like.
Once you’ve seen it, though, your journey to Feedbackmaxxville (twinned with Gas Town) can begin.
“The key to AI-assisted and agentic software development is <insert thing you were selling before>”
The Big Design Up-Front folks say the key is better specifications. The plan-driven folks say it’s better plans. The architects say it’s better architecture. The product managers say it’s better product management. The command-and-control folks say it’s better agent orchestration. The test automators say it’s better test suites. The folks selling static analysis tools say it’s better automated code reviews. The folks selling the models say… well, we know what they say. MORE TOKENS!!!
It’s true that I’m also claiming that the key to AI-assisted software development is something I just happen to specialise in – development practices that work in small batches and rapid feedback loops.
The difference is that the data’s led me back here, just like it led me to it in the first place.
advertisement
The only thing that AI code generation has really changed is the speed at which code’s generated and the amount of code that needs designing, testing, reviewing, refactoring and integrating.
Data collected on thousands of teams by the DevOps Research & Assessment group shows code being created faster, only to end up languishing in queues waiting for user feedback, design decisions, testing, review and merging to the release branch. Net effect – slower delivery and less stable releases.
Data collected on millions of CI workflows by CircleCI shows code being created faster on developer branches, only to end up languishing in queues waiting for user feedback, design decisions, testing, review and merging to the release branch. Net effect – slower delivery and less stable releases.
Data collected on thousands of teams by Faros shows code being created faster on developer branches, only to end up languishing in queues waiting for user feedback, design decisions, testing, review and merging to the release branch. Net effect – slower delivery and less stable releases.
The problem is what it always was – phase-gated development processes that try to handle design, testing, review, refactoring, merging and releasing large batches of changes.
You can’t specify your way out of it. You can’t architect your way out of it. You can’t automate your way out of it (because judgement will always be needed – Actual Intelligence). You can’t product manage or type-check or DDD or team topology your way out of it.
That’s not to say these things bring no value. They all do.
But batch sizes and feedback loops hold the biggest leverage here, by orders of magnitude. They always did and they always will.
But who wants to hear about taking smaller steps, right? That’s just boring stuff from the 1990s.
The hardest lesson I had to learn as a software developer in the early part of my career was that what felt “productive” to me locally – uninterrupted time, code getting created fast etc – often turned out to be a bad sign for overall team outcomes.
Taking interruptions as an example, I mistook concurrency for parallelism in the way I worked. Concurrency is about communication and coordination, and when a bunch of devs are working on different aspects of the same problem at the same time, communication and coordination become the primary activities.
Coding interrupts communication and coordination. CODING IS THE INTERRUPTION.
The more time I spend coding and not communicating, the further I drift from the rest of the team.
We talk a good game about building shared understanding and aligning teams, but then we do everything we can to minimise that in the pursuit of individual “productivity”.
Of course, there are ways we can go about writing code that maintain communication and coordination – but your boss might not like them. (Sing along if you know the words – “But that’s 2 developers doing the work of 1!”)
Cliffs Notes version: what feels productive to you is often counter-productive for the team
The fix? Keep one eye on the horizon, not both eyes on your feet. Effective teams need a high level of situational awareness – being cognizant of what’s going on around you.
And we can work in ways that make us more interruptible. But your boss might not like them, either. (“One little test at a time? It looks so slow!”)
I’ve found so many times that having clear end goals – unambiguously articulated, and ideally measurable for meaningful feedback – has counteracted this illusion of individual productivity.
Indeed, a little like Douglas Harding’s Headless Way, after a while you realise that there is no individual productivity in software development. There is only the team.
Of course – as has happened so many times before – most teams are running in the exact opposite direction with AI-assisted coding, pursuing this seductive illusion of “individual productivity” at the expense of team outcomes.
And – just as all those times before – the solutions that actually work are known to a small few.
For the small percentage of engineering orgs who’d genuinely like to be shipping more reliable software and be more responsive to the needs of their business and their users – it’s a niche, I know – I’m running a public 3-day online Code Craft workshop on July 7-9.
If you’re a developer, twist your manager’s arm – especially if they’re expecting you to be more productive using tools like Claude Code and Copilot.
If you’re an engineering leader, this is the real AI-assisted software engineering training your teams need – and, funnily enough, it’s mostly about software engineering and only a little bit about AI. It’s about making teams AI-ready.
It’s 6x half-day modules that give developers a practical, hands-on introduction to the foundational technical practices that enable teams to accelerate release cycles, shrink lead times and improve release reliability – with and without AI.
On top of this, decisions have dependencies, and this means that errors can compound downstream. Take a wrong turn at step N, and step N+1, N+2, N+3 could well build on that mistake.
The other side of the equation is verification. Mistakes aren’t a problem if they’re caught before they compound.
So we now have two components: the probability of an error, and the probable number of subsequent steps before the error’s detected.
More bluntly, if the agent f***ed up, how soon would we/it know?
The E in my CRESS principles for context engineering stands for “Empirical” – input contexts should be grounded in observed reality, not unverified model output. I visualise raw model output as being like untreated sewage. Yes, there’s water in it. But it’s not safe for the model to drink.
To make it safe, it needs to be tested against reality and potentially debugged and refactored, or flushed down the drain if it’s too far gone.
I don’t know about you, but I’d think twice about drinking water from the tap if I knew that the only testing it had been through was someone holding it up to the light and pronouncing “Looks good to me”.
This takes us into the wonderful world of test assurance, and into territory that will be alien to the vast majority of software teams. The longer the autonomous horizon, the higher the assurance needs to be.
I’m seeing lots of folks (finally!) discovering the value of mutation testing – a technique for testing your tests by deliberately introducing errors and seeing if they fail – in agentic workflows. And there’s no doubting this helps close the gaps that errors can leak through from one step to the next.
But the kind of full autonomy Anthropic and others claim will soon be upon us requires degrees of assurance that go way beyond even that needed for safety-critical systems into uncharted territory.
Now, personally, I think testing and verification in software has left a lot to be desired for many decades. Almost none of you have ever gotten in the ballpark of what I consider to be good enough, even for line-of-business applications, let alone safety-critical ones.
But even if we could drag ourselves into that ballpark, I know from experience that high-integrity software engineering still requires acres of human judgement and learning that LLMs will likely never be capable of.
But it might extend the agentic horizon from, say, N steps to 1.1 N steps before we need to course correct. And that could be the key to squeezing out more net value from the technology – maybe our lead times shrink from L to 0.9 L?
The fun part is that I know for a fact – having tried for nearly 30 years to get teams interested in upping the integrity of their products – that 99% will not want to hear that the answer is MORE RIGOUR.
(And, of course, we’re just talking about one kind of testing here. When we add in other qualities of software, like maintainability, performance and security – you can probably see why I consider full autonomy a Fool’s Errand.)
Since I wrote about my CRESS principles for context engineering – contexts should be Current, Refutable, Empirical, Small & Specific – I’ve been thinking about how that applies to my AI-assisted software development workflow.
You also won’t be surprised to hear that I work in small steps, solving one problem at a time. (Though you might be surprised at how small a step I mean by a “small step”).
You probably won’t be surprised that I run my tests after every change to the code. And you probably won’t be surprised that I’m in the habit of committing changes when I see the tests pass, or that I often revert changes when tests fail.
Nor will you be very surprised that I review the code after each small change, and not after a whole bunch of changes. I’ll look at the code carefully, perhaps run a linter to check for low-level problems that are easy to miss.
This has been my workflow for nearly 3 decades. And so you probably won’t be surprised to learn that it’s still my workflow in 2026, whether I’m using AI tools or not.
I’ve experimented extensively with automating the parts where I normally judge results and make decisions, and I’ve seen many others trying to do the same.
I went on a journey from me essentially orchestrating every small step, to a single agent, to multiple concurrent agents working without intervention for longer and longer.
And I saw just how impossible long-horizon, fully autonomous agentic workflows are. And I do mean impossible. A single step it might get right 80% of the time. 2 in a row? 10 in a row? 100 in a row? Forget it. It might not fall at the first hurdle, but it will fall soon enough.
So I walked it back to a single agent – a basic Ralph loop – and then back even further to me essentially being the agent. I am Ralph.
I see more and more people who’ve spent lots of time on the same journey, and they too have reached a stage where they’re making their harnesses simpler and simpler, stripping out everything that they’ve discovered isn’t helping – and, in many instances, probably making things worse. I expect to meet some of them at “I am Ralph” soon.
If I visualise my workflow as a conversation between me and the agent, and between the agent and the model, there’s pretty much a one-to-one mapping between the steps in the process and my interventions.
I asked ChatGPT to try and visualise how this might look in a test-driven workflow, with continuous testing, inspection and refactoring and continuous integration.
What it came up with is close – in spirit, at least. Except I wouldn’t ask Claude to perform a refactoring that my IDE has a shortcut for. If you find yourself asking AI to do something you can do quicker and better, arguably you’ve lost the plot.
Also not mentioned in the diagram is automated code inspection – static analysis and that sort of thing – which I would have used multiple times in this workflow.
And, most importantly, the agent doesn’t decide the next step. I do. Always. But ChatGPT refused to let go of that one.
Note how context is being created fresh for each step, and being flushed after each step. As soon as changes are applied to the code – having been tested first – the context is now stale. New balls, please!
It also means that the agent isn’t dragging context from earlier steps behind it, keeping context small and task-specific, dramatically reducing the risk of effects like attention dilution, context rot and probability collapse, and improving model predictions.
This kind of workflow is far more token and compute-efficient for the models, too.
Join me on Saturday May 23rd at 9:50 BST with other self-funding learners to get hands on with the micro-cycles and small steps of Test-Driven Development.
You know how it is when folks agree on something, but in their heads they have very different pictures of what it is they think they’re agreeing on?
I get that a lot when I talk about working in “small steps”. They nod enthusiastically and we all agree that small steps are a good thing.
And then I look at the size of their commits. Or they look at the size of mine. And now we don’t agree. We don’t agree at all.
Aside from being a classic example of where “Don’t tell me, show me” can aid in communication, it’s generally useful to contrast and compare our place in a distribution, and maybe recalibrate our expectations.
To give you an idea, pay close attention to how little code I change before I run my tests and – if they pass – commit those changes, before making the next change in this demonstration of refactoring.
One of the common challenges I face as a teacher is getting developers to move forward by putting one sure foot in front of the other, instead of trying to do it in risky leaps and bounds.
One activity in particular where this friction occurs is refactoring. I watch people hack away at swathes of code, making dozens of changes, before I say “Shall we run our tests now?”
More often than not, the tests fail. Every change to the code carries the risk of breaking it, and that’s true whether we make them one a time or 100 at a time before testing them.
But when we make them one at a time, if we break the software we know exactly which change broke it. Fixing it is a doddle usually. And if we can’t fix it, we can just roll back the change with a simple Ctrl-Z or git reset –hard (if we’re in the habit of committing whenever we see the tests pass after a change).
If we make 100 changes and one or more of them breaks the software – which is now almost certain – then we have a much bigger problem. Was it the first change? Or maybe the last change? Or the 48th change? Into the debugger we go – probably for quite some time. And if we can’t fix it, undoing it is a lot of work lost.
The discipline of refactoring is in reshaping code one single atomic, tested change at a time instead of hacking away at it, making a whole bunch of changes. We rename a function, and run the tests. And if the tests pass, we might commit that before making another change.
No matter the scale of the restructuring we plan to do, we do it one small, easily-reversible change at a time. We put one foot in front of the other.
And each change has one specific objective – make the intent of a function easier to understand, break down a complicated IF block into something simpler, decouple business logic from an API call, and so on. Each change solves one problem, working in rapid micro-cycles with continuous testing and code review, instead of turning them into serious bottlenecks later.
This is beneficial when we’re working with AI coding tools, as multiple large-scale studies show very clearly the negative impact of downstream bottlenecks in development.
And it’s also helpful when we’re using LLMs to generate code for us. The more we ask a model to do, the less likely it is to do it successfully. (See “S is for Small“)
If we move forward in small steps, solving one problem at a time in tight feedback cycles, then our contexts can be about one specific thing – write this specific failing test, write the simplest code to pass that test, review the code that’s changed for this specific smell, do this specific refactoring.
And don’t include any information in the context that isn’t needed for that specific task.
If the task is to move a method from one class to another, we don’t need to give the model a summary of our architecture or of our coding standards or anything else unrelated.
It just needs a single instruction, and the code affected by the refactoring, and perhaps an example – maybe in a reusable context file – that illustrates the mechanics of that refactoring.
And it needs a way to test that the refactoring achieved the goal. So if the goal was to eliminate Feature Envy, then we can test for that smell afterwards in the code that’s changed.
This means that – provided the “blast radius” of the change is small – the context for this interaction with the model will be well within effective limits.
Any information included in the context that has no relation to the task at hand will just water down the model’s attention and reduce the probability of successful completion.
I conducted a closed-loop experiment where I asked Claude Opus 4.6 to execute a coding task, and then – with the help of GPT-5.2, arguably the best model if waffle is what you’re after – added more and more irrelevant information to the prompt. The task remained the same, but we buried it under increasing amounts of distractions – including a fictional set of coding standards and an architecture summary.
Each variation was attempted 1o times, so I could measure how many times out of 10 the task was successfully completed.
Long story short – the more extraneous or irrelevant information, the worse the model performs in specific tasks.
The experiments I’ve done, backed up by larger independent studies into the effect of context size on model performance, have also forced me to recalibrate what I mean by a “small context”. Forget the maximum advertised context limit for your model. Accuracy degrades rapidly with even just a few hundred tokens.
So for each interaction, contexts needs to be fresh, task-specific and only contain the minimum information needed for that task.