refactoring – Codemanship's Blog

Feedbackmaxxing

You know the TV gameshow Play Your Cards Right? Contestants are shown a sequence – in two rows – of giant playing cards presented face-down. The host turns over the first card. The contestant then has to guess if the next card is higher or lower than that one.

They move across the board, guessing and then revealing one card at a time until either the contestant guesses wrong or they complete the sequence and win the game.

Now imagine a version of that where they don’t turn the cards over until the contestant has guessed higher or lower for the entire sequence.

“That’s just silly, Jason.”

You’re absolutely right. It is silly. Very silly. The odds of winning the game would be so remote that we’d probably never see it happen.

So why are you developing software that way?

Be honest now – you are.

You don’t turn the cards over one a time. You make a whole bunch of guesses about what the users or the business really needs. Then you make a whole bunch of design decisions that may or may not be the right decisions. Then you make a whole bunch of changes to the code that may or may not work. And only then do you turn the cards over to see if all those many guesses were good guesses.

Every decision, and every change to the code, carries uncertainty. And that uncertainty compounds with every subsequent decision or change. If we have a 90% chance of getting one right, we have an 81% chance of getting two right, a 35% chance of getting ten right, and 0.003% chance of getting 100 right. The more uncertainty accumulates, the longer we spend driving in the dark with the lights off.

These decisions and these changes don’t exist in isolation. One decision is often a consequence of an earlier decision – another junction along the way of the path we chose. One change to the code will constrain our choice of future changes.

If we take a wrong turn with any decision or any change (which is just another decision, really), how long can we afford to waste heading down the wrong road? How long will it take and how much will it cost to get back on the right road?

The further we go before we get a meaningful answer, the bigger the wasted time and effort, and the more it will cost to correct.

And this is where sunk cost enters the chat. When the cost of correcting a mistake is too high, teams will tend to choose to live with the mistake. Waddayagonnado?

And that’s how you make software, that is.

A smarter way is to turn the cards over as they’re being played. Test your guesses against reality as soon as possible, so the next guess is less likely to be a stop on the wrong road.

If you guessed wrong, no problemo. Correcting your mistake is quick and cheap. You don’t have to undo 100 decisions that followed, then make 100 new ones.

So a critical metric in software development is how long it takes for us to test our decisions after they’ve been made. That feedback latency needs to be as low as possible.

I’m now calling this approach feedbackmaxxing, because that’s how we talk these days apparently.

Feedbackmaxxing is maximising feedback frequency while minimising feedback latency across the entire software development system

This is about two variables we can control in our development process:

Batch Size – how many decisions need feedback (e.g., from testing, from code review, from users) at a time?
Feedback Frequency– how often do we get that feedback?

The bigger the batches, the longer it takes to get feedback. The smaller the batches, the sooner we learn what works and what doesn’t.

The smart players work in small batches – they solve one problem at a time – and engineer their feedback loops to be very fast.

Software development cycles are loops within loops. We have that outer loop – will a reminder to reorder a prescription reduce missed doses? And we have the inner loop – did that change I just made to the code work? Did it break anything that was depending on it?

The smart players know something about how to optimise nested loops, too. They know that to speed up the outer loop – the real-world user feedback from working releases – you focus your attention on the innermost loop.

How long does it take to build and test the software? If the answer is an hour, you have a big problem. Your choices are not great – you can either test one change at a time, and spend most of your day waiting for feedback. Or- and this is the most popular choice – you make a lot of changes, and then test them, in the mistaken belief this will save you time. “I’m too busy building on top of broken code for testing!”

The other systemic effect that large batches has is – because they take longer to get feedback on (reviewing a 5-line diff vs. a 500-line diff, for example) – changes tend to end up sitting in queues waiting their turn.

Make the batches bigger, the queues get larger, and delays get longer. The more decisions we make before testing them, the slower we get overall.

The evidence at this point is overwhelming that AI code generation speeds developers up, but slows teams down. We’ve been maxxing the wrong thing.

Large Language Models can make a lot of decisions – e.g., a lot of changes to our code – very, very quickly. It comes as no surprise that data from studying work queues across thousands of teams shows diffs getting bigger and bigger, queues getting large and larger, and lead times for getting changes into production getting longer and longer.

In the most meaningful sense, feedback latency isn’t the time elapsed after a decision’s been made before we get feedback, but the number of subsequent decisions made that are a consequence of it – how many miles did we carry on down that road. Lightning fast code generation doesn’t help us here. If anything, it probably makes latency worse – we’re much further down potentially the wrong road driving a Maserati than if we’d walked.

“Ah, but Jason, we can just get the agent to regenerate the software again from the original specs.” U-huh? Tell me you’ve never tried that on anything non-trivial without telling me you’ve never tried that on anything non-trivial.

“Aha! But we can just get the agent to make the changes we need.” This is where the peak-end rule bites on the backside. Ask users, for example, for feedback on a single design choice, and you’ll get specific, meaningful, useful thoughts. Ask them for feedback on 50 choices, and they’ll talk about the one or two things that stood out, and the last thing they saw. (See also: code reviews – “Looks good to me”).

And then there’s the established fact that LLMs are good at generating code that they’re bad at modifying later. And the more complex the code base is, the worse they get. I wish you the best of luck with that!

You are drinking from a code-generating firehose, and it’s getting out of control.

The answer to your AI-generated woes is feedbackmaxxing. Ask one question at a time. Get an answer as soon as possible. Test continuously. Review continuously. Integrate continuously. Get real-world feedback continuously.

A lot of people struggle to picture what that looks like.

Once you’ve seen it, though, your journey to Feedbackmaxxville (twinned with Gas Town) can begin.

Talking of which…

What If The Real Key To AI Coding Is Old-Fashioned & Boring?

“The key to AI-assisted and agentic software development is <insert thing you were selling before>”

The Big Design Up-Front folks say the key is better specifications. The plan-driven folks say it’s better plans. The architects say it’s better architecture. The product managers say it’s better product management. The command-and-control folks say it’s better agent orchestration. The test automators say it’s better test suites. The folks selling static analysis tools say it’s better automated code reviews. The folks selling the models say… well, we know what they say. MORE TOKENS!!!

It’s true that I’m also claiming that the key to AI-assisted software development is something I just happen to specialise in – development practices that work in small batches and rapid feedback loops.

The difference is that the data’s led me back here, just like it led me to it in the first place.

The only thing that AI code generation has really changed is the speed at which code’s generated and the amount of code that needs designing, testing, reviewing, refactoring and integrating.

Data collected on thousands of teams by the DevOps Research & Assessment group shows code being created faster, only to end up languishing in queues waiting for user feedback, design decisions, testing, review and merging to the release branch. Net effect – slower delivery and less stable releases.

Data collected on millions of CI workflows by CircleCI shows code being created faster on developer branches, only to end up languishing in queues waiting for user feedback, design decisions, testing, review and merging to the release branch. Net effect – slower delivery and less stable releases.

Data collected on thousands of teams by Faros shows code being created faster on developer branches, only to end up languishing in queues waiting for user feedback, design decisions, testing, review and merging to the release branch. Net effect – slower delivery and less stable releases.

The problem is what it always was – phase-gated development processes that try to handle design, testing, review, refactoring, merging and releasing large batches of changes.

You can’t specify your way out of it. You can’t architect your way out of it. You can’t automate your way out of it (because judgement will always be needed – Actual Intelligence). You can’t product manage or type-check or DDD or team topology your way out of it.

That’s not to say these things bring no value. They all do.

But batch sizes and feedback loops hold the biggest leverage here, by orders of magnitude. They always did and they always will.

But who wants to hear about taking smaller steps, right? That’s just boring stuff from the 1990s.

Would it help if I called it “feedbackmaxxing”?

Public Code Craft Training – July 7-9

For the small percentage of engineering orgs who’d genuinely like to be shipping more reliable software and be more responsive to the needs of their business and their users – it’s a niche, I know – I’m running a public 3-day online Code Craft workshop on July 7-9.

If you’re a developer, twist your manager’s arm – especially if they’re expecting you to be more productive using tools like Claude Code and Copilot.

If you’re an engineering leader, this is the real AI-assisted software engineering training your teams need – and, funnily enough, it’s mostly about software engineering and only a little bit about AI. It’s about making teams AI-ready.

It’s 6x half-day modules that give developers a practical, hands-on introduction to the foundational technical practices that enable teams to accelerate release cycles, shrink lead times and improve release reliability – with and without AI.

Specification By Example
Test-Driven Development
Refactoring
Design Principles
Continuous Delivery
Code Craft & AI – grounded on hard data, includes how to apply CRESS principles for context engineering to AI-assisted workflows

To learn more and register, visit https://codemanship.co.uk/codecraft.html

Places are limited.

I Am Ralph – CRESS Principles in Practice

Since I wrote about my CRESS principles for context engineering – contexts should be Current, Refutable, Empirical, Small & Specific – I’ve been thinking about how that applies to my AI-assisted software development workflow.

You won’t be surprised – if you follow this blog or know me professionally at all – to hear that I drive design and development with tests.

You also won’t be surprised to hear that I work in small steps, solving one problem at a time. (Though you might be surprised at how small a step I mean by a “small step”).

You probably won’t be surprised that I run my tests after every change to the code. And you probably won’t be surprised that I’m in the habit of committing changes when I see the tests pass, or that I often revert changes when tests fail.

Nor will you be very surprised that I review the code after each small change, and not after a whole bunch of changes. I’ll look at the code carefully, perhaps run a linter to check for low-level problems that are easy to miss.

This has been my workflow for nearly 3 decades. And so you probably won’t be surprised to learn that it’s still my workflow in 2026, whether I’m using AI tools or not.

I’ve experimented extensively with automating the parts where I normally judge results and make decisions, and I’ve seen many others trying to do the same.

I went on a journey from me essentially orchestrating every small step, to a single agent, to multiple concurrent agents working without intervention for longer and longer.

And I saw just how impossible long-horizon, fully autonomous agentic workflows are. And I do mean impossible. A single step it might get right 80% of the time. 2 in a row? 10 in a row? 100 in a row? Forget it. It might not fall at the first hurdle, but it will fall soon enough.

So I walked it back to a single agent – a basic Ralph loop – and then back even further to me essentially being the agent. I am Ralph.

I see more and more people who’ve spent lots of time on the same journey, and they too have reached a stage where they’re making their harnesses simpler and simpler, stripping out everything that they’ve discovered isn’t helping – and, in many instances, probably making things worse. I expect to meet some of them at “I am Ralph” soon.

If I visualise my workflow as a conversation between me and the agent, and between the agent and the model, there’s pretty much a one-to-one mapping between the steps in the process and my interventions.

I asked ChatGPT to try and visualise how this might look in a test-driven workflow, with continuous testing, inspection and refactoring and continuous integration.

What it came up with is close – in spirit, at least. Except I wouldn’t ask Claude to perform a refactoring that my IDE has a shortcut for. If you find yourself asking AI to do something you can do quicker and better, arguably you’ve lost the plot.

Also not mentioned in the diagram is automated code inspection – static analysis and that sort of thing – which I would have used multiple times in this workflow.

And, most importantly, the agent doesn’t decide the next step. I do. Always. But ChatGPT refused to let go of that one.

Note how context is being created fresh for each step, and being flushed after each step. As soon as changes are applied to the code – having been tested first – the context is now stale. New balls, please!

It also means that the agent isn’t dragging context from earlier steps behind it, keeping context small and task-specific, dramatically reducing the risk of effects like attention dilution, context rot and probability collapse, and improving model predictions.

This kind of workflow is far more token and compute-efficient for the models, too.

Calibrating Your Steps – How Small is “Small”?

_{Join me on Saturday May 23rd at 9:50 BST with other self-funding learners to get hands on with the micro-cycles and small steps of Test-Driven Development.}

You know how it is when folks agree on something, but in their heads they have very different pictures of what it is they think they’re agreeing on?

I get that a lot when I talk about working in “small steps”. They nod enthusiastically and we all agree that small steps are a good thing.

And then I look at the size of their commits. Or they look at the size of mine. And now we don’t agree. We don’t agree at all.

Aside from being a classic example of where “Don’t tell me, show me” can aid in communication, it’s generally useful to contrast and compare our place in a distribution, and maybe recalibrate our expectations.

To give you an idea, pay close attention to how little code I change before I run my tests and – if they pass – commit those changes, before making the next change in this demonstration of refactoring.

CRESS Principles for Context Engineering – S is for Specific

One of the common challenges I face as a teacher is getting developers to move forward by putting one sure foot in front of the other, instead of trying to do it in risky leaps and bounds.

One activity in particular where this friction occurs is refactoring. I watch people hack away at swathes of code, making dozens of changes, before I say “Shall we run our tests now?”

More often than not, the tests fail. Every change to the code carries the risk of breaking it, and that’s true whether we make them one a time or 100 at a time before testing them.

But when we make them one at a time, if we break the software we know exactly which change broke it. Fixing it is a doddle usually. And if we can’t fix it, we can just roll back the change with a simple Ctrl-Z or git reset –hard (if we’re in the habit of committing whenever we see the tests pass after a change).

If we make 100 changes and one or more of them breaks the software – which is now almost certain – then we have a much bigger problem. Was it the first change? Or maybe the last change? Or the 48th change? Into the debugger we go – probably for quite some time. And if we can’t fix it, undoing it is a lot of work lost.

The discipline of refactoring is in reshaping code one single atomic, tested change at a time instead of hacking away at it, making a whole bunch of changes. We rename a function, and run the tests. And if the tests pass, we might commit that before making another change.

No matter the scale of the restructuring we plan to do, we do it one small, easily-reversible change at a time. We put one foot in front of the other.

And each change has one specific objective – make the intent of a function easier to understand, break down a complicated IF block into something simpler, decouple business logic from an API call, and so on. Each change solves one problem, working in rapid micro-cycles with continuous testing and code review, instead of turning them into serious bottlenecks later.

This is beneficial when we’re working with AI coding tools, as multiple large-scale studies show very clearly the negative impact of downstream bottlenecks in development.

And it’s also helpful when we’re using LLMs to generate code for us. The more we ask a model to do, the less likely it is to do it successfully. (See “S is for Small“)

If we move forward in small steps, solving one problem at a time in tight feedback cycles, then our contexts can be about one specific thing – write this specific failing test, write the simplest code to pass that test, review the code that’s changed for this specific smell, do this specific refactoring.

And don’t include any information in the context that isn’t needed for that specific task.

If the task is to move a method from one class to another, we don’t need to give the model a summary of our architecture or of our coding standards or anything else unrelated.

It just needs a single instruction, and the code affected by the refactoring, and perhaps an example – maybe in a reusable context file – that illustrates the mechanics of that refactoring.

And it needs a way to test that the refactoring achieved the goal. So if the goal was to eliminate Feature Envy, then we can test for that smell afterwards in the code that’s changed.

This means that – provided the “blast radius” of the change is small – the context for this interaction with the model will be well within effective limits.

Any information included in the context that has no relation to the task at hand will just water down the model’s attention and reduce the probability of successful completion.

I conducted a closed-loop experiment where I asked Claude Opus 4.6 to execute a coding task, and then – with the help of GPT-5.2, arguably the best model if waffle is what you’re after – added more and more irrelevant information to the prompt. The task remained the same, but we buried it under increasing amounts of distractions – including a fictional set of coding standards and an architecture summary.

Each variation was attempted 1o times, so I could measure how many times out of 10 the task was successfully completed.

Long story short – the more extraneous or irrelevant information, the worse the model performs in specific tasks.

The experiments I’ve done, backed up by larger independent studies into the effect of context size on model performance, have also forced me to recalibrate what I mean by a “small context”. Forget the maximum advertised context limit for your model. Accuracy degrades rapidly with even just a few hundred tokens.

So for each interaction, contexts needs to be fresh, task-specific and only contain the minimum information needed for that task.

Essential Code Craft – The Roadmap

Some of you may have noticed that I’ve been running out-of-hours training workshops for self-funding learners recently, under the banner of Essential Code Craft.

In a way, this is a return to the early days of Codemanship when I ran regular weekend workshops – priced for individual pockets – that were mostly attended by developers investing in their own skills and career development.

Many of those people are now CTOs and heads of engineering, and I’ve been fortunate – and grateful – that quite a few have brought me in to provide the same kind of training for their teams.

But with senior engineering leaders now very distracted by the code-generating firehose – and while I wait for them to realise that nothing’s actually changed as far as software engineering fundamentals are concerned – I’m pivoting back to self-funders.

So far – just as it was way back when – the first two workshops filled up quickly. While the boss might not be thinking about investing in their developers at the moment, it seems a lot of developers are looking to invest in themselves.

And this is exactly the moment to do it. While a gazillion developers hunt for magic incantations to make a probabilistic next-token predictor act like something other than a probabilistic next-token predictor, the people who’ve done their homework already know: better results with AI coding tools have very little to do with the tools, and almost everything to do with the processes around them.

And it’s a double-win. The practices that produce the best outcomes with AI are the exact same practices that produce the best outcomes without AI.

The key to being effective with AI is being effective without it.

And here’s the hedge, but only for the informed gamblers – developer hiring is rising again, but the demographic of these new hires is changing. Employers are favouring senior developers with significant pre-LLM experience.

I, and a few others, predicted this would happen. Demand would be highest for people who can do the things AI coding tools can’t – like, well, understand code. I mean really understand it. Not “LGTM” understanding. Deep comprehension of programs.

Not only that, but for all kinds of good reasons – economic, environmental, energy, ethical, geopolitical – the future of hyperscale LLMs is by no means predictable. Folks grappling with reduced token limits and rapidly degrading performance with Anthropic’s newest models will hopefully have figured out by now that building workflows that depend heavily in hyperscale LLMs is building on quicksand.

Who are Acme Megacorp gonna’ hire – the dev who sits on their hands because they’re waiting for their token limit to reset, or the dev who can just carry on at roughly the same overall pace of delivery?

And we should be under no illusions that teams who’ve mastered the fundamentals of software delivery are routinely outperforming teams who haven’t – with or without AI. AI is clearly not the differentiator.

So, whether you’re going to apply these disciplines with Claude Code or Codex, or with IntelliJ or VS Code, they still matter – arguably more than ever.

And what are these disciplines? What is Essential Code Craft?

Specification By Example – build shared understanding and pin down requirements with testable specifications
Test-Driven Development – rapidly iterate working software designs with short delivery lead times and reliable releases
Continuous Integration – keep teams more in sync with their changes, merging and testing them many times a day to ensure a working, shippable-at-any-time product
Continuous Collaboration – keep teams on the same page by continuously communicating with practices like pair programming and teaming
Refactoring – reshape code to make change easier, while keeping it working and shippable at all times
Modular Design – optimise software architecture to localise the “blast radius” and minimise the cost of changes, while making rapid testing and smarter reuse easier
Continuous Inspection – minimise the bottleneck and the “LGTM” effect of downstream code review by making it a continuous and highly automated process
Continuous Delivery – combine these fundamentals in a delivery process that can get the proverbial peas from the farmer’s field to the kitchen table through rapid, reliable integration, build and deployment pipelines
Continuous Improvement – build development capability in an evidence-based way, learning what really works and what doesn’t as you build skills, automate tools and workflows, and explore and experiment with your approach – and that’s where I come in!)

Workshops on Specification By Example and Test-Driven Development are already live and taking registrations. If there’s demand, more will follow.

The roadmap is to build a set of repeating individual workshops, rotating monthly, that will eventually cover all of these disciplines – some explicitly, some implicitly like Continuous Integration and pair programming, which will be an integral part of most workshops.

Self-funders can pick and choose which to attend, and my hope is that they’ll be a bit like Pokemon cards – gotta collect ’em all!

Keep an eye on the Codemanship Ticket Tailor box office for details of upcoming workshops.

Also, details of new workshop times will be posted here first, so subscribe to this blog if you’d like to be kept in the loop for future workshops.

Engineering Leaders: Your AI Adoption Doesn’t Start With AI

In the past few months, I’ve been hearing from more and more teams that the use of AI coding tools is being strongly encouraged in their organisations.

I’ve also been hearing that this mandate often comes with high expectations about the productivity gains leaders expect this technology to bring. But this narrative is rapidly giving way to frustration when these gains fail to materialise.

The best data we have shows that a minority of development teams are reporting modest gains – in the order of 5%-15% – in outcomes like delivery lead times and throughput. The rest appear to be experiencing negative impacts, with lead times growing and the stability of releases getting worse.

The 2025 DevOps Research & Assessment State of AI-assisted Software Development report makes it clear that the teams reporting gains were already high-performing or elite by DORA’s classification, releasing frequently, with short lead times and with far fewer fires in production to put out.

As the report puts it, this is not about tools or technology – and certainly not about AI. It’s about the engineering capability of the team and the surrounding organisation.

It’s about the system.

Teams who design, test, review, refactor, merge and release in bigger batches are overwhelmed by what DORA describes as “downstream chaos” when AI code generation makes those batches even bigger. Queues and delays get longer, and more problems leak into releases.

Teams who design, test, review, refactor, merge and release continuously in small batches tend to get a boost from AI.

In this respect, the team’s ranking within those DORA performance classifications is a reasonably good predictor of the impact on outcomes when AI coding assistants are introduced.

The DORA website helpfully has a “quick check” diagnostic questionnaire that can give you a sense of where your team sits in their performance bands.

(Answer as accurately as you can. Perception and aspiration aren’t capability.)

The overall result is usefully colour-coded. Red is bad, blue is good. Average is Meh. Yep, Meh is a colour.

If your team’s overall performance is in the purple or red, AI code generation’s likely to make things worse.

If your team’s performance is comfortably in the blue, they may well get a little boost. (You can abandon any hopes of 2x, 5x or 10x productivity gains. At the level of team outcomes, that’s pure fiction.)

The upshot of all this is that before you even think about attaching a code-generating firehose to your development process, you need to make sure the team’s already performing at a blue level.

If they’re not, then they’ll need to shrink their batch sizes – take smaller steps, basically – and accelerate their design, test, review, refactor and merge feedback loops.

Before you adopt AI, you need to be AI-ready.

Many teams go in the opposite direction, tackling whole features in a single step – specifying everything, letting the AI generate all the code, testing it after-the-fact, reviewing the code in larger change-sets (“LGTM”), doing large-scale refactorings using AI, and integrating the whole shebang in one big bucketful of changes.

Heavy AI users like Microsoft and Amazon Web Services have kindly been giving us a large-scale demonstration of where that leads – more bugs, more outages, and significant reputational damage.

A smaller percentage of teams are learning that what worked well before AI works even better with it. Micro-iterative practices like Test-Driven Development, Continuous Integration, Continuous Inspection, and real refactoring (one small change at a time) are not just compatible with AI-assisted development, they’re essential for avoiding the “downstream chaos” DORA finds in the purple-to-red teams.

And while many focus on the automation aspects of Continuous Delivery – and a lot of automation is required to accelerate the feedback loops – by far the biggest barrier to pushing teams into the blue is skills.

Yes. SKILLS.

Skills that most developers, regardless of their level of experience, don’t have. The vast majority of developers have never even seen practices like TDD, refactoring and CI being performed for real.

That’s certainly because real practitioners are pretty rare, so they’re unlikely to bump into one. But much of this is because of their famously steep learning curves. TDD, for example, takes months of regular practice to to be able to use it on real production systems.

And, as someone who’s been practicing TDD and teaching it for more than 25 years, I know it requires ongoing mindful practice to maintain the habits that make it work. Use it or lose it!

An experienced guide can be incredibly valuable in that journey. It’s unrealistic to expect developers new to these practices to figure it all out for themselves.

Maybe you’re lucky to have some of the 1% of software developers – yes, it really is that few – who can actually do this stuff for real. Or even one of the 0.1% who has had a lot of experience helping developers learn them. (Just because they can do it, it doesn’t necessarily follow that they can teach it.)

This is why companies like mine exist. With high-quality training and mentoring from someone who not only has many thousands of hours of practice, but also thousands of hours of experience teaching these skills, the journey can be rapidly accelerated.

I made all the mistakes so that you don’t have to.

And now for the good news: when you build this development capability, the speed-ups in release cycles and lead times, while reliability actually improves, happen whether you’re using AI or not.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.