Codemanship's Blog – Page 12 – Musings And Mutterings By Jason Gorman

The AI-Ready Software Developer #7 – Commit On Green, Revert On Red

Imagine you’re walking a tightrope tied to the peaks of two mountains. When you reach the middle, it’s a long way to safety – forwards or backwards – and a long way down if you fall.

Changing code’s a bit like walking a tightrope. Every step we take risks a fall, and the more changes we make, the more likely we are to experience catastrophe.

Now imagine the same rope, but now its tied to wooden posts just a few feet apart and a few feet tall. The risk of a fall with each step remains the same, but you’re never far from safety, and if you do fall, it’s no big deal. You can just climb back up and carry on from the last post you reached.

If “safety” in terms of software means code we’re confident works, and that we could therefore ship if we wanted to, than we want those safe points to be close together and “low to the ground” – easy to climb back on at the last safe point and try again if we fall.

In practice, this means getting into the habit of committing our changes whenever we see all the tests pass. Provided they’re good tests, of course. Another thing LLM coding assistants are notorious for is generating meaningless, “weak” tests – or even commenting them out when they fail. Gosh, I wonder where they learned that? Monkey see, monkey do. My advice? Test your tests!

This works hand-in-hand with working in short feedback loops, solving one problem at a time and testing continuously. The bigger the feedback loops, the more changes between safe points, the further apart the wooden posts get, and the bigger the drop if we fall.

And the bigger the sunk cost.

If I change one line of code and tests fail, it’s no big deal to figure out which change broke the code. I can usually see the problem and fix it quickly. If not, I can revert to the previous working commit and try again with very little time lost.

If I change 100 lines of code and tests fail… Well, now I have to figure out which of those 100 changes broke it, and if I can’t, that’s a lot of time lost with a reset. In this situation, we’ll naturally be unwilling to cut our losses.

LLMs can generate a lot of changes very quickly, and because they understand nothing, each change is significantly more likely to break the software.

And models can’t distinguish between working code and broken code. It’s all just context to a language model. Ideally, we don’t want the broken stuff figuring in its machinations, so it’s important to remove broken code as soon as it appears, so the model is building on solid ground whenever possible.

The easy way to do that is a hard reset back to the previous working commit. Otherwise, we can send the model into a “doom loop” where it keeps trying to fix the problem, but actually makes things worse with each attempt, contaminating the context for subsequent passes. This usually means resetting the context, too.

Some “AI” coding assistant users report success with a “three strikes and out” policy. If the tests fail, the model is given two more attempts to fix any problems, before a hard reset. But I’ve been finding that a “zero tolerance” approach works well for me. I revert the code, adapt the prompt – often looking for a smaller intermediate step – and ask the model to try again.

(And, yes, I do have a policy on how many attempts I’ll allow before I write the code myself. We’ll also be talking about when to grab the wheel in a future post.)

LLMs work better on a clean slate.

The AI-Ready Software Developer #6 – Continuous Refactoring

Finally, we get to the “R” word.

Our software works. We know, because we’ve been testing it continuously. And we’ve reviewed the code at every step, looking for areas that might need clarifying, looking for duplication that might need consolidating and abstracting, looking for modules that do or know too much, and/or are tightly coupled to other parts of the code, and looking for unnecessary complexity and redundancy that needs pruning.

Basically, looking for things that are going to make the code harder to change.

But when we find problems in the code, what are we gonna do about them?

Refactoring to the rescue! At least, if you know what the word means and know how to do it.

This is where some of the previous AI-ready dev practices conjoin. Refactoring, done safely, entails individual atomic rewrites – restructurings, if you like – of the code that bring it back to working.

Think of your source code as a database, and a refactoring – yes, it’s a noun, too – as a transaction against that database. It’s an all-or-nothing change that preserves the semantics of the data. The code means what it did before the transaction.

For example, if I want to rename a function, I can’t just change the name. I have to update all of the places in the code that the function’s called. That whole “Rename” refactoring has to complete for us to get back to code that works.

How do we know it works again? We run our fast automated tests. I knew they’d come in handy.

So, one atomic refactoring at a time – Rename a function, Extract a block of code into a new function, Move a function to a different module, and so on (refactorings are a bit like the moves of chess) – we restructure our code, testing after every refactoring to make sure we’re not wandering off the path of working, potentially shippable code into the deep dark forest of untested, un-shippable software. (You test before you ship, right?)

We solve one problem at a time. We test continuously. And, as we’ll explore in the next post, we bank every successful step using version control.

Refactoring by hand is a discipline that takes time to learn. IDEs that automate refactorings can help greatly in applying that discipline, which is why I remain a JetBrains user for any programming language their tools support, because they take refactoring more seriously than most.

Refactoring using LLMs, as has been the case with every other practice covered in this series, is really no different.

We get the best – least worst – results when we ask the model to perform one refactoring at a time.

We get better results when we clarify refactorings using examples (I’ve been growing a little library of Markdown files explaining the mechanics of each refactoring).

It’s essential that we test continuously while we’re refactoring.

And it’s essential to review the code again to see the effect of each refactoring.

And finally, if our code lacks reasonably good separation of concerns, even relatively simple restructurings are more likely to go bad – wider “blast radius”, bringing in too much context, throwing the model out of its training data distribution.

This is why it’s very important not to let the problems build up over hours or days (or weeks), thinking we can do a “big refactoring” later. (There’s no such thing as a “big refactoring”, BTW – that’s very much in the deep, dark forest.)

Never forget that LLMs are really good at generating code they’re really bad at modifying. Keep on top of every “code smell” the model spits out. Don’t let the cruft build up, because your LLM will very quickly be walking through a minefield of its own creation.

Personally, whenever possible I do refactoring using the automated tools built in to my IDEs. I’ll take predictable over powerful every time.

The AI-Ready Software Developer #5 – Continuous Inspection

So, we’re working in small steps, solving one problem at a time. We’re clarifying with examples to reduce the risk of models grabbing the wrong end of the prompt stick. We’re cleanly separating concerns to localise the “blast radius” of LLM-generated changes. And we’re continuously testing to get immediate feedback when the model breaks stuff.

Once we’re satisfied that the software’s still working, this might be a good opportunity to take a step back and examine the code that it generated or changed.

Most important, after making sure it works, is whether the code makes sense to us. Read it. (No, seriously, READ IT!) Can you understand what it’s doing? Can you understand why it’s doing it that way?

Code comprehensibility is a complex topic, since it’s a function not just of the code, but of what we can comprehend. (We’ll get to that in a later post.)

For now, suffice to say that every line of code that you don’t understand – or haven’t read – that makes it into the product adds something I’m calling “comprehension debt“. When you have to change code that you don’t understand, you’ll see what I mean.

So read the LLM’s code. Try to understand what it does and why. See if you can predict what it will do in specific test cases.

Another problem coding assistants are notorious for creating is duplication. They’re plagiarism machines – monkey see, monkey do. Bits of duplication here and there aren’t a problem. But the same code, or the same concept, repeated over and over definitely is. The Rule of Three can be helpful here.

And don’t forget that the real role of duplication in a design process is to signpost opportunities for reuse – to point us towards genuinely useful abstractions.

Of course, if you remove the duplication and it makes the code harder to understand, maybe put it back (or look for a better abstraction – one that “says what it does on the tin”).

Look, too, for problems with modular design. LLMs are really bad at modular design. Probably because their training data mostly consists of examples with low or no modularity, like Stack Overflow answers.

Yep. LLMs are really good at generating code that they’re really bad at modifying later.

Keep on top of the separation of concerns in your code, or they will lead you out into deep water and leave you to drown.

Look for modules that have multiple distinct reasons to change, and/or depend on too many other modules. Look for Feature Envy. And look for Primitive Obsession. Oh, boy, look for that!

And, lest we forget, look for code that isn’t being used and isn’t needed. LLMs aren’t noted for sticking to the brief. They will generate code you didn’t ask for.

And for the many low-level issues models may introduce – unused imports, visibilities higher than needed, data that could be immutable, and all that malarkey – I’m in the habit of running a linter with every code review. They can scan large amounts of code for dozens of issues very quickly, exhaustively, and deterministically.

DO NOT ask the model to mark its own homework. It misses tons, and it cheats.

“But that sounds like a lot, Jason.”

Not really. If you’re taking small steps, the amount of new or changed code will be just a few lines. If it hurts, do it more often!

The AI-Ready Software Developer #4 – Continuous Testing

Now, where were we? Ah, yes.

So, we’re working in small steps with our LLM, solving one problem at a time, which makes it easier for the model to pay attention to important details (just like in real life).

We’re keeping our contexts small, and making them more specific by clarifying with examples to reduce the risk of misinterpretation (just like in real life).

And we’re cleanly separating the different concerns in our architecture to limit the “blast radius” when the model changes code, reducing the risk of boo-boos (just like in real life), and keeping diffs smaller. (More about that in a future post – for now, smaller diffs == gooderer).

When we apply all three of these practices together, it opens a door: we can test more often.

Those examples we used to clarify our requirements can become tests we can perform after the model has done that work to check that it did what we told it to.

We could perform these tests ourselves by running the software or by accessing the code directly at the command line in Read-Evaluate-Print loops (REPL). Or, if a UI is involved, we could run it and click the buttons ourselves.

I highly recommend seeing it work with your own eyes at least once. Trust no one, Agent Mulder!

But what about code that was working that the model has since changed? As the software grows, manually retreading dozens, hundreds, maybe thousands of tests to make sure we’re obeying Gorman’s First Law of Software Development :

“Thou shalt not break shit that was working”

– is going to take a lot of time. Eventually, our development process will become O(n!) complex, where n is the number of tests, and every time we add a new one – one problem at a time, remember? – we have to repeat the existing tests.

Automation to the rescue! If we find ourselves performing the same test over and over, we can write code to perform it for us. Or we can get the LLM to write it for us (be careful here – triple-check every test the model writes! Been burned by that multiple times.)

And this is where clean separation of concerns turns into a superhero.

If the code that, say, calculates mortgage repayments is buried inside the module that generates the Repayments web page, and which also directly accesses an external web service to get interest rates, then you’ll have little choice but to test through the browser (or something pretending to be the browser).

But if there’s a separate MortgageCalculator module that does this work, and that module is decoupled from the code that fetches interest rates, a test can be automated directly against it that will run very reliably and very fast – milliseconds instead of seconds. Thousands of those kinds of “unit” tests could run in seconds instead of hours.

Which means comprehensively retesting your software after every small step, giving you an instant heads-up if the LLM broke anything (AND IT WILL), becomes completely practical.

Once again, you won’t be surprised to learn that this is very good news whether we’re using “AI” or not. Many teams consider it essential.

The AI-Ready Software Developer #3 – One Problem At A Time

Calling back to my analogy of LLM context being like “cognitive load”, it’s now well understood why more context isn’t necessarily a good thing.

Additional context that clarifies (e.g., with examples) can help reduce misinterpretations by models. But we must bear in mind that the effective maximum context sizes LLMs can handle – even the hyper-scale “frontier” models – are orders of magnitude smaller than advertised. In studies, model accuracy was seen to degrade rapidly with contexts of as few as 100 tokens.

The quadratic time complexity and memory overhead of self-attention, coupled with very real effects like attention dilution and the now infamous “lost in the middle” phenomenon, means that the more you ask an LLM to focus on, the less it can focus on any of it. (Here’s a great blog post that explains these things very nicely.)

Perhaps most damaging is the tendency for larger contexts to go outside the model’s training data distribution. Beans on toast is one thing. A 5-course banquet is quite another, and coding LLMs were trained on a power-law distribution of mostly beans on toast examples (e.g., Stack Overflow answers, tiny GitHub repos) and vanishingly few 5-course banquets. When our inputs take LLMs outside their “comfort zones” like this, predictions happen with very low confidence. Basically, the model has to “guess”.

Imagine it this way: on the TV game show Who Wants To Be A Millionaire, when a contestant gets stuck on a question, they can Ask The Audience. Audience members vote for which of the four possible answers they think is right.

If the question is “What’s the capital city of France?”, the statistical distribution of audience votes might go:

A. Berlin – 4%
B. Paris – 85%
C. Rennes – 8%
D. Brussels – 3%

The contestant can answer “Paris” with high confidence. This is a low entropy prediction.

But if the question is “Which microscopic mechanism gives rise to superconductivity in semiconductors?”, there’s a high probability that almost nobody in the audience knows. The question goes outside the audience’s training distribution (quite literally).

So they have to guess, and the distribution might look more like 23%, 25%, 26%, 24%. This is a high entropy, low confidence prediction. Our answer is much more likely to be wrong.

This effect is very visible in niche problem domains like semiconductor physics, or rarely-used programming languages that are underrepresented in the training data. But it also happens when we ask the model to pay attention to more and more things.

It’s harder to see in complex problems like code and poetry, but becomes very visible in simple, predictable problems, like chess.

With each new move, the probability of a match to the corpus of game transcripts the model was trained on reduces by an order of magnitude – games become less and less probable the more they go on, and model predictions become less and less confident.

This “probability collapse” makes asking the model to pay attention to many things in a single context very problematic. And it’s the reason why large, general context files containing dozens of rules and “guardrails” (e.g., CLAUDE.md) demonstrably don’t work much of the time. (Though whether many of us notice that is another question.)

So, while we seek to make our inputs more specific (e.g., clarifying with examples), we also seek to make them as small as we can. Modularity helps in that respect, and so too does asking for less in each interaction. Ideally, from my own experiments, solving one problem with each prompt.

And – quelle coïncidence! – this works with humans, too! When we try to spin many plates at the same time, the typical result is broken plates.

Micro-iterative practices like TDD and refactoring enable us to make forward progress not in risky leaps, but safely putting one foot in front of the other. The net effect is very similar – fewer broken plates and less time wasted sweeping up.

In software development team productivity, batch size is one of the most powerful levers we have. And, unsurprisingly, if you want to reduce delays at the checkout, bigger baskets isn’t the answer.

It also gives us many more opportunities for steering as we go.

The AI-Ready Software Developer #2 – Clarifying With Examples

In communication studies at school, we were taught a simple way to gauge how well we’d understood what someone had told us: reflect it back with an example.

“So what you’re saying is that if, for example, I had a pension pot of £250,000, I could take £62,500 tax-free and invest the rest in an annuity that pays £10,500 a year?”

If we’ve misunderstood, the numbers won’t add up – so to speak – and the other person can more easily see the discrepancy.

This can be a very powerful way to test our understanding of software requirements. By agreeing examples with our customers, we can pin down a more precise meaning and catch discrepancies before they leak into the code.

When the customer says that “tax should be added to the order total”, we can clarify exactly what they mean by that by exploring real-world examples.

A customer in the UK, say, where sales tax is 20%, who places an order net totaling £100.00, will add tax of £20.00, bringing the gross total to £120.00.

A customer in Switzerland, where sales tax is 8.1%, on an order with net total 100.00 CHF, would pay a gross total of 108.10 CHF.

We clarify with these examples that we’re talking about sales tax (not income tax or property stamp duty or window tax), and that it’s applied in the country where the order’s being placed. Those details could easily have gotten lost in translation had we not agreed these examples.

The other thing we’ve done here is flesh out a shared vocabulary for describing this problem: customer, country, order, net total, sales tax, gross total and so on.

Establishing a share vocabulary to describe the problems we want to solve, and using it consistently in all our communications – including code – is very valuable whether we’re talking to a customer, talking to a team mate, talking to a compiler, or indeed, talking to a language model. When we use different terms to describe the same concept, that’s when misunderstandings become much more likely.

For now, though, suffice to say that when we go away and write the code for this piece of functionality, there’s a much lower probability of us misunderstanding what our customer meant.

And – would you believe it? – it turns out this works with LLMs, too!

Large Language Models match patterns – relationships between tokens – in our input to patterns in the model so they can predict what comes next. Semantic ambiguity’s a real problem here. The exact same input could potentially be interpreted in multiple different ways.

But when it comes to generating code, ambiguity’s highly undesirable. Our job as programmers is to produce a single interpretation, expressed in machine-executable code. Ideally, the one the customer meant.

Now, here’s a funny thing: it turns out that the code examples LLMs like Claude Opus and GPT-5 are trained on are often paired with usage examples in some form. These could be in docstrings, in main() methods, or actual unit tests.

Providing a usage example with our prompt therefore helps to narrow the search, significantly reducing the range of possible interpretations and increasing the confidence of the model’s predictions.

From personal experience, and watching many teams, I’ve seen it make quite a big difference to the usefulness of LLM-generated code.

The AI-Ready Software Developer #1 – Separation of Concerns

Can we talk about separation of concerns and cognitive load?

One thing about LLM coding assistants that’s very interesting is how they tend to crap out on code that has poor separation of concerns.

Despite some pretty darn big advertised maximum context sizes (e.g., GPT-5 has 400K tokens), the effective maximum context size – beyond which accuracy degrades rapidly, and downstream rework multiplies – is orders of magnitude smaller. Studies have found it to be in the order of 100 – 1000 tokens, even for the hyperscale “frontier” models.

Coding assistants like Claude Code and Codex use static dependency analysis to determine what source files need to be added to the context for a particular task.

If you ask it to make a change to a 1,000-line source file that has 20 direct dependencies on other source files, that’s a lot of context.

If you ask it to make a change to a 100-line source file with 2 direct dependencies on other files, the context is much smaller. Changes with a smaller “blast radius” are less likely to go wrong.

I think of LLM context as being analogous to cognitive load: in order to understand Module A, what else do I need to understand?

Higher modularity tends to reduce cognitive load when it’s done effectively. If I can understand the contents of Module A, I shouldn’t need to understand the contents of any of its dependencies. To reverse an old marketing slogan, each dependency “Says what it does on the tin”, so to speak.

A useful test of code comprehensibility is to ask people to predict what a method or function will do in a specific test case without letting them see the implementations of any other methods or functions it uses. What these dependencies are doing should be obvious from the outside, and the details of how they do it should be irrelevant.

And it turns out that’s good advice when working with LLMs, too. Signposting dependencies clearly (e.g., with intent-revealing names and/or type information) helps models pattern-match more accurately – they don’t need to “guess”. And from experiment I’ve seen it reduce context size – “cognitive load” – on many occasions, producing fewer “hallucinations” in the output.

In languages like C# and Java, we don’t get much of a choice over whether we provide type information (though watch out for those implied types!)

In dynamically-typed languages, I’ve found I need to be more careful. If, for example, a dependency name doesn’t correspond to its type in my Python code, I’ll add a type hint to give the model more to go on.

One final thought: LLMs are famously good at generating code they’re bad at modifying. I routinely see larger source files with lots of dependencies being spat out by Claude, GPT-5, Llama etc. So you need to keep on top of your modular design.

(I see some folks posting that they get the model to review and break modules down once a day. IME, generated code can be tripping the model up before lunch, so I’d recommend refactoring more often than that. Indeed, I recommend refactoring continuously.)

Wax On, Wax Off

If you’ve seen the original Karate Kid movie, you’ll remember how Mr Miyagi starts Danny LaRusso’s training by getting him to wax his car over and over again.

“Wax on, right hand. Wax off, left hand. Wax on, wax off. Breathe in through nose, out the mouth. Don’t forget to breathe, very important.”

Larusso can’t see what waxing cars has to do with karate.

It’s not until he begins his formal training that he realises that his repeated “wax on, wax off” motion has been building the kind of muscle memory he’ll need to be effective at karate.

As I update Codemanship training for 2026, there’s going to be a significant element of “wax on, wax off”, and no doubt many will be asking “What has this got to do with AI-assisted coding?”

It turns out, everything.

While many dev teams are using “AI” coding assistants these days, it’s clear that only a minority are effective with them. For the large majority, these tools actually slow them down.

It’s also becoming very clear that the few who are effective using them are only so for reasons that really have nothing to do with “AI”.

It turns out that the principles and practices that have served teams well for thirty or more years are exactly the same principles and practices that are required to stop a code-generating firehose overwhelming your development processes and making the bottlenecks worse.

The key to being effective with “AI” coding assistants is being effective without them.

Wax on, wax off.

Is Your Development Team “AI-Ready”?

From the 2025 DORA State of AI-Assisted Software Development report:

“In 2025, the central question for technology leaders is no longer if they should adopt AI, but how to realize its value. DORA’s research includes more than 100 hours of qualitative data and survey responses from nearly 5,000 technology professionals from around the world.1 The research reveals a critical truth: AI’s primary role in software development is that of an amplifier. It magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones.

The greatest returns on AI investment come not from the tools themselves, but from a strategic focus on the underlying organizational system: the quality of the internal platform, the clarity of workflows, and the alignment of teams. Without this foundation, AI creates localized pockets of productivity that are often lost to downstream chaos.”

It’s worth bearing in mind, too, that DORA respondents are a self-selecting group. The real percentage of low-performing teams, outside of that bubble, for whom “AI” coding assistants have a negative impact on software delivery is likely much higher.

But the message is clear: attach a code-generating firehose to a dysfunctional development process with blockers, bottlenecks and quality leaks, and you’ll make things WORSE.

So, we can theorise that there’s such a thing as an “AI-ready” software development team, and that being “AI-ready” has got very little to do with AI, and almost everything to do with the team and the way they already work:

Small batch sizes
Rapid iterations & continuous feedback at all levels
Low/no ambiguity specifications
Continuous design & architecture
Highly modular code
Continuous testing
Continuous code review
Continuous refactoring
Continuous integration
High levels of automation (otherwise “continuous” is unworkable)
High autonomy
Cohesive cross-functional team make-up

None of this comes in the monthly Claude Code plan. You can’t buy it. You can’t install it.

I know from helping many organisations build these capabilities that it takes a significant ongoing (never-ending) investment in people, in teams (the real product), in skills (it’s mostly a skills thing), in tooling and in automation. Typically about 20-25% of your entire development budget.

And that’s why most organisations won’t do it.

But enjoy the code-generating firehose 🙂

TDD Under The Microscope #3 – One Outcome Per Test

Continuing my series of posts attempting to (pseudo-)formalise the practice of Test-Driven Development, I want to look at a “rule” of TDD that’s actually helpful whether you’re going test-first or test-after.

It goes by several names, including “One Reason To Fail” and “One Question Per Test”. In Kent Beck’s Test Desiderata it’s listed as “Specific”:

If a test fails, the cause of the failure should be obvious.

From a testing perspective, this is very good advice. A test should be about one thing. If a test’s about many things, then when it fails we may well end up in the debugger trying to figure out which of them has gone wrong.

A good unit test pinpoints that immediately.

In practice, this usually means that a test only asserts one thing. A test like:

@Test
void buyingCd(){
   CD cd = new CD("Thriller", "Michael Jackson", 9.99, 10);
   Card card = mock(Card.class);
   
   cd.buy(2, card);

   assertEquals(8, cd.getStock());
   verify(card).charge(19.98);
}

Is really about two outcomes. To make it specific, this could be two tests:

@Test
void whenCdIsBoughtStockIsAdjusted(){
   CD cd = new CD("Thriller", "Michael Jackson", 9.99, 10);
   Card card = mock(Card.class);
   
   cd.buy(2, card);

   assertEquals(8, cd.getStock());
}

@Test
void whenCdIsBoughtCardIsCharged(){
   CD cd = new CD("Thriller", "Michael Jackson", 9.99, 10);
   Card card = mock(Card.class);
   
   cd.buy(2, card);

   verify(card).charge(19.98);
}

Each test is about a specific behaviour – a specific combination of a set-up, an action and one outcome.

In a test-driven approach to development, we aim to flesh out our software one feature at a time, one scenario (set-up + action) at a time, and one outcome at a time.

So the other major benefit of our tests being about one outcome is that it helps us to work in these micro feedback loops, effectively putting one foot in front of the other.

Formalising this declaratively is actually quite simple. When an assertion is added to the syntax tree inside the body of a test, there must not already be any assertions.

In pseudo-code:

test.assertions.count == 0

When we combine this rule with the others so far:

We lock programmers into a workflow where the only way to make solution code happen is to have it in the call stack of a test, to write the test assertion first and work backwards – because “usage-driven” – and to assert only one thing in each test.

My vision of a real-time “workflow linter” would be triggered by syntax tree events and, like a traditional linter, output an error or a warning (I haven’t decided on the details yet) if:

The programmer declares something that isn’t needed for a test yet
Writes anything other than the test assertion first
Writes more than one assertion in the same test

I’d say it’s already starting to look pretty test-driven. Next up will be a very important habit in TDD: See The Test Fail.