The AI-Ready Software Developer #22 – Test Your Tests

Many teams are learning that the key to speeding up the feedback loops in software development is greater automation.

The danger of automation that enables us to ship changes faster is that it can allow bad stuff to make it out into the real world faster, too.

To reduce that risk, we need effective quality gates that catch the boo-boos before they get deployed at the very latest. Ideally, catching the boo-boos as they’re being – er – boo-boo’d, so we don’t carry on far after a wrong turn.

There’s much talk about automating tests and code reviews in the world of “AI”-assisted software development these days. Which is good news.

But there’s less talk about just how good these automated quality gates are at catching problems. LLM-generated tests and LLM-performed code reviews in particular can often leave gaping holes through which problems can slip undetected, and at high speed.

The question we need to ask ourselves is: “If the agent broke this code, how soon would we know?”

Imagine your automated test suite is a police force, and you want to know how good your police force is at catching burglars. A simple way to test that might be to commit a burglary and see if you get caught.

We can do something similar with automated tests. Commit a crime in the code that leaves it broken, but still syntactically valid. Then run your tests to see if any of them fail.

This is a technique called mutation testing. It works by applying “mutations” to our code – for example, turning a + into a -, or a string into “” – such that our code no longer does what it did before, but we can still run it.

Then our tests are run against that “mutant” copy of the code. If any tests fail (or time-out), we say that the tests “killed the mutant”.

If no tests fail, and the “mutant” survives, then we may well need to look at whether that part of the code is really being tested, and potentially close the gap with new or better tests.

Mutation testing tools exist for most programming languages – like PIT for Java and Stryker for JS and .NET. They systematically go through solution code line by line, applying appropriate mutation operators, and running your tests.

This will produce a detailed report of what mutations were performed, and what the outcome of testing was, often with a summary of test suite “strength” or “mutation coverage”. This helps us gauge the likelihood that at least one test would fail if we broke the code.

This is much more meaningful than the usual code coverage stats that just tell us which lines of code were executed when we ran the tests.

Some of the best mutation testing tools can be used to incrementally do this on code that’s changed, so you don’t need to run it again and again against code that isn’t changing, making it practical to do within, say, a TDD micro-cycle.

So that answers the question about how we minimise the risk of bugs slipping through our safety net.

But what about other kinds of problems? What about code smells, for example?

The most common route teams take to checking for issues relating to maintainability or security etc is code reviews. Many teams are now learning that after-the-fact code reviews – e.g., for Pull Requests – are both too little, too late and a bottleneck that a code-generating firehose easily overwhelms.

They’re discovering that the review bottleneck is a fundamental speed limit for AI-assisted engineering, and there’s much talk online about how to remove or circumvent that limit.

Some people are proposing that we don’t review the code anymore. These are silly people, and you shouldn’t listen to them.

As we’ve known for decades now, when something hurts, you do it more often. The Cliffs Notes version of how to unblock a bottleneck in software development is to put the word “continuous” in front of it.

Here, we’re talking about continuous inspection.

I build code review directly into my Test-Driven micro-cycle. Whenever the tests are passing after a change to the code, I review it, and – if necessary – refactor it.

Continuous inspection has three benefits:

  1. It catches problems straight away, before I start pouring more cement on them
  2. It dramatically reduces the amount of code that needs to be reviewed, so brings far greater focus
  3. It eliminates the Pull Request/after-the-horse-has-bolted code review bottleneck (it’s PAYG code review)

But, as with our automated tests, the end result is only as good as our inspections.

Some manual code inspection is highly recommended. It lets us consider issues of high-level modular design, like where responsibilities really belong and what depends on what. And it’s really the only way to judge whether code is easy to understand.

But manual inspections tend to miss a lot, especially low-level details like unused imports and embedded URLs. There are actually many, many potential problems we need to look for.

This is where automation is our friend. Static analysis – the programmatic checking of code for conformance to rules – can analyse large amounts of code completely systematically for dozens or even hundreds of problems.

Static analysers – you may know them as “linters” – work by walking the abstract syntax tree of the parsed code, applying appropriate rules to each element in the tree, and reporting whenever an element breaks one of our rules. Perhaps a function has too many parameters. Perhaps a class has too many methods. Perhaps a method is too tightly coupled to the features of another class.

We can think of those code quality rules as being like fast-running unit tests for the structure of the code itself. And, like unit tests, they’re the key to dramatically accelerating code review feedback loops, making it practical to do comprehensive code reviews many times an hour.

The need for human understanding and judgement will never go away, but if 80%-90% of your coding standards and code quality goals can be covered by static analysis, then the time required reduces very significantly. (100% is a Fool’s Errand, of course.)

And, just like unit tests, we need to ask ourselves “If I made a mess in my code, would the automated code inspection catch it?”

Here, we can apply a similar technique; commit a crime in the code and see if the inspection detects it.

For example, I cloned a copy of the JUnit 5 framework source – which is a pretty high-quality code base, as you might expect – and “refuctored” it to introduce unused imports into random source files. Then I asked Claude to look for them. This, by the way, is when I learned not to trust code reviews undertaken by LLMs. They’re not linters, folks!

Continuous inspection is an advanced discipline. You have to invest a lot of time and thought into building and evolving effective quality gates. And a big part of that is testing those gates and closing gaps. Out of the box, most linters won’t get you anywhere near the level of confidence you’ll need for continuous inspection.

If we spot a problem that slipped through the net – and that’s why manual inspections aren’t going away (think of them as exploratory code quality testing) – we need to feed that back into further development of the gate.

(It also requires a good understanding of the code’s abstract syntax, and the ability to reason about code quality. Heck, though, it is our domain model, so it’ll probably make you a better programmer.)

Read the whole series

Super-Mediocrity

March will be the third anniversary of the beginning of my journey with Large Language Models and generative “A.I.”

At the time, we were all being dazzled – myself included – by ChatGPT, the chat interface to OpenAI’s “frontier” LLM, GPT-4.

There was much talk at the time of this technology eventually producing Artificial General Intelligence (AGI) – intelligence equal to that of a human being – and, from there, ascending to god-like “super-intelligence”. All we needed was more data and more GPUs.

It’s now becoming very clear that scaling very probably isn’t the path to AGI, let alone super-intelligence. But it was pretty clear to me at the time, even after just a few dozen hours experimenting with the technology.

The way I see it, LLMs are playing a giant game of Blankety Blank. And you don’t win at Blankety Blank by being original or witty or clever. You win at Blankety Blank by being as average as possible.

Image

The more data we train them on, the more compute we use, the more average the output will get, I speculated. Model performance will tend towards the mean.

Three years later, is there any hard evidence to back this up? Turns out there is – the well-documented phenomenon of model collapse.

Train an LLM on human-created text, then train another LLM on outputs from the first LLM, then another from the outputs generated by that “copy”. Researchers found that, from one generation to the next, output degraded until it become little more than gibberish.

What causes this is that output generated by LLMs clusters closer to the mean than the data they’re trained on. Long-tail examples – things that are novel or niche – get effectively filtered out. The text generated by LLMs is measurably less diverse, less surprising, less “smart” than the text they’re trained on.

Given infinite training data, and infinite compute, the resulting model will not become infinitely smart – it will become infinitely average. I coined the term “super-mediocrity” to describe this potential final outcome of scaling LLMs.

(What really strikes me, watching this video again after 3 years, is just how on-the-money I was even then. I guess the lesson is: don’t bet against entropy.)

Naturally, my focus is on software development. The burning question for me is, what does super-mediocre code look like? When it comes to the code models like Claude Opus and GPT-5 are trained on, what’s the mean? And it’s bad news, I’m afraid.

We know what the large-scale publicly-available sources of code examples are – places like Stack Overflow and GitHub. And the large majority of code samples we find on these sites are… how can I put this tactfully?… crap.

The ones that actually even compile often contain bugs. The ones that don’t contain bugs are often written with little thought to making them easy to understand and easy to change. And that’s before we get on to the subject of things like security vulnerabilities.

Hard to believe, I know, but when Olaf posted that answer on Stack Overflow, he wasn’t thinking about those sorts of things. Because who in their right mind would just copy and paste a Stack Overflow answer into their business-critical code? Right? RIGHT?

And LLM-generated code tends towards the average of that. It tends to be idiomatic, “boilerplate” and often subtly wrong – as well as often being more complicated than it needed to be. It’s that junior developer who just copies what they’ve seen other developers do, without stopping to wonder why they did it. Monkey see, monkey do.

What does super-mediocrity at scale look like, we might ask? I think a bit of a clue can be found on the Issues pages of our most-beloved “AI” coding tools.

Image

As a daily user of these tools, I’m often taken aback at just how buggy updates can be. And I see a lot of chatter online complaining about how unreliable some of the most popular “AI” coding assistants are, so I’m evidently not alone.

Anthropic have been boasting about how pretty much 100% of their code’s generated by one of their models these days, usually being driven in FOR loops (you may know them as “agents”).

I’ll skip the jokes about dealers “getting high on their own supply”, and just make a basic observation about the practical implications of attaching a super-mediocrity generator that’s been trained on mostly crap to your development process.

Just as I don’t lift code blindly from Stack Overflow without putting it through some kind of quality check – and that often involves fixing problems in it, which requires me to understand it – I also don’t accept LLM-generated code without putting it through the same filter. It has to go through my brain to make it into the code.

This is an unavoidable speed limit on code generation – code doesn’t get created (or modified) faster than I can comprehend it.

When code generation outruns comprehension, slipping into what I call “LGTM-speed”, well… we see what happens. Problems accumulate faster, while our understanding of the code – and therefore our ability to fix the problems – withers. Mean Time To Failure gets shorter. Mean Time To Recovery gets longer.

Your outages happen more and more often, and they last longer and longer.

Yes, this happens with human teams, too. But an “AI” coding assistant can get us there in weeks instead of years.

As of writing, there’s no shortcut. Sorry.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise rework due to misunderstandings about requirements, they’ll need to describe requirements in a testable way as part of a close and ongoing collaboration with our customers.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of code reviews, they’ll need to build review into the coding workflow itself, and automate the majority of their code quality checks.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise merge conflicts and broken builds, and to minimise software delivery lead times, they’ll need to integrate their changes more often and automatically build and test the software each time to make it ready for automated deployment.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the “blast radius” of changes, they’ll need to cleanly separate concerns in their designs to reduce coupling and increase cohesion.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the cost and the risk of changing code, they’ll need to continuously refactor their code to keep its intent clear, and keep it simple, modular and low in duplication.

“We definitely don’t have time for that, Jason!”

“AI” coding assistants don’t solve any of these problems. They AMPLIFY THEM.

More code, with more problems, hitting these bottlenecks at accelerated speed turns the code-generating firehose into a load test for your development process.

For most teams, the outcome is less reliable software that costs more to change and is delivered later.

Those teams are being easily outperformed by teams who test, review, refactor and integrate continuously, and who build shared understanding of requirements using examples – with and without “AI”.

Will you make time for them in 2026? Drop me a line if you think it’s about time your team addressed these bottlenecks.

Or was productivity never the point?

Rely On AI And Get Left Behind

LLM-based code generation comes with multiple potential gotchas for a business.

More code hitting process bottlenecks like testing, review and integration makes most teams slower.

More problems being generated by text prediction engines that simply don’t understand what the requirements or the code means makes releases less stable.

These are effects we’re seeing today. Most dev teams are shipping less reliable software, and shipping it later thanks to “AI”. Good work, everyone!

But I suspect the biggest risk is creeping up on us without us necessarily realising.

The more teams offload their thinking to these tools, the less they understand the code your business relies on. And the less they understand it, the less able they are to fix problems that “AI” can’t fix. Outages last longer, and fixes are less sticky.

Mean Time To Failure goes down, and Mean Time To Recovery gets longer. More fires, taking longer to put out. Just ask the folks at AWS!

“AI”-assisted programming’s a bit like pedal-assist on electric bicycles. It makes progress feel easy, but we might not realise the impact that’s having on our coding “muscles” – our ability to comprehend and reason about code.

That is, until we run out of juice and have to pedal unaided. That’s when it becomes obvious just how much of our Code Fu we’ve lost as we’ve come to rely on that assistance more and more. Increasingly, I hear developers say “I’ve hit my token limit, so I’m blocked.”

Working in the low-gravity environment of “AI” coding, we may not notice the decline in our own abilities. And the more they decline, the less able we are to notice – until we find ourselves back in 1g and suddenly are unable to walk.

I’m seeing teams where most of the developers now have a hard time working on their code unaided by “AI”. It’s taking them much longer to wrap their heads around it, and they’re missing more and more problems.

Shipping code faster than we can understand it creates comprehension debt – extra time needed to build that understanding when the tool can’t do what we need it to do (and the need for that is very probably never going away).

Relying on LLMs to write the code for us erodes our ability to understand code, full stop. It’s a double-whammy.

There’s little doubting now that the devs who are most effective using “AI” are the ones who can work just fine without it.

Some say “Use AI or get left behind”. The reality is more probably “Rely on AI and get left behind”. Developers need to maintain their cognitive edge, which means keeping their hand in with regular unaided working.

There’s every chance that, in the future, these developers will be most in-demand.

If they don’t, then this is going to add a lot of risk for businesses – not least because the vendors will have them over a barrel. Price plans are heavily subsidised today. That can’t last, and if you’re not able to walk away, then you have no negotiating position.

That more businesses don’t see becoming completely reliant on a third-party as a strategic risk is baffling, especially in such a volatile market as “AI”. It could all implode at any moment.

But the real risk is that, one day, your core product or system will break, and there’ll be nobody capable of fixing it.

Well, nobody you could afford.

Best Way To Keep A Secret In Software Development? Document It

Once upon a time, in a magical land far, far way – just north of London Bridge – I wore the shiny cape and pointy hat of a senior software architect.

A big part of my job was to document architectures, and the key decisions that led to them.

Many diagrams. Many words. Many tables. Many, many documents.

Structure, dynamics, patterns, goals, principles, problem domains, business processes. You name it, I documented it. Much long time writing them, much even longer time keeping them up-to-date. Descriptions of things that aren’t the things themselves, it turns out, go stale fast.

It only takes a couple of gigs to become suspicious that maybe, perhaps, nobody actually reads these documents. So I started keeping the receipts. I’d monitor file access to see when a document was last pulled from our DMS or shared server, or when Wiki pages were last viewed.

My suspicions were confirmed. Teams weren’t looking at the documents much, or – usually – at all. Those fiends!

In their defence, as the saying goes:

Documentation’s useful until you need it

(I looked up the source of this anonymous quote, which I’ve used often. Turns out it’s me. LOL. Why didn’t I remember? I probably wrote it in a document.)

A lot of architecture documentation is out-of-date to the point where it becomes misleading. Code typically evolves faster than one person’s ability to keep an accurate record of the changes.

A fair amount was never in-date in the first place because it describes what the architect wanted the developers to do, and not what they actually did.

The upshot is that architecture documentation is rarely an accurate guide to reality. It’s either stale history, or creation myths.

Recent research found that familiarity with a code base is a far better predictor of a developer’s comprehension of the architecture than access to documentation, and the style of the documentation didn’t seem to make any significant difference.

When I think back to my times as developer rather than architect, that rings true. I’d often skim the documentation – if I looked at it at all – then go look at the code. Because the code is the code. It’s the truth of the architecture.

For architecture documentation to be more useful in aiding comprehension, it has to be tightly-coupled to the reality it describes. Changes to the architecture need to be reflected very soon after they’re made – ideally, pretty much immediately.

I experimented with round-trip modeling of architecture quite deeply, reverse-engineering code on every successful build, producing an updated description for every working version.

But that, too, produced web documentation that I know for a fact almost never got looked at. Least of all by me, as one of the developers.

Reverse-engineering code tends to produce static models – models of structure. I do not find purely structural descriptions very helpful in understanding what code does. It only describes what code is.

If a library has documentation generated from JavaDoc, I’ll tend to ignore that and look for examples of how the library can be used.

When I want to understand a modular design, I’ll look for examples of how key components in the architecture interact in specific usage scenarios – how does the architecture achieve user outcomes?

Quick war story: I joined a team in an investment bank who’d been working in isolation on different COM components. The Friday before our launch date, we still hadn’t seen how these very well-specified components – we all got handed detailed Word documents for our parts – would work together to satisfy key use cases.

Long-story-short, turns out they didn’t. Not a single end-to-end use case scenario.

Every part of the system was described in detail. But the system as a whole didn’t work, and we couldn’t see that until it was too late.

This is why I start with usage scenarios. You can keep your documentation – I’m gonna set breakpoints, run the damn thing, and click some buttons so I can step down through the call stack and build a picture of the real architecture, as it is now, in functional slices.

And I would have damned well designed it that way in the first place, starting with user outcomes and working backwards to an architecture that achieves them.

So, a dynamic model driven by usage examples is preferable in aiding my comprehension of architecture to structural models of static elements and their compile-time relationships.

One neat way of capturing usage examples is by writing them as tests. This can serve a dual purpose, both documenting scenarios as well as enabling us to see explicitly if the software satisfies them.

Tests as specifications – specification by example – is a cornerstone of test-driven approaches to software design. And it can also be a cornerstone of test-driven approaches to software comprehension.

When tests describe user outcomes, we can pull on that thread to visualise how the architecture fulfils them. That so few tools exist to do that automatically – generate diagrams and descriptions of key functions, key components, key interactions as the code is running (a sort of visual equivalent of your IDE’s debugger) – is puzzling. Other fields of design and engineering have had them for decades.

As for understanding the architecture’s history, again I’ll defer to what actually happened over what someone says happened. The version history of a code base can reveal all sorts of interesting things about its evolution.

Of course, the problem with software paleontology is that we’re often working with an incomplete fossil record. Some teams will make many design decisions in each commit, and then document them with a helpful message like “Changed some stuff, bro”.

A more fine-grained version history, where each diff solves one problem and every commit explains what that problem was, provides a more complete picture of not just what was done to the design, but why it was done.

e.g., “Refactor: moved Checkout.confirm() to Order to eliminate Feature Envy”

Couple this with a per-commit history of outcomes – test run results, linter issues, etc – and we can get a detailed picture of the architecture’s evolution, as well as the development process itself. You can tell a lot about an animal’s lifestyle from what it excretes. Or, as Jeff Goldblum put it in Jurassic Park, “That is one big pile of shit”.

Naturally, there will be times when the reason for a design decision isn’t clear from the fossil record. Just as with code comments, in those cases we probably need to add an explanation to that record.

This is my approach to architecture decision records – we don’t record every decision, just the ones that need explaining. And these, too, need to be tightly-coupled to the code affected. So if I’m looking at a piece of code and going “Buh?”, there’s an easy way to find out what happened.

Right now, some of you are thinking “But I can just get Claude to do that”. The problem with LLM-generated summaries is that they’re unreliable narrators – just like us.

I’ve been encouraging developers to keep the receipts for “AI”-generated documentation. It’s pretty clear that humans aren’t reading it most of the time, and when we do, it’s at “LGTM-speed”. We’re not seeing the brown M&Ms in the bowl.

Load unreliable summaries into the context alongside actual code, and models can’t tell the difference. Their own summaries are lying to them. Humans, at least, are capable of approaching documentation skeptically.

Whenever possible, contexts – just like architecture descriptions intended for human eyes – should be grounded in contemporary reality and validated history.

And talking of “AI” – because, it seems, we always are these days – one area of machine learning that I haven’t seen applied to software development yet might lie in mining the rich stream of IDE, source file and code-level events we give off as we work to recognise workflows and intents.

A friend of mine, for his PhD in Machine Vision, trained a model to recognise what people were doing – or attempting to do – in kitchens from videos of them doing it.

Such pattern recognition of developer activity might also be useful to classify probable intent, and even predict certain likely outcomes as we work. At the very least, more accurate and meaningful commit messages could be automatically generated. No more “Changed some stuff, bro”.

At the end of the (very long) day, comprehension is the goal here. Not documentation. If a document is not comprehended – and if nobody’s reading it, then that’s a given – then it’s of no help.

And if it’s going to be read, it needs to be helpful in aiding comprehension. Like, duh!

Personally, what I would find useful is not better documentation, but better visualisation – visualisation of static structure and dynamic behaviour, of components and their interactions, and of the software’s evolution.

And visualisation at multiple scales from individual functions and modules all the way up to systems of systems, and the business processes and problem domains they interact with.

And when we want the team to actually read the documentation, we need to take it to them. Diagrams should be prominently displayed (I’ve spent a lot of time updating domain models to be printed on A0 paper and hung on the wall), explanations should be communicated. Ideally, architecture should be a shared team activity – e.g., with teaming sessions, with code reviews, with pair programming – and an active, ongoing, highly-visible collaboration.

Active participation in architecture is essential to better comprehension. Doing > seeing > reading or hearing.

That also goes for legacy architecture – active engagement (clicking buttons, stepping through call stacks in the debugger, writing tests for existing code) tends to build deeper understanding faster than reading documentation.

The fastest way to understand code is to write it. The second fastest is to debug it.

This is especially critical in highly-iterative, continuous approaches to software and system architecture, where decisions are being made in real-time many times a day. Without a comprehensible, bang-up-to-date picture of the architecture, we’re basing our decisions on shaky foundations.

Like an LLM generating code by matching patterns in its own summaries, we risk coming untethered from reality.

Which would be an apt description of most architecture documents, including mine.

Writing Code May Be Dead (Not Really), But Reading Code Will Live On

“The age of the syntax writer is over”, hailed a post on LinkedIn. (Why is it always LinkedIn?)

I say: welcome to the age of the syntax READER!

If I presented you with a bowl of candy and told you only 1% had broken glass in them, what percentage would you check before giving them to your kids?

The fact is that a 1% error rate would be an order-of-magnitude improvement in the reliability of LLM-generated code, and one we shouldn’t expect for a very long time – if ever – as scaling hits the wall of diminishing returns at exponentially increasing cost.

The need to check generated code isn’t going to go away, and therefore neither is the need to understand it.

And, given the error rates involved, we may need to understand all of it. At least, anyone who doesn’t want to be handing candy-covered shards to their users will.

“But Jason, we don’t need to check machine code or assembly language generated by compilers.”

That’s a category mistake. When was the last time a compiler misinterpreted your source code, or hallucinated output? Compiler boo-boos are very rare.

If my compiler randomly misinterpreted my source code just 1% of the time, then, yes, I’d check all of the generated machine code for anything that was going to be put in the hands of significant numbers of users. And that means I’d need to be able to understand that machine code.

Now for the fun part.

Decades of studies into program comprehension show that one of the best predictors of a person’s ability to understand code is how much experience they have writing it.

Cognitively, we don’t engage with syntax and semantics more deeply than when we’re writing the code ourselves. (Which is why I strongly discourage students from copying and pasting – it stunts their growth.)

In my blog series The AI-Ready Software Developer, I talk about how a mountain of “comprehension debt” can rapidly accumulate as “AI” produces code far faster than we can wrap our heads around it. I call this “LGTM-speed”.

I recommend “staying sharp” where code comprehension is concerned, as well as slowing down code generation to the speed of comprehension. When you’re drinking from a firehose, the limit isn’t the firehose.

This implies that we’re not limited by how many tokens/second “AI” coding assistants can predict, but by how many tokens/second we can understand. That’s the thing we need to optimise.

The best way – the only way, really – to maintain good code comprehension is to write it regularly.

We need to keep our hand in to make sure we don’t get caught in a trap where comprehension debt balloons as our ability to comprehend withers.

That leads to a situation where serious, show-stopping – potentially business-stopping – errors become much more likely to make it into releases, and there’s nobody on the team who can fix them.

And in that scenario, who are they gonna call? The “AI-native developer” who boasts they haven’t written code in months, or the developer who was debugging that kind of code just this morning?

Clean Contexts

You’ve probably heard of “clean code” (and the “clean coder”, and “clean architecture”, and other things Bob Martin has added the word “clean” in front of to get another book out of it).

In this dawning age of “AI”-assisted software development, I’d like to propose clean contexts.

What is a “clean context”? Well, I’m glad you asked.

A clean context:

  • Addresses one problem – one failing test, one code quality rule, one refactoring etc.
  • Is small enough to stay inside the model’s effective context limit – which is going to be orders of magnitude smaller than the advertised maximum context
  • Uses clear and consistent shared language – if you’ve been calling it “sales tax”, don’t suddenly start calling it “VAT”
  • Clarifies with examples that can be used as success criteria (i.e., tests) – The code samples used in training were paired with usage examples, so it improves matching
  • Only contains information pertinent to the task – don’t divert the model’s attention (literally)
  • Only contains accurate information (“ground truth”) – the code and the architecture as it is now (not a bunch of changes back when you asked the tool to summarise it), the test failure message, the mutation testing results and so on. Ground your interactions in reality.
  • Only contains working code – if the model breaks the code, don’t feed it back to it. It can’t tell broken code from working code, and you’ll pollute the context. Revert and try again. The exception to this is bug fixes, of course. But if the model introduced the bug – git reset –hard
  • Contains code that doesn’t go outside the model’s data distribution – LLMs famously choke on code that lacks clarity, is overly complex and lacks separation of concerns because it’s far outside the distribution of examples they were trained on. When it comes to gnarly legacy code, I’ve had more success breaking it down myself initially before letting Claude loose on it. Y’know, like how an adult bird chews the food first before feeding it to its chicks.

And remember that a prompt isn’t the entire context. Claude Code and Cursor will use static analysis to determine what source code needs to be added. Context files may be added (e.g., CLAUDE.md). And of course, everything in the conversation- your (or your agent’s) prompts and the model’s responses – are all part of the context. When an LLM “hallucinates”, that becomes part of the context, and the model has no way of determining fact from its own fiction. It’s all just context to a language model.

This is why I purge and then construct a new, task-specific context with each interaction. Many users are reporting how much more accurate LLMs tend to be with a fresh context.

Our goal with a clean context is to minimise ambiguity and the risk of misinterpretation, to minimise attention dilution and context drift, context pollution and context “rot”, and as much as possible, stay within the LLM’s training data distribution.

Basically, we’re aiming to maximise the chances of an accurate prediction from the LLM, and spend less time cleaning up mistakes and digging the tool out of “doom loops”.

Importantly, working in small steps – solving one problem at a time – opens up many more opportunities after each step to get feedback from testing, code review and merging, so clean contexts are highly compatible with much more iterative approaches.

Just as Continuous Delivery enables us to make progress by putting one foot in front of the other, ensuring a working product after every step, we also aim to start every step with a clean context that significantly reduces the risk of a stumble.

Yes, Maintainability Still Matters in “AI”-assisted Coding

A couple of people have asked, in relation to my 2-day Software Design Principles training course, whether maintainability matters anymore.

Perhaps they’ve read some of the wrong-headed posts here about why LLM-generated code doesn’t need to be understandable or maintainable by humans.

Putting aside the undeniable fact that these tools are nowhere near that reliable, in reality, code maintainability matters just as much – if not more – when LLMs are working with it.

First, and hopefully you’ve figured this out by now, “AI”-assisted programming without a good suite of fast-running regression tests is very, very risky. Fast tests have such a huge impact on the cost and the risk of changing code that Michael Feathers defines “legacy code” as code that lacks them.

More teams are discovering that they need to be constantly assessing the “strength” of the automated tests their “AI” assistant generates – they’re notorious for weak tests, and for cheating to get tests passing.

I highly recommend regular mutation testing to check for gaps in your test suites.

Clarity matters, because… well… language models. If I’m asking Claude to add a premium tier to video rentals pricing, but the code’s talking about “vd_prc_1” and “tr_rate_fs”, it hasn’t got much to match on. Concepts need to be clearly signposted and consistent with the language we use to describe our requirements.

Duplication’s a problem, because logic repeated 5x takes up 5x the context, and also models might not actually “spot” the repetition, so there’s a risk of drift.

Complexity’s a big problem. LLMs don’t like complex patterns. Overly complex code is likely to fall outside the data distribution, leading to low-confidence matches and low-accuracy predictions.

And then there’s separation of concerns…

LLMs are trained on a huge amount of code snippets of the Stack Overflow variety that contain little or no modularity. That’s their comfort zone, and code they generate will tend to be like that, too.

The irony is that, while they suck at generating effectively modular code – cohesive, loosely-coupled modules that localise the ripple effect of changes – they also suck at modifying code that isn’t highly modular. The wider the ripple effect, the more code gets brought into play, and the further out-of-distribution the context grows.

In this way, they’ll tend to paint themselves into a corner as the code grows. So we really need to keep on top on modular design.

So, yes, maintainability matters in “AI”-assisted coding. A LOT.

<shameless-plug>

If you think your team could use some levelling up or a refresher on software design principles, my training's half-price if you confirm your booking by Jan 31st. Link in my profile.

</shameless-plug>

Why Does Test-Driven Development Work So Well In “AI”-assisted Programming?

In my series on The AI-Ready Software Developer, I propose a set of principles for getting better results using LLM-based coding assistants like Claude Code and Cursor.

Users of these tools report how often and how easily they go off the rails, producing code that doesn’t do what we want and frequently breaking code that was working. As the code grows, these risks grow with them. On large code bases, they can really struggle.

From experiment and from real-world use, I’ve seen a number of things help reduce those risks and keep the “AI” on the rails.

  • Working in smaller steps
  • Testing after every step
  • Reviewing code after every step
  • Refactoring code as soon as problems appear
  • Clarifying prompts with examples

Smaller Steps

Human programmers have a limited capacity for cognitive load. There’s only so much we can comfortably wrap our heads around with any real focus, and when we overload ourselves, mistakes become much more likely. When we’re trying to spin many plates, the most likely result is broken plates.

LLMs have a similarly-limited capacity for context. While vendors advertise very impressive maximum context sizes of hundreds of thousands of tokens, research – and experience – shows that they have effective context limits that are orders of magnitude smaller.

The more things we ask models to pay attention to, the less able they are to pay attention to any of them. Accuracy drops of a cliff once the context goes beyond these limits.

After thousands of hours working with “AI” coding assistants, I’ve found I get the best results – the fewest broken plates – when I ask the model to solve one problem at a time.

Continuous Testing

If I make one change to the code, and test it straight away, if tests fail then I wouldn’t need to be a debugging genius to figure out which change broke the code. It’s either a quick fix, or a very cheap undo.

If I make ten changes and then test it, it’s going to take significantly longer, potentially, to debug. And if I have to revert to the last known working version, it’s 10x the work and the time lost.

An LLM is more likely to generate breaking changes than a skilled programmer, so frequent testing is even more essential to keep us close to working code.

And if the model’s first change breaks the code, that broken code is now in its context and it – and I – don’t know it’s broken yet. So the model is predicting further code changes on top of a polluted context.

Many of us have been finding that a lot less rework is required when we test after every small step rather than saving up testing for the end of a batch of work.

There’s an implication here, though. If we testing and re-testing continuously, that suggests that testing very fast.

Continuous Inspection

Left to their own devices, LLMs are very good at generating code they’re pretty bad at modifying later.

Some folks rely on rules and guardrails about code quality which are added to the context with every code-generating interaction with the model. This falls foul of the effective context limits of even the hyperscale LLMs. The model may “obey” – remember, they don’t in reality, they match and predict – some of these rules, but anyone who’s spent more than a few minutes attempting this approach will know that they rarely consistently obey all of them.

And filling up the context with rules runs the risk of “distracting” the LLM from the task at hand.

A more effective approach is to keep the context specific to the task – the problem to be solved – and then, when we’ve got something that works, we can turn our attention to maintainability.

After I’ve seen all my tests pass, I then do a code review, checking everything in the diff between the last working version and the latest. Because these diffs are small – one problem at a time – these code reviews are short and very focused, catching “code smells” as soon as they appear.

The longer I let the problems build up, the more the model ends up wading through it’s own “slop”, making every new change riskier and riskier.

I pay attention to pretty much the same things I would if I was writing all the code myself:

  • Clarity (LLMs really benefit from this, because… language model, duh!)
  • Complexity – the model needs the code likely to be affected in its context. More code, bigger context. Also, the more complex it is, the more likely it is to end up outside of the model’s training data distribution. Monkey no see, monkey can’t do.
  • Duplication – oh boy, do LLMs love duplicating code and concepts! Again, this is a context size issue. If I duplicate the same logic 5x, and need to make a change to the common logic, that’s 5x the code and 5x the tokens. But also, duplication often signposts useful abstractions and a more modular design. Talking of which…
  • Separation of Concerns – this is a big one. If I ask Claude Code to make a change to a 1,000-line class with 25 direct dependencies, that’s a lot of context, and we’re way outside the distribution. Many people have reported how their coding assistant craps out on code that lacks separation of concerns. I find I really have to keep on top of it. Modules should have one reason to change, and be loosely-coupled to other parts of the system.

On top of these, there are all kinds of low-level issues – security vulnerabilities, hanging imports, dead code etc etc – that I find I need to look for. Static analysis can help me check diffs for a whole range of issues that would otherwise by easy to miss by me, or by an LLM doing the code review. I’m seeing a lot of developers upping their game with linting as they use “AI” more in their work.

Continuous Refactoring

Of course, finding code quality issues is only academic if we don’t actually fix them. And, for the reasons I’ve already laid out – we want to give the model the smoothest surface to travel on – fix them immediately.

And I don’t fix all the problems at once. I fix one problem at a time, again for reasons already stated.

And after I fix each problem, I run the tests again, in case the fix broke anything.

This process of fixing one “code smell” at a time, testing throughout, is called refactoring. You may well have heard of it. You may even think you’re doing it. There’s a very high probability that you’re not.

Clarifying With Examples

Here’s an experiment you can try for yourself. Prepare two prompts for a small code project. In one prompt, try to describe what you want as precisely as possible in plain language, without giving any examples.

The total of items in the basket is the sum of the item subtotals, which are the item price multiplied by the item quantity

In the second version, give the exact same requirements, but using examples.

The total of items in a shopping basket is the sum of item subtotals:

item #1: price = 9.99, quantity = 1

item #2: price – 11.99, quantity = 2

shopping basket total = (9.99 * 1) + (11.99 * 2) = 33.97

See what kind of results you get with both approaches. How often does the model misinterpret precisely-described requirements vs. requirements accompanied by examples?

It’s worth knowing that code-generating LLMs are typically trained on code samples that are paired with examples like this. When we include examples, we’re giving the model more to match on, limiting the search space to examples that do what we want.

Examples help prevent LLMs grabbing the wrong end of the prompt, and many users have found them to greatly improve accuracy in generated code.

Harking back to the need for very fast tests, these examples make an ideal basis for fast-running automated “unit” tests (where “units” = units of behaviour). It would make good sense to ask our coding assistant to generate them for us, because we’re going to be needing them soon enough.

Putting It All Together

If we were to imagine a workflow that incorporates all of these principles – small steps, continuous testing, continuous inspection, continuous refactoring, clarifying with examples – it would look very familiar to the small percentage of developers who practice Test-Driven Development.

TDD has been around for several decades, and builds on practices that have been around even longer. It’s a tried-and-tested approach that’s been enabling the rapid, reliable and sustainable evolution of working software for those in the know. If you look inside the “elite-performing” teams in the DORA data – the ones delivering the most reliable software with the shortest lead times and the lowest cost of change – you’ll find they’re pretty much all doing TDD, or something very like TDD.

TDD specifies what we want software to do using examples, in the form of tests. (Hence, “test-driven”).

It works in micro-iterations where we write a test that fails because it requires something the software doesn’t do yet. Then we write the simplest code- the quickest thing we can think of – to get the tests passing. When all the tests are passing, then we review the changes we’ve made, and if necessary refactor the code to fix any quality problems. Once we’re satisfied that the code is good enough – both working and easy to change – we move on to the next failing test case. And rinse and repeat until our feature or our change is complete.

Image

TDD practitioners work one feature at a time, one usage scenario at a time, one outcome at a time and one example at a time, and one refactoring at a time. Basically, we solve one problem at a time.

And we’re continuously running our tests at every step to ensure the code is always working. While automated tests are a side-effect of driving design using tests, they’re a damned useful one! And because we’re only writing code that’s needed to pass tests, all of our code will end up being tested. It’s a self-fulfilling prophecy.

Embedded in that micro-cycle, many practitioners also use version control to ensure they’re making progress in safe, easily-reverted steps, progressing from one working version of the code to the next.

Some of us have discovered the benefits of a “commit on green, revert on red” approach to version control. If all the tests pass, we commit the changes. If any tests fail, we do a hard reset back to the previous working commit. This means that broken versions of the code don’t end up in the context for the next interaction. (Remember that LLMs can’t distinguish between working code and broken code – it’s all just context.)

The beauty of TDD is that the benefits can be yours whether you’re using “AI” or not. Which is why I now teach it both ways.

The key to being effective with “AI” coding assistants is being effective without them.

Shameless Plug

Test-Driven Development is not a skill that you can just switch on, whether you’re doing it with “AI” or without. It takes a lot of practice to get the hang of it, and especially to build the discipline – the habits – of TDD.

An alarming number of TDD tutorials aren’t actually teaching TDD. (And the more people learn from them, the more bad tutorials we’ll no doubt see.)

If your team wants training in Test-Driven Development, including how to do it effectively using tools like Claude Code and Cursor, my 2-day TDD training workshop is half-price if you confirm your booking by January 31st.

The AI-Ready Software Developer #19 – Prompt-and-Fix

For over a billion years now, we’ve known that “code-and-fix” software development, where we write a whole bunch of code for a feature, or even for a whole release, and then check it for bugs, maintainability problems, security vulnerabilities and so on, is by far the most expensive and least effective approach to delivering production-ready software.

If I change one line of code and tests start failing, I’ve got a pretty good idea what broke it, and it’s a very small amount of work (or lost work) to fix it.

If I change 1,000 lines of code, and tests start failing… Well, we’re in a very different ballpark now. Figuring out what change(s) broke the software and then fixing them is a lot of work, and rolling back to the last known working version is a lot of work lost.

Also, checking a single change is likely to bring a lot more focus than checking 1,000. Hence my go-to meme for after-the-fact testing and code reviews:

Image

The usual end result of code-and-fix development is buggier, less maintainable software delivered much later and at a much higher cost.

And all things in traditional software development have their “AI”-assisted equivalents, of course.

I see developers offloading large tasks – whole features or even sets of features for a release – and then setting the agentic dogs loose on them while they go off to eat a sandwich or plan a holiday or get a spa treatment or whatever it is software developers do these days.

Then they come back after the agent has finished to “check” the results. I’ve even heard them say “Looks good to me” out loud as they skim hundreds or thousands of changes.

Time for the meme again:

Image

Now, no doubting that “AI”-assisted coding tools have improved much in the last 6-12 months. But they’re still essentially LLMs wrapped in WHILE loops, with all the reliability we’ve come to expect.

Odds of it getting one change right? 80%, maybe, with a good wind behind it. Chances of it getting two right? 65%, perhaps.

Odds of it getting 100 changes right? Effectively zero.

Sure, tests help. You gave it tests, right?

Guardrails can help, when the model actually pays attention to them.

External checking – linters and that sort of thing – can definitely help.

But, as anyone who’s spent enough time using these tools can tell you, no matter how we prompt or how we test or how we try to constrain the output, every additional problem we ask it to solve adds risk.

LLMs are unreliable narrators, and there’s really nothing we can do to get around that except to be skeptical of their output.

And then there are the “doom loops”, when the context goes outside the model’s data distribution, and even with infinite iterations, it just can’t do what we want it to do. It just can’t conjure up the code equivalent of “a wine glass full to the brim”.

Image

And the bigger the context – the more we ask for – the greater the risk of out-of-distribution behaviour, with each additional pertinent token collapsing the probability of matching the pattern even further. (Don’t believe me? Play one at chess and watch it go off that OOD cliff.)

So problems are very likely with this approach – which I’m calling “prompt-and-fix”, because I can – and finding them and fixing them, or backing out, is a bigger cost.

What I’ve seen most developers do is skim the changes and then wave the problems through into a release with a “LGTM”.

One more time:

Image

This creates a comforting temporary illusion of time saved, just like code-and-fix. But we’re storing up a lot more time that’s going to be lost later with production fires, bug fixes and high cost-of-change.

One of the most important lessons in software development is that what’s downstream of present you is upstream of future you – as Sandra Bullock and George Clooney discovered in Gravity.

The antidote to code-and-fix was defect prevention. We take smaller steps, testing and reviewing changes continuously, so most problems are caught long before finding, fixing or reverting them becomes expensive.

I have a meme for that, too:

Image

The equivalent in “AI”-assisted software development would be to work in small steps – one change at a time – and to test and review the code continuously after every step.

Sorry, folks. No time for that spa treatment! You’ll be keeping the “AI” on a very short leash – both hands on the wheel at all times, sort of thing.

The other benefit of small steps is that they’re much less likely to push the LLM out of its data distribution. Keeping the model in-distribution more, so screw-ups will happen less often – while reaping the benefits of immediate problem detection in reduced work added or lost when things go south – is a WIN-WIN.

I know that some of you will be reading this and thinking “But Claude can break a big problem down into smaller problems and tackle them one at a time, running the tests and linting the code and all that”.

Yes, in that mode, it certainly can. But every step it takes carries a real risk of taking it in the wrong direction. And direction, despite what some fans of the technology claim, isn’t an LLM’s strong suit. Remember, they don’t understand, they don’t reason, they don’t plan. They recursively match patterns in the input to patterns in the model and predict what token comes next.

Any sense that they’re thinking or reasoning or planning is a product of the Actual Intelligence they’re trained on. It may look plausible, but on closer inspection – and “closer inspection” is often the problem here – it’s usually riddled with “brown M&Ms”.

So, no, you can’t just walk away and let them get on with it. If they take a wrong turn, that error will likely compound through the rest of the processing.

Think of what happens in traditional software development when a misunderstanding or an incorrect assumption goes unchecked while we merrily build on top of that code.