The AI-Ready Software Developer #22 – Test Your Tests

Many teams are learning that the key to speeding up the feedback loops in software development is greater automation.

The danger of automation that enables us to ship changes faster is that it can allow bad stuff to make it out into the real world faster, too.

To reduce that risk, we need effective quality gates that catch the boo-boos before they get deployed at the very latest. Ideally, catching the boo-boos as they’re being – er – boo-boo’d, so we don’t carry on far after a wrong turn.

There’s much talk about automating tests and code reviews in the world of “AI”-assisted software development these days. Which is good news.

But there’s less talk about just how good these automated quality gates are at catching problems. LLM-generated tests and LLM-performed code reviews in particular can often leave gaping holes through which problems can slip undetected, and at high speed.

The question we need to ask ourselves is: “If the agent broke this code, how soon would we know?”

Imagine your automated test suite is a police force, and you want to know how good your police force is at catching burglars. A simple way to test that might be to commit a burglary and see if you get caught.

We can do something similar with automated tests. Commit a crime in the code that leaves it broken, but still syntactically valid. Then run your tests to see if any of them fail.

This is a technique called mutation testing. It works by applying “mutations” to our code – for example, turning a + into a -, or a string into “” – such that our code no longer does what it did before, but we can still run it.

Then our tests are run against that “mutant” copy of the code. If any tests fail (or time-out), we say that the tests “killed the mutant”.

If no tests fail, and the “mutant” survives, then we may well need to look at whether that part of the code is really being tested, and potentially close the gap with new or better tests.

Mutation testing tools exist for most programming languages – like PIT for Java and Stryker for JS and .NET. They systematically go through solution code line by line, applying appropriate mutation operators, and running your tests.

This will produce a detailed report of what mutations were performed, and what the outcome of testing was, often with a summary of test suite “strength” or “mutation coverage”. This helps us gauge the likelihood that at least one test would fail if we broke the code.

This is much more meaningful than the usual code coverage stats that just tell us which lines of code were executed when we ran the tests.

Some of the best mutation testing tools can be used to incrementally do this on code that’s changed, so you don’t need to run it again and again against code that isn’t changing, making it practical to do within, say, a TDD micro-cycle.

So that answers the question about how we minimise the risk of bugs slipping through our safety net.

But what about other kinds of problems? What about code smells, for example?

The most common route teams take to checking for issues relating to maintainability or security etc is code reviews. Many teams are now learning that after-the-fact code reviews – e.g., for Pull Requests – are both too little, too late and a bottleneck that a code-generating firehose easily overwhelms.

They’re discovering that the review bottleneck is a fundamental speed limit for AI-assisted engineering, and there’s much talk online about how to remove or circumvent that limit.

Some people are proposing that we don’t review the code anymore. These are silly people, and you shouldn’t listen to them.

As we’ve known for decades now, when something hurts, you do it more often. The Cliffs Notes version of how to unblock a bottleneck in software development is to put the word “continuous” in front of it.

Here, we’re talking about continuous inspection.

I build code review directly into my Test-Driven micro-cycle. Whenever the tests are passing after a change to the code, I review it, and – if necessary – refactor it.

Continuous inspection has three benefits:

  1. It catches problems straight away, before I start pouring more cement on them
  2. It dramatically reduces the amount of code that needs to be reviewed, so brings far greater focus
  3. It eliminates the Pull Request/after-the-horse-has-bolted code review bottleneck (it’s PAYG code review)

But, as with our automated tests, the end result is only as good as our inspections.

Some manual code inspection is highly recommended. It lets us consider issues of high-level modular design, like where responsibilities really belong and what depends on what. And it’s really the only way to judge whether code is easy to understand.

But manual inspections tend to miss a lot, especially low-level details like unused imports and embedded URLs. There are actually many, many potential problems we need to look for.

This is where automation is our friend. Static analysis – the programmatic checking of code for conformance to rules – can analyse large amounts of code completely systematically for dozens or even hundreds of problems.

Static analysers – you may know them as “linters” – work by walking the abstract syntax tree of the parsed code, applying appropriate rules to each element in the tree, and reporting whenever an element breaks one of our rules. Perhaps a function has too many parameters. Perhaps a class has too many methods. Perhaps a method is too tightly coupled to the features of another class.

We can think of those code quality rules as being like fast-running unit tests for the structure of the code itself. And, like unit tests, they’re the key to dramatically accelerating code review feedback loops, making it practical to do comprehensive code reviews many times an hour.

The need for human understanding and judgement will never go away, but if 80%-90% of your coding standards and code quality goals can be covered by static analysis, then the time required reduces very significantly. (100% is a Fool’s Errand, of course.)

And, just like unit tests, we need to ask ourselves “If I made a mess in my code, would the automated code inspection catch it?”

Here, we can apply a similar technique; commit a crime in the code and see if the inspection detects it.

For example, I cloned a copy of the JUnit 5 framework source – which is a pretty high-quality code base, as you might expect – and “refuctored” it to introduce unused imports into random source files. Then I asked Claude to look for them. This, by the way, is when I learned not to trust code reviews undertaken by LLMs. They’re not linters, folks!

Continuous inspection is an advanced discipline. You have to invest a lot of time and thought into building and evolving effective quality gates. And a big part of that is testing those gates and closing gaps. Out of the box, most linters won’t get you anywhere near the level of confidence you’ll need for continuous inspection.

If we spot a problem that slipped through the net – and that’s why manual inspections aren’t going away (think of them as exploratory code quality testing) – we need to feed that back into further development of the gate.

(It also requires a good understanding of the code’s abstract syntax, and the ability to reason about code quality. Heck, though, it is our domain model, so it’ll probably make you a better programmer.)

Read the whole series

How Can We Progress Past “AI” Woo?

For decades, when I’ve interacted with folks working in machine learning, one of the big questions that comes up is “How do we test these systems?”

Traditional software’s behaviour is (usually) predictably repeatable, in the sense that the exact same inputs provided to the system in the exact same internal state will produce the exact same output.

If I ask to debit $51 from an account with $50 available, the system will say “No can do, amigo”.

With, say, a Large Language Model, the exact same input fed to the exact same model – their internal state doesn’t change – can produce surprisingly different outputs each time.

When I say “I tried this, and it didn’t work”, and then you say “Well, I tried it, and it did work”, that’s a bit like saying “Well, I threw a 7, so you must have been throwing the dice wrong”.

This is where probabilistic technology collides with 70+ years of experience with deterministic software. We’re still treating it like the banking system, expecting our results to be reproducible in the same way.

When we consider what’s real and what works when we’re using ML, we have to look beyond data points and start looking for statistically significant trends. We don’t throw the dice, get the 7 we wanted, and declare “These dice are better”. We throw the dice a bunch of times, and see what the distribution of outputs is.

I’ve been using small-scale, closed-loop experiments – once it’s started, I can’t intervene – to test claims about what improves model accuracy, with hard “either it did or it didn’t” fitness functions to gauge success. (Typically, a hidden acceptance test suite, scoring on how many tests passed).

But these are small-scale. I might run them 10x, because I simply can’t afford the tokens. If I published the results, that’s the first thing folks would say: “But this is just 10 data points, Jason”.

And they’d be quite right to, just as I’d be right to take a single data point (especially an anecdotal one) with a big pinch of salt.

And that’s why I’ve leaned more on related – the same phenomena but in different problem domains – large-scale peer-reviewed studies and on the science (e.g., statistical mechanics). So the stage I’m at is: “In theory, this should happen, and in my small experiment, that’s kind of what happened”.

In that respect, I appear to be ahead of the curve, with most folks still relying entirely on how it “feels”. Confirmation bias and projection still dominate the discourse.

It turns out there’s a significant correlation between confidence in “AI” output and belief in the paranormal. Our susceptibility to suggestion appears to be a factor here.

Meanwhile, I see more and more of the ideas presented in my AI-Ready Software Developer blog series being incorporated into workflows and even into the tools themselves. I can’t take the credit, of course. But when a bunch of different people independently come to similar conclusions, that’s interesting data in itself. We could all just be suffering the same delusions, though. Just because most teams use Pull Requests, that doesn’t make them a good idea.

In my defence, I’ve tried very hard to follow the best available evidence, and I’ve been taking nobody’s word for it. This has, at times, made me about as welcome as James Randi at a spoon-bending convention.

I see folks claiming that all of the frontier models and all of the “AI” coding assistants took a big step up in performance in the last 2-3 months. Claude users are saying it. Cursor users are saying it. Copilot users are saying it.

But did the technology really shift over Christmas, or was it actually a shift in expectations and incentives feeding a sort of mass delusion fueled by social proof?

When the CTO comes back to the office in January and declares “This is the future! Get with it, teams!”, does our perception of tool performance change in response? Is this how religions start?

But this is nothing new in our field.

If we’re to move forward in our understanding of what’s real and what works, and avoid surrendering to “AI Woo”, perhaps we need a way to test our hypotheses at a statistically significant scale, and in a more methodical way?

It’s my hope that – on this particularly important subject for our profession – we can just this once move beyond social proof, beyond online debate, beyond committees and manifestos drawn up by people who just happened to be in the room at the time, and beyond appeals to authority, and engage with the reality in a more objective way.

I, for one, would really like to know what that reality is.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise rework due to misunderstandings about requirements, they’ll need to describe requirements in a testable way as part of a close and ongoing collaboration with our customers.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of code reviews, they’ll need to build review into the coding workflow itself, and automate the majority of their code quality checks.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise merge conflicts and broken builds, and to minimise software delivery lead times, they’ll need to integrate their changes more often and automatically build and test the software each time to make it ready for automated deployment.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the “blast radius” of changes, they’ll need to cleanly separate concerns in their designs to reduce coupling and increase cohesion.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the cost and the risk of changing code, they’ll need to continuously refactor their code to keep its intent clear, and keep it simple, modular and low in duplication.

“We definitely don’t have time for that, Jason!”

“AI” coding assistants don’t solve any of these problems. They AMPLIFY THEM.

More code, with more problems, hitting these bottlenecks at accelerated speed turns the code-generating firehose into a load test for your development process.

For most teams, the outcome is less reliable software that costs more to change and is delivered later.

Those teams are being easily outperformed by teams who test, review, refactor and integrate continuously, and who build shared understanding of requirements using examples – with and without “AI”.

Will you make time for them in 2026? Drop me a line if you think it’s about time your team addressed these bottlenecks.

Or was productivity never the point?

Ready, Fire, Aim!

I teach Test-Driven Development. You may have heard.

And as a teacher of TDD for some quarter of a century now, you can probably imagine that I’ve heard every reason for not doing TDD under the Sun. (And some more reasons under the Moon.)

“It won’t work with our tech stack” is one of the most common, and one of the most easily addressed. I’ve done and seen done TDD on all of the tech stacks, at all levels of abstraction from 4GLs down through assembly language to the hardware design itself. If you can invoke it and get an output, you can automatically test it. And if you can automatically test it, you can write that test first.

(Typically, what they really mean is that the architecture of the framework(s) they’re using doesn’t make unit testing easy. That’s about separation of concerns, though, and usually work-aroundable.)

The second most common reason I hear is perhaps the more puzzling: “But how can I write tests first if I don’t know what the code’s supposed to do?”

The implication here is that developers are writing solution code without a clear idea of what they expect it to do – that they’re retrofitting intent to implementations.

I find that hard to imagine. When I write code, I “hear the tune” in my head, so to speak. The intended meaning is clear to me. When I run it, my understanding might turn out to be wrong. But there is an expectation of what the code will do: I think it’s going to do X.

My best guess is that we all kind of sort of have those inner expectations when we write code. The code has meaning to us, even if we turn out to have understood it wrong when we run it.

So I could perhaps rephrase “How can I write tests first if I don’t know what the code’s supposed to do?” to articulate what might actually be happening:

“How do I express what I want the code to do before I’ve seen that code?”

Take this example of code that calculates the total of items in a shopping basket:

class Basket:
def __init__(self, items):
self.items = items
def total(self):
sum = 0.0
for item in self.items:
sum += item.price * item.quantity
return sum

When I write this code, in my head – often subconsciously – I have expectations about what it’s going to do. I start by declaring a sum of zero, because an empty basket will have a total of zero.

Then, for every item in the basket, I add that item’s price multiplied by it’s quantity to the sum.

So, in my head, there’s an expectation that if the basket had one item with a quantity of one, the total would equal just the price of that item.

If that item had a quantity of two, then the total would be the price multiplied by two.

If there were two items, the total would be the sum of price times quantity of both items.

And so on.

You’ll notice that my thinking isn’t very abstract. I’m thinking more with examples than with symbols.

  • No items.
  • One item with quantity of one.
  • One item with quantity of two.
  • Two items.

If you asked me to write unit tests for the total function, these examples might form the basis of them.

A test-driven approach just flips the script. I start by listing examples of what I expect the function to do, and then – one example at a time – I write a failing test, write the simplest code to pass the test, and then refactor if I need to before moving on to the next example.

    def test_total_of_empty_basket(self):
        items = []
        basket = Basket(items)
        
        self.assertEqual(0.0, basket.total())
class Basket:
def __init__(self, items):
self.items = items
def total(self):
return 0.0

What I’m doing – and this is part of the art of Test-Driven Development – is externalising the subconscious expectations I would no doubt have as I write the total function’s implementation.

Importantly, I’m not doing it in the abstract – “the total of the basket is the sum of price times quantity for all of its items”.

I’m using concrete examples, like the total of an empty basket, or the total of a single item of quantity one.

“But, Jason, surely it’s six of one and half-a-dozen of the other whether we write the tests first or write the implementation first. Why does it matter?”

The psychology of it’s very interesting. You may have heard life coaches and business gurus tell their audience to visualise their goal – picture themselves in their perfect home, or sipping champagne on their yacht, or making that acceptance speech, or destabilising western democracy. It’s good to have goals.

When people set out with a clear goal, we’re much more likely to achieve it. It’s a self-fulfilling prophecy.

We make outcomes visible and concrete by adding key details – how many bedrooms does your perfect home have? How big is the yacht? Which Oscar did you win? How little regulation will be applied to your business dealings?

What should the total of a basket with no items be? What should the total of a basket with a single item with price 9.99 and quantity 1 be?

    def test_total_of_single_item(self):
        items = [
            Item(9.99, 1),
        ]
        basket = Basket(items)

        self.assertEqual(9.99, basket.total())

We precisely describe the “what” – the desired properties of the outcome – and work our way backwards directly to the “how”? What would be the simplest way of achieving that outcome?

class Basket:
def __init__(self, items):
self.items = items
def total(self):
if len(self.items) > 0:
return self.items[0].price
return 0.0

Then we move on to the next outcome – the next example:

    def test_total_of_item_with_quantity_of_2(self):
        items = [
            Item(9.99, 2)
        ]
        basket = Basket(items)

        self.assertEqual(19.98, basket.total())
class Basket:
def __init__(self, items):
self.items = items
def total(self):
if len(self.items) > 0:
item = self.items[0]
return item.price * item.quantity
return 0.0

And then our final example:

    def test_total_of_two_items(self):
        items = [
            Item(9.99, 1),
            Item(5.99, 1)
        ]
        basket = Basket(items)

        self.assertEqual(15.98, basket.total())
class Basket:
def __init__(self, items):
self.items = items
def total(self):
sum = 0.0
for item in self.items:
sum += item.price * item.quantity
return sum

If we enforce that items must have a price >= 0.0 and an integer quantity > 0, this code should cover any list of items, including an empty list, with any price and any quantity.

And our unit tests cover every outcome. If I were to break this code so that, say, an empty basket causes an error to be thrown, one of these tests would fail. I’d know straight away that I’d broken it.

This is another self-fulfilling prophecy of starting with the outcome and working directly backwards to the simplest way of achieving it – we end up with the code we need, and only the code we need, and we end up with tests that give us high assurance after every change that those outcomes are still being satisfied.

Which means that if I were to refactor the design of the total function:

    def total(self):
        return sum(
                map(lambda item: item.subtotal(), self.items))

I can do that with high confidence.

If I write the code and then write tests for it, several things tend to happen:

  • I may end up with code I didn’t actually need, and miss code I did need
  • I may well miss important cases, because unit tests? Such a chore when the work’s already done! I just wanna ship it!
  • It’s not safe to refactor the new code without those tests, so I have to leave that until the end, and – well, yeah. Refactoring? Such a chore! etc etc etc.
  • The tests I choose – the “what” – are now being driven by my design – the “how”. I’m asking “What test do I need to cover that branch?” and not “What branch do I need to pass that test?”

And finally, there’s the issue of design methodology. Any effective software design methodology is usually usage-driven. We don’t start by asking “What does this feature do?” We start by asking “How will this feature be used?”

What the feature does is a consequence of how it will be used. We don’t build stuff and then start looking for use cases for it. Well, I don’t, anyway.

In a test-driven approach, my tests are the first users of the total function. That’s what my tests are about – user outcomes. I’m thinking about the design from the user’s – the external – perspective and driving the design of my code from the outside in.

I’m not thinking “How am I going to test this total function?” I’m thinking “How will the user know the total cost of the basket?” and my tests reveal the need for a total function. I use it in the test, and that tells me I need it.

“Test-driven”. In case you were wondering what that meant.

When we design code from the user’s perspective, we’re far more likely to end up with useful code. And when we design code with tests playing the role of the user, we’re far more likely to end up with code that works.

One final question: if I find myself asking “What is this function supposed to do?”, is that a cue for me to start writing code in the hope that somebody will find a use for it?

Or is that my cue to go and speak to someone who understands the user’s needs?

The AI-Ready Software Developer #20 – It’s The Bottlenecks, Stupid!

For many years now, cycling has been consistently the fastest way to get around central London. Faster than taking the tube. Faster than taking the train. Faster than taking the bus. Faster than taking a cab. Faster than taking your car.

Image

All of these other modes of transport are, in theory, faster than a bike. But the bike will tend to get there first, not because it’s the fastest vehicle, but because it’s subject to the fewest constraints.

Cars, cabs, trains and buses move not at the top speed of the vehicle, but at the speed of the system.

And, of course, when we measure their journey speed at an average 9 mph, we don’t see them crawling along steadily at that pace.

“Travelling” in London is really mostly waiting. Waiting at junctions. Waiting at traffic lights. Waiting to turn. Waiting for the bus to pull out. Waiting on rail platforms. Waiting at tube stations. Waiting for the pedestrian to cross. Waiting for that van to unload.

Cyclists spend significantly less time waiting, and that makes them faster across town overall.

Similarly, development teams that can produce code much faster, but work in a system with real constraints – lots of waiting – will tend to be outperformed overall by teams who might produce code significantly slower, but who are less constrained – spend less time waiting.

What are developers waiting for? What are the traffic lights, junctions and pedestrian crossings in our work?

If I submit a Pull Request, I’m waiting for it to be reviewed. If I send my code for testing, I’m waiting for the results. If I don’t have SQL skills, and I need a new column in the database, I’m waiting for the DBA to add it for me. If I need someone on another team to make a change to their API, more waiting. If I pick up a feature request that needs clarifying, I’m waiting for the customer or the product owner to shed some light. If I need my manager to raise a request for a laptop, then that’s just yet more waiting.

Teams with handovers, sign-offs and other blocking activities in their development process will tend to be outperformed by teams who spend less time waiting, regardless of the raw coding power available to them.

Teams who treat activities like testing, code review, customer interaction and merging as “phases” in their process will tend to be outperformed by teams who do them continuously, regardless of how many LOC or tokens per minute they’re capable of generating.

This isn’t conjecture. The best available evidence is pretty clear. Teams who’ve addressed the bottlenecks in their system are getting there sooner – and in better shape – than teams who haven’t. With or without “AI”.

The teams who collaborate with customers every day – many times a day – outperform teams who have limited, infrequent access.

The teams who design, test, review, refactor and integrate continuously outperform teams who do them in phases.

The teams with wider skillsets outperform highly-specialised teams.

The teams working in cohesive and loosely-coupled enterprise architectures outperform teams working in distributed monoliths.

The teams with more autonomy outperform teams working in command-and-control hierarchies.

None of these things comes with your Claude Code plan. You can’t buy them. You can’t install them. But you can learn them.

And if you’re ticking none of those boxes, and you still think a code-generating supercar is going to make things better, I have a Bugatti Chiron Sport you might be interested in buying. Perfect for the school run!

The AI-Ready Software Developer #19 – Prompt-and-Fix

For over a billion years now, we’ve known that “code-and-fix” software development, where we write a whole bunch of code for a feature, or even for a whole release, and then check it for bugs, maintainability problems, security vulnerabilities and so on, is by far the most expensive and least effective approach to delivering production-ready software.

If I change one line of code and tests start failing, I’ve got a pretty good idea what broke it, and it’s a very small amount of work (or lost work) to fix it.

If I change 1,000 lines of code, and tests start failing… Well, we’re in a very different ballpark now. Figuring out what change(s) broke the software and then fixing them is a lot of work, and rolling back to the last known working version is a lot of work lost.

Also, checking a single change is likely to bring a lot more focus than checking 1,000. Hence my go-to meme for after-the-fact testing and code reviews:

Image

The usual end result of code-and-fix development is buggier, less maintainable software delivered much later and at a much higher cost.

And all things in traditional software development have their “AI”-assisted equivalents, of course.

I see developers offloading large tasks – whole features or even sets of features for a release – and then setting the agentic dogs loose on them while they go off to eat a sandwich or plan a holiday or get a spa treatment or whatever it is software developers do these days.

Then they come back after the agent has finished to “check” the results. I’ve even heard them say “Looks good to me” out loud as they skim hundreds or thousands of changes.

Time for the meme again:

Image

Now, no doubting that “AI”-assisted coding tools have improved much in the last 6-12 months. But they’re still essentially LLMs wrapped in WHILE loops, with all the reliability we’ve come to expect.

Odds of it getting one change right? 80%, maybe, with a good wind behind it. Chances of it getting two right? 65%, perhaps.

Odds of it getting 100 changes right? Effectively zero.

Sure, tests help. You gave it tests, right?

Guardrails can help, when the model actually pays attention to them.

External checking – linters and that sort of thing – can definitely help.

But, as anyone who’s spent enough time using these tools can tell you, no matter how we prompt or how we test or how we try to constrain the output, every additional problem we ask it to solve adds risk.

LLMs are unreliable narrators, and there’s really nothing we can do to get around that except to be skeptical of their output.

And then there are the “doom loops”, when the context goes outside the model’s data distribution, and even with infinite iterations, it just can’t do what we want it to do. It just can’t conjure up the code equivalent of “a wine glass full to the brim”.

Image

And the bigger the context – the more we ask for – the greater the risk of out-of-distribution behaviour, with each additional pertinent token collapsing the probability of matching the pattern even further. (Don’t believe me? Play one at chess and watch it go off that OOD cliff.)

So problems are very likely with this approach – which I’m calling “prompt-and-fix”, because I can – and finding them and fixing them, or backing out, is a bigger cost.

What I’ve seen most developers do is skim the changes and then wave the problems through into a release with a “LGTM”.

One more time:

Image

This creates a comforting temporary illusion of time saved, just like code-and-fix. But we’re storing up a lot more time that’s going to be lost later with production fires, bug fixes and high cost-of-change.

One of the most important lessons in software development is that what’s downstream of present you is upstream of future you – as Sandra Bullock and George Clooney discovered in Gravity.

The antidote to code-and-fix was defect prevention. We take smaller steps, testing and reviewing changes continuously, so most problems are caught long before finding, fixing or reverting them becomes expensive.

I have a meme for that, too:

Image

The equivalent in “AI”-assisted software development would be to work in small steps – one change at a time – and to test and review the code continuously after every step.

Sorry, folks. No time for that spa treatment! You’ll be keeping the “AI” on a very short leash – both hands on the wheel at all times, sort of thing.

The other benefit of small steps is that they’re much less likely to push the LLM out of its data distribution. Keeping the model in-distribution more, so screw-ups will happen less often – while reaping the benefits of immediate problem detection in reduced work added or lost when things go south – is a WIN-WIN.

I know that some of you will be reading this and thinking “But Claude can break a big problem down into smaller problems and tackle them one at a time, running the tests and linting the code and all that”.

Yes, in that mode, it certainly can. But every step it takes carries a real risk of taking it in the wrong direction. And direction, despite what some fans of the technology claim, isn’t an LLM’s strong suit. Remember, they don’t understand, they don’t reason, they don’t plan. They recursively match patterns in the input to patterns in the model and predict what token comes next.

Any sense that they’re thinking or reasoning or planning is a product of the Actual Intelligence they’re trained on. It may look plausible, but on closer inspection – and “closer inspection” is often the problem here – it’s usually riddled with “brown M&Ms”.

So, no, you can’t just walk away and let them get on with it. If they take a wrong turn, that error will likely compound through the rest of the processing.

Think of what happens in traditional software development when a misunderstanding or an incorrect assumption goes unchecked while we merrily build on top of that code.

Do You Know Where Your Load-Bearing Code Is?

Do you know where your load-bearing code is?

90% of the time, TDD is enough to assure that code of the everyday variety is reliable enough.

But some code really, really needs to work. I call it “load-bearing code”, and it’s rare to find a software product or system that doesn’t have any code that’s critical to its users in some way.

In my 3-day Code Craft training workshop, we go beyond Test-Driven Development to look at a couple of more advanced testing techniques that can help us make sure that code that really, really needs to work in all likelihood does.

It raises the question, how do we know which parts of our code are load-bearing, and therefore might warrant going that extra mile?

An obvious indicator is critical paths. If a feature or a usage scenario is a big deal for users and/or for the business, tracing which code lies on the execution path for it can lead us to code that may require higher assurance.

Some teams work with stakeholders to assess risk for usage scenarios, perhaps captured alongside examples that they use to drive the design (e.g., in .feature files), and then when these tests are run, use instrumentation (e.g., test coverage) to build a “heat map” of their code that graphically illustrates which code is cool – no big deal if this fails – and which code might be white hot – the consequences will be severe if it fails.

(It’s not as hard to build a tool like this as you might think, BTW.)

A less obvious indicator is dependencies. Code that’s widely reused, directly or indirectly, also presents a potentially higher risk. Static analysis tools like NDepend can calculate the “rank” of a method or a class or a package in the system (as in, the Page Rank) to show where code is widely reused.

Monitoring how often code’s executed in production can produce a similar, but dynamic, picture of which code’s used most often.

These are all measures of the potential impact of failure. But what about the likelihood of failure? A function may be on a critical path, and reused widely, but if it’s just adding a list of numbers together, it’s not very likely to fail.

Complex logic, on the other hand, presents many more ways of being wrong – the more complex, the greater that risk.

Code that’s load-bearing and complex should attract our attention.

And code that’s load-bearing, complex and changing often is white hot. That should be balanced by the strength of our testing. The hotter the code, the more exhaustively and the more frequently it might need testing.

Hopefully, with a testing specialist in the team, you will have a good repertoire of software verification techniques to match against the temperature of the code – guided inspection, property-based testing, DBC, decision tables, response matrices, state transition tables, model checking, maybe even proofs of correctness when it really needs to work.

But a good start is knowing where your hottest code actually is.

101 Uses of An Abstract Test #43: Contract Testing

As distributed systems have become more and more prevalent, I’ve seen how teams spend an increasing amount of time putting out fires caused by dependencies changing.

Team A go to bed with a spiffy working doodad, and in the wee small hours, Team B does a release of their spiffy working thingummy that Team A just happen to rely on. Team A wakes up to a decidedly not-spiffy doodad that has mysteriously stopped working overnight.

Team B’s thingummy may well have passed all their tests before they deployed it, but their tests might not show when a change they’ve made isn’t backwards-compatible with how their clients are using it. For a system to be correct, the interactions between components need to be correct.

We can define the correctness of interactions between clients and suppliers using contracts that determine what’s expected of both parties.

The supplier promises to provide certain benefits to the client – the weather forecast for the next 7 days at their location, for example. But that promise only holds under specific circumstances – the client’s location has to be provided, and must be expressed in Decimal Degrees.

If the supplier changes the contract so that locations must now be provided in Degrees, Minutes and Seconds, it may well pass all their tests, but it breaks the client, who’s now getting error messages instead of weather forecasts.

Now, the client will likely have some integration tests where the end point is real. And those are the tests that enforce expectations about interactions with that end point.

What if we abstracted those tests so that the end could be the real deal, or a stub or a mock? The object or function that’s responsible for the interaction could be supplied to each test via, say, a factory method that’s abstract in the test base class, and can be overridden in subclasses, enabling us to vary the set-up as we wish – real or pretend.

Then we can run the exact same tests with and without the real end point. If all the tests using pretend versions are passing, but the ones using the real thing suddenly start failing, that strongly suggests something’s changed at the other end. If the “unit” tests start failing too, then the problem is at our end.

This gives client developers a heads-up as soon as integration fires start. But the real payoff is when the team at the other end can run those tests themselves before they even think about deploying.

The AI-Ready Software Developer #4 – Continuous Testing

Now, where were we? Ah, yes.

So, we’re working in small steps with our LLM, solving one problem at a time, which makes it easier for the model to pay attention to important details (just like in real life).

We’re keeping our contexts small, and making them more specific by clarifying with examples to reduce the risk of misinterpretation (just like in real life).

And we’re cleanly separating the different concerns in our architecture to limit the “blast radius” when the model changes code, reducing the risk of boo-boos (just like in real life), and keeping diffs smaller. (More about that in a future post – for now, smaller diffs == gooderer).

When we apply all three of these practices together, it opens a door: we can test more often.

Those examples we used to clarify our requirements can become tests we can perform after the model has done that work to check that it did what we told it to.

We could perform these tests ourselves by running the software or by accessing the code directly at the command line in Read-Evaluate-Print loops (REPL). Or, if a UI is involved, we could run it and click the buttons ourselves.

I highly recommend seeing it work with your own eyes at least once. Trust no one, Agent Mulder!

But what about code that was working that the model has since changed? As the software grows, manually retreading dozens, hundreds, maybe thousands of tests to make sure we’re obeying Gorman’s First Law of Software Development :

“Thou shalt not break shit that was working”

– is going to take a lot of time. Eventually, our development process will become O(n!) complex, where n is the number of tests, and every time we add a new one – one problem at a time, remember? – we have to repeat the existing tests.

Automation to the rescue! If we find ourselves performing the same test over and over, we can write code to perform it for us. Or we can get the LLM to write it for us (be careful here – triple-check every test the model writes! Been burned by that multiple times.)

And this is where clean separation of concerns turns into a superhero.

If the code that, say, calculates mortgage repayments is buried inside the module that generates the Repayments web page, and which also directly accesses an external web service to get interest rates, then you’ll have little choice but to test through the browser (or something pretending to be the browser).

But if there’s a separate MortgageCalculator module that does this work, and that module is decoupled from the code that fetches interest rates, a test can be automated directly against it that will run very reliably and very fast – milliseconds instead of seconds. Thousands of those kinds of “unit” tests could run in seconds instead of hours.

Which means comprehensively retesting your software after every small step, giving you an instant heads-up if the LLM broke anything (AND IT WILL), becomes completely practical.

Once again, you won’t be surprised to learn that this is very good news whether we’re using “AI” or not. Many teams consider it essential.

Code Reviews as Exploratory Testing

Image

Code reviews? Let me tell you about code reviews!

To me, a code review done by people is exploratory testing. We gather around a piece of code (e.g., a merge diff for a new feature). We go through the code and we ask ourselves “What do we think of this?”

Maybe we see a method or a function that has control flow nested 4 deep. Eurgh! Difficult to test and easy to break (such cyclomatic complexity, much wow). So we flag it up.

So far, so normal.

Once that code quality “bug” has been flagged up, I’m sure we both agree that it needs fixing. So we fix it. Job done? Now, here’s where you and I may part company.

It’s almost guaranteed that won’t be the last time that problem rears its head in our code. So, as well as fixing the complex conditional we found in our review, we also fix the process that allowed the problem to make it that far – and waste a bunch of time – in the first place.

When we find a logic error in our code by exploratory testing, we don’t just fix the bug. We write a regression test in case a future change breaks it again. (We do, right?)

And when we find a code quality bug, we shouldn’t just refactor that example. We should add a code quality check for it to our suite of code inspections – automated if at all possible – that will catch it as soon as it reappears somewhere else in the code.

Now, you can take this too far, as with all things. Automating regression tests is D.R.Y. applied to our testing process. If we need to perform the same test over and over, automating it makes a lot of sense.

But D.R.Y. has caveats, and one of those caveats is The Rule of Three. On average, we wait until we’ve seen something repeated 3 times before we refactor. This increases the odds that:

a. The refactoring will pay off later (the more examples we see, the more likely there’ll be more in the future)

b. We have more examples to guide us towards the better design.

Both apply as much to duplicated effort in our process as they do to duplication in our code.

So we might want to keep a log of the problems our code reviews find, and we see the same type of issue appear 3 times (or thereabouts), that might be our cue to look into building a check for it into our Continuous Inspection rule suite. Maybe our linter already has a rule we can use. Maybe we’ll need to write our own custom rule for it. (That’s a very undervalued skillset, BTW). Maybe we could train a small ANN to detect it. Maybe we’ll need to add it to the manual inspection checklist.

A good yardstick might be that the same type of code quality issue doesn’t appear in merges (or attempted merges) more than 3 times.

And there’s more. There are teams I’ve worked with who not only add a check to their Continuous Inspection suite, but also ask “Why does this problem happen in the first place?” How do 500-line functions become 500-line functions? How do deeply-nested IFs become deeply-nested IFs? How do classes end up with 12 distinct responsibilities and 25 dependencies?

The answer, BTW, is that – more often than not – the way functions get 500 lines long, IFs get deeply nested and classes end up doing so many different things is because the people writing that code didn’t see it as a problem.

And that’s usually where I come in 🙂