The AI-Ready Software Developer #19 – Prompt-and-Fix

For over a billion years now, we’ve known that “code-and-fix” software development, where we write a whole bunch of code for a feature, or even for a whole release, and then check it for bugs, maintainability problems, security vulnerabilities and so on, is by far the most expensive and least effective approach to delivering production-ready software.

If I change one line of code and tests start failing, I’ve got a pretty good idea what broke it, and it’s a very small amount of work (or lost work) to fix it.

If I change 1,000 lines of code, and tests start failing… Well, we’re in a very different ballpark now. Figuring out what change(s) broke the software and then fixing them is a lot of work, and rolling back to the last known working version is a lot of work lost.

Also, checking a single change is likely to bring a lot more focus than checking 1,000. Hence my go-to meme for after-the-fact testing and code reviews:

Image

The usual end result of code-and-fix development is buggier, less maintainable software delivered much later and at a much higher cost.

And all things in traditional software development have their “AI”-assisted equivalents, of course.

I see developers offloading large tasks – whole features or even sets of features for a release – and then setting the agentic dogs loose on them while they go off to eat a sandwich or plan a holiday or get a spa treatment or whatever it is software developers do these days.

Then they come back after the agent has finished to “check” the results. I’ve even heard them say “Looks good to me” out loud as they skim hundreds or thousands of changes.

Time for the meme again:

Image

Now, no doubting that “AI”-assisted coding tools have improved much in the last 6-12 months. But they’re still essentially LLMs wrapped in WHILE loops, with all the reliability we’ve come to expect.

Odds of it getting one change right? 80%, maybe, with a good wind behind it. Chances of it getting two right? 65%, perhaps.

Odds of it getting 100 changes right? Effectively zero.

Sure, tests help. You gave it tests, right?

Guardrails can help, when the model actually pays attention to them.

External checking – linters and that sort of thing – can definitely help.

But, as anyone who’s spent enough time using these tools can tell you, no matter how we prompt or how we test or how we try to constrain the output, every additional problem we ask it to solve adds risk.

LLMs are unreliable narrators, and there’s really nothing we can do to get around that except to be skeptical of their output.

And then there are the “doom loops”, when the context goes outside the model’s data distribution, and even with infinite iterations, it just can’t do what we want it to do. It just can’t conjure up the code equivalent of “a wine glass full to the brim”.

Image

And the bigger the context – the more we ask for – the greater the risk of out-of-distribution behaviour, with each additional pertinent token collapsing the probability of matching the pattern even further. (Don’t believe me? Play one at chess and watch it go off that OOD cliff.)

So problems are very likely with this approach – which I’m calling “prompt-and-fix”, because I can – and finding them and fixing them, or backing out, is a bigger cost.

What I’ve seen most developers do is skim the changes and then wave the problems through into a release with a “LGTM”.

One more time:

Image

This creates a comforting temporary illusion of time saved, just like code-and-fix. But we’re storing up a lot more time that’s going to be lost later with production fires, bug fixes and high cost-of-change.

One of the most important lessons in software development is that what’s downstream of present you is upstream of future you – as Sandra Bullock and George Clooney discovered in Gravity.

The antidote to code-and-fix was defect prevention. We take smaller steps, testing and reviewing changes continuously, so most problems are caught long before finding, fixing or reverting them becomes expensive.

I have a meme for that, too:

Image

The equivalent in “AI”-assisted software development would be to work in small steps – one change at a time – and to test and review the code continuously after every step.

Sorry, folks. No time for that spa treatment! You’ll be keeping the “AI” on a very short leash – both hands on the wheel at all times, sort of thing.

The other benefit of small steps is that they’re much less likely to push the LLM out of its data distribution. Keeping the model in-distribution more, so screw-ups will happen less often – while reaping the benefits of immediate problem detection in reduced work added or lost when things go south – is a WIN-WIN.

I know that some of you will be reading this and thinking “But Claude can break a big problem down into smaller problems and tackle them one at a time, running the tests and linting the code and all that”.

Yes, in that mode, it certainly can. But every step it takes carries a real risk of taking it in the wrong direction. And direction, despite what some fans of the technology claim, isn’t an LLM’s strong suit. Remember, they don’t understand, they don’t reason, they don’t plan. They recursively match patterns in the input to patterns in the model and predict what token comes next.

Any sense that they’re thinking or reasoning or planning is a product of the Actual Intelligence they’re trained on. It may look plausible, but on closer inspection – and “closer inspection” is often the problem here – it’s usually riddled with “brown M&Ms”.

So, no, you can’t just walk away and let them get on with it. If they take a wrong turn, that error will likely compound through the rest of the processing.

Think of what happens in traditional software development when a misunderstanding or an incorrect assumption goes unchecked while we merrily build on top of that code.

Do You Know Where Your Load-Bearing Code Is?

Do you know where your load-bearing code is?

90% of the time, TDD is enough to assure that code of the everyday variety is reliable enough.

But some code really, really needs to work. I call it “load-bearing code”, and it’s rare to find a software product or system that doesn’t have any code that’s critical to its users in some way.

In my 3-day Code Craft training workshop, we go beyond Test-Driven Development to look at a couple of more advanced testing techniques that can help us make sure that code that really, really needs to work in all likelihood does.

It raises the question, how do we know which parts of our code are load-bearing, and therefore might warrant going that extra mile?

An obvious indicator is critical paths. If a feature or a usage scenario is a big deal for users and/or for the business, tracing which code lies on the execution path for it can lead us to code that may require higher assurance.

Some teams work with stakeholders to assess risk for usage scenarios, perhaps captured alongside examples that they use to drive the design (e.g., in .feature files), and then when these tests are run, use instrumentation (e.g., test coverage) to build a “heat map” of their code that graphically illustrates which code is cool – no big deal if this fails – and which code might be white hot – the consequences will be severe if it fails.

(It’s not as hard to build a tool like this as you might think, BTW.)

A less obvious indicator is dependencies. Code that’s widely reused, directly or indirectly, also presents a potentially higher risk. Static analysis tools like NDepend can calculate the “rank” of a method or a class or a package in the system (as in, the Page Rank) to show where code is widely reused.

Monitoring how often code’s executed in production can produce a similar, but dynamic, picture of which code’s used most often.

These are all measures of the potential impact of failure. But what about the likelihood of failure? A function may be on a critical path, and reused widely, but if it’s just adding a list of numbers together, it’s not very likely to fail.

Complex logic, on the other hand, presents many more ways of being wrong – the more complex, the greater that risk.

Code that’s load-bearing and complex should attract our attention.

And code that’s load-bearing, complex and changing often is white hot. That should be balanced by the strength of our testing. The hotter the code, the more exhaustively and the more frequently it might need testing.

Hopefully, with a testing specialist in the team, you will have a good repertoire of software verification techniques to match against the temperature of the code – guided inspection, property-based testing, DBC, decision tables, response matrices, state transition tables, model checking, maybe even proofs of correctness when it really needs to work.

But a good start is knowing where your hottest code actually is.

101 Uses of An Abstract Test #43: Contract Testing

As distributed systems have become more and more prevalent, I’ve seen how teams spend an increasing amount of time putting out fires caused by dependencies changing.

Team A go to bed with a spiffy working doodad, and in the wee small hours, Team B does a release of their spiffy working thingummy that Team A just happen to rely on. Team A wakes up to a decidedly not-spiffy doodad that has mysteriously stopped working overnight.

Team B’s thingummy may well have passed all their tests before they deployed it, but their tests might not show when a change they’ve made isn’t backwards-compatible with how their clients are using it. For a system to be correct, the interactions between components need to be correct.

We can define the correctness of interactions between clients and suppliers using contracts that determine what’s expected of both parties.

The supplier promises to provide certain benefits to the client – the weather forecast for the next 7 days at their location, for example. But that promise only holds under specific circumstances – the client’s location has to be provided, and must be expressed in Decimal Degrees.

If the supplier changes the contract so that locations must now be provided in Degrees, Minutes and Seconds, it may well pass all their tests, but it breaks the client, who’s now getting error messages instead of weather forecasts.

Now, the client will likely have some integration tests where the end point is real. And those are the tests that enforce expectations about interactions with that end point.

What if we abstracted those tests so that the end could be the real deal, or a stub or a mock? The object or function that’s responsible for the interaction could be supplied to each test via, say, a factory method that’s abstract in the test base class, and can be overridden in subclasses, enabling us to vary the set-up as we wish – real or pretend.

Then we can run the exact same tests with and without the real end point. If all the tests using pretend versions are passing, but the ones using the real thing suddenly start failing, that strongly suggests something’s changed at the other end. If the “unit” tests start failing too, then the problem is at our end.

This gives client developers a heads-up as soon as integration fires start. But the real payoff is when the team at the other end can run those tests themselves before they even think about deploying.

The AI-Ready Software Developer #4 – Continuous Testing

Now, where were we? Ah, yes.

So, we’re working in small steps with our LLM, solving one problem at a time, which makes it easier for the model to pay attention to important details (just like in real life).

We’re keeping our contexts small, and making them more specific by clarifying with examples to reduce the risk of misinterpretation (just like in real life).

And we’re cleanly separating the different concerns in our architecture to limit the “blast radius” when the model changes code, reducing the risk of boo-boos (just like in real life), and keeping diffs smaller. (More about that in a future post – for now, smaller diffs == gooderer).

When we apply all three of these practices together, it opens a door: we can test more often.

Those examples we used to clarify our requirements can become tests we can perform after the model has done that work to check that it did what we told it to.

We could perform these tests ourselves by running the software or by accessing the code directly at the command line in Read-Evaluate-Print loops (REPL). Or, if a UI is involved, we could run it and click the buttons ourselves.

I highly recommend seeing it work with your own eyes at least once. Trust no one, Agent Mulder!

But what about code that was working that the model has since changed? As the software grows, manually retreading dozens, hundreds, maybe thousands of tests to make sure we’re obeying Gorman’s First Law of Software Development :

“Thou shalt not break shit that was working”

– is going to take a lot of time. Eventually, our development process will become O(n!) complex, where n is the number of tests, and every time we add a new one – one problem at a time, remember? – we have to repeat the existing tests.

Automation to the rescue! If we find ourselves performing the same test over and over, we can write code to perform it for us. Or we can get the LLM to write it for us (be careful here – triple-check every test the model writes! Been burned by that multiple times.)

And this is where clean separation of concerns turns into a superhero.

If the code that, say, calculates mortgage repayments is buried inside the module that generates the Repayments web page, and which also directly accesses an external web service to get interest rates, then you’ll have little choice but to test through the browser (or something pretending to be the browser).

But if there’s a separate MortgageCalculator module that does this work, and that module is decoupled from the code that fetches interest rates, a test can be automated directly against it that will run very reliably and very fast – milliseconds instead of seconds. Thousands of those kinds of “unit” tests could run in seconds instead of hours.

Which means comprehensively retesting your software after every small step, giving you an instant heads-up if the LLM broke anything (AND IT WILL), becomes completely practical.

Once again, you won’t be surprised to learn that this is very good news whether we’re using “AI” or not. Many teams consider it essential.

Code Reviews as Exploratory Testing

Image

Code reviews? Let me tell you about code reviews!

To me, a code review done by people is exploratory testing. We gather around a piece of code (e.g., a merge diff for a new feature). We go through the code and we ask ourselves “What do we think of this?”

Maybe we see a method or a function that has control flow nested 4 deep. Eurgh! Difficult to test and easy to break (such cyclomatic complexity, much wow). So we flag it up.

So far, so normal.

Once that code quality “bug” has been flagged up, I’m sure we both agree that it needs fixing. So we fix it. Job done? Now, here’s where you and I may part company.

It’s almost guaranteed that won’t be the last time that problem rears its head in our code. So, as well as fixing the complex conditional we found in our review, we also fix the process that allowed the problem to make it that far – and waste a bunch of time – in the first place.

When we find a logic error in our code by exploratory testing, we don’t just fix the bug. We write a regression test in case a future change breaks it again. (We do, right?)

And when we find a code quality bug, we shouldn’t just refactor that example. We should add a code quality check for it to our suite of code inspections – automated if at all possible – that will catch it as soon as it reappears somewhere else in the code.

Now, you can take this too far, as with all things. Automating regression tests is D.R.Y. applied to our testing process. If we need to perform the same test over and over, automating it makes a lot of sense.

But D.R.Y. has caveats, and one of those caveats is The Rule of Three. On average, we wait until we’ve seen something repeated 3 times before we refactor. This increases the odds that:

a. The refactoring will pay off later (the more examples we see, the more likely there’ll be more in the future)

b. We have more examples to guide us towards the better design.

Both apply as much to duplicated effort in our process as they do to duplication in our code.

So we might want to keep a log of the problems our code reviews find, and we see the same type of issue appear 3 times (or thereabouts), that might be our cue to look into building a check for it into our Continuous Inspection rule suite. Maybe our linter already has a rule we can use. Maybe we’ll need to write our own custom rule for it. (That’s a very undervalued skillset, BTW). Maybe we could train a small ANN to detect it. Maybe we’ll need to add it to the manual inspection checklist.

A good yardstick might be that the same type of code quality issue doesn’t appear in merges (or attempted merges) more than 3 times.

And there’s more. There are teams I’ve worked with who not only add a check to their Continuous Inspection suite, but also ask “Why does this problem happen in the first place?” How do 500-line functions become 500-line functions? How do deeply-nested IFs become deeply-nested IFs? How do classes end up with 12 distinct responsibilities and 25 dependencies?

The answer, BTW, is that – more often than not – the way functions get 500 lines long, IFs get deeply nested and classes end up doing so many different things is because the people writing that code didn’t see it as a problem.

And that’s usually where I come in 🙂

Enterprise Refactoring Requires Enterprise Tests

Image

A book that had quite an impact on me was Ubiquity: The Science of History by Mark Buchanan. It proposes that many catastrophic events are ultimately caused by the interconnectedness of things (banking system collapses, forest fires etc). The more interconnected the system, the more catastrophically effects can “ripple” out through those connections.

In software, we see these network effects as failures propagate through dependencies between components and systems. We also see how change can propagate for the same reasons. I’ve watched many dev organisations grapple with what could have been small changes to their software that ended up impacting every team because their applications, components and services were so tightly coupled.

At the level of source files (e.g., .java files or .py files), we can decouple modules by moving responsibilities to where the majority of their dependencies are. Things that get used together belong together, and things that get used together change together (and fail together).

Moving a feature from one Java class to another is easy as peas, even doing it with manual edits. Moving a feature from a Java web service to a Python web service maintained by a different team, or from a COBOL CICS system to a C# application, requires an order of magnitude more coordination. Which is why, when “Feature Envy” appears at that level – when a system or component is coupled to multiple features of a different system or component, indicating that behaviour may be in the wrong place – most organisations do nothing about it.

What we might call “enterprise refactoring” is a discipline that could benefit many organisations, though. And what distinguishes refactoring at any level of code organisation from just changing stuff willy-nilly is testing.

If I move a method from one Java class to another, I retest that code at a higher level for behaviours that involve both classes.

When we move a feature from one system or component to another, we again need to retest at a higher level, checking scenarios that involve both components or both systems – and these are typically business scenarios.

Many dev orgs lack tests at that level. They may see tests failing when a change breaks a component, but no tests fail when that change breaks the business. (This is why so many businesses can be blissfully unaware that, say, order fulfilment isn’t working. The POS system’s working. The warehouse system’s working. The shipping system’s working. But somewhere between them, the ball gets dropped.)

Indeed, too many organisations don’t actually know how their software’s being used in those wider contexts – business use cases, if you like. They understand what their system or component does, but have no real visibility of how it’s being used.

Businesses are systems, too. They have users and business use cases. These use cases are realised by internal processes, and they often involve multiple interacting software systems.

Refactoring those internal processes to localise the “ripple effect” of change is on a whole other level.

It’s the System, Stupid!

Image

Since this “Age of A.I.” arrived in late 2022, something’s been nagging at me. As more and more data rolls in, we see an apparent paradox emerging where “A.I.” coding assistants are concerned.

Individual developers report productivity gains using these tools (though many also report significant frustrations with, for example, “hallucinations”).

And at the same time, data clearly shows that the more teams use them, the bigger the negative impact on team outcomes like delivery throughput and release stability.

How can both these things be true?

We have one very plausible candidate for a causal mechanism, and it’s an age-old story in our industry.

When programmers get a feeling that they’re getting things done faster, they’re often only considering the part where they write the code – particularly when that’s their part of the process.

What they’re not considering is the whole software development process, and especially downstream activities like testing, code review, merging, deployment and operations.

More code faster can mean bigger change sets – more to test (and more bugs to fix), more code to review (and more refactorings to get it through review), more changes to merge (and more conflicts to resolve), and so on.

“A.I.” code generation’s a local optimisation that can come at the expense of the development system as a whole, especially if that system is more batch-oriented, with design, coding, testing, review, merging and release operating like sequential phases in the delivery of a new feature. In such a system, more code faster means bigger bottlenecks later. So there’s no paradox at all: one causes the other.

When teams work in much smaller cycles – make one change, test it, review the code, refactor, commit that and maybe push it to the trunk – they may experience far fewer downstream bottlenecks, with or without “A.I.” coding assistance. Arguably, coding assistants might make little noticeable difference in such a workflow.

The DORA data strongly indicates that the teams with the shortest lead times and the highest release stability tend to work this way, with continuous testing, code review and merging as the code’s being written.

And all this got me to thinking, maybe we’re targeting machine learning and “A.I.” at the wrong problem. Instead of focusing on individual developer productivity with things like code generation, perhaps this technology would yield more fruit if it was focused on systemic issues and reducing bottlenecks.

Maybe, for example, instead of using ML models to generate code, could they be more productively applied to reviewing code? Could a “smart” linter reduce the need for after-the-fact code review?

Of course, many of us already enjoy the benefits of “smart” linters. We call it “pair programming” or “ensemble programming”. And, having used static code analysis tools that incorporated statistical models or neural networks, the results weren’t all that impressive. Hard to see such a tool significantly out-performing a classic linter + a second pair of experienced eyes (if such eyes are available to you, of course, and maybe that’s the use case).

Perhaps the real value might be found in widening our view. What if a model (or models) could be trained on data collected across the entire cycle, from product strategy through to operational telemetry, support and beyond?

Imagine a model that, given, say, a Figma UI wireframe, could predict how many support calls you’d be likely to get about it. Imagine a model that, given a source file, could predict its mean time to failure in production?

More generally, imagine a model that could, with reasonable accuracy, predict the downstream impact of upstream activities, so as SuperDuperAgenticAI spits out its slop, alarm bells start to go off about where this is likely to lead if it gets any further.

A pipe dream, you might think. But in actual fact, such predictive technologies exist in other disciplines like electronic engineering, where statistical and ML models are used to predict the reliability and probable lifetimes of printed circuit boards, for example.

There would be some major hurdles to overcome to apply similar techniques to software development, though, not least of which is the jungle of higgledy-piggledy data formats our many proprietary tools and platforms produce. Electronics has established data interchange standards. We, for the most part, do not – probably because that would require enough of us to agree on some stuff, and that isn’t really our strong suit.

But, if these challenges could be overcome, or worked around (e.g., with a translation/encoding layer), I’m pretty sure there are patterns hidden in our complex and multi-dimensional workflow data that maybe nobody’s spotted yet. I mean, we’ve barely scratched the surface in the last 70+ years.

In a very handwavy sense, though, I feel quite sure now that “A.I.” is being targeted at the wrong problem in software: with an exclusive focus on individual developer productivity, when the focus should be on the system as a whole.

In the meantime, we’re pretty sure at this point that things like continuous design, continuous testing, continuous code review and continuous integration do have a positive systemic impact, so focusing on that is probably the most productive I can be for the foreseeable future.


If your team would like training and mentoring in the technical practices that we know speed up delivery cycles, shorten lead times and improve product and system reliability, with or without “A.I.”, pay us a visit.

The A-Z of Code Craft – P is for Precondition

Image

Teaching continuous testing and Test-Driven Development, I spend a lot of my time thinking about post-conditions. These are the expected outcomes of actions that we assert at the end of our tests.

In an online store, we might assert that the post-condition of buying an item is that the quantity purchased is deducted from that product’s available stock, so we don’t sell stuff we don’t have.

In a made-up Python-like language, We might assert something like:

product = Product(name="Acme Widget", price=9.99, stock=10)
item = OrderItem(product=product, quantity=2) 

item.buy() 

assert(product.stock == 8)

Reasonable enough. But a focus on post-conditions can have an interesting side-effect when we start to consider edge cases. What happens when there isn’t enough stock?

When we solve this problem with an outcome, we may choose something like:

product = Product(name="Acme Widget", price=9.99, stock=1)
item = OrderItem(product=product, quantity=2) 

assertRaises(InsufficientStockError, lambda: item.buy())

And we might pass this test by adding a guard condition to the buy() method:

def buy(self):
    if self.product.stock < self.quantity:
        raise InsufficentStockError("Sorry, only " +         self.product.stock + " available.")
    
    self.product.stock -= quantity

When we consider the design at this internal level, it all looks pretty sensible. Except… Well, is it?

It’s very easy – and very common – to lose sight of how the system should handle this scenario. That error, and its message? That’s part of the user’s experience. It’s UX design, buried in a guard condition in our core business logic.

Something higher up the call stack has to catch this error and then decide how the system should handle it. This is a conversation we should have had with our customer.

And I don’t know about you, but I get annoyed by software that let’s me do things then says “Sorry, no can do!”, like it’s telling me off. BAD USER! Why did you select a quantity we don’t have?! (Like when ATM’s offer you a choice of withdrawing £10, £20, £30, and when you select £30, it tells you only £20 notes are available. Grrr! BAD UX!!)

The key to a better user experience, and to simplifying our core business logic, is not to offer the user choices the system can’t fulfil. We need to shift our design focus from post-conditions to preconditions.

A precondition determines if an action can be performed. In the context of our buy() action, the precondition might be that we have sufficient stock:

item.quantity <= item.product.stock

If the precondition isn’t satisfied, we shouldn’t give the user the choice to buy. In our UI design, we could disable the “Buy” button. Or offer an alternative action, like “Pre-order”.

If, at the system level, we don’t allow the Buy action, there’s no need to guard against that scenario in our core logic, which significantly simplifies the code.

Of course, if buy() wasn’t an internal method, but instead was, say, part of a web service, we’d need to check because we don’t necessarily control the client code.

But if the client code is also our code, and we control it, then calling buy() when there’s not enough stock is a programming error, not a user error.

The remedy for that isn’t an “Oops, something went wrong!” message. The remedy for programming errors is to take more care over programming. I hear testing is quite the thing these days.


If you’re serious about building your team’s capability to rapidly, reliably and sustainably evolve software to meet rapidly changing business needs, visit codemanship.co.uk for details of high-quality hands-on training mentoring for software developers.

It’s The Bottlenecks, Stupid!

Image

It’s the top question I get asked. “What’s your AI strategy?” people demand to know.

And as I steer my taxi cab into their driveway, I tell them.

My AI strategy was to spend 2 years investigating various claims about just how “night and day” better the latest model was compared to that last one that had disappointed me, only to discover that the shiny new one was almost as disappointing. And that “almost” has been getting smaller and smaller with every new generation.

I’m satisfied now to just sit back with a nice glass of wine and watch from the sidelines. I don’t feel any urgency to “get with the latest” large language doodad or agentic watsimyjig, because I see teams using them every week, and the things that slowed them down 2+ years ago are still slowing them down.

If anything, “AI” code generation appears to exacerbate the downstream bottlenecks like merge conflicts (bigger changesets, see), testing (bigger changesets, see), code reviews (bigger changesets, see), technical debt (bigger and sloppier changesets, see), waiting for management & product decisions (bigger changesets, see), and waiting for customer/user feedback (bigger changesets, see).

It’s like trying to fix the bottleneck at the ferry port by raising the speed limit on the roads that feed it.

“Coding” hasn’t been the bottleneck in software development since people were manually rewriting the connections between vacuum tubes.

If your development process is full of blocking practices like command-and-control management or after-the-fact testing or requirements and design handovers or Pull Request code reviews, then faster code generation will just make the bottlenecks worse and the lead times even longer. And the DORA data seems to back this up.

And I’ve been seeing this in teams throughout. I can tell if they’ve been using LLMs to generate or modify code by looking at the code. (Oh boy, can I tell!)

I can’t tell by looking at their delivery metrics.

As always, Big Tech has solved the wrong problem.

I just so happen to offer training and mentoring for dev teams in *non-blocking* software delivery practices that were shrinking lead times, improving reliability and sustaining the pace of innovation before “AI” coding assistants were glints in their inventors’ eyes.

And yes, the DORA data backs *that* up 🙂

Visit www.codemanship.co.uk for details.

Fast-Running Tests Are Key To Agility. But How Fast Is ‘Fast’?

I’ve written before about how vital it is to be able to re-test our software quickly, so we can ensure it’s always shippable after every change we make to the code.

Achieving a fast-running test suite requires us to engineer our tests so that the vast majority run as quickly as possible, and that means most of our tests don’t involve any external dependencies like databases or web services.

If we visualise our test suites as a pyramid, the base of the pyramid – the bulk of the tests – should be these in-memory tests (let’s call them ‘unit tests’ for the sake of argument). The tip of the pyramid – the slowest running tests – would typically be end-to-end or system tests.

But one person’s “fast” is often another person’s “slow”. It’s kind of ambiguous as to what I and others mean when we say “your tests should run fast”. For a team relying on end-to-end tests that can take many seconds to run, 100ms sounds really fast. For a team relying on unit tests that take 1 or 2 milliseconds, 100ms sounds really slow.

A Twitter follower asked me how long a suite of 5,000 tests should take to run? If the test suite’s organised into an ideal pyramid, then – and, of course, these are very rough numbers based on my own experience – it might look something like this:

  • The top of the pyramid would be end-to-end tests. Let’s say each of those takes 1 second. You should aim to have about 1% of your tests be end-to-end tests. So, 50 tests = 50s.
  • The middle of the pyramid would be integration and contract tests that check interactions with external dependencies. Maybe they each take about 100ms to run. You should aim to have less than 10% of those kinds of tests, so about 500 tests = 50s.
  • The base of the pyramid should be the remaining 4450 unit tests, each running in roughly 1-10ms. Let’s take an average of 5ms. 4450 unit tests = 22s.

You’d be in a good place if the entire suite could run in about 2 minutes.

Of course, these are ideal numbers. But it’s the ballpark we’re interested in. System tests run in seconds. Integration tests run in 100s of milliseconds. Unit tests run in milliseconds.

It’s also worth bearing in mind you wouldn’t need to run all of the tests all of the time. Integration and contract tests, for example, only need to be run when you’ve changed integration code. If that’s 10% of the code, then we might need to run them 10% of the time. End-to-end tests might be run even less frequently (e.g., in CI).

Now, what if your pyramid was actually a diamond shape, with the bulk of your tests hitting external processes? Then your test suite would take about 8 minutes to run, and you’d have to run those integration tests 90% of the time. Most teams would find that a serious bottleneck.

And if your pyramid was upside-down, with most tests being end-to-end, then you’re looking at 75 minutes for each test run, 90% of the time. I’ve seen those kind of test execution times literally kill businesses with their inability to evolve their products and systems.