Engineering Leaders: Your AI Adoption Doesn’t Start With AI

In the past few months, I’ve been hearing from more and more teams that the use of AI coding tools is being strongly encouraged in their organisations.

I’ve also been hearing that this mandate often comes with high expectations about the productivity gains leaders expect this technology to bring. But this narrative is rapidly giving way to frustration when these gains fail to materialise.

The best data we have shows that a minority of development teams are reporting modest gains – in the order of 5%-15% – in outcomes like delivery lead times and throughput. The rest appear to be experiencing negative impacts, with lead times growing and the stability of releases getting worse.

The 2025 DevOps Research & Assessment State of AI-assisted Software Development report makes it clear that the teams reporting gains were already high-performing or elite by DORA’s classification, releasing frequently, with short lead times and with far fewer fires in production to put out.

As the report puts it, this is not about tools or technology – and certainly not about AI. It’s about the engineering capability of the team and the surrounding organisation.

It’s about the system.

Teams who design, test, review, refactor, merge and release in bigger batches are overwhelmed by what DORA describes as “downstream chaos” when AI code generation makes those batches even bigger. Queues and delays get longer, and more problems leak into releases.

Teams who design, test, review, refactor, merge and release continuously in small batches tend to get a boost from AI.

In this respect, the team’s ranking within those DORA performance classifications is a reasonably good predictor of the impact on outcomes when AI coding assistants are introduced.

The DORA website helpfully has a “quick check” diagnostic questionnaire that can give you a sense of where your team sits in their performance bands.

Image

(Answer as accurately as you can. Perception and aspiration aren’t capability.)

The overall result is usefully colour-coded. Red is bad, blue is good. Average is Meh. Yep, Meh is a colour.

Image

If your team’s overall performance is in the purple or red, AI code generation’s likely to make things worse.

If your team’s performance is comfortably in the blue, they may well get a little boost. (You can abandon any hopes of 2x, 5x or 10x productivity gains. At the level of team outcomes, that’s pure fiction.)

The upshot of all this is that before you even think about attaching a code-generating firehose to your development process, you need to make sure the team’s already performing at a blue level.

If they’re not, then they’ll need to shrink their batch sizes – take smaller steps, basically – and accelerate their design, test, review, refactor and merge feedback loops.

Before you adopt AI, you need to be AI-ready.

Many teams go in the opposite direction, tackling whole features in a single step – specifying everything, letting the AI generate all the code, testing it after-the-fact, reviewing the code in larger change-sets (“LGTM”), doing large-scale refactorings using AI, and integrating the whole shebang in one big bucketful of changes.

Heavy AI users like Microsoft and Amazon Web Services have kindly been giving us a large-scale demonstration of where that leads – more bugs, more outages, and significant reputational damage.

A smaller percentage of teams are learning that what worked well before AI works even better with it. Micro-iterative practices like Test-Driven Development, Continuous Integration, Continuous Inspection, and real refactoring (one small change at a time) are not just compatible with AI-assisted development, they’re essential for avoiding the “downstream chaos” DORA finds in the purple-to-red teams.

And while many focus on the automation aspects of Continuous Delivery – and a lot of automation is required to accelerate the feedback loops – by far the biggest barrier to pushing teams into the blue is skills.

Yes. SKILLS.

Skills that most developers, regardless of their level of experience, don’t have. The vast majority of developers have never even seen practices like TDD, refactoring and CI being performed for real.

That’s certainly because real practitioners are pretty rare, so they’re unlikely to bump into one. But much of this is because of their famously steep learning curves. TDD, for example, takes months of regular practice to to be able to use it on real production systems.

And, as someone who’s been practicing TDD and teaching it for more than 25 years, I know it requires ongoing mindful practice to maintain the habits that make it work. Use it or lose it!

An experienced guide can be incredibly valuable in that journey. It’s unrealistic to expect developers new to these practices to figure it all out for themselves.

Maybe you’re lucky to have some of the 1% of software developers – yes, it really is that few – who can actually do this stuff for real. Or even one of the 0.1% who has had a lot of experience helping developers learn them. (Just because they can do it, it doesn’t necessarily follow that they can teach it.)

This is why companies like mine exist. With high-quality training and mentoring from someone who not only has many thousands of hours of practice, but also thousands of hours of experience teaching these skills, the journey can be rapidly accelerated.

I made all the mistakes so that you don’t have to.

And now for the good news: when you build this development capability, the speed-ups in release cycles and lead times, while reliability actually improves, happen whether you’re using AI or not.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise rework due to misunderstandings about requirements, they’ll need to describe requirements in a testable way as part of a close and ongoing collaboration with our customers.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of code reviews, they’ll need to build review into the coding workflow itself, and automate the majority of their code quality checks.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise merge conflicts and broken builds, and to minimise software delivery lead times, they’ll need to integrate their changes more often and automatically build and test the software each time to make it ready for automated deployment.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the “blast radius” of changes, they’ll need to cleanly separate concerns in their designs to reduce coupling and increase cohesion.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the cost and the risk of changing code, they’ll need to continuously refactor their code to keep its intent clear, and keep it simple, modular and low in duplication.

“We definitely don’t have time for that, Jason!”

“AI” coding assistants don’t solve any of these problems. They AMPLIFY THEM.

More code, with more problems, hitting these bottlenecks at accelerated speed turns the code-generating firehose into a load test for your development process.

For most teams, the outcome is less reliable software that costs more to change and is delivered later.

Those teams are being easily outperformed by teams who test, review, refactor and integrate continuously, and who build shared understanding of requirements using examples – with and without “AI”.

Will you make time for them in 2026? Drop me a line if you think it’s about time your team addressed these bottlenecks.

Or was productivity never the point?

Ready, Fire, Aim!

I teach Test-Driven Development. You may have heard.

And as a teacher of TDD for some quarter of a century now, you can probably imagine that I’ve heard every reason for not doing TDD under the Sun. (And some more reasons under the Moon.)

“It won’t work with our tech stack” is one of the most common, and one of the most easily addressed. I’ve done and seen done TDD on all of the tech stacks, at all levels of abstraction from 4GLs down through assembly language to the hardware design itself. If you can invoke it and get an output, you can automatically test it. And if you can automatically test it, you can write that test first.

(Typically, what they really mean is that the architecture of the framework(s) they’re using doesn’t make unit testing easy. That’s about separation of concerns, though, and usually work-aroundable.)

The second most common reason I hear is perhaps the more puzzling: “But how can I write tests first if I don’t know what the code’s supposed to do?”

The implication here is that developers are writing solution code without a clear idea of what they expect it to do – that they’re retrofitting intent to implementations.

I find that hard to imagine. When I write code, I “hear the tune” in my head, so to speak. The intended meaning is clear to me. When I run it, my understanding might turn out to be wrong. But there is an expectation of what the code will do: I think it’s going to do X.

My best guess is that we all kind of sort of have those inner expectations when we write code. The code has meaning to us, even if we turn out to have understood it wrong when we run it.

So I could perhaps rephrase “How can I write tests first if I don’t know what the code’s supposed to do?” to articulate what might actually be happening:

“How do I express what I want the code to do before I’ve seen that code?”

Take this example of code that calculates the total of items in a shopping basket:

class Basket:
def __init__(self, items):
self.items = items
def total(self):
sum = 0.0
for item in self.items:
sum += item.price * item.quantity
return sum

When I write this code, in my head – often subconsciously – I have expectations about what it’s going to do. I start by declaring a sum of zero, because an empty basket will have a total of zero.

Then, for every item in the basket, I add that item’s price multiplied by it’s quantity to the sum.

So, in my head, there’s an expectation that if the basket had one item with a quantity of one, the total would equal just the price of that item.

If that item had a quantity of two, then the total would be the price multiplied by two.

If there were two items, the total would be the sum of price times quantity of both items.

And so on.

You’ll notice that my thinking isn’t very abstract. I’m thinking more with examples than with symbols.

  • No items.
  • One item with quantity of one.
  • One item with quantity of two.
  • Two items.

If you asked me to write unit tests for the total function, these examples might form the basis of them.

A test-driven approach just flips the script. I start by listing examples of what I expect the function to do, and then – one example at a time – I write a failing test, write the simplest code to pass the test, and then refactor if I need to before moving on to the next example.

    def test_total_of_empty_basket(self):
        items = []
        basket = Basket(items)
        
        self.assertEqual(0.0, basket.total())
class Basket:
def __init__(self, items):
self.items = items
def total(self):
return 0.0

What I’m doing – and this is part of the art of Test-Driven Development – is externalising the subconscious expectations I would no doubt have as I write the total function’s implementation.

Importantly, I’m not doing it in the abstract – “the total of the basket is the sum of price times quantity for all of its items”.

I’m using concrete examples, like the total of an empty basket, or the total of a single item of quantity one.

“But, Jason, surely it’s six of one and half-a-dozen of the other whether we write the tests first or write the implementation first. Why does it matter?”

The psychology of it’s very interesting. You may have heard life coaches and business gurus tell their audience to visualise their goal – picture themselves in their perfect home, or sipping champagne on their yacht, or making that acceptance speech, or destabilising western democracy. It’s good to have goals.

When people set out with a clear goal, we’re much more likely to achieve it. It’s a self-fulfilling prophecy.

We make outcomes visible and concrete by adding key details – how many bedrooms does your perfect home have? How big is the yacht? Which Oscar did you win? How little regulation will be applied to your business dealings?

What should the total of a basket with no items be? What should the total of a basket with a single item with price 9.99 and quantity 1 be?

    def test_total_of_single_item(self):
        items = [
            Item(9.99, 1),
        ]
        basket = Basket(items)

        self.assertEqual(9.99, basket.total())

We precisely describe the “what” – the desired properties of the outcome – and work our way backwards directly to the “how”? What would be the simplest way of achieving that outcome?

class Basket:
def __init__(self, items):
self.items = items
def total(self):
if len(self.items) > 0:
return self.items[0].price
return 0.0

Then we move on to the next outcome – the next example:

    def test_total_of_item_with_quantity_of_2(self):
        items = [
            Item(9.99, 2)
        ]
        basket = Basket(items)

        self.assertEqual(19.98, basket.total())
class Basket:
def __init__(self, items):
self.items = items
def total(self):
if len(self.items) > 0:
item = self.items[0]
return item.price * item.quantity
return 0.0

And then our final example:

    def test_total_of_two_items(self):
        items = [
            Item(9.99, 1),
            Item(5.99, 1)
        ]
        basket = Basket(items)

        self.assertEqual(15.98, basket.total())
class Basket:
def __init__(self, items):
self.items = items
def total(self):
sum = 0.0
for item in self.items:
sum += item.price * item.quantity
return sum

If we enforce that items must have a price >= 0.0 and an integer quantity > 0, this code should cover any list of items, including an empty list, with any price and any quantity.

And our unit tests cover every outcome. If I were to break this code so that, say, an empty basket causes an error to be thrown, one of these tests would fail. I’d know straight away that I’d broken it.

This is another self-fulfilling prophecy of starting with the outcome and working directly backwards to the simplest way of achieving it – we end up with the code we need, and only the code we need, and we end up with tests that give us high assurance after every change that those outcomes are still being satisfied.

Which means that if I were to refactor the design of the total function:

    def total(self):
        return sum(
                map(lambda item: item.subtotal(), self.items))

I can do that with high confidence.

If I write the code and then write tests for it, several things tend to happen:

  • I may end up with code I didn’t actually need, and miss code I did need
  • I may well miss important cases, because unit tests? Such a chore when the work’s already done! I just wanna ship it!
  • It’s not safe to refactor the new code without those tests, so I have to leave that until the end, and – well, yeah. Refactoring? Such a chore! etc etc etc.
  • The tests I choose – the “what” – are now being driven by my design – the “how”. I’m asking “What test do I need to cover that branch?” and not “What branch do I need to pass that test?”

And finally, there’s the issue of design methodology. Any effective software design methodology is usually usage-driven. We don’t start by asking “What does this feature do?” We start by asking “How will this feature be used?”

What the feature does is a consequence of how it will be used. We don’t build stuff and then start looking for use cases for it. Well, I don’t, anyway.

In a test-driven approach, my tests are the first users of the total function. That’s what my tests are about – user outcomes. I’m thinking about the design from the user’s – the external – perspective and driving the design of my code from the outside in.

I’m not thinking “How am I going to test this total function?” I’m thinking “How will the user know the total cost of the basket?” and my tests reveal the need for a total function. I use it in the test, and that tells me I need it.

“Test-driven”. In case you were wondering what that meant.

When we design code from the user’s perspective, we’re far more likely to end up with useful code. And when we design code with tests playing the role of the user, we’re far more likely to end up with code that works.

One final question: if I find myself asking “What is this function supposed to do?”, is that a cue for me to start writing code in the hope that somebody will find a use for it?

Or is that my cue to go and speak to someone who understands the user’s needs?

Why Does Test-Driven Development Work So Well In “AI”-assisted Programming?

In my series on The AI-Ready Software Developer, I propose a set of principles for getting better results using LLM-based coding assistants like Claude Code and Cursor.

Users of these tools report how often and how easily they go off the rails, producing code that doesn’t do what we want and frequently breaking code that was working. As the code grows, these risks grow with them. On large code bases, they can really struggle.

From experiment and from real-world use, I’ve seen a number of things help reduce those risks and keep the “AI” on the rails.

  • Working in smaller steps
  • Testing after every step
  • Reviewing code after every step
  • Refactoring code as soon as problems appear
  • Clarifying prompts with examples

Smaller Steps

Human programmers have a limited capacity for cognitive load. There’s only so much we can comfortably wrap our heads around with any real focus, and when we overload ourselves, mistakes become much more likely. When we’re trying to spin many plates, the most likely result is broken plates.

LLMs have a similarly-limited capacity for context. While vendors advertise very impressive maximum context sizes of hundreds of thousands of tokens, research – and experience – shows that they have effective context limits that are orders of magnitude smaller.

The more things we ask models to pay attention to, the less able they are to pay attention to any of them. Accuracy drops of a cliff once the context goes beyond these limits.

After thousands of hours working with “AI” coding assistants, I’ve found I get the best results – the fewest broken plates – when I ask the model to solve one problem at a time.

Continuous Testing

If I make one change to the code, and test it straight away, if tests fail then I wouldn’t need to be a debugging genius to figure out which change broke the code. It’s either a quick fix, or a very cheap undo.

If I make ten changes and then test it, it’s going to take significantly longer, potentially, to debug. And if I have to revert to the last known working version, it’s 10x the work and the time lost.

An LLM is more likely to generate breaking changes than a skilled programmer, so frequent testing is even more essential to keep us close to working code.

And if the model’s first change breaks the code, that broken code is now in its context and it – and I – don’t know it’s broken yet. So the model is predicting further code changes on top of a polluted context.

Many of us have been finding that a lot less rework is required when we test after every small step rather than saving up testing for the end of a batch of work.

There’s an implication here, though. If we testing and re-testing continuously, that suggests that testing very fast.

Continuous Inspection

Left to their own devices, LLMs are very good at generating code they’re pretty bad at modifying later.

Some folks rely on rules and guardrails about code quality which are added to the context with every code-generating interaction with the model. This falls foul of the effective context limits of even the hyperscale LLMs. The model may “obey” – remember, they don’t in reality, they match and predict – some of these rules, but anyone who’s spent more than a few minutes attempting this approach will know that they rarely consistently obey all of them.

And filling up the context with rules runs the risk of “distracting” the LLM from the task at hand.

A more effective approach is to keep the context specific to the task – the problem to be solved – and then, when we’ve got something that works, we can turn our attention to maintainability.

After I’ve seen all my tests pass, I then do a code review, checking everything in the diff between the last working version and the latest. Because these diffs are small – one problem at a time – these code reviews are short and very focused, catching “code smells” as soon as they appear.

The longer I let the problems build up, the more the model ends up wading through it’s own “slop”, making every new change riskier and riskier.

I pay attention to pretty much the same things I would if I was writing all the code myself:

  • Clarity (LLMs really benefit from this, because… language model, duh!)
  • Complexity – the model needs the code likely to be affected in its context. More code, bigger context. Also, the more complex it is, the more likely it is to end up outside of the model’s training data distribution. Monkey no see, monkey can’t do.
  • Duplication – oh boy, do LLMs love duplicating code and concepts! Again, this is a context size issue. If I duplicate the same logic 5x, and need to make a change to the common logic, that’s 5x the code and 5x the tokens. But also, duplication often signposts useful abstractions and a more modular design. Talking of which…
  • Separation of Concerns – this is a big one. If I ask Claude Code to make a change to a 1,000-line class with 25 direct dependencies, that’s a lot of context, and we’re way outside the distribution. Many people have reported how their coding assistant craps out on code that lacks separation of concerns. I find I really have to keep on top of it. Modules should have one reason to change, and be loosely-coupled to other parts of the system.

On top of these, there are all kinds of low-level issues – security vulnerabilities, hanging imports, dead code etc etc – that I find I need to look for. Static analysis can help me check diffs for a whole range of issues that would otherwise by easy to miss by me, or by an LLM doing the code review. I’m seeing a lot of developers upping their game with linting as they use “AI” more in their work.

Continuous Refactoring

Of course, finding code quality issues is only academic if we don’t actually fix them. And, for the reasons I’ve already laid out – we want to give the model the smoothest surface to travel on – fix them immediately.

And I don’t fix all the problems at once. I fix one problem at a time, again for reasons already stated.

And after I fix each problem, I run the tests again, in case the fix broke anything.

This process of fixing one “code smell” at a time, testing throughout, is called refactoring. You may well have heard of it. You may even think you’re doing it. There’s a very high probability that you’re not.

Clarifying With Examples

Here’s an experiment you can try for yourself. Prepare two prompts for a small code project. In one prompt, try to describe what you want as precisely as possible in plain language, without giving any examples.

The total of items in the basket is the sum of the item subtotals, which are the item price multiplied by the item quantity

In the second version, give the exact same requirements, but using examples.

The total of items in a shopping basket is the sum of item subtotals:

item #1: price = 9.99, quantity = 1

item #2: price – 11.99, quantity = 2

shopping basket total = (9.99 * 1) + (11.99 * 2) = 33.97

See what kind of results you get with both approaches. How often does the model misinterpret precisely-described requirements vs. requirements accompanied by examples?

It’s worth knowing that code-generating LLMs are typically trained on code samples that are paired with examples like this. When we include examples, we’re giving the model more to match on, limiting the search space to examples that do what we want.

Examples help prevent LLMs grabbing the wrong end of the prompt, and many users have found them to greatly improve accuracy in generated code.

Harking back to the need for very fast tests, these examples make an ideal basis for fast-running automated “unit” tests (where “units” = units of behaviour). It would make good sense to ask our coding assistant to generate them for us, because we’re going to be needing them soon enough.

Putting It All Together

If we were to imagine a workflow that incorporates all of these principles – small steps, continuous testing, continuous inspection, continuous refactoring, clarifying with examples – it would look very familiar to the small percentage of developers who practice Test-Driven Development.

TDD has been around for several decades, and builds on practices that have been around even longer. It’s a tried-and-tested approach that’s been enabling the rapid, reliable and sustainable evolution of working software for those in the know. If you look inside the “elite-performing” teams in the DORA data – the ones delivering the most reliable software with the shortest lead times and the lowest cost of change – you’ll find they’re pretty much all doing TDD, or something very like TDD.

TDD specifies what we want software to do using examples, in the form of tests. (Hence, “test-driven”).

It works in micro-iterations where we write a test that fails because it requires something the software doesn’t do yet. Then we write the simplest code- the quickest thing we can think of – to get the tests passing. When all the tests are passing, then we review the changes we’ve made, and if necessary refactor the code to fix any quality problems. Once we’re satisfied that the code is good enough – both working and easy to change – we move on to the next failing test case. And rinse and repeat until our feature or our change is complete.

Image

TDD practitioners work one feature at a time, one usage scenario at a time, one outcome at a time and one example at a time, and one refactoring at a time. Basically, we solve one problem at a time.

And we’re continuously running our tests at every step to ensure the code is always working. While automated tests are a side-effect of driving design using tests, they’re a damned useful one! And because we’re only writing code that’s needed to pass tests, all of our code will end up being tested. It’s a self-fulfilling prophecy.

Embedded in that micro-cycle, many practitioners also use version control to ensure they’re making progress in safe, easily-reverted steps, progressing from one working version of the code to the next.

Some of us have discovered the benefits of a “commit on green, revert on red” approach to version control. If all the tests pass, we commit the changes. If any tests fail, we do a hard reset back to the previous working commit. This means that broken versions of the code don’t end up in the context for the next interaction. (Remember that LLMs can’t distinguish between working code and broken code – it’s all just context.)

The beauty of TDD is that the benefits can be yours whether you’re using “AI” or not. Which is why I now teach it both ways.

The key to being effective with “AI” coding assistants is being effective without them.

Shameless Plug

Test-Driven Development is not a skill that you can just switch on, whether you’re doing it with “AI” or without. It takes a lot of practice to get the hang of it, and especially to build the discipline – the habits – of TDD.

An alarming number of TDD tutorials aren’t actually teaching TDD. (And the more people learn from them, the more bad tutorials we’ll no doubt see.)

If your team wants training in Test-Driven Development, including how to do it effectively using tools like Claude Code and Cursor, my 2-day TDD training workshop is half-price if you confirm your booking by January 31st.

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

In this series, I’ve explored the principles and practices that teams seeing modest improvements in software development outcomes have been applying.

After more than four years since the first “AI” coding assistant, GitHub Copilot, appeared, the evidence is clear. Claims of teams achieving 2x, 5x, even 10x productivity gains simply don’t stand up to scrutiny. No shortage of anecdotal evidence, but not a shred of hard data. It seems when we measure it, the gains mysteriously disappear.

The real range, when it’s measured in terms of team outcomes like delivery lead time and release stability, is roughly 0.8x – 1.2x, with negative effects being substantially more common than positives.

And we know why. Faster cars != faster traffic. Gains in code generation, according to the latest DORA State of AI-Assisted Software Development report, are lost to “downstream chaos” for the majority of teams.

Coding never was the bottleneck in software development, and optimising a non-bottleneck in a system with real bottlenecks just makes those bottlenecks worse.

Far from boosting team productivity, for the majority of “AI” users, it’s actually slowing them down, while also negatively impacting product or system reliability and maintainability. They’re producing worse software, later.

Most of those teams won’t be aware that it’s happening, of course. They attached a code-generating firehose to their development plumbing, and while the business is asking why they’re not getting the power shower they were promised, most teams are measuring the water pressure coming out of the hose (lines of code, commits, Pull Requests) and not out of the shower (business outcomes), because those numbers look far more impressive.

The teams who are seeing improvements in lead times of 5%, 10%, 15%, without sacrificing reliability and without increasing the cost of change, are doing it the way they were always doing it:

  • Working in small batches, solving one problem at a time
  • Iterating rapidly, with continuous testing, code review, refactoring and integration
  • Architecting highly modular designs that localise the “blast radius” of changes
  • Organising around end-to-end outcomes instead of around role or technology specialisms
  • Working with high autonomy, making timely decisions on the ground instead of sending them up the chain of command

When I observe teams that fall into the “high-performing” and “elite” categories of the DORA capability classifications using tools like Claude Code and Cursor, I see feedback loops being tightened. Batch sizes get even smaller, quality gates get even narrower, iterations get even faster. They keep “AI” on a very tight leash, and that by itself could well account for the improvements in outcomes.

Meanwhile, the majority of teams are doing the opposite. They’re trying to specify large amounts of work in detail up-front. They’re leaving “AI agents” to chew through long tasks that have wide impact, generating or modifying hundreds or even thousands of lines of code while developers go to the proverbial pub.

And, of course, they test and inspect too late, applying too little rigour – “Looks good to me.” They put far too much trust in the technology, relying on “rules” and “guardrails” set out in Markdown files that we know LLMs will misinterpret and ignore randomly, barely keeping one hand on the wheel.

As far as I’ve seen, no team actually winning with the technology works like that. They’re keeping both hands firmly on the wheel. They’re doing the driving. As AI luminary Andrej Karpathy put it, “agentic” solutions built on top of LLMs just don’t work reliably enough today to leave them to get on with it.

It may be many years before they do. Statistical mechanics predicts it could well be never, with the order-of-magnitude improvement in accuracy needed to make them reliable enough (wrong 2% of the time instead of 20%) calculated to require 1020 times the compute to train. To do that on similar timescales to the hyperscale models of today would require Dyson Spheres (plural) to power it.

Any autonomous software developer – human or machine – requires Actual Intelligence: the ability to reason, to learn, to plan and to understand. There’s no reason to believe that any technology built using deep learning alone will ever be capable of those things, regardless of how plausibly they can mimic them, and no matter how big we scale them. LLMs are almost certainly a dead end for AGI.

For this reason I’ve resisted speculating about how good the technology might become in the future, even though the entire value proposition we see coming out of the frontier labs continues to be about future capabilities. The gold is always over the next hill, it seems.

Instead, I’ve focused my experiments and my learning on present-day reality. And the present-day reality that we’ll likely have to live with for a long time is that LLMs are unreliable narrators. End of. Any approach that doesn’t embrace this fact is doomed to fail.

That’s not to say, though, that there aren’t things we can do to reduce the “hallucinations” and confabulations, and therefore the downstream chaos.

LLMs perform well – are less unreliable – when we present them with problems that are well-represented in their training data. The errors they make are usually a product of going outside of their data distribution, presenting them with inputs that are too complex, too novel or too niche.

Ask them for one thing, in a common problem domain, and chances are much higher that they’ll get it right. Ask them for 10 things, or for something in the long-tail of sparse training examples, and we’re in “hallucination” territory.

Clarifying with examples (e.g., test cases) helps to minimise the semantic ambiguity of inputs, reducing the risk of misinterpretation, and this is especially helpful when the model’s working with code because the samples they’re trained on are paired with those kinds of examples. They give the LLM more to match on.

Contexts need to be small and specific to the current task. How small? Research suggests that the effective usable context sizes of even the frontier LLMs are orders of magnitude smaller than advertised. Going over 1,000 tokens is likely to produce errors, but even contexts as small as 100 tokens can produce problems.

Attention dilution, drift, “probability collapse” (play one at chess and you’ll see what I mean), and the famous “lost in the middle” effect make the odds of a model following all of the rules in your CLAUDE.md file, or all the requirements for a whole feature, vanishingly remote. They just can’t accurately pay attention to that many things.

But even if they could, trying to match on dozens of criteria simultaneously will inevitably send them out-of-distribution.

So the smart money focuses on one problem at a time and one rule at a time, working in rapid iterations, testing and inspecting after every step to ensure everything’s tickety-boo before committing the change (singular) and moving on to the next problem.

And when everything’s not tickety-boo – e.g., tests start failing – they do a hard reset and try again, perhaps breaking the task down into smaller, more in-distribution steps. Or, after the model’s failed 2-3 times, writing the code themselves to get themselves out of a “doom loop”.

There will be times – many times – when you’ll be writing or tweaking or fixing the code yourself. Over-relying on the tool is likely to cause your skills to atrophy, so it’s important to keep your hand in.

It will also be necessary to stay on top of the code. The risk, when code’s being created faster than we can understand it, is that a kind of “comprehension debt” will rapidly build up. When we have to edit the code ourselves, it’s going to take us significantly longer to understand it.

And, of course, it compounds the “looks good to me” problem with our own version of the Gell-Mann amnesia effect. Something I’ve heard often over the last 3 years is people saying “Well, it’s not good with <programming language they know well>, but it’s great at <programming language they barely know>”. The less we understand the output, the less we see the brown M&Ms in the bowl.

“Agentic” coding assistants are claimed to be able to break complex problems down, and plan and execute large pieces of work in smaller steps. Even if they can – and remember that LLMs don’t reason and don’t plan, they just produce plausible-looking reasoning and plausible-looking plans – that doesn’t mean we can hit “Play” and walk away to leave them to it. We still need to check the results at every step and be ready to grab the wheel when the model inevitably takes a wrong turn.

Many developers report how LLM accuracy falls of a cliff when tasked with making changes to code that lacks separation of concerns, and we know why this is too. Changing large modules with many dependencies brings a lot more code into play, which means the model has to work with a much larger context. And we’re out-of-distribution again.

The really interesting thing is that the teams DORA found were succeeding with “AI” were already working this way. Practices like Test-Driven Development, refactoring, modular design and Continuous Integration are highly compatible with working with “AI” coding assistants. Not just compatible, in fact – essential.

But we shouldn’t be surprised, really. Software development – with or without “AI” – is inherently uncertain. Is this really what the user needs? Will this architecture scale like we want? How do I use that new library? How do I make Java do this, that or the other?

It’s one unknown after another. Successful teams don’t let that uncertainty pile up, heaping speculation and assumption on top of speculation and assumption. They turn the cards over as they’re being dealt. Small steps, rapid feedback. Adapting to reality as it emerges.

Far from “changing the game”, probabilistic “AI” coding assistants have just added a new layer of uncertainty. Same game, different dice.

Those of us who’ve been promoting and teaching these skills for decades may have the last laugh, as more and more teams discover it really is the only effective way to drink from the firehose.

Skills like Test-Driven Development, refactoring, modular design and Continuous Integration don’t come with your Claude Code plan. You can’t buy them or install them like an “AI” coding assistant. They take time to learn – lots of time. Expert guidance from an experienced practitioner can expedite things and help you avoid the many pitfalls.

If you’re looking for training and coaching in the practices that are distinguishing the high-performing teams from the rest – with or without “AI” – visit my website.

The AI-Ready Software Developer #20 – It’s The Bottlenecks, Stupid!

For many years now, cycling has been consistently the fastest way to get around central London. Faster than taking the tube. Faster than taking the train. Faster than taking the bus. Faster than taking a cab. Faster than taking your car.

Image

All of these other modes of transport are, in theory, faster than a bike. But the bike will tend to get there first, not because it’s the fastest vehicle, but because it’s subject to the fewest constraints.

Cars, cabs, trains and buses move not at the top speed of the vehicle, but at the speed of the system.

And, of course, when we measure their journey speed at an average 9 mph, we don’t see them crawling along steadily at that pace.

“Travelling” in London is really mostly waiting. Waiting at junctions. Waiting at traffic lights. Waiting to turn. Waiting for the bus to pull out. Waiting on rail platforms. Waiting at tube stations. Waiting for the pedestrian to cross. Waiting for that van to unload.

Cyclists spend significantly less time waiting, and that makes them faster across town overall.

Similarly, development teams that can produce code much faster, but work in a system with real constraints – lots of waiting – will tend to be outperformed overall by teams who might produce code significantly slower, but who are less constrained – spend less time waiting.

What are developers waiting for? What are the traffic lights, junctions and pedestrian crossings in our work?

If I submit a Pull Request, I’m waiting for it to be reviewed. If I send my code for testing, I’m waiting for the results. If I don’t have SQL skills, and I need a new column in the database, I’m waiting for the DBA to add it for me. If I need someone on another team to make a change to their API, more waiting. If I pick up a feature request that needs clarifying, I’m waiting for the customer or the product owner to shed some light. If I need my manager to raise a request for a laptop, then that’s just yet more waiting.

Teams with handovers, sign-offs and other blocking activities in their development process will tend to be outperformed by teams who spend less time waiting, regardless of the raw coding power available to them.

Teams who treat activities like testing, code review, customer interaction and merging as “phases” in their process will tend to be outperformed by teams who do them continuously, regardless of how many LOC or tokens per minute they’re capable of generating.

This isn’t conjecture. The best available evidence is pretty clear. Teams who’ve addressed the bottlenecks in their system are getting there sooner – and in better shape – than teams who haven’t. With or without “AI”.

The teams who collaborate with customers every day – many times a day – outperform teams who have limited, infrequent access.

The teams who design, test, review, refactor and integrate continuously outperform teams who do them in phases.

The teams with wider skillsets outperform highly-specialised teams.

The teams working in cohesive and loosely-coupled enterprise architectures outperform teams working in distributed monoliths.

The teams with more autonomy outperform teams working in command-and-control hierarchies.

None of these things comes with your Claude Code plan. You can’t buy them. You can’t install them. But you can learn them.

And if you’re ticking none of those boxes, and you still think a code-generating supercar is going to make things better, I have a Bugatti Chiron Sport you might be interested in buying. Perfect for the school run!

Refactoring Is Like Chess

When I’m introducing developers to refactoring, I draw a parallel between this hugely valuable – but much-misunderstood – design discipline and chess.

Primitive refactorings are like the moves of chess that apply to the different pieces on a chess board.

A bishop can move diagonally, a rook can move horizontally or vertically, and so on.

Likewise, there are “pieces” in our code we can rename, extract or introduce things from, inline, move, etc etc.

These are the smallest “moves” we can make when we’re refactoring that bring us back to code that works.

At a higher level, there are tactics. These are sequences of basic moves that achieve a specific purpose, with designations like “Clearance Sacrifice” and “Desperado”. Serious players might study hundreds or even thousands of them.

Refactoring, too, has its tactics – sequences of primitive refactorings that achieve a higher level goal. Many of those have their designations, like “Replace Conditional With Polymorphism”, “Introduce Method Object”.

Importantly, they’re executed as a sequence of primitive, behaviour-preserving refactorings like Extract Method and Introduce Parameter. So, no matter how long the sequence, we’re never far from working (shippable) code.

Of course, we could spend a lifetime studying tactics, and still not cover even a tiny fraction of the possibilities. It’s an infinite problem space.

At the highest level, chess has strategies. These are the organising principles – the end goals – of tactics:

  • Material Count
  • Piece Activity
  • Pawn Structure
  • Space
  • King Safety

Strategies in chess are about gaining positional advantage in a game going forward.

And, at the highest level, refactoring has its strategies, too – organising principles that make changing code easier going forward. This is the software design equivalent of positional advantage.

You may know them as “software design principles“:

  • Readability
  • Complexity
  • Duplication
  • Coupling & Cohesion
  • (the one we tend to forget) Testability

Each refactoring tactic is designed to gain us “positional advantage” in one or more of these dimensions to:

  • make code easier to understand
  • make it simpler
  • remove duplication (by introducing modularity/generality)
  • reduce coupling (by improving cohesion – 2 sides of the same coin)
  • make it easier to test quickly, which is often a very valuable side-effect – and sometimes the main goal – of the first 4

The most effective refactorers operate seamlessly across all 3 levels.

They’re thinking strategically about their design goals and measuring impact along those dimensions.

They’re thinking tactically, looking several refactorings ahead, to get them safely from A to B.

And they’re working one primitive refactoring at a time, keeping the code working all the way.

And, like chess, this can take a lifetime to master. Expert help is highly recommended if you want to grasp it faster, of course 🙂

“AI”-Assisted Refactoring Golf

A very common interaction I see online is people talking about how hard it was to get their “AI” coding assistant to do what they wanted, and someone – inevitably – replying “It works for me. You must be doing it wrong.”

The difficulty with these conversations is knowing whether we’re comparing apples with apples. On the occasions when someone’s offered to try to solve a specific problem that was irking me, in the end, they’ve almost always moved the goalposts and done something else.

In a recent post I talked about whether refactoring is more efficient and effective using “AI” coding assistants or using automated refactorings in an IDE like IntelliJ.

My experience is that, nine times out of ten, automated refactorings win. I make exceptions when the IDE doesn’t have the refactoring I need (e.g., Move Instance Method in PyCharm). But the rest of the time, I’ll take predictable over powerful any day.

But this is, of course, subjective and qualitative. It’s my lived experience, and we all know how reliable that kind of evidence can be.

So, I thought to myself, what might be a more subjective test?

It just so happens that there is a game we can play called Refactoring Golf. First run as a workshop at the original Software Craftsmanship 20xx conference by Dave Cleal, Ivan Moore and Mike Hill, this is a game that helps us get more familiar with the automated refactorings and other useful code-manipulating shortcuts in our IDE.

The original rules have been lost to time, but these are the rules I’ve been playing it by:

  • Contestants are given two versions of the same code, a “before” version and a refactored “after” that’s behaviorally identical. It does exactly the same thing.
  • The goal is to refactor the “before” into the “after” such that they have an identical abstract syntax tree (formatting notwithstanding), scoring as few points as possible – hence “golf”.
  • Every code edit made using an automated refactoring or other IDE shortcut (e.g., Find+Replace) costs 1 point.
  • Any code edit made manually costs 2 points. Any time you change a line of code, there’s a penalty.
  • Any edit made while the code isn’t working (tests failing or build failing) is double the points (so a manual edit with tests failing is 4 points, for example)
  • Any edit that doesn’t change the abstract syntax tree – e.g., reformatting or deleting blank lines – costs 0 points.
  • Needless to say, the tests must be re-run after every change.

Emily Bache has helpfully curated a small selection of rounds of Refactoring Golf in Java, Kotlin and C#.

Anyhoo, I just happened to be revisiting the game this week for a refactoring deep-dive that I was running for a new client, and it got me to thinking: could this be the “apples with apples” test of my gut feeling about IDE-assisted vs. “AI”-assisted refactoring?

I’m proposing an “AI”-assisted version of the game that has the following rules:

  • Same “before” and “after”. Same goal.
  • Any code edit made in a single interaction with the LLM costs 1 point. Not an interaction with the coding assistant (e.g., Cursor, Codex), but with the actual model itself. So an agent sending 5 requests to the LLM that result in changes that are applied to the code – even if they’re applied in one final step – would cost 5 points. We’re scoring based on code changes generated by the model in a single interaction.
  • Any code edit made by anything other than the LLM (including using automated refactorings and IDE shortcuts) costs 2 points. Basically, every time you change a line of code, there’s a penalty.
  • Any code edit made while the code is broken costs 2x the points. So if you have to use the Extract Method refactoring in your IDE while tests are failing, that’s 4 points.
  • Any code edit made by you or the “AI” that doesn’t change the AST costs 0 points. Formatting costs nothing, basically. Good luck getting an LLM to reformat code without changing it!
  • Again, shouldn’t need saying, but the tests must be re-run after every change. If you’re going the “agentic” route, you will need to enforce this somehow.
  • YOU ARE ONLY ALLOWED TO INCLUDE THE CODE AS IT CURRENTLY IS, OR CODE EXAMPLES FROM A DIFFERENT PROBLEM, IN YOUR INPUTS (PROMPTS OR CONTEXT FILES etc). YOU MUST NOT TELL THE “AI” WHAT THE CODE SHOULD BE (because then you’re the one writing it).

I’ve tried this on the C# version of ROUND_3 in Emily’s repo, and it was a lot of fun trying to get Claude to do exactly what I want. It felt a bit like playing real golf with a bazooka.

I did manage to get one almost-clean round where I didn’t need to edit the code myself much at all, but – by jingo – we went around the houses!

I want to experiment more with this, because I also see it as useful test of the principles I’ve been converging on for “AI”-assisted development more generally. For sure, smaller steps helped. Prompting with examples of how specific refactorings work helped, and I’ve been doing that for quite some time.

Is “AI-First” a Strategy, an Ideology, or a Performance?

I was recently observing a team doing their day-to-day work. Their C-suite had introduced an “AI-first” policy over the summer, mandating that development teams use “AI” as much as possible on their code.

Starting in November, this mandate turned into a KPI for individual developers, and for teams: % of AI-generated code in Pull Requests. (And, no, I have no idea how they measure that. But I understand that tool use is being tracked. More tokens, nurse!)

The underlying threat didn’t need to be said out loud. “Use this technology more, or start looking for a new job.”

Developers are now incentivised to find reasons to use “AI” coding assistants, and they’re doing it at any cost. All other priorities rescinded. Crew expendable.

By now, we probably all know Goodhart’s Law:

When a measure becomes a target, it ceases to be a good measure.

I have a shorter version: be careful what you wish for.

The history of software development is littered with the bones of teams who were given incentives to adopt dysfunctional behaviour.

The classic “Lines of Code”, “Function Points”, “Velocity” and other easily gameable measures of “productivity” have forced thousands upon thousands of teams to take their eyes off the prize – i.e. business outcomes – and focus their efforts on producing more stuff – output.

Introducing mandates about how that stuff must be produced is a step up the dysfunction ladder.

So I had the privilege of watching a Java developer write the following prompt, which I jotted down for posterity.

Please extract the selected block of code into a new method called 'averageDailySales'

Using their IDE, that would have been just Ctrl+Alt+M and a method name. And, importantly, it would have worked first time. They ended up taking a second pass to fix the missing parameter the new method needed.

The whole 2-hour session was a masterclass in trying to cook a complete roast dinner in a sandwich toaster. The goal was very clearly not to solve the problem, but to use the tool.

I’m not saying that a tool like Claude Code or Cursor would add no value in the process. I’m saying that developers should be incentivised to use the right tool for the job.

But the “AI-first” mandate has encouraged some of the developers to drop all the other tools. They’ve gone 100% “AI”. No IDE in sight.

An Integrated Development Environment is a Swiss Army Knife of tools for viewing, navigating, manipulating (including refactoring), executing, debugging, profiling, inspecting, testing, version controlling and merging code. Well, the ones I use are, anyway.

Could IDEs be better? For sure. But when it comes to, for example, extracting a method, they are still my go-to. It’s usually much faster, and it’s much, much safer. I’ll take predictable over powerful any day.

Using refactoring as an example, if my IDE doesn’t have the automated refactoring I need – e.g., there’s no Move Instance Method in PyCharm – then I’ll let Claude have a crack at it, with my finger poised over the reset button.

Because my focus is on achieving better outcomes, I’ve necessarily landed on a hybrid approach that uses Claude when that makes sense – and, if you read my blog regularly, you’ll know I’m still exploring that – and uses my IDE or some boring old-fashioned deterministic command line tool when that makes sense. And, right now, that’s most of the time.

I feel no compulsion to drink exclusively from the firehose “just because”.

But then, I’m the only shareholder. And that’s probably what “AI-first” policies are really about: optics. There’s something about this that genuinely feels performative. It’s not about using “AI”, it’s about being seen to use “AI”. Look at us! We’re cutting edge!

There’s no credible evidence that “AI” ten-times’s dev team productivity. But there’s plenty of evidence that it can 10x a valuation.

The fact that, according to the more credible data, the technology slows most teams down – less reliable software gets delivered later and costs more – doesn’t seem to matter.

It’s quite revealing, if you think about it. Perhaps it never mattered?

I contracted in a London firm that would proudly announce in each year’s annual report how much they’d invested in technology. It didn’t seem to matter what return they got on that investment, just as long as they spent that £30 million on the latest “cool thing”.

When my team tried to engage with the business on real problems, the push-back came from the IT Director himself. That, apparently, was “not what we do here”. We’re here to chew bubblegum and spend money. And we’re all out of bubblegum.

So, in that sense, t’was ever thus. But, as with all things “AI” these days, it’s a question of scale. Watching team after team after team drop everything to try and tame the code-generating firehose, while real business and real user needs go unaddressed, is quite the spectacle. It’s a hyper-scaled dysfunction.

Of course, eventually, reality’s going to catch up with us. I was interviewed for a Financial Times newsletter, The AI Shift, a few weeks ago, and it was clear that the resetting of expectations has spread far beyond the the dev floor. People who aren’t software developers are starting to notice.

If, like me, you’re interested in what’s real and what works in developing software – with or without “AI” – you might want to visit my training and coaching site for details of courses and consulting in principles and practices that are proven to shorten lead times, improve reliability of releases and lower the cost of change.

I mean, if that’s your sort of thing.

And if you’re curious about what really seems to work when we’re using “AI” coding assistants, I’ve brain-dumped my learnings from nearly 3 years experimenting with and exploring the code-generating firehose. You might be surprised to hear that it has very little to do with code generation, and almost everything to do with the real bottlenecks in development.

Then again, you might not.

Manual Refactoring: Python – Introduce Parameter Object & Move Instance Method

Two refactorings I can’t live without are Introduce Parameter Object and Move Instance Method.

I often find myself introducing new classes to separate concerns using them in a little dance I call “chunking”.

In IntelliJ and Rider or Resharper, these are available as automated refactorings, which saves some time. (In Rider, Introduce Parameter Object is helpfully called “Transform Parameters” for no good reason.)

But when I’m working in dynamic languages – which suffer from a lack of type information – I have to do these refactorings by hand.

Half of the courses I run, I’m either demonstrating in Python or JavaScript, so this comes up a lot. I thought it might be helpful to document these manual refactorings for future reference.

In this example, I’ve been asked to change this code that generates quotes for fitted carpets so that rooms can have different shapes, meaning that there will be different ways of calculating the area of carpet required.

class CarpetQuote:
    def calculate(self, width, length, price_per_sq_m, round_up):
        area = width * length

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

My solution would be to introduce a class for calculating the room’s area that knows its dimensions. (If you were just thinking “switch statement”, give yourself a wobble.)

I want to introduce a parameter to the calculate method for the room. And I want to do it in teeny, safe steps.

Step #1 – Add a new room parameter

class CarpetQuote:
    def calculate(self, width, length, price_per_sq_m, round_up,  room=None):
        area = width * length

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

By giving room a default value, this code still runs and passes the tests.

Step #2 – Instantiate room in the client code (the tests) as a new class

class Room:
    pass


class CarpetQuoteTest(unittest.TestCase):
    def test_quote_for_carpet_no_rounding(self):
        quote = CarpetQuote()
        self.assertEqual(122.50, quote.calculate(3.5, 3.5, 10.0, False, Room() ))

    def test_quote_for_carpet_with_rounding(self):
        quote = CarpetQuote()
        self.assertEqual(130.0, quote.calculate(3.5, 3.5, 10.0, True, Room() ))

Step #3 – Pass in width and length as constructor parameters of Room

class Room:
    def __init__(self, width, length):
        pass


class CarpetQuoteTest(unittest.TestCase):
    def test_quote_for_carpet_no_rounding(self):
        quote = CarpetQuote()
        self.assertEqual(122.50, quote.calculate(3.5, 3.5, 10.0, False, Room(3.5, 3.5) ))

    def test_quote_for_carpet_with_rounding(self):
        quote = CarpetQuote()
        self.assertEqual(130.0, quote.calculate(3.5, 3.5, 10.0, True, Room(3.5, 3.5) ))

Step #4 – Assign width and length to fields (member variables) of Room

class Room:
    def __init__(self, width, length):
        self.length = length
        self.width = width

Room is now ready to be used in the calculate method.

Step #4 – Replace reference to calculate‘s width and length parameters with references to room‘s fields

class CarpetQuote:
    def calculate(self, width, length, price_per_sq_m, round_up,  room=None):
        area = room.width * room.length

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

We can now do a little cleaning up.

Step #5 – Remove unused width and length parameters from calculate (Safe Delete)

class CarpetQuote:
    def calculate(self, price_per_sq_m, round_up, room=None):

Step #6 – Remove redundant default value for room parameter

class CarpetQuote:
    def calculate(self, price_per_sq_m, round_up, room):

Okay, that’s some hanging chads dealt with. Let’s look at moving the area calculation to where it now belongs.

Step #7 – Extract area calculation into a separate method

This involves cutting the calculation code and pasting it into the new method as a return value, and replacing that code with a call to the new method.

class CarpetQuote:
    def calculate(self, price_per_sq_m, round_up, room):
        area = self.area(room)

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

    def area(self, room):
        return room.width * room.length

We can now see that the area method has very obvious Feature Envy for room.

Step #8 – Move the area method to the Room class

First, I cut the area method and paste it into Room. I then switch the target of the call to area from self to room.

class Room:
    def __init__(self, width, length):
        self.length = length
        self.width = width
        
    def area(self, room):
        return room.width * room.length


class CarpetQuote:
    def calculate(self, price_per_sq_m, round_up, room):
        area = room.area(room)

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

Then I switch the references to room.length and room.width to self.length and self.width. Remember, room and self are the same object.

class Room:
    def __init__(self, width, length):
        self.length = length
        self.width = width

    def area(self, room):
        return self.width * self.length

The room parameter is now unused. Let’s delete it.

class Room:
    def __init__(self, width, length):
        self.length = length
        self.width = width

    def area(self):
        return self.width * self.length


class CarpetQuote:
    def calculate(self, price_per_sq_m, round_up, room):
        area = room.area()

        if round_up:
            area = math.ceil(area)

        return area * price_per_sq_m

Now there’s no need to expose the width and length fields.

Step #9 – Hide” width and length

Let’s rename these fields to indicate that they should not be accessed from outside Room.

class Room:
    def __init__(self, width, length):
        self._length = length
        self._width = width

    def area(self):
        return self._width * self._length

Now it’s easy to substitute different implementations of room in the CarpetQuote‘s calculate method. Job done!

One final note: every code snippet here was taken after I’d seen it pass the tests. That’s 13 test runs – and 13 commits – to do this refactoring.

…In case you were wondering what I mean by “small steps”.

(Of course, in IntelliJ or Rider, it would have been a lot fewer steps. That’s the pay-off for automated refactorings, and why I’ll choose my IDE with that in mind.)