Yeah, About Your “Precise” Specification…

Increasingly, I see people who’ve been struggling with LLM-based coding assistants reaching the conclusion that what’s needed is “better” specifications.

If you were to ask me what might make a specification “better”, I’d probably say:

  • Less ambiguous – less open to multiple valid interpretations
  • More complete – fewer gaps where expected system behaviour and other properties are left undefined
  • More consistent – fewer contradictions (e.g., Requirement #1: “Users can opt in to notifications”, Requirement #77: “By default, notifications must be on”)

Of these three factors, ambiguity is top of my list. It can mask contradictions and paper over gaps. When requirements are ambiguous, that takes us into physicist Wolfgang Pauli’s “not even wrong” territory.

It’s hard to know what the software’s supposed to do, and hard to know when it’s not doing it. This is why so many testers tell me that a large part of their job is figuring out what the requirements were in the first place. (Pro tip: bring them into those discussions.)

An ideal software specification therefore has no ambiguity. It’s not open to multiple interpretations. This enables us to spot gaps and inconsistencies more easily. But more importantly, it enables us to know with certainty when the software doesn’t conform to the specification.

We can never know, of course, that it always conforms to the specification. That would require infinite testing in most cases. But it only needs one test to refute it – and that requires the specification to be refutable.

So I guess when I talk about a “better” specification, I’m talking mostly about refutability.

“Precise”. You Keep Using That Word.

Refutability requires precision. And this is where our natural languages let us down. Try as we might to articulate rules in “precise English” or “precise French” or “precise Cantonese”, these languages haven’t evolved for precision.

Language entropy – the tendency of natural language statements to have multiple valid interpretations, and therefore uncertain meaning – is pretty inescapable.

For completely unambiguous statements, we need a formal language – a language with precisely-defined syntax – with formal semantics that precisely define how that syntax is to be interpreted. Statements made with these can have one – and only one – interpretation. It’s possible to know with certainty when an example contradicts it.

Computer programmers are very familiar with these formal systems. Programming languages are formal languages, and compilers and interpreters endow them with formal semantics – with precise meaning.

I half-joke, when product managers and software designers ask me where they can find good examples of complete software specifications to look on GitHub. It’s full of them.

It’s only half a joke because it’s literally true that program source code is a program specification, not an actual program. It expresses all of the rules of a program in a formal language that are then interpreted into lower-level formal languages like x86 assembly language or machine code. These in turn are interpreted into even lower-level representations, until eventually they’re interpreted by the machine itself – the ultimate arbiter of meaning.

It’s turtles all the way down, and given a specific stack of turtles, meaning – hardware failures notwithstanding – is completely predictable. The same source code, compiled by the same compiler, executed by the same CPU, will produce the same observable behaviour.

So we have a specification that’s refutable and predictable. The same rules will produce the same behaviour every time, and we can know with certainty when examples break the rules.

But, of course, a computer program does what it does. It will always conform to its program specification, expressed in Java or Python or – okay, maybe not JavaScript – or Go. That doesn’t mean it’s the right program.

So we need to take a step back from the program. Sure, it does what it does. But what is it supposed to do?

Remember those turtles? Well, it would be a mistake if we believed the program source code is at the top of the stack. In order to meaningfully test if we wrote the right program code, we need another formal specification (and I use those words most accurately) that describes the desired properties of the program without being part of the program itself.

Let’s think of a simple example. If I have a program that withdraws money from a bank account, and me and my customer agree that withdrawal amounts must be more than zero, and the account needs to have sufficient funds to cover it, we might specify that withdrawals should only happen when that’s true.

In informal language, a precondition of any withdrawal is that the amount must be greater than zero, and the balance must be greater than or equal to the amount being withdrawn. If the withdraw function is invoked when that condition isn’t met, the program is wrong.

To remove any ambiguity, I would wish to express that in a formal language. I could do it in a programming language. I could insert an assertion at the start of the withdraw function that checks the condition and e.g., throws an exception of it’s not satisfied, or halts execution during testing and reports an error.

e.g. in Python “defensive programming” (we can talk in another blog post about what terrible UX design this is – yes, UX design. In the code. Bazinga!)

def withdraw(self, amount):
if amount <= 0:
raise InvalidAmountError()
if self.balance < amount:
raise InsufficientFundsError()
self.balance -= amount

e.g., using inline assertions that are checked during testing

def withdraw(self, amount):
assert amount > 0
assert self.balance >= amount
self.balance -= amount

These approaches are fine, but they’re not a great way to establish what those rules are with our customer in the first place. Are we going to sit down with them and start writing code to capture the requirements?

In the late 1980s, formal languages started to appear specifically with the aim of creating precise external specifications of correct behaviour that aren’t part of the code at all.

The first I used was Z. Z was a notation founded on predicate logic and set theory. Here’s an artist’s impression of a Z specification that ChatGPT hallucinated for me.

Image

Not the most customer-friendly of notations. Other formal specification languages attempted to be more “business-friendly”, like the Object Constraint Language:

context BankAccount::withdraw(amount: Real)
pre: amount > 0
pre: balance >= amount
post: balance = balance@pre - amount

These OCL constraints were designed to extend UML models to make their meaning more precise. I remember being told that it was designed to be used by business people. I found that naivety endearing.

To cut a long story short, while formal specification certainly found a home in the niche of high-integrity and critical systems engineering, that same snow never settled on the plains of business and requirements analysis and everyday software development. We were expecting business stakeholders to become programmers. That rarely works out.

But for a time, I used formal specifications – luckily, my customers were electronics engineers and not marketing executives, so most already had programming experience.

Tests As Specifications

We’d firm up a specification using a combination of Z and the Object Modeling Technique (UML wasn’t a thing then) describing precisely what a feature or a function needed to do.

Then I’d analyse that specification and choose test examples.

BankAccount:: withdraw
Example #1: invalid amount
amount = 0
Outcome:
throws InvalidAmountError
Example #2: valid amount and sufficient funds
amount = 50.0
balance = 50.0
Outcome:
balance = 0.0
Example #3: insufficient funds
amount = 50.01
balance = 50.0
Outcome:
throws InsufficientFundsError

It turned out that business stakeholders can much more easily understand specific examples than general rules expressed in formal languages. So we flipped the script, and explored examples first, and then generalised them to a formal specification.

It was when I first started learning about “test-first design”, one of the practices of the earliest documented versions of Extreme Programming, that the lightbulb moment came.

If we’ve got tests, do we need the formal specifications at all? Maybe we could cut out the middle-man and go straight to the tests?

This often works well – exploring the precise meaning of requirements using test examples – with non-programming stakeholders.

And many people are discovering that including test examples in our prompts helps LLMs match more accurately by reducing the search space of code patterns. It turns out that models are trained on code samples that have been paired with usage examples (tests, basically), so including examples in the prompt gives them more to match on.

So, if you were to ask me what might make a specification for LLM code generation “better”, I’d definitely say “tests”. (And there was you thinking it was the LLM’s job to dream up tests.)

Visualising The Gaps

That helps reduce ambiguity and the risk of misinterpretation, but what of completeness and consistency?

This is where some kind of generalisation is really needed, but it doesn’t have take us down the Z or OCL road. What we really need is a way to visualise the state space of the problem.

One simple technique I’ve used to good effect is a decision table. This helps me to see how the rules of a function or an action map to different outcomes.

Image

Here, I’ve laid out all the possible combinations of conditions and mapped them to specific outcomes. There’s one simplification we can make – if the amount isn’t greater than zero, we don’t care if the account has sufficient funds.

Image

That maps exactly on to my three original test cases, so I’m confident they’re a complete description of this withdraw function.

Mapping it out like this and exploring test cases encourages us to clarify exactly what the customer expects to happen. When the amount is greater than the balance, exactly what should the software do? It forces us and our customers to consider details that probably wouldn’t have come up otherwise.

Other tools we can use to visualise system behaviour and rules include Venn diagrams (have we tested every part of the diagram?), state transition diagrams and state transition tables (have we tested every transition from every state?), logic flow diagrams (have we tested every branch and every path?), and good old-fashioned truth tables – the top half of a decision table.

Isn’t This Testing?

“But, Jason, this sounds awfully like what testers do!”

Yup 🙂

Tests are to specifications what experiments are to hypotheses.

If I say “It should throw an error when the account holder tries to withdraw more than their balance” before any code’s been written to do that, I’m specifying what should happen. Hypothesis.

If I try to withdraw £100 from an account with a balance of £99, then that’s a test of whether the software satisfies it’s specification. It’s a test of what does happen. Experiment.

This is why I strongly recommend teams bring testing experts into requirements discussions. You’re far more likely to get a complete specification when someone in the room is thinking “Ah, but what if A and B, but not C?”

You can, of course, learn to think more like a tester. I did, so it can’t be that hard.

But there’s really no substitute for someone with deep and wide testing experience in the room.

If a function or a feature is straightforward, we can probably figure out what test cases we’d need to cover in our heads. My initial guesses at tests for the withdraw function were pretty good, it turned out.

But when they’re not straightforward, or when the scenario’s high risk, I’ve found these techniques very valuable.

As a bottom line, I’ve found that tests of some kind are table stakes. They’re the least I’ll include in my specification.

Shared Language

Another thing I’ve found that helps to minimise misinterpretations is establishing a shared model of the concepts we’re talking about in our specifications.

In a training exercise I run often, pairs are asked to use Test-Driven Development to create a simple online retail program. They’re given a set of requirements expressed in plain English and the idea is that they agree tests with the customer (one of them plays that role) to pin down what they think the requirements mean.

e.g.

Add item – add an item to an order. An order item has a product and a quantity. There must be sufficient stock of that product to fulfil the order

Total including shipping – calculate the total amount payable for the order, including shipping to the address

Confirm – when an order is confirmed, the stock levels of every product in the items are adjusted by the item quantity, and then the order is added to the sales history.

A couple of years back, I changed the exercise by giving them a “walking skeleton” – essentially a “Hello, world!” project for their tech stack with a dummy test and a CI build script set up and ready to go – to get them started.

And in that project I added a bare-bones domain model – just classes, fields and relationships – that modeled the concepts used in the requirements.

In UML, it looked something like this.

Image

Before I added a domain model, pairs would come up with distinctly different interpretations of the requirements.

With the addition of a domain model, 90% of pairs would land on pretty much the same interpretation. Such is the power of a shared conceptual model of what it is we’re actually talking about.

It doesn’t need to be code or a UML diagram – but some expression in some form we hopefully can all understand of the concepts in our requirements and how they’re related evidently cuts out a lot of misunderstandings.

Precision In UX & UI Design

And, of course, if we’re trying to describe a user interface, pictures can really help there. Wireframes and mock-ups are great, but if we’re trying to describe dynamic behaviour – what happens when I click that button? – I highly recommend storyboards.

A storyboard is just a sequence of snapshots of the UI in specific test scenarios that illustrates what happens with each user interaction. Here’s a great example.

Image
Source: Annie Hay Design https://anniehaydesign.weebly.com/app-design/storyboarding

It’s another way of visualising a test case, just from the user’s perspective. In that sense, it can be a powerful tool in user experience design, helping stakeholders to come to a shared understanding of the user’s journey, and potentially revealing problems with the design early.

Precision != BDUF

Before anybody jumps in with accusations of Big Design Up-Front (BDUF), a quick reminder that I would never suggest trying to specify everything, then implement it, then test it, then merge and release it in one pass. I trust you know me better than that.

When clarity’s needed, I have a pretty well-stocked toolbox of techniques for providing it, as and when it’s needed in a highly iterative process delivering working software in thin slices – one feature at a time, one scenario at a time, one outcome at a time, and one example at a time. Solving one problem at a time in tight feedback loops.

Taking small steps with continuous feedback and opportunities to steer is highly compatible with working with LLM-based coding assistants. It’s actually kind of essential, really. Folks talking about specifying e.g., a whole feature “precisely” and then leaving the agent(s) to get on with it are… Well, you probably know what I think. I’ve seen those trains come off the rails so many times.

And with each step, I stay on-task. I’ll rarely, for example, model domain concepts that aren’t involved in the test cases I’m working on. I’m not one of these “First, I model ALL THE THINGS, then I think about the user’s goals” guys.

And using tests as specifications goes hand-in-glove with a test-driven approach to development, which you may have heard I’m quite partial to.

Believe it or not, agility and precision are completely compatible. How precise you’re being, and the size of the steps you’re taking that end in user feedback from working software, are orthogonal concerns. If you look in the original XP books, you’ll even find – gasp! – UML diagrams.

Hopefully you get some ideas about the kinds of things we can include in a specification to make it more precise, more complete and more consistent.

But at the very least, you might begin to rethink just how good your current specifications actually are.

Prompts Aren’t Code and LLMs Aren’t Compilers

One final thought. The formal systems of computer programming – programming languages, compilers, machine code and so on – and the “turtles” in an LLM-based stack are very different.

Prompts – even expressed in formal languages – aren’t code, and LLMs aren’t compilers. They will rarely produce the exact same output given the exact same input. It’s a category mistake to believe otherwise.

This means that no matter how precise our inputs are, they will not be processed precisely or predictably. Expect surprises.

But less ambiguity will – and I’ve tested this a lot – reduce the number of surprises. And refutability gives us a way to spot the brown M&Ms in the output more easily.

It’s easier to know when the model got it wrong.

Clean Contexts

You’ve probably heard of “clean code” (and the “clean coder”, and “clean architecture”, and other things Bob Martin has added the word “clean” in front of to get another book out of it).

In this dawning age of “AI”-assisted software development, I’d like to propose clean contexts.

What is a “clean context”? Well, I’m glad you asked.

A clean context:

  • Addresses one problem – one failing test, one code quality rule, one refactoring etc.
  • Is small enough to stay inside the model’s effective context limit – which is going to be orders of magnitude smaller than the advertised maximum context
  • Uses clear and consistent shared language – if you’ve been calling it “sales tax”, don’t suddenly start calling it “VAT”
  • Clarifies with examples that can be used as success criteria (i.e., tests) – The code samples used in training were paired with usage examples, so it improves matching
  • Only contains information pertinent to the task – don’t divert the model’s attention (literally)
  • Only contains accurate information (“ground truth”) – the code and the architecture as it is now (not a bunch of changes back when you asked the tool to summarise it), the test failure message, the mutation testing results and so on. Ground your interactions in reality.
  • Only contains working code – if the model breaks the code, don’t feed it back to it. It can’t tell broken code from working code, and you’ll pollute the context. Revert and try again. The exception to this is bug fixes, of course. But if the model introduced the bug – git reset –hard
  • Contains code that doesn’t go outside the model’s data distribution – LLMs famously choke on code that lacks clarity, is overly complex and lacks separation of concerns because it’s far outside the distribution of examples they were trained on. When it comes to gnarly legacy code, I’ve had more success breaking it down myself initially before letting Claude loose on it. Y’know, like how an adult bird chews the food first before feeding it to its chicks.

And remember that a prompt isn’t the entire context. Claude Code and Cursor will use static analysis to determine what source code needs to be added. Context files may be added (e.g., CLAUDE.md). And of course, everything in the conversation- your (or your agent’s) prompts and the model’s responses – are all part of the context. When an LLM “hallucinates”, that becomes part of the context, and the model has no way of determining fact from its own fiction. It’s all just context to a language model.

This is why I purge and then construct a new, task-specific context with each interaction. Many users are reporting how much more accurate LLMs tend to be with a fresh context.

Our goal with a clean context is to minimise ambiguity and the risk of misinterpretation, to minimise attention dilution and context drift, context pollution and context “rot”, and as much as possible, stay within the LLM’s training data distribution.

Basically, we’re aiming to maximise the chances of an accurate prediction from the LLM, and spend less time cleaning up mistakes and digging the tool out of “doom loops”.

Importantly, working in small steps – solving one problem at a time – opens up many more opportunities after each step to get feedback from testing, code review and merging, so clean contexts are highly compatible with much more iterative approaches.

Just as Continuous Delivery enables us to make progress by putting one foot in front of the other, ensuring a working product after every step, we also aim to start every step with a clean context that significantly reduces the risk of a stumble.

The Great Filter (Or Why High Performance Still Eludes Most Dev Teams, Even With AI)

In my post about The Gorman Paradox, I compare the lack of any evidence of “AI”-assisted productivity gains to be found out here in the Real WorldTM with the famous Fermi Paradox that asks, if the universe is teeming with intelligent life, where is everybody?

It’s been over 3 years, and we’ve seen no uptick in products being added to the app stores. We’ve seen no rising tide on business bottom lines. We’ve seen no impact on national GDPs.

There is a likely explanation, and it’s the most obvious one: “AI”-assisted coding doesn’t actually make the majority of dev teams more productive. For sure, it produces more code. But, on average, it creates no net additional value.

The DORA data does find some teams reaping modest gains in terms of software delivery lead times without sacrificing reliability, and – interestingly – the data shows that those high-performing teams using “AI” were already high-performing without it.

The majority of teams showed that “AI” actually slowed them down, and these were the teams who were already pretty slow before “AI”. Attaching a code-generating firehose to the process just made them marginally slower.

The differentiator? Are the high-performing teams super-skilled programmers? Are they getting paid more? Are they putting something in the office water supply?

It turns out that what separates the teams who get a negative boost from the teams who get a positive boost is that the latter have addressed the bottlenecks in their development process.

Blocking activities, like detailed up-front design, after-the-fact testing, Pull Request code reviews, and big merges to the main branch, have been turned into continuous activities.

Teams work in much smaller batches and in much tighter feedback loops, designing, testing, inspecting and merging many times an hour instead of every few days.

Work doesn’t sit in queues waiting for someone’s attention. There are very few traffic lights between the developer’s desktop and the outside world to slow that traffic down.

And this means that changes can make it into the hands of users very rapidly, with highly automated, highly reliable, frictionless delivery pipelines that – as the supermarket ads used to say – get the peas from the farmer’s field to your table in no time at all.

The just-in-time grocery supply chains of supermarkets are a good analogy for the processes high-performing teams are using. Supermarkets don’t buy a year’s supply of fresh peas once a year. They buy tomorrow’s supply today, and their formidable logistical capabilities get those peas on the shelves pronto.

Those formidable logistical capabilities didn’t just appear, either. They’re the product of many decades of investment. Supermarket chains have sunk billions into getting better at it, so they can maximise cash flow by minimising the amount of working capital they have committed at any time.

They don’t want millions of pounds-worth of produce sitting in warehouses making them no money.

And businesses don’t want millions of pounds-worth of software changes sitting in queues waiting to be released. They want them out there in the hands of users, creating value in the form of learning what works and what doesn’t. Software that can’t be used has no value.

Walk into any large organisation and take a snapshot of how much investment in developed code is “in progress”. For some, it literally is million of pounds-worth – tens or hundreds of thousands of pounds, multiplied by dozens or hundreds of teams.

The impact on a business of being able to out-learn the competition can be so profound, we might ask ourselves “Why isn’t everybody doing this?” Can you imagine a supermarket chain deciding not to bother with JIT supply? They wouldn’t last long.

It’s come into focus even more sharply with the rise of “AI”-assisted software development. It’s quite clear now that even modest productivity gains lie on the other side of the spectrum with teams who have addressed their bottlenecks and have low-friction delivery pipelines.

I see a “Great Filter” that continues to prevent the large majority of dev teams making it to that Nirvana. It requires a big, ongoing investment in the software development capability needed.

We’re talking about investment in people and skills. We’re talking about investment in teams and organisational design. We’re talking about investment in tooling and automation. We’re talking about investment in research and experimentation. We’re talking about investment in talent pipelines and outreach. We’re talking about investment in developer communities and the profession of software development.

Typically, I’ve seen that companies who manage to progress from the bottleneck-ridden ways of working to highly iterative, frictionless methods needed to invest 20-25% of their entire development budget in building and maintaining that capability.

And building that kind of capability takes years.

You can’t buy it. You can’t install it. You can’t have it flown in fresh from Silicon Valley.

And, like organ transplants, any attempt to transplant that kind of capability into your business will be met with organisational anti-bodies protecting the status quo.

And that, folks, is The Great Filter.

Most organisations are simply not prepared to make that kind of commitment in time, effort and money.

Sure, they want the business benefits of faster lead times, more reliable releases, and a lower cost of change. But they’re just not willing to pony up to get it.

On a daily basis, I see people online warning us not to “get left behind by AI”. The reality is that the people who really are getting left behind are the ones who think that the bottlenecks and blockers they’ve struggled with in the past will magically get out of the way of the code-generating firehose.

Low-performing teams, now grappling with the downstream chaos caused by “AI” code generation, will probably always be the norm. And the value of this technology will probably never be realised by those businesses.

If you’re on of the few who are serious about building software development capability, my training courses in the technical practices that enable rapid, reliable and sustained evolution of software to meeting changing needs are half price if you confirm your booking by Jan 31st.

Yes, Maintainability Still Matters in “AI”-assisted Coding

A couple of people have asked, in relation to my 2-day Software Design Principles training course, whether maintainability matters anymore.

Perhaps they’ve read some of the wrong-headed posts here about why LLM-generated code doesn’t need to be understandable or maintainable by humans.

Putting aside the undeniable fact that these tools are nowhere near that reliable, in reality, code maintainability matters just as much – if not more – when LLMs are working with it.

First, and hopefully you’ve figured this out by now, “AI”-assisted programming without a good suite of fast-running regression tests is very, very risky. Fast tests have such a huge impact on the cost and the risk of changing code that Michael Feathers defines “legacy code” as code that lacks them.

More teams are discovering that they need to be constantly assessing the “strength” of the automated tests their “AI” assistant generates – they’re notorious for weak tests, and for cheating to get tests passing.

I highly recommend regular mutation testing to check for gaps in your test suites.

Clarity matters, because… well… language models. If I’m asking Claude to add a premium tier to video rentals pricing, but the code’s talking about “vd_prc_1” and “tr_rate_fs”, it hasn’t got much to match on. Concepts need to be clearly signposted and consistent with the language we use to describe our requirements.

Duplication’s a problem, because logic repeated 5x takes up 5x the context, and also models might not actually “spot” the repetition, so there’s a risk of drift.

Complexity’s a big problem. LLMs don’t like complex patterns. Overly complex code is likely to fall outside the data distribution, leading to low-confidence matches and low-accuracy predictions.

And then there’s separation of concerns…

LLMs are trained on a huge amount of code snippets of the Stack Overflow variety that contain little or no modularity. That’s their comfort zone, and code they generate will tend to be like that, too.

The irony is that, while they suck at generating effectively modular code – cohesive, loosely-coupled modules that localise the ripple effect of changes – they also suck at modifying code that isn’t highly modular. The wider the ripple effect, the more code gets brought into play, and the further out-of-distribution the context grows.

In this way, they’ll tend to paint themselves into a corner as the code grows. So we really need to keep on top on modular design.

So, yes, maintainability matters in “AI”-assisted coding. A LOT.

<shameless-plug>

If you think your team could use some levelling up or a refresher on software design principles, my training's half-price if you confirm your booking by Jan 31st. Link in my profile.

</shameless-plug>

Walking Skeletons, Delivery Pipelines & DevOps Drills

On my 3-day Code Craft training workshop (and if you’re reading this in January 2026, training’s half-price if you confirm your booking by Jan 31st), there’s a team exercise where the group need to work together to deliver a simple program to the customer’s (my) laptop where I can acceptance-test it.

It’s primarily an exercise in Continuous Delivery, bringing together many of the skills explored earlier in the course like Test-Driven Development and Continuous Integration.

But it also exercises the muscles individual or pair-programmed exercise don’t reach. Any problem, even a simple one like the Mars Rover, tends to become much more complicated when we tackle it as a team. It requires a lot of communication and coordination. A team will typically take more time to complete it.

And it also exercises muscles that developers these days have never used before. In 2026, the average developer has never created, say, a command-line project from scratch in their tech stack. They’ve never set up a repo using their version control tool. They’ve never created a build script for Continuous Integration builds. They’ve never written a script to automatically deploy working software.

In the age of “developer experience”, a lot of people have these things done for them. Entry-level devs land on a project and it’s all just there.

That may seem like a convenience initially, but it comes with a sort of learned helplessness, with total reliance on other people to create and adapt build and deployment logic when it’s needed. A lot of developers would be on a significant learning curve if they ever needed to get a project up and running or to change, say, a build script.

It’s the delivery pipeline that frustrates most teams’ attempts to get any functionality in front of the customer in this exercise.

I urge them at the start to get that pipeline in place first. Code that can’t be used has no value. They may have written all of it, but if I can’t test it on my machine – nil points. Just like in real life.

They’re encouraged to create a “walking skeleton” for their tech stack – e.g., a command-line program that outputs “Hello, world!”, and has one dummy unit test.

This can then be added to a new GitHub repository, and the rest of the team can be invited to collaborate on it. That’s the first part of the pipeline.

Then someone can create a build script that runs the tests, and is triggered by pushes to the main (trunk) branch. On GitHub, if we keep our technical architecture vanilla for our tech stack (e.g., a vanilla Java/Maven project structure), GitHub actions can usually generate a script for us. It might need a tweak or two – the right version of Java, for example – but it will get us in the ballpark.

So now everyone in the team can clone a repo that has a skeleton project with a dummy unit test and a simple output to check that it’s working end to end.

That’s the middle of the pipeline. We now have what we need to at least do Continuous Integration.

The final part of the pipeline is when the food makes it to the customer’s table. I remind teams that my laptop is a developer’s machine, and that I have versions of Python, Node.js, Java and .NET installed, as well as a Git client.

So, they could write a batch script that clones the repo, builds the software (e.g., runs pip install for a Python project), and runs the program. When I see “Hello, world!” appear on my screen, we have lift-off. The team can begin implementing the Mars Rover, and whenever a feature is complete, they can ping me and ask me to run that script again to test it.

And thus, value begins to flow, in the form of meaningful user feedback from working software. (Aww, bless. Did you think the software was the value? No, mate. The value’s in what we learn, not what we deliver.)

And, of course, in the real world, that delivery pipeline will evolve, adding more quality gates (e.g., linting), parallelising test execution as the suite gets larger, progressing to more sophisticated deployment models and that sort of thing, as needs change.

DevOps – the marriage of software development and operations – means that the team writing the solution code also handles these matters. We don’t throw it over the wall to a separate “DevOps” team. That’s kind of the whole point of DevOps, really. When we need a change to, say, the build script, we – the team – make that change.

But you might be surprised how many people who describe themselves as “DevOps Engineers” wouldn’t even know where to start. (Or maybe you wouldn’t.)

It’s not their fault if they’ve been given no exposure to operations. And it’s not every day that we start a project from scratch, so the opportunities to gain experience are few and far between.

Given just how critical these pipelines are to our delivery lead times, it’s surprising how little time and effort many organisations invest in getting good at them. It should be a core competency in software development.

It’s especially mysterious why so many businesses allow it to become a bottleneck by favouring specialised teams instead of T-shaped DevOps software engineers who can do most of it themselves instead of waiting for someone else to do it. Teams could have a specialised expert on hand for the rare times when deep expertise is really needed.

If the average developer knew the 20% they’d need 80% of the time to create and change delivery pipelines for their tech stack(s), there’d be a lot less waiting on “DevOps specialists” (which is an oxymoron, of course).

Just as a contractor who has to move house often tends to become very efficient at it, developers who have to get delivery pipelines up and running often tend to be much better at the yak shaving it involves.

So I encourage teams to make these opportunities by doing regular “DevOps drills” for their tech stacks. Get a Node Express “Hello, world” pipeline up and running from scratch. Get a Spring Boot pipeline up and running from scratch. etc.

Typically, I see teams doing them monthly, and as they gain confidence, varying the parameters (e.g., parallel test execution, deployment to a cluster and so on), and making the quality gates more sophisticated (security testing, linting, mutation testing and so on), while learning how to optimise pipelines to keep them as frictionless as possible.

Why Does Test-Driven Development Work So Well In “AI”-assisted Programming?

In my series on The AI-Ready Software Developer, I propose a set of principles for getting better results using LLM-based coding assistants like Claude Code and Cursor.

Users of these tools report how often and how easily they go off the rails, producing code that doesn’t do what we want and frequently breaking code that was working. As the code grows, these risks grow with them. On large code bases, they can really struggle.

From experiment and from real-world use, I’ve seen a number of things help reduce those risks and keep the “AI” on the rails.

  • Working in smaller steps
  • Testing after every step
  • Reviewing code after every step
  • Refactoring code as soon as problems appear
  • Clarifying prompts with examples

Smaller Steps

Human programmers have a limited capacity for cognitive load. There’s only so much we can comfortably wrap our heads around with any real focus, and when we overload ourselves, mistakes become much more likely. When we’re trying to spin many plates, the most likely result is broken plates.

LLMs have a similarly-limited capacity for context. While vendors advertise very impressive maximum context sizes of hundreds of thousands of tokens, research – and experience – shows that they have effective context limits that are orders of magnitude smaller.

The more things we ask models to pay attention to, the less able they are to pay attention to any of them. Accuracy drops of a cliff once the context goes beyond these limits.

After thousands of hours working with “AI” coding assistants, I’ve found I get the best results – the fewest broken plates – when I ask the model to solve one problem at a time.

Continuous Testing

If I make one change to the code, and test it straight away, if tests fail then I wouldn’t need to be a debugging genius to figure out which change broke the code. It’s either a quick fix, or a very cheap undo.

If I make ten changes and then test it, it’s going to take significantly longer, potentially, to debug. And if I have to revert to the last known working version, it’s 10x the work and the time lost.

An LLM is more likely to generate breaking changes than a skilled programmer, so frequent testing is even more essential to keep us close to working code.

And if the model’s first change breaks the code, that broken code is now in its context and it – and I – don’t know it’s broken yet. So the model is predicting further code changes on top of a polluted context.

Many of us have been finding that a lot less rework is required when we test after every small step rather than saving up testing for the end of a batch of work.

There’s an implication here, though. If we testing and re-testing continuously, that suggests that testing very fast.

Continuous Inspection

Left to their own devices, LLMs are very good at generating code they’re pretty bad at modifying later.

Some folks rely on rules and guardrails about code quality which are added to the context with every code-generating interaction with the model. This falls foul of the effective context limits of even the hyperscale LLMs. The model may “obey” – remember, they don’t in reality, they match and predict – some of these rules, but anyone who’s spent more than a few minutes attempting this approach will know that they rarely consistently obey all of them.

And filling up the context with rules runs the risk of “distracting” the LLM from the task at hand.

A more effective approach is to keep the context specific to the task – the problem to be solved – and then, when we’ve got something that works, we can turn our attention to maintainability.

After I’ve seen all my tests pass, I then do a code review, checking everything in the diff between the last working version and the latest. Because these diffs are small – one problem at a time – these code reviews are short and very focused, catching “code smells” as soon as they appear.

The longer I let the problems build up, the more the model ends up wading through it’s own “slop”, making every new change riskier and riskier.

I pay attention to pretty much the same things I would if I was writing all the code myself:

  • Clarity (LLMs really benefit from this, because… language model, duh!)
  • Complexity – the model needs the code likely to be affected in its context. More code, bigger context. Also, the more complex it is, the more likely it is to end up outside of the model’s training data distribution. Monkey no see, monkey can’t do.
  • Duplication – oh boy, do LLMs love duplicating code and concepts! Again, this is a context size issue. If I duplicate the same logic 5x, and need to make a change to the common logic, that’s 5x the code and 5x the tokens. But also, duplication often signposts useful abstractions and a more modular design. Talking of which…
  • Separation of Concerns – this is a big one. If I ask Claude Code to make a change to a 1,000-line class with 25 direct dependencies, that’s a lot of context, and we’re way outside the distribution. Many people have reported how their coding assistant craps out on code that lacks separation of concerns. I find I really have to keep on top of it. Modules should have one reason to change, and be loosely-coupled to other parts of the system.

On top of these, there are all kinds of low-level issues – security vulnerabilities, hanging imports, dead code etc etc – that I find I need to look for. Static analysis can help me check diffs for a whole range of issues that would otherwise by easy to miss by me, or by an LLM doing the code review. I’m seeing a lot of developers upping their game with linting as they use “AI” more in their work.

Continuous Refactoring

Of course, finding code quality issues is only academic if we don’t actually fix them. And, for the reasons I’ve already laid out – we want to give the model the smoothest surface to travel on – fix them immediately.

And I don’t fix all the problems at once. I fix one problem at a time, again for reasons already stated.

And after I fix each problem, I run the tests again, in case the fix broke anything.

This process of fixing one “code smell” at a time, testing throughout, is called refactoring. You may well have heard of it. You may even think you’re doing it. There’s a very high probability that you’re not.

Clarifying With Examples

Here’s an experiment you can try for yourself. Prepare two prompts for a small code project. In one prompt, try to describe what you want as precisely as possible in plain language, without giving any examples.

The total of items in the basket is the sum of the item subtotals, which are the item price multiplied by the item quantity

In the second version, give the exact same requirements, but using examples.

The total of items in a shopping basket is the sum of item subtotals:

item #1: price = 9.99, quantity = 1

item #2: price – 11.99, quantity = 2

shopping basket total = (9.99 * 1) + (11.99 * 2) = 33.97

See what kind of results you get with both approaches. How often does the model misinterpret precisely-described requirements vs. requirements accompanied by examples?

It’s worth knowing that code-generating LLMs are typically trained on code samples that are paired with examples like this. When we include examples, we’re giving the model more to match on, limiting the search space to examples that do what we want.

Examples help prevent LLMs grabbing the wrong end of the prompt, and many users have found them to greatly improve accuracy in generated code.

Harking back to the need for very fast tests, these examples make an ideal basis for fast-running automated “unit” tests (where “units” = units of behaviour). It would make good sense to ask our coding assistant to generate them for us, because we’re going to be needing them soon enough.

Putting It All Together

If we were to imagine a workflow that incorporates all of these principles – small steps, continuous testing, continuous inspection, continuous refactoring, clarifying with examples – it would look very familiar to the small percentage of developers who practice Test-Driven Development.

TDD has been around for several decades, and builds on practices that have been around even longer. It’s a tried-and-tested approach that’s been enabling the rapid, reliable and sustainable evolution of working software for those in the know. If you look inside the “elite-performing” teams in the DORA data – the ones delivering the most reliable software with the shortest lead times and the lowest cost of change – you’ll find they’re pretty much all doing TDD, or something very like TDD.

TDD specifies what we want software to do using examples, in the form of tests. (Hence, “test-driven”).

It works in micro-iterations where we write a test that fails because it requires something the software doesn’t do yet. Then we write the simplest code- the quickest thing we can think of – to get the tests passing. When all the tests are passing, then we review the changes we’ve made, and if necessary refactor the code to fix any quality problems. Once we’re satisfied that the code is good enough – both working and easy to change – we move on to the next failing test case. And rinse and repeat until our feature or our change is complete.

Image

TDD practitioners work one feature at a time, one usage scenario at a time, one outcome at a time and one example at a time, and one refactoring at a time. Basically, we solve one problem at a time.

And we’re continuously running our tests at every step to ensure the code is always working. While automated tests are a side-effect of driving design using tests, they’re a damned useful one! And because we’re only writing code that’s needed to pass tests, all of our code will end up being tested. It’s a self-fulfilling prophecy.

Embedded in that micro-cycle, many practitioners also use version control to ensure they’re making progress in safe, easily-reverted steps, progressing from one working version of the code to the next.

Some of us have discovered the benefits of a “commit on green, revert on red” approach to version control. If all the tests pass, we commit the changes. If any tests fail, we do a hard reset back to the previous working commit. This means that broken versions of the code don’t end up in the context for the next interaction. (Remember that LLMs can’t distinguish between working code and broken code – it’s all just context.)

The beauty of TDD is that the benefits can be yours whether you’re using “AI” or not. Which is why I now teach it both ways.

The key to being effective with “AI” coding assistants is being effective without them.

Shameless Plug

Test-Driven Development is not a skill that you can just switch on, whether you’re doing it with “AI” or without. It takes a lot of practice to get the hang of it, and especially to build the discipline – the habits – of TDD.

An alarming number of TDD tutorials aren’t actually teaching TDD. (And the more people learn from them, the more bad tutorials we’ll no doubt see.)

If your team wants training in Test-Driven Development, including how to do it effectively using tools like Claude Code and Cursor, my 2-day TDD training workshop is half-price if you confirm your booking by January 31st.

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

In this series, I’ve explored the principles and practices that teams seeing modest improvements in software development outcomes have been applying.

After more than four years since the first “AI” coding assistant, GitHub Copilot, appeared, the evidence is clear. Claims of teams achieving 2x, 5x, even 10x productivity gains simply don’t stand up to scrutiny. No shortage of anecdotal evidence, but not a shred of hard data. It seems when we measure it, the gains mysteriously disappear.

The real range, when it’s measured in terms of team outcomes like delivery lead time and release stability, is roughly 0.8x – 1.2x, with negative effects being substantially more common than positives.

And we know why. Faster cars != faster traffic. Gains in code generation, according to the latest DORA State of AI-Assisted Software Development report, are lost to “downstream chaos” for the majority of teams.

Coding never was the bottleneck in software development, and optimising a non-bottleneck in a system with real bottlenecks just makes those bottlenecks worse.

Far from boosting team productivity, for the majority of “AI” users, it’s actually slowing them down, while also negatively impacting product or system reliability and maintainability. They’re producing worse software, later.

Most of those teams won’t be aware that it’s happening, of course. They attached a code-generating firehose to their development plumbing, and while the business is asking why they’re not getting the power shower they were promised, most teams are measuring the water pressure coming out of the hose (lines of code, commits, Pull Requests) and not out of the shower (business outcomes), because those numbers look far more impressive.

The teams who are seeing improvements in lead times of 5%, 10%, 15%, without sacrificing reliability and without increasing the cost of change, are doing it the way they were always doing it:

  • Working in small batches, solving one problem at a time
  • Iterating rapidly, with continuous testing, code review, refactoring and integration
  • Architecting highly modular designs that localise the “blast radius” of changes
  • Organising around end-to-end outcomes instead of around role or technology specialisms
  • Working with high autonomy, making timely decisions on the ground instead of sending them up the chain of command

When I observe teams that fall into the “high-performing” and “elite” categories of the DORA capability classifications using tools like Claude Code and Cursor, I see feedback loops being tightened. Batch sizes get even smaller, quality gates get even narrower, iterations get even faster. They keep “AI” on a very tight leash, and that by itself could well account for the improvements in outcomes.

Meanwhile, the majority of teams are doing the opposite. They’re trying to specify large amounts of work in detail up-front. They’re leaving “AI agents” to chew through long tasks that have wide impact, generating or modifying hundreds or even thousands of lines of code while developers go to the proverbial pub.

And, of course, they test and inspect too late, applying too little rigour – “Looks good to me.” They put far too much trust in the technology, relying on “rules” and “guardrails” set out in Markdown files that we know LLMs will misinterpret and ignore randomly, barely keeping one hand on the wheel.

As far as I’ve seen, no team actually winning with the technology works like that. They’re keeping both hands firmly on the wheel. They’re doing the driving. As AI luminary Andrej Karpathy put it, “agentic” solutions built on top of LLMs just don’t work reliably enough today to leave them to get on with it.

It may be many years before they do. Statistical mechanics predicts it could well be never, with the order-of-magnitude improvement in accuracy needed to make them reliable enough (wrong 2% of the time instead of 20%) calculated to require 1020 times the compute to train. To do that on similar timescales to the hyperscale models of today would require Dyson Spheres (plural) to power it.

Any autonomous software developer – human or machine – requires Actual Intelligence: the ability to reason, to learn, to plan and to understand. There’s no reason to believe that any technology built using deep learning alone will ever be capable of those things, regardless of how plausibly they can mimic them, and no matter how big we scale them. LLMs are almost certainly a dead end for AGI.

For this reason I’ve resisted speculating about how good the technology might become in the future, even though the entire value proposition we see coming out of the frontier labs continues to be about future capabilities. The gold is always over the next hill, it seems.

Instead, I’ve focused my experiments and my learning on present-day reality. And the present-day reality that we’ll likely have to live with for a long time is that LLMs are unreliable narrators. End of. Any approach that doesn’t embrace this fact is doomed to fail.

That’s not to say, though, that there aren’t things we can do to reduce the “hallucinations” and confabulations, and therefore the downstream chaos.

LLMs perform well – are less unreliable – when we present them with problems that are well-represented in their training data. The errors they make are usually a product of going outside of their data distribution, presenting them with inputs that are too complex, too novel or too niche.

Ask them for one thing, in a common problem domain, and chances are much higher that they’ll get it right. Ask them for 10 things, or for something in the long-tail of sparse training examples, and we’re in “hallucination” territory.

Clarifying with examples (e.g., test cases) helps to minimise the semantic ambiguity of inputs, reducing the risk of misinterpretation, and this is especially helpful when the model’s working with code because the samples they’re trained on are paired with those kinds of examples. They give the LLM more to match on.

Contexts need to be small and specific to the current task. How small? Research suggests that the effective usable context sizes of even the frontier LLMs are orders of magnitude smaller than advertised. Going over 1,000 tokens is likely to produce errors, but even contexts as small as 100 tokens can produce problems.

Attention dilution, drift, “probability collapse” (play one at chess and you’ll see what I mean), and the famous “lost in the middle” effect make the odds of a model following all of the rules in your CLAUDE.md file, or all the requirements for a whole feature, vanishingly remote. They just can’t accurately pay attention to that many things.

But even if they could, trying to match on dozens of criteria simultaneously will inevitably send them out-of-distribution.

So the smart money focuses on one problem at a time and one rule at a time, working in rapid iterations, testing and inspecting after every step to ensure everything’s tickety-boo before committing the change (singular) and moving on to the next problem.

And when everything’s not tickety-boo – e.g., tests start failing – they do a hard reset and try again, perhaps breaking the task down into smaller, more in-distribution steps. Or, after the model’s failed 2-3 times, writing the code themselves to get themselves out of a “doom loop”.

There will be times – many times – when you’ll be writing or tweaking or fixing the code yourself. Over-relying on the tool is likely to cause your skills to atrophy, so it’s important to keep your hand in.

It will also be necessary to stay on top of the code. The risk, when code’s being created faster than we can understand it, is that a kind of “comprehension debt” will rapidly build up. When we have to edit the code ourselves, it’s going to take us significantly longer to understand it.

And, of course, it compounds the “looks good to me” problem with our own version of the Gell-Mann amnesia effect. Something I’ve heard often over the last 3 years is people saying “Well, it’s not good with <programming language they know well>, but it’s great at <programming language they barely know>”. The less we understand the output, the less we see the brown M&Ms in the bowl.

“Agentic” coding assistants are claimed to be able to break complex problems down, and plan and execute large pieces of work in smaller steps. Even if they can – and remember that LLMs don’t reason and don’t plan, they just produce plausible-looking reasoning and plausible-looking plans – that doesn’t mean we can hit “Play” and walk away to leave them to it. We still need to check the results at every step and be ready to grab the wheel when the model inevitably takes a wrong turn.

Many developers report how LLM accuracy falls of a cliff when tasked with making changes to code that lacks separation of concerns, and we know why this is too. Changing large modules with many dependencies brings a lot more code into play, which means the model has to work with a much larger context. And we’re out-of-distribution again.

The really interesting thing is that the teams DORA found were succeeding with “AI” were already working this way. Practices like Test-Driven Development, refactoring, modular design and Continuous Integration are highly compatible with working with “AI” coding assistants. Not just compatible, in fact – essential.

But we shouldn’t be surprised, really. Software development – with or without “AI” – is inherently uncertain. Is this really what the user needs? Will this architecture scale like we want? How do I use that new library? How do I make Java do this, that or the other?

It’s one unknown after another. Successful teams don’t let that uncertainty pile up, heaping speculation and assumption on top of speculation and assumption. They turn the cards over as they’re being dealt. Small steps, rapid feedback. Adapting to reality as it emerges.

Far from “changing the game”, probabilistic “AI” coding assistants have just added a new layer of uncertainty. Same game, different dice.

Those of us who’ve been promoting and teaching these skills for decades may have the last laugh, as more and more teams discover it really is the only effective way to drink from the firehose.

Skills like Test-Driven Development, refactoring, modular design and Continuous Integration don’t come with your Claude Code plan. You can’t buy them or install them like an “AI” coding assistant. They take time to learn – lots of time. Expert guidance from an experienced practitioner can expedite things and help you avoid the many pitfalls.

If you’re looking for training and coaching in the practices that are distinguishing the high-performing teams from the rest – with or without “AI” – visit my website.

The AI-Ready Software Developer #20 – It’s The Bottlenecks, Stupid!

For many years now, cycling has been consistently the fastest way to get around central London. Faster than taking the tube. Faster than taking the train. Faster than taking the bus. Faster than taking a cab. Faster than taking your car.

Image

All of these other modes of transport are, in theory, faster than a bike. But the bike will tend to get there first, not because it’s the fastest vehicle, but because it’s subject to the fewest constraints.

Cars, cabs, trains and buses move not at the top speed of the vehicle, but at the speed of the system.

And, of course, when we measure their journey speed at an average 9 mph, we don’t see them crawling along steadily at that pace.

“Travelling” in London is really mostly waiting. Waiting at junctions. Waiting at traffic lights. Waiting to turn. Waiting for the bus to pull out. Waiting on rail platforms. Waiting at tube stations. Waiting for the pedestrian to cross. Waiting for that van to unload.

Cyclists spend significantly less time waiting, and that makes them faster across town overall.

Similarly, development teams that can produce code much faster, but work in a system with real constraints – lots of waiting – will tend to be outperformed overall by teams who might produce code significantly slower, but who are less constrained – spend less time waiting.

What are developers waiting for? What are the traffic lights, junctions and pedestrian crossings in our work?

If I submit a Pull Request, I’m waiting for it to be reviewed. If I send my code for testing, I’m waiting for the results. If I don’t have SQL skills, and I need a new column in the database, I’m waiting for the DBA to add it for me. If I need someone on another team to make a change to their API, more waiting. If I pick up a feature request that needs clarifying, I’m waiting for the customer or the product owner to shed some light. If I need my manager to raise a request for a laptop, then that’s just yet more waiting.

Teams with handovers, sign-offs and other blocking activities in their development process will tend to be outperformed by teams who spend less time waiting, regardless of the raw coding power available to them.

Teams who treat activities like testing, code review, customer interaction and merging as “phases” in their process will tend to be outperformed by teams who do them continuously, regardless of how many LOC or tokens per minute they’re capable of generating.

This isn’t conjecture. The best available evidence is pretty clear. Teams who’ve addressed the bottlenecks in their system are getting there sooner – and in better shape – than teams who haven’t. With or without “AI”.

The teams who collaborate with customers every day – many times a day – outperform teams who have limited, infrequent access.

The teams who design, test, review, refactor and integrate continuously outperform teams who do them in phases.

The teams with wider skillsets outperform highly-specialised teams.

The teams working in cohesive and loosely-coupled enterprise architectures outperform teams working in distributed monoliths.

The teams with more autonomy outperform teams working in command-and-control hierarchies.

None of these things comes with your Claude Code plan. You can’t buy them. You can’t install them. But you can learn them.

And if you’re ticking none of those boxes, and you still think a code-generating supercar is going to make things better, I have a Bugatti Chiron Sport you might be interested in buying. Perfect for the school run!

The AI-Ready Software Developer #19 – Prompt-and-Fix

For over a billion years now, we’ve known that “code-and-fix” software development, where we write a whole bunch of code for a feature, or even for a whole release, and then check it for bugs, maintainability problems, security vulnerabilities and so on, is by far the most expensive and least effective approach to delivering production-ready software.

If I change one line of code and tests start failing, I’ve got a pretty good idea what broke it, and it’s a very small amount of work (or lost work) to fix it.

If I change 1,000 lines of code, and tests start failing… Well, we’re in a very different ballpark now. Figuring out what change(s) broke the software and then fixing them is a lot of work, and rolling back to the last known working version is a lot of work lost.

Also, checking a single change is likely to bring a lot more focus than checking 1,000. Hence my go-to meme for after-the-fact testing and code reviews:

Image

The usual end result of code-and-fix development is buggier, less maintainable software delivered much later and at a much higher cost.

And all things in traditional software development have their “AI”-assisted equivalents, of course.

I see developers offloading large tasks – whole features or even sets of features for a release – and then setting the agentic dogs loose on them while they go off to eat a sandwich or plan a holiday or get a spa treatment or whatever it is software developers do these days.

Then they come back after the agent has finished to “check” the results. I’ve even heard them say “Looks good to me” out loud as they skim hundreds or thousands of changes.

Time for the meme again:

Image

Now, no doubting that “AI”-assisted coding tools have improved much in the last 6-12 months. But they’re still essentially LLMs wrapped in WHILE loops, with all the reliability we’ve come to expect.

Odds of it getting one change right? 80%, maybe, with a good wind behind it. Chances of it getting two right? 65%, perhaps.

Odds of it getting 100 changes right? Effectively zero.

Sure, tests help. You gave it tests, right?

Guardrails can help, when the model actually pays attention to them.

External checking – linters and that sort of thing – can definitely help.

But, as anyone who’s spent enough time using these tools can tell you, no matter how we prompt or how we test or how we try to constrain the output, every additional problem we ask it to solve adds risk.

LLMs are unreliable narrators, and there’s really nothing we can do to get around that except to be skeptical of their output.

And then there are the “doom loops”, when the context goes outside the model’s data distribution, and even with infinite iterations, it just can’t do what we want it to do. It just can’t conjure up the code equivalent of “a wine glass full to the brim”.

Image

And the bigger the context – the more we ask for – the greater the risk of out-of-distribution behaviour, with each additional pertinent token collapsing the probability of matching the pattern even further. (Don’t believe me? Play one at chess and watch it go off that OOD cliff.)

So problems are very likely with this approach – which I’m calling “prompt-and-fix”, because I can – and finding them and fixing them, or backing out, is a bigger cost.

What I’ve seen most developers do is skim the changes and then wave the problems through into a release with a “LGTM”.

One more time:

Image

This creates a comforting temporary illusion of time saved, just like code-and-fix. But we’re storing up a lot more time that’s going to be lost later with production fires, bug fixes and high cost-of-change.

One of the most important lessons in software development is that what’s downstream of present you is upstream of future you – as Sandra Bullock and George Clooney discovered in Gravity.

The antidote to code-and-fix was defect prevention. We take smaller steps, testing and reviewing changes continuously, so most problems are caught long before finding, fixing or reverting them becomes expensive.

I have a meme for that, too:

Image

The equivalent in “AI”-assisted software development would be to work in small steps – one change at a time – and to test and review the code continuously after every step.

Sorry, folks. No time for that spa treatment! You’ll be keeping the “AI” on a very short leash – both hands on the wheel at all times, sort of thing.

The other benefit of small steps is that they’re much less likely to push the LLM out of its data distribution. Keeping the model in-distribution more, so screw-ups will happen less often – while reaping the benefits of immediate problem detection in reduced work added or lost when things go south – is a WIN-WIN.

I know that some of you will be reading this and thinking “But Claude can break a big problem down into smaller problems and tackle them one at a time, running the tests and linting the code and all that”.

Yes, in that mode, it certainly can. But every step it takes carries a real risk of taking it in the wrong direction. And direction, despite what some fans of the technology claim, isn’t an LLM’s strong suit. Remember, they don’t understand, they don’t reason, they don’t plan. They recursively match patterns in the input to patterns in the model and predict what token comes next.

Any sense that they’re thinking or reasoning or planning is a product of the Actual Intelligence they’re trained on. It may look plausible, but on closer inspection – and “closer inspection” is often the problem here – it’s usually riddled with “brown M&Ms”.

So, no, you can’t just walk away and let them get on with it. If they take a wrong turn, that error will likely compound through the rest of the processing.

Think of what happens in traditional software development when a misunderstanding or an incorrect assumption goes unchecked while we merrily build on top of that code.

The AI-Ready Software Developer #18 – “Productivity”. You Keep Using That Word.

It’s 20 years since I created a website with the banner “I Care About Software” as part of a loose “post-agile” movement that sought to step back from the tribes and factions that had grown to dominate software development at the time.

Regardless of whether we believed X, Y or Z was the “best way”, could we at least agree that the outcomes matter?

It matters if the software does what the user expects it to do. It matters if it does it reliably. It matters that it does it when they need it. It matters that when they need it to do something else, they don’t have to wait a year or three for us to bring them that change.

Unlike many other professions, and with few exceptions, we’re under no compulsion to produce useful, usable, reliable software or to be responsive to the customer’s needs. It’s largely voluntary.

We don’t usually get fined when we ship bugs. We won’t be sanctioned if the platform goes down for 24 hours. We won’t get struck off some professional register if the lead time on changes is months or years (or never).

(Of course, eventually, if we’re consistently bad, we can go out of business. But historically, another job – where we can screw up another business – hasn’t been difficult to find, even with a long trail of bodies behind us.)

And we don’t usually get a bonus for releases that go without incident, or a promotion for consistently maintaining short lead times.

In this sense, we have less incentive to do a good job than a takeaway delivery driver.

A friend once kindly introduced me to the project managers in her company to give them the old “better, sooner, for longer” pitch. I talked about teams I’d worked with who had built the capability to deliver and deliver and deliver, week after week, year after year, with no drama and no fires to put out.

They actually said the quiet part out loud: “But we get paid to put out the fires!”

For software developers, the carrot and the stick usually have very little to do with actual outcomes that customers and end users might care about. This is evidenced by the fact that so few teams keep even one eye on those outcomes.

The average development team doesn’t actually know how much of their time is spent fixing bugs instead of responding to user needs. They don’t know what their lead times are, or how they might be changing over the lifetime of the product or system. They’re often the last to know when the website’s down.

Most damning of all, the average development team has no idea what the users’ needs or the business goals of the product actually are. And that’s where the value that we all talk about really is, you’d have thought.

And so it’s entirely possible – inevitable, even – for the priorities of dev teams and of the people paying for and using the software to become very misaligned.

I’m always struck by the chasm that can grow between them, with developers genuinely believing they’re doing a great job while users just roll their eyes. You’d be surprised how often teams are blissfully unaware of how dissatisfied their customers are.

So, before you start that 2-year REPLACE ALL THE THINGS WITH RUST project, stop to ask yourselves “What impact would this have on overall outcomes?”

If your goal is to make your software more memory-safe, are there other ways that might be less radical or disruptive? (You might be surprised what you can do with static analysis, for example.)

Is it possible to do it a bit at a time, under the radar, to minimise the impact on customer-perceived value?

Will it really solve any problem the business actually has at all? I’m a fan of asking what the intended business outcomes are. You’d be amazed how often technical initiatives explode on contact with that question.

Which brings me to the topic de jour. The Gorman Paradox asks why, if “AI” coding assistants are having the profound impact on development team productivity many report – 2x, 5x, 10x, 100x (!) – we see no sign of that in the app stores, on business bottom lines, or in the wider economy? Where’s all this extra productivity going?

I also have to ask why the reports of productivity gains using “AI” vary so widely, with anecdotal reports of increases in excess of 1000%, and measured variances in the range of -20% to +20%.

The words doing all the work here are “anecdotal” and “measured”, I suspect. But also, in precisely what is being measured.

Optimistic findings are usually based on measurements of things the customer doesn’t care about – lines of code, commits, Pull Requests etc.

The pessimistic – or certainly less sensational – findings are usually based on measurements of things the customer does care about, like lead times, reliability and overall costs.

It’s well-understood why producing more code faster – faster than we can understand it and test it – tends to overwhelm the real bottlenecks in the software development process. So there’s no great mystery about how “AI” code generation can actually reduce overall system performance.

What has been mysterious is why some teams see it, and most teams don’t.

They attach a code-generating firehose to their process and can’t understand why the business is complaining that they’re not getting the power shower they were promised.

There is a candidate for a causal mechanism. Most teams don’t see the impact on systemic outcomes because they’re not looking.

So when a developer tells you that, say, Claude Code has made them 10x more productive, they’re not lying. (Well, okay, maybe some of them are.) They just have a very different understanding of what “productivity” means.

If we’re to survive as professionals in this “age of AI”, I recommend pinning your flag to the mast of user needs and business outcomes.

Most importantly, we should be measuring our success by the business goals of the software, or the feature, or the change. If the goal is to, say, increase our share of the vegan takeaway market, the ultimate test is whether in reality we actually do.

This is the ultimate definition of “Done”.

We claim to develop software iteratively, but that implies we’re iterating towards some goal. If iterations don’t converge, we get (literal) chaos – just a random walk through solution space. Which would be a sadly accurate summary of the majority of efforts, with most teams unable to articulate what the goals actually are. If, indeed, there are any.