For over a billion years now, we’ve known that “code-and-fix” software development, where we write a whole bunch of code for a feature, or even for a whole release, and then check it for bugs, maintainability problems, security vulnerabilities and so on, is by far the most expensive and least effective approach to delivering production-ready software.
If I change one line of code and tests start failing, I’ve got a pretty good idea what broke it, and it’s a very small amount of work (or lost work) to fix it.
If I change 1,000 lines of code, and tests start failing… Well, we’re in a very different ballpark now. Figuring out what change(s) broke the software and then fixing them is a lot of work, and rolling back to the last known working version is a lot of work lost.
Also, checking a single change is likely to bring a lot more focus than checking 1,000. Hence my go-to meme for after-the-fact testing and code reviews:

The usual end result of code-and-fix development is buggier, less maintainable software delivered much later and at a much higher cost.
And all things in traditional software development have their “AI”-assisted equivalents, of course.
I see developers offloading large tasks – whole features or even sets of features for a release – and then setting the agentic dogs loose on them while they go off to eat a sandwich or plan a holiday or get a spa treatment or whatever it is software developers do these days.
Then they come back after the agent has finished to “check” the results. I’ve even heard them say “Looks good to me” out loud as they skim hundreds or thousands of changes.
Time for the meme again:

Now, no doubting that “AI”-assisted coding tools have improved much in the last 6-12 months. But they’re still essentially LLMs wrapped in WHILE loops, with all the reliability we’ve come to expect.
Odds of it getting one change right? 80%, maybe, with a good wind behind it. Chances of it getting two right? 65%, perhaps.
Odds of it getting 100 changes right? Effectively zero.
Sure, tests help. You gave it tests, right?
Guardrails can help, when the model actually pays attention to them.
External checking – linters and that sort of thing – can definitely help.
But, as anyone who’s spent enough time using these tools can tell you, no matter how we prompt or how we test or how we try to constrain the output, every additional problem we ask it to solve adds risk.
LLMs are unreliable narrators, and there’s really nothing we can do to get around that except to be skeptical of their output.
And then there are the “doom loops”, when the context goes outside the model’s data distribution, and even with infinite iterations, it just can’t do what we want it to do. It just can’t conjure up the code equivalent of “a wine glass full to the brim”.

And the bigger the context – the more we ask for – the greater the risk of out-of-distribution behaviour, with each additional pertinent token collapsing the probability of matching the pattern even further. (Don’t believe me? Play one at chess and watch it go off that OOD cliff.)
So problems are very likely with this approach – which I’m calling “prompt-and-fix”, because I can – and finding them and fixing them, or backing out, is a bigger cost.
What I’ve seen most developers do is skim the changes and then wave the problems through into a release with a “LGTM”.
One more time:

This creates a comforting temporary illusion of time saved, just like code-and-fix. But we’re storing up a lot more time that’s going to be lost later with production fires, bug fixes and high cost-of-change.
One of the most important lessons in software development is that what’s downstream of present you is upstream of future you – as Sandra Bullock and George Clooney discovered in Gravity.
The antidote to code-and-fix was defect prevention. We take smaller steps, testing and reviewing changes continuously, so most problems are caught long before finding, fixing or reverting them becomes expensive.
I have a meme for that, too:

The equivalent in “AI”-assisted software development would be to work in small steps – one change at a time – and to test and review the code continuously after every step.
Sorry, folks. No time for that spa treatment! You’ll be keeping the “AI” on a very short leash – both hands on the wheel at all times, sort of thing.
The other benefit of small steps is that they’re much less likely to push the LLM out of its data distribution. Keeping the model in-distribution more, so screw-ups will happen less often – while reaping the benefits of immediate problem detection in reduced work added or lost when things go south – is a WIN-WIN.
I know that some of you will be reading this and thinking “But Claude can break a big problem down into smaller problems and tackle them one at a time, running the tests and linting the code and all that”.
Yes, in that mode, it certainly can. But every step it takes carries a real risk of taking it in the wrong direction. And direction, despite what some fans of the technology claim, isn’t an LLM’s strong suit. Remember, they don’t understand, they don’t reason, they don’t plan. They recursively match patterns in the input to patterns in the model and predict what token comes next.
Any sense that they’re thinking or reasoning or planning is a product of the Actual Intelligence they’re trained on. It may look plausible, but on closer inspection – and “closer inspection” is often the problem here – it’s usually riddled with “brown M&Ms”.
So, no, you can’t just walk away and let them get on with it. If they take a wrong turn, that error will likely compound through the rest of the processing.
Think of what happens in traditional software development when a misunderstanding or an incorrect assumption goes unchecked while we merrily build on top of that code.




