Engineering Leaders: Your AI Adoption Doesn’t Start With AI

In the past few months, I’ve been hearing from more and more teams that the use of AI coding tools is being strongly encouraged in their organisations.

I’ve also been hearing that this mandate often comes with high expectations about the productivity gains leaders expect this technology to bring. But this narrative is rapidly giving way to frustration when these gains fail to materialise.

The best data we have shows that a minority of development teams are reporting modest gains – in the order of 5%-15% – in outcomes like delivery lead times and throughput. The rest appear to be experiencing negative impacts, with lead times growing and the stability of releases getting worse.

The 2025 DevOps Research & Assessment State of AI-assisted Software Development report makes it clear that the teams reporting gains were already high-performing or elite by DORA’s classification, releasing frequently, with short lead times and with far fewer fires in production to put out.

As the report puts it, this is not about tools or technology – and certainly not about AI. It’s about the engineering capability of the team and the surrounding organisation.

It’s about the system.

Teams who design, test, review, refactor, merge and release in bigger batches are overwhelmed by what DORA describes as “downstream chaos” when AI code generation makes those batches even bigger. Queues and delays get longer, and more problems leak into releases.

Teams who design, test, review, refactor, merge and release continuously in small batches tend to get a boost from AI.

In this respect, the team’s ranking within those DORA performance classifications is a reasonably good predictor of the impact on outcomes when AI coding assistants are introduced.

The DORA website helpfully has a “quick check” diagnostic questionnaire that can give you a sense of where your team sits in their performance bands.

Image

(Answer as accurately as you can. Perception and aspiration aren’t capability.)

The overall result is usefully colour-coded. Red is bad, blue is good. Average is Meh. Yep, Meh is a colour.

Image

If your team’s overall performance is in the purple or red, AI code generation’s likely to make things worse.

If your team’s performance is comfortably in the blue, they may well get a little boost. (You can abandon any hopes of 2x, 5x or 10x productivity gains. At the level of team outcomes, that’s pure fiction.)

The upshot of all this is that before you even think about attaching a code-generating firehose to your development process, you need to make sure the team’s already performing at a blue level.

If they’re not, then they’ll need to shrink their batch sizes – take smaller steps, basically – and accelerate their design, test, review, refactor and merge feedback loops.

Before you adopt AI, you need to be AI-ready.

Many teams go in the opposite direction, tackling whole features in a single step – specifying everything, letting the AI generate all the code, testing it after-the-fact, reviewing the code in larger change-sets (“LGTM”), doing large-scale refactorings using AI, and integrating the whole shebang in one big bucketful of changes.

Heavy AI users like Microsoft and Amazon Web Services have kindly been giving us a large-scale demonstration of where that leads – more bugs, more outages, and significant reputational damage.

A smaller percentage of teams are learning that what worked well before AI works even better with it. Micro-iterative practices like Test-Driven Development, Continuous Integration, Continuous Inspection, and real refactoring (one small change at a time) are not just compatible with AI-assisted development, they’re essential for avoiding the “downstream chaos” DORA finds in the purple-to-red teams.

And while many focus on the automation aspects of Continuous Delivery – and a lot of automation is required to accelerate the feedback loops – by far the biggest barrier to pushing teams into the blue is skills.

Yes. SKILLS.

Skills that most developers, regardless of their level of experience, don’t have. The vast majority of developers have never even seen practices like TDD, refactoring and CI being performed for real.

That’s certainly because real practitioners are pretty rare, so they’re unlikely to bump into one. But much of this is because of their famously steep learning curves. TDD, for example, takes months of regular practice to to be able to use it on real production systems.

And, as someone who’s been practicing TDD and teaching it for more than 25 years, I know it requires ongoing mindful practice to maintain the habits that make it work. Use it or lose it!

An experienced guide can be incredibly valuable in that journey. It’s unrealistic to expect developers new to these practices to figure it all out for themselves.

Maybe you’re lucky to have some of the 1% of software developers – yes, it really is that few – who can actually do this stuff for real. Or even one of the 0.1% who has had a lot of experience helping developers learn them. (Just because they can do it, it doesn’t necessarily follow that they can teach it.)

This is why companies like mine exist. With high-quality training and mentoring from someone who not only has many thousands of hours of practice, but also thousands of hours of experience teaching these skills, the journey can be rapidly accelerated.

I made all the mistakes so that you don’t have to.

And now for the good news: when you build this development capability, the speed-ups in release cycles and lead times, while reliability actually improves, happen whether you’re using AI or not.

The AI-Ready Software Developer #21 – Stuck In A “Doom Loop”? Drop A Gear

Image

I find it helpful to visualise agentic workflows as sequences of dice throws. Take the now-popular “Ralph Wiggum loop“. You want 7. The agent throws the dice, and it’s 5. That fails your test for 7, so the agent reverts the code, flushes the context, and tries again, repeating this cycle until it throws the 7 you wanted.

Different numbers have different probabilities, and therefore might take more or less throws to achieve. Throwing 7 might take 6 throws. Throwing 2 or 12 might take 36 throws. It depends on the probability distribution associated with 2x six-sided dice.

Image

The outputs of Large Language Models also depend on probability distributions, and some outputs will take more attempts to achieve than others, depending on the input and on the distribution of data “learned” by the model.

Image
A toy example of the distribution of probabilities of next-token predictions by an LLM. Note how there’s a very obvious choice because that exact token sequence appears often in the training data. The model’s prediction will be confident.

On average, a more complex context – one that asks the model to match on more complex patterns – will tend to require more improbable pattern matches and produce less confident predictions. This is “hallucination” territory. Little Ralph goes round and round in circles trying in vain to throw 13.

I call these “doom loops”.

An adapted agentic workflow recognises when a step has gone outside the model’s distribution (e.g., by the number of failed iterations) and “drops a gear” – meaning it switches back into read-only planning mode and asks the model to break that step down into simpler contributing steps that are less improbable – more in-distribution.

(It’s a phrase I’ve been using informally for years in pairing sessions when a problem’s “gradient” turns out to be steeper than we can handle in a single step. It harks back to an idea in Kent Beck’s book Test-Driven Development By Example, where he talks about how we might want the teeth on a pulley to be smaller the heavier the load. The fiddlier the problem, the smaller the steps.)

Image

If you think about it, this is exactly what “multi-step reasoning” is – tackling a more complex problem as a sequence of simpler problems. You can try this experiment to see the impact it has on the accuracy of outputs:

If you’re familiar with the Mars Rover programming exercise, where the surface of Mars is a 2D grid and you can drive a rover over it by sending it sequences of commands (F – forward, B – back, L – left, R – right), you can task a model like GPT-5.x with solving problems like “Given the rover is at (5,9) facing North and it’s told to go ‘RFFRBBBLB’, what is its final position and direction?”

Image

In one interaction, instruct the model additionally “Don’t show your working. Respond with the final answer only.” Odds are it’ll get it wrong every time.

Then let it “reason”. It’ll likely get it right first or second time, and it does this by breaking the pattern down:

“R turns the rover clockwise, so it’s at (5,9) facing East.

FF moves the rover forward 2 squares in the +x direction, so it’s at (7,9) facing East.”

etc etc.

When little Ralph tries to throw 13, the model just can’t do it. It’s OOD. When the agent drops a gear, it might try to throw 6+7, both of which are comfortably in-distribution.

Final note, sometimes a problem is just not in the data. It doesn’t matter how many times we throw the dice, or how we break it down, the model just can’t do it. Be ready to step in.

Read the full series here

A New DORA Performance Level – Catastrophically Bad

DORA (DevOps Research & Assessment) has 4 broad levels for dev team performance: Elite, High, Medium, and Low.

A Low-Performing team deploys less than once a month, has lead times for changes > 1 month, sees as many as half their deployments go boom, and takes more than a week to fix them.

An Elite team deploys changes multiple times a day, has lead times typically of < 1 day, failure rates < 15% (fewer than 1 in 8 deployments go boom), and can fix failed releases in under an hour.

I picture a Low-Performing team walking a tightrope between two mountain peaks. It’s a long way to safety (working, shippable code), a long way down if they fall, and a long climb back up to try again.

I picture an Elite team as walking the same length tightrope, tied to wooden posts a few feet apart, just 3 feet off the ground. Safety’s never far away, and a fall’s no big deal. They can quickly recover.

I would like to propose a 5th performance level, one that I’ve seen for real more than once: Catastrophically Bad.

At this level, the tightrope has snapped.

I’ve seen teams stuck in a death spiral where no changes can be deployed because every release goes boom. One example was a financial services company here in London who’d been running themselves ragged trying to stabilise a release for almost a year.

Every deployment had to be rolled back, and every deployment cost them high six-figures in client compensation, not to mention loss of reputation.

You know the drill: testing was done 100% manually and took weeks. While the devs were fixing the bugs testing found, they were introducing all-new bugs (and reintroducing a few old favourites).

A change failure rate of 100% and a lead time of – effectively – infinity.

How does a Catastrophically Bad development process that delivers nothing turn into at the very least a Low-Performing process that delivers something? (And yes, this is the mythical “hyper-productivity” the Scrum folks told you about – and no, Scrum isn’t the answer).

What we did with my client started with 2 fundamental changes:

  • Fast-running automated “smoke” tests, selected by analysing which features broke most often
  • CI traffic lights – a wrapper around version control that forced check-ins to go in single file, and blocked check-ins and check-outs of changes when the light wasn’t “green” (meaning, no check-in in progress and build is working).

It took 12 weeks to go from Catastrophically Bad to Medium-Performing. (Pro tip for new leaders and process improvement consultants – poorly-performing teams are a gift because they have such low-hanging fruit).

You can build from here. In this case, by showing teams how to safely change legacy code in ways that add more fast-running regression tests and gradually simplify and modularise the parts of the code that are changing the most.

(The title image is taken from a Catastrophically Bad agentic “team”)

Is AI Killing Software Engineering Jobs?

One claim I see repeated widely on social media is that LLM-based coding tools are taking software engineering jobs. These are often accompanied by extrapolations that soon – any time now – the profession of software engineering will be 80%, 90%, even 100% done.

The evidence put forward to support these claims is the large number of tech layoffs and the decimation of the jobs market in recent years. The narrative typically starts with the launch of ChatGPT, and uses a little chronological sleight of hand to make the start of the downturn coincide. A sort of AI “Event One”, for the Whovians among us.

I also a see a lot of stress, and angst and worry in response to these claims, quite understandably.

But are the claims true?

Let’s look at the actual order of events, shall we? First of all, ChatGPT was launched on November 30th 2022. Before that, most of us had never heard of it.

So when did the downturn actually start?

Image

The inflection point occurred around mid-May of that year. So whatever Event One was that triggered the downturn, it wasn’t ChatGPT. Had you even heard of Large Language Models in May 2022?

What did happen around that time that might have caused businesses to reconsider their headcount?

Note the massive ramp-up in job postings in the preceding two years. Businesses were hiring software engineers like they were going out of fashion.

How could they afford these huge increases in headcount? It’s simple. Borrowing was cheap. Not just cheap, but effectively free. At the start of the pandemic, central banks dropped their base interest rates to roughly zero to encourage borrowing and stimulate their locked-down economies. In particular, to stimulate investment in hiring.

And it evidently worked where software engineers were concerned. This coincided with a period of time when businesses had to adapt to a remote-first reality for employees and customers, and were discovering that their “digital transformations” had left them not quite as digital as they might have hoped. Programming skills were selling like hotcakes as they frantically tried to plug the gaps in their operations.

Then around February 2022, inflation kicked in – inevitably, when you pump free money into an economy – and interest rates started to go up again, quite sharply.

Image

It takes a while for the oil tanker to turn around, but turn it did in May in a complete 180°.

By the time ChatGPT was launched, the trend was already more than half played-out. Hiring dropped to pre-pandemic levels and then some, but has actually been growing again over the last year or so.

Job postings for software engineers are up about 5% on this time last year, and could be set to recover fully to pre-pandemic levels. Let’s wait and see.

“But, Jason, the CEO of AcmeAI said they were replacing devs with agents.”

First of all, what is the CEO of AcmeAI selling? And which sounds better to investors: “Oops, we over-hired!” or “We’re cutting edge with this AI stuff!”

As always, don’t listen to what these people say. Watch what they do. All of the businesses who’ve claimed to be replacing engineers with “AI” are actually hiring engineers, and in large numbers. Sure, not as large as in 2021-22, but they’re definitely planning on using human engineers for the foreseeable future. Just perhaps cheaper ones, maybe offshore.

As for the tumbleweed on job postings Main Street, I know a lot of software folks, and I’ve heard from many of them how – when this downturn was in full swing – they decided to stay put longer than they normally might have in a healthier market.

Without that usual background level of churn, open positions become much rarer. There are fewer dead man’s shoes to fill. It creates a negative feedback loop that feeds on itself. The fewer the live job postings, the more fear, the more people stay put, the fewer the live job postings.

Mix that with a narrative that says “It wuz AI wot done it, guv’nor!” and you get… well, LinkedIn, basically. And thus, the illusion of a disappearing profession is created.

But in term of actual jobs – as opposed to job openings – the professional developer population was actually growing during this same period, albeit an aging population.

I saw this after the dotcom bubble burst in 2000, after the financial crash in 2008, and now, I strongly suspect, after the post-pandemic inflation crisis of 2022. Money ain’t what it used to be. It’s a boom-and-bust cycle that comes around roughly once a decade. Our profession probably needs to get better at handling it.

This will no doubt be cold comfort to the many excellent software professionals who just happened to find themselves without a chair when the music stopped. It genuinely is completely random. Give it a few years, and the industry may well come to regret letting those people leave the game.

Just as they’ll very likely regret kicking away the on-ramp to the profession at the same time.

I have little sympathy for the folks spreading this kind of fear, uncertainty and doubt, based on made-up and distorted facts. You’re causing distress and making things worse for a lot of good people. And based on what?

Only time will tell how this is going to pan out, but there are plenty of good reasons to believe that demand for software engineers will recover like it has every time before.


ADDENDUM: One way that “AI” might kill software engineering jobs, of course, is when the bubble inevitably bursts and trillions of dollars are wiped off the valuations of AI companies and their hardware, Cloud and other services suppliers. But the economic, social and geopolitical ramifications of that are likely to reach far beyond our profession.

The Double-Edged Sword of Automation

Here’s a fun fact: my business has a fax number. No, really.

Nobody’s faxed me for more than 20 years, but I still have a fax number. I pay a few quid each month for a fax-to-email service just so I can have an active fax number, just in case.

Why do I still have a fax number? Because when some of my corporate clients sign my company up as a supplier, their system requires a fax number, and I’m scared that, one day, one of them will actually try to fax me.

How can it be, in 2026, that business systems still require a fax number? It’s arcane. Well, this is the thing – most business systems are arcane. Most business systems are legacy systems. They’re hard and risky to change without breaking business processes, so businesses tend not to change them unless they really, really have to. (A big part of my training and coaching is teaching teams how to slow down and even reverse software aging, by the way.)

This is the double-edged sword of automation; computerising business processes can make our work easier, but it can make changing how we work much harder. Automation bakes in workflows.

The other problem automation often brings is the edge cases the automators – that’s us, by the way – didn’t think of when we were doing the automating. Daily life in this digital world can involve a lot of “falling through the cracks” of computer systems that were designed years ago for an ideal, cracks-free world that doesn’t exist anymore (and very probably never did), often by people with no first-hand experience of the reality of those business processes.

Savvy business leaders are well-aware of the limitations of automation, and for decades have left space in their operations for human judgement and human adaptation.

But humans take up space and you have to feed them and sign cards when they leave, so CEOs thought all their birthdays had come at once when someone claimed they’d invented a chatbot that could reason and adapt like a human. “We can fill the gaps with that!”

Three years later, most people have cottoned on to the fact that if there’s two things the chatbot can’t do, it’s reasoning and adapting. It’s just more automation, only this time, not predictable or reliable enough for many use cases.

And now the fashion is for the automators to automate their automating with these language-predicting automatons. And the risk is that we’ll walk into exactly the same trap.

To be fair to the chatbots, development teams baking in their processes with automation is nothing new. I’ve lost count of the times I’ve been coaching pairs and they’ve been unable to change how they’re working – and that’s why I’m there, in case you were wondering – because the “DevOps team” (an oxymoron if ever there was one) has baked the integration process into an automated, one-size-fits-all workflow.

Let’s take TDD as an example. Test-Driven Development is being cited more and more as an essential part of an “AI”-assisted development workflow. (Although it looks like Google Search didn’t get that memo.)

Search trends reveal no renewed interest in Test-Driven Development in recent years
Search trends reveal no renewed interest in Test-Driven Development in recent years

I, of course, agree. Prompting with tests helps pin down the meaning of requirements, producing more accurate pattern-matches from the model. And the rapid feedback loops of TDD, with continuous testing, code review and refactoring, tightly-coupled to version control and continuous integration, can minimise the risks of having the world’s most well-read idiot write the code for us.

I know many of us have attempted to completely automate the TDD workflow for agents so we can feed in our test examples (you may have heard these referred to as “specifications”), hit Play, and go to the proverbial pub while the agent crunches through them in a red-green-refactor “Ralph Wiggum loop”.

I’ve yet to see an attempt that isn’t clunky at best – running into walls and falling over often – and downright hilarious in some cases. Please, if you’re going to try to automate TDD, at least take 5 minutes to find out what it actually is.

I know the TDD workflow very, very well. I’ve performed that red-green-refactor loop hundreds of thousands of times over many years, I’ve taught it to thousands of developers, and I’ve watched thousands of developers doing it (with me saying “Now run your tests!” thousands of times – I could be a t-shirt).

I’ve thought very deeply about it. Talked to many, many practitioners about it. Experimented in many, many ways with it.

And, importantly, made numerous attempts to precisely codify it. But no matter how precise and detailed my TDD state machines are, I keep running into the double-edged sword of automation. It gets especially fiddly trying to coordinate concurrent TDD loops working on the same code base.

The TDD process unavoidably requires space for human judgement and for adaptation to outlier scenarios. It requires Actual Intelligence. So it always come back to the simplest workflow: red->green->refactor-> rinse & repeat, leaving space for me to apply judgement and to adapt after every step. (It also prevents the code running ahead of my understanding of it.)

Image

While I have fun experimenting with automated “agentic” workflows, when it matters, I keep myself very much in the loop. Not just because stochastic parrot, but also because I will never be able to think of every eventuality.

The closer we try to get to 100% automation – trying to plan for every possible scenario – the more the cost skyrockets. 100% is an asymptote – an anti-gravity wall that pushes us back the closer we get. Attempting to reach it is a Fool’s Errand.

And the closer we get to 100% automation, the more we bake ourselves into a workflow that leaves us less and less room for manoeuvre.

How Can We Progress Past “AI” Woo?

For decades, when I’ve interacted with folks working in machine learning, one of the big questions that comes up is “How do we test these systems?”

Traditional software’s behaviour is (usually) predictably repeatable, in the sense that the exact same inputs provided to the system in the exact same internal state will produce the exact same output.

If I ask to debit $51 from an account with $50 available, the system will say “No can do, amigo”.

With, say, a Large Language Model, the exact same input fed to the exact same model – their internal state doesn’t change – can produce surprisingly different outputs each time.

When I say “I tried this, and it didn’t work”, and then you say “Well, I tried it, and it did work”, that’s a bit like saying “Well, I threw a 7, so you must have been throwing the dice wrong”.

This is where probabilistic technology collides with 70+ years of experience with deterministic software. We’re still treating it like the banking system, expecting our results to be reproducible in the same way.

When we consider what’s real and what works when we’re using ML, we have to look beyond data points and start looking for statistically significant trends. We don’t throw the dice, get the 7 we wanted, and declare “These dice are better”. We throw the dice a bunch of times, and see what the distribution of outputs is.

I’ve been using small-scale, closed-loop experiments – once it’s started, I can’t intervene – to test claims about what improves model accuracy, with hard “either it did or it didn’t” fitness functions to gauge success. (Typically, a hidden acceptance test suite, scoring on how many tests passed).

But these are small-scale. I might run them 10x, because I simply can’t afford the tokens. If I published the results, that’s the first thing folks would say: “But this is just 10 data points, Jason”.

And they’d be quite right to, just as I’d be right to take a single data point (especially an anecdotal one) with a big pinch of salt.

And that’s why I’ve leaned more on related – the same phenomena but in different problem domains – large-scale peer-reviewed studies and on the science (e.g., statistical mechanics). So the stage I’m at is: “In theory, this should happen, and in my small experiment, that’s kind of what happened”.

In that respect, I appear to be ahead of the curve, with most folks still relying entirely on how it “feels”. Confirmation bias and projection still dominate the discourse.

It turns out there’s a significant correlation between confidence in “AI” output and belief in the paranormal. Our susceptibility to suggestion appears to be a factor here.

Meanwhile, I see more and more of the ideas presented in my AI-Ready Software Developer blog series being incorporated into workflows and even into the tools themselves. I can’t take the credit, of course. But when a bunch of different people independently come to similar conclusions, that’s interesting data in itself. We could all just be suffering the same delusions, though. Just because most teams use Pull Requests, that doesn’t make them a good idea.

In my defence, I’ve tried very hard to follow the best available evidence, and I’ve been taking nobody’s word for it. This has, at times, made me about as welcome as James Randi at a spoon-bending convention.

I see folks claiming that all of the frontier models and all of the “AI” coding assistants took a big step up in performance in the last 2-3 months. Claude users are saying it. Cursor users are saying it. Copilot users are saying it.

But did the technology really shift over Christmas, or was it actually a shift in expectations and incentives feeding a sort of mass delusion fueled by social proof?

When the CTO comes back to the office in January and declares “This is the future! Get with it, teams!”, does our perception of tool performance change in response? Is this how religions start?

But this is nothing new in our field.

If we’re to move forward in our understanding of what’s real and what works, and avoid surrendering to “AI Woo”, perhaps we need a way to test our hypotheses at a statistically significant scale, and in a more methodical way?

It’s my hope that – on this particularly important subject for our profession – we can just this once move beyond social proof, beyond online debate, beyond committees and manifestos drawn up by people who just happened to be in the room at the time, and beyond appeals to authority, and engage with the reality in a more objective way.

I, for one, would really like to know what that reality is.

71% of Developers and Engineering Leaders Believe “AI” Makes Engineering Discipline More Important

I ran the same poll on LinkedIn and Mastodon, asking:

In your estimation, does AI-assisted and agentic coding make engineering discipline:

  • More important than before?
  • As important as before?
  • Less important than before?
  • Don’t know

464 people responded, and the final votes tell an interesting story.

Image

More than two-thirds of developers and engineering leaders believe that engineering discipline is more important when we’re using “AI”.

And a whopping 94% believe it’s at least as important as before.

This directly contradicts the narrative that “software engineering is dead” in this age of “AI” coding assistants and agents. And the evidence bears that out. “AI” is an amplifier of engineering capability, not a magic wand that fixes bottlenecks, blockers and quality leaks. Unless you address them (and that’s mostly a skills thing – it doesn’t come with your Claude Code plan), it just makes them worse.

Teams using Claude Code, Cursor, Copilot and other LLM-based code generating tools experience positive benefits – in terms of outcomes like lead times and release stability – if they’re applying good technical practices.

Teams that don’t apply enough engineering discipline experience negative impact on those outcomes. They deliver less reliable software that costs more to change, and they deliver it later.

And those teams are sadly in the majority, according to the DORA data.

So the beliefs seem to match the reality – engineering discipline is undeniably more important when we’re drinking from a code-generating firehose.

The mystery that remains is why, given we mostly all seem to agree on this, we see no signs of increased investment in engineering capability. Demand for training hasn’t been rising (and I would know if it had) – and that includes training in “AI-assisted” development practices.

Teams appear to have been left to figure it out for themselves by a process of trial and error.

At the same time, many developers now report feeling under pressure to deliver the claimed productivity gains of “AI” – spend 2 minutes on sites like LinkedIn and you’ll see some very high expectations being set – and the lack of support probably isn’t helping.

Practices like Specification By Example, Test-Driven Development, continuous inspection, refactoring, and continuous integration aren’t just compatible with “AI”-assisted workflows – they’re essential for them.

I’ve spent more than 25 years applying these practices successfully, and teaching them to thousands of developers face-to-face and countless more through online video tutorials, blog posts and social media.

And I’ve devoted a large chunk of the last three years exploring how they can benefit “AI”-assisted workflows. Check out my AI-Ready Software Developer blog series for the highlights.

For teams who want a hands-on introduction, my training courses now incorporate “AI”. Once you’ve learned these practices in your IDE, you’ll get a chance to apply them using tools like Claude Code and Cursor and see the difference they can make. This way, you can become more effective with and without “AI”.

It turns out that’s the key to succeeding with it.

Super-Mediocrity

March will be the third anniversary of the beginning of my journey with Large Language Models and generative “A.I.”

At the time, we were all being dazzled – myself included – by ChatGPT, the chat interface to OpenAI’s “frontier” LLM, GPT-4.

There was much talk at the time of this technology eventually producing Artificial General Intelligence (AGI) – intelligence equal to that of a human being – and, from there, ascending to god-like “super-intelligence”. All we needed was more data and more GPUs.

It’s now becoming very clear that scaling very probably isn’t the path to AGI, let alone super-intelligence. But it was pretty clear to me at the time, even after just a few dozen hours experimenting with the technology.

The way I see it, LLMs are playing a giant game of Blankety Blank. And you don’t win at Blankety Blank by being original or witty or clever. You win at Blankety Blank by being as average as possible.

Image

The more data we train them on, the more compute we use, the more average the output will get, I speculated. Model performance will tend towards the mean.

Three years later, is there any hard evidence to back this up? Turns out there is – the well-documented phenomenon of model collapse.

Train an LLM on human-created text, then train another LLM on outputs from the first LLM, then another from the outputs generated by that “copy”. Researchers found that, from one generation to the next, output degraded until it become little more than gibberish.

What causes this is that output generated by LLMs clusters closer to the mean than the data they’re trained on. Long-tail examples – things that are novel or niche – get effectively filtered out. The text generated by LLMs is measurably less diverse, less surprising, less “smart” than the text they’re trained on.

Given infinite training data, and infinite compute, the resulting model will not become infinitely smart – it will become infinitely average. I coined the term “super-mediocrity” to describe this potential final outcome of scaling LLMs.

(What really strikes me, watching this video again after 3 years, is just how on-the-money I was even then. I guess the lesson is: don’t bet against entropy.)

Naturally, my focus is on software development. The burning question for me is, what does super-mediocre code look like? When it comes to the code models like Claude Opus and GPT-5 are trained on, what’s the mean? And it’s bad news, I’m afraid.

We know what the large-scale publicly-available sources of code examples are – places like Stack Overflow and GitHub. And the large majority of code samples we find on these sites are… how can I put this tactfully?… crap.

The ones that actually even compile often contain bugs. The ones that don’t contain bugs are often written with little thought to making them easy to understand and easy to change. And that’s before we get on to the subject of things like security vulnerabilities.

Hard to believe, I know, but when Olaf posted that answer on Stack Overflow, he wasn’t thinking about those sorts of things. Because who in their right mind would just copy and paste a Stack Overflow answer into their business-critical code? Right? RIGHT?

And LLM-generated code tends towards the average of that. It tends to be idiomatic, “boilerplate” and often subtly wrong – as well as often being more complicated than it needed to be. It’s that junior developer who just copies what they’ve seen other developers do, without stopping to wonder why they did it. Monkey see, monkey do.

What does super-mediocrity at scale look like, we might ask? I think a bit of a clue can be found on the Issues pages of our most-beloved “AI” coding tools.

Image

As a daily user of these tools, I’m often taken aback at just how buggy updates can be. And I see a lot of chatter online complaining about how unreliable some of the most popular “AI” coding assistants are, so I’m evidently not alone.

Anthropic have been boasting about how pretty much 100% of their code’s generated by one of their models these days, usually being driven in FOR loops (you may know them as “agents”).

I’ll skip the jokes about dealers “getting high on their own supply”, and just make a basic observation about the practical implications of attaching a super-mediocrity generator that’s been trained on mostly crap to your development process.

Just as I don’t lift code blindly from Stack Overflow without putting it through some kind of quality check – and that often involves fixing problems in it, which requires me to understand it – I also don’t accept LLM-generated code without putting it through the same filter. It has to go through my brain to make it into the code.

This is an unavoidable speed limit on code generation – code doesn’t get created (or modified) faster than I can comprehend it.

When code generation outruns comprehension, slipping into what I call “LGTM-speed”, well… we see what happens. Problems accumulate faster, while our understanding of the code – and therefore our ability to fix the problems – withers. Mean Time To Failure gets shorter. Mean Time To Recovery gets longer.

Your outages happen more and more often, and they last longer and longer.

Yes, this happens with human teams, too. But an “AI” coding assistant can get us there in weeks instead of years.

As of writing, there’s no shortcut. Sorry.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise rework due to misunderstandings about requirements, they’ll need to describe requirements in a testable way as part of a close and ongoing collaboration with our customers.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of code reviews, they’ll need to build review into the coding workflow itself, and automate the majority of their code quality checks.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise merge conflicts and broken builds, and to minimise software delivery lead times, they’ll need to integrate their changes more often and automatically build and test the software each time to make it ready for automated deployment.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the “blast radius” of changes, they’ll need to cleanly separate concerns in their designs to reduce coupling and increase cohesion.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the cost and the risk of changing code, they’ll need to continuously refactor their code to keep its intent clear, and keep it simple, modular and low in duplication.

“We definitely don’t have time for that, Jason!”

“AI” coding assistants don’t solve any of these problems. They AMPLIFY THEM.

More code, with more problems, hitting these bottlenecks at accelerated speed turns the code-generating firehose into a load test for your development process.

For most teams, the outcome is less reliable software that costs more to change and is delivered later.

Those teams are being easily outperformed by teams who test, review, refactor and integrate continuously, and who build shared understanding of requirements using examples – with and without “AI”.

Will you make time for them in 2026? Drop me a line if you think it’s about time your team addressed these bottlenecks.

Or was productivity never the point?

100% Autonomous “Agentic” Coding Is A Fool’s Errand

Though I’ve seen very little evidence of it being attempted on production systems with real users (because risk), my socials are flooded with posts about people’s attempts to crack fully-autonomous, completely-unattended software creation and evolution using “agents” at scale.

Demonstrations by Cursor and Anthropic of large-scale development done – they claim – almost entirely by agents working in parallel have proven that the current state of the art produces software that doesn’t work. Perhaps, to those businesses, that’s just a minor detail. In the real world, we kind of prefer it when it does.

I’ve attempted experiments myself to see if I can get to a set-up good enough that I can hit “Play” and walk away to leave the agents to it while I go to the proverbial pub.

That seems to be the end goal here – the pot of gold at the end of the rainbow. Whoever makes that work will surely, at the very least, make a name for themselves, and probably a few coins.

I’ve seen many people – some who understand this technology far better than me – attempt the same thing. Curiously, they don’t seem to have nailed it either, but are convinced that somebody else must have.

It’s that FOMO, I suspect, that continues to drive people to try, despite repeated failures.

But, as of writing, I’ve seen no concrete evidence that anybody has done it successfully on any appreciable scale. (And no, a GitHub repo you claim was 100% agent-generated, “Trust me bro”, doesn’t qualify, I’m afraid.)

The rules of my closed-loop experiments are quite simple: I can take as much time as I like setting things up for Claude Code in read-only planning mode, but once the wheels of code generation are set in motion, we’re like an improv troupe – everything it suggests, the answer is automatically “yes”. I just let it play out.

Progress is measured with pre-baked automated acceptance tests driving deployed software, which act as a rough proxy for “value created”, and help to avoid confirmation bias and the kind of “LGTM” assessments of progress that plague accounts of “agentic” achievements right now. It’s very a much an “either it did, or it didn’t” final quality bar.

I can’t intervene until either Claude says it’s done, or progress stalls. I can’t correct anything. I can’t edit any of the generated files. I have to simply sit back, watch and wait.

So far, no matter how I dice it and slice it, no set-up has produced 100% autonomous completion, or anything close.

No doubting, the tools are improving – using LLMs in smarter ways. But there’s only so much we can do with context management, workflow, agent coordination, quality gates and version control before we reach the limits of reliability that are possible when LLMs are involved. I suspect some of us are almost at that plateau already.

Agents – with those faulty narrators at their core – will always get stuck in “doom loops” where the problem falls outside their training data distribution, or the constraints we try to impose on them conflict.

Round and round little Ralph Wiggum will go, throwing the dice again and again in the hope of getting 13, or any prime number greater than 5 that’s also divisible by 3.

Out-of-distribution problems will always be a feature of generative transformers. It’s an unfixable problem. The best solution OpenAI have managed to come up with is having the model look at the probabilities and, if there’s no clear winner for next token, reply “I don’t know”. That’s not good news for full autonomy.

And, no, a swarm of Ralphs won’t solve the problem, either. It just creates another major problem – coordination. No matter how many lanes in your motorway, ultimately every change has to go through the same garden gate of integration at the end.

A bunch of agents checking in on top of each other will almost certainly break the build, and once the build’s broken, everybody’s blocked, and your beeper is summoning you back from the proverbial pub to unblock them.

One amusing irony of all these attempts to fully define 100% autonomous “agentic” workflows is that it’s turning many advocates into software process engineers.

Just taking quality gates as the example, a completely automated code quality check will require us to precisely and completely describe exactly what we mean by “quality”, and in some form that can be directly interpreted against, for example, the code’s abstract syntax tree.

I know Feature Envy when I see it, but describing it precisely in those terms is a whole other story. Computing has a long history of teaching us that there are many things we thought we understood that, when we try to explain it to the computer, it turns out we don’t.

Software architecture and design is replete with woolly concepts – what exactly is a “responsibility”, for example? How could we instruct a computer to recognise when a function or a class has more than one reason to change? (Answers on a postcard, please.)

Fully autonomous code inspections are really, really, really (really) hard.

90% automated? Definitely do-able. But skill, nuance and judgement will likely always be required for the inevitable edge cases.

Having worked quite extensively in software process engineering earlier in my career, I know from experience that it’s largely a futile effort.

We naively believed that if we just described the processes well enough – the workflows, the inputs, the outputs, the roles and the rules – then we could shove a badger in a bowtie into any of those roles and the process would work. No skill or judgement was required.

You can probably imagine why this appealed to the people signing the salary cheques.

It didn’t work, of course. Not just because it’s way, way harder to describe software development processes to that level of precision, but also because – you guessed it – teams never actually did it the way the guidance told them to. They painted outside the lines, and we just couldn’t stop them.

In 2026, some of us are making the same mistakes all over again, only now the well-dressed badger’s being paid by the token.

We might get 80% of the way and think we’re one-fifth away from full autonomy, but the long and checkered history of AI research is littered with the discarded bones of approaches that got us “most of the way”. Close, but no cigar.

It turns out that last few percent is almost always exponentially harder to achieve, as it represents the fundamental limits of the technology. On the graph of progress vs. cost, 100% is typically an asymptote. We need to recognise a wall when we see one and back away to where the costs make sense.

Attempting to achieve better outcomes using agents with more autonomy seems like a reasonable pursuit, as long as we’re actually getting those better outcomes – shorter lead times, more reliable releases, more satisfied customers.

Folks I know being successful with an “agentic” approach have stepped back from searching for the end of that rainbow, and have focused on what can be achieved while staying very much in the loop.

They let the firehose run in short, controlled bursts and check the results thoroughly – using a combination of automated checks and their expert judgement – after every one. And for a host of reasons, that’s probably why they’re getting better results.

It’s highly likely there’s no end to the “agentic” rainbow. Perhaps we should start looking for some gold where we actually are, using tools we’ve actually got?