Super-Mediocrity

March will be the third anniversary of the beginning of my journey with Large Language Models and generative “A.I.”

At the time, we were all being dazzled – myself included – by ChatGPT, the chat interface to OpenAI’s “frontier” LLM, GPT-4.

There was much talk at the time of this technology eventually producing Artificial General Intelligence (AGI) – intelligence equal to that of a human being – and, from there, ascending to god-like “super-intelligence”. All we needed was more data and more GPUs.

It’s now becoming very clear that scaling very probably isn’t the path to AGI, let alone super-intelligence. But it was pretty clear to me at the time, even after just a few dozen hours experimenting with the technology.

The way I see it, LLMs are playing a giant game of Blankety Blank. And you don’t win at Blankety Blank by being original or witty or clever. You win at Blankety Blank by being as average as possible.

Image

The more data we train them on, the more compute we use, the more average the output will get, I speculated. Model performance will tend towards the mean.

Three years later, is there any hard evidence to back this up? Turns out there is – the well-documented phenomenon of model collapse.

Train an LLM on human-created text, then train another LLM on outputs from the first LLM, then another from the outputs generated by that “copy”. Researchers found that, from one generation to the next, output degraded until it become little more than gibberish.

What causes this is that output generated by LLMs clusters closer to the mean than the data they’re trained on. Long-tail examples – things that are novel or niche – get effectively filtered out. The text generated by LLMs is measurably less diverse, less surprising, less “smart” than the text they’re trained on.

Given infinite training data, and infinite compute, the resulting model will not become infinitely smart – it will become infinitely average. I coined the term “super-mediocrity” to describe this potential final outcome of scaling LLMs.

(What really strikes me, watching this video again after 3 years, is just how on-the-money I was even then. I guess the lesson is: don’t bet against entropy.)

Naturally, my focus is on software development. The burning question for me is, what does super-mediocre code look like? When it comes to the code models like Claude Opus and GPT-5 are trained on, what’s the mean? And it’s bad news, I’m afraid.

We know what the large-scale publicly-available sources of code examples are – places like Stack Overflow and GitHub. And the large majority of code samples we find on these sites are… how can I put this tactfully?… crap.

The ones that actually even compile often contain bugs. The ones that don’t contain bugs are often written with little thought to making them easy to understand and easy to change. And that’s before we get on to the subject of things like security vulnerabilities.

Hard to believe, I know, but when Olaf posted that answer on Stack Overflow, he wasn’t thinking about those sorts of things. Because who in their right mind would just copy and paste a Stack Overflow answer into their business-critical code? Right? RIGHT?

And LLM-generated code tends towards the average of that. It tends to be idiomatic, “boilerplate” and often subtly wrong – as well as often being more complicated than it needed to be. It’s that junior developer who just copies what they’ve seen other developers do, without stopping to wonder why they did it. Monkey see, monkey do.

What does super-mediocrity at scale look like, we might ask? I think a bit of a clue can be found on the Issues pages of our most-beloved “AI” coding tools.

Image

As a daily user of these tools, I’m often taken aback at just how buggy updates can be. And I see a lot of chatter online complaining about how unreliable some of the most popular “AI” coding assistants are, so I’m evidently not alone.

Anthropic have been boasting about how pretty much 100% of their code’s generated by one of their models these days, usually being driven in FOR loops (you may know them as “agents”).

I’ll skip the jokes about dealers “getting high on their own supply”, and just make a basic observation about the practical implications of attaching a super-mediocrity generator that’s been trained on mostly crap to your development process.

Just as I don’t lift code blindly from Stack Overflow without putting it through some kind of quality check – and that often involves fixing problems in it, which requires me to understand it – I also don’t accept LLM-generated code without putting it through the same filter. It has to go through my brain to make it into the code.

This is an unavoidable speed limit on code generation – code doesn’t get created (or modified) faster than I can comprehend it.

When code generation outruns comprehension, slipping into what I call “LGTM-speed”, well… we see what happens. Problems accumulate faster, while our understanding of the code – and therefore our ability to fix the problems – withers. Mean Time To Failure gets shorter. Mean Time To Recovery gets longer.

Your outages happen more and more often, and they last longer and longer.

Yes, this happens with human teams, too. But an “AI” coding assistant can get us there in weeks instead of years.

As of writing, there’s no shortcut. Sorry.

Rely On AI And Get Left Behind

LLM-based code generation comes with multiple potential gotchas for a business.

More code hitting process bottlenecks like testing, review and integration makes most teams slower.

More problems being generated by text prediction engines that simply don’t understand what the requirements or the code means makes releases less stable.

These are effects we’re seeing today. Most dev teams are shipping less reliable software, and shipping it later thanks to “AI”. Good work, everyone!

But I suspect the biggest risk is creeping up on us without us necessarily realising.

The more teams offload their thinking to these tools, the less they understand the code your business relies on. And the less they understand it, the less able they are to fix problems that “AI” can’t fix. Outages last longer, and fixes are less sticky.

Mean Time To Failure goes down, and Mean Time To Recovery gets longer. More fires, taking longer to put out. Just ask the folks at AWS!

“AI”-assisted programming’s a bit like pedal-assist on electric bicycles. It makes progress feel easy, but we might not realise the impact that’s having on our coding “muscles” – our ability to comprehend and reason about code.

That is, until we run out of juice and have to pedal unaided. That’s when it becomes obvious just how much of our Code Fu we’ve lost as we’ve come to rely on that assistance more and more. Increasingly, I hear developers say “I’ve hit my token limit, so I’m blocked.”

Working in the low-gravity environment of “AI” coding, we may not notice the decline in our own abilities. And the more they decline, the less able we are to notice – until we find ourselves back in 1g and suddenly are unable to walk.

I’m seeing teams where most of the developers now have a hard time working on their code unaided by “AI”. It’s taking them much longer to wrap their heads around it, and they’re missing more and more problems.

Shipping code faster than we can understand it creates comprehension debt – extra time needed to build that understanding when the tool can’t do what we need it to do (and the need for that is very probably never going away).

Relying on LLMs to write the code for us erodes our ability to understand code, full stop. It’s a double-whammy.

There’s little doubting now that the devs who are most effective using “AI” are the ones who can work just fine without it.

Some say “Use AI or get left behind”. The reality is more probably “Rely on AI and get left behind”. Developers need to maintain their cognitive edge, which means keeping their hand in with regular unaided working.

There’s every chance that, in the future, these developers will be most in-demand.

If they don’t, then this is going to add a lot of risk for businesses – not least because the vendors will have them over a barrel. Price plans are heavily subsidised today. That can’t last, and if you’re not able to walk away, then you have no negotiating position.

That more businesses don’t see becoming completely reliant on a third-party as a strategic risk is baffling, especially in such a volatile market as “AI”. It could all implode at any moment.

But the real risk is that, one day, your core product or system will break, and there’ll be nobody capable of fixing it.

Well, nobody you could afford.

Best Way To Keep A Secret In Software Development? Document It

Once upon a time, in a magical land far, far way – just north of London Bridge – I wore the shiny cape and pointy hat of a senior software architect.

A big part of my job was to document architectures, and the key decisions that led to them.

Many diagrams. Many words. Many tables. Many, many documents.

Structure, dynamics, patterns, goals, principles, problem domains, business processes. You name it, I documented it. Much long time writing them, much even longer time keeping them up-to-date. Descriptions of things that aren’t the things themselves, it turns out, go stale fast.

It only takes a couple of gigs to become suspicious that maybe, perhaps, nobody actually reads these documents. So I started keeping the receipts. I’d monitor file access to see when a document was last pulled from our DMS or shared server, or when Wiki pages were last viewed.

My suspicions were confirmed. Teams weren’t looking at the documents much, or – usually – at all. Those fiends!

In their defence, as the saying goes:

Documentation’s useful until you need it

(I looked up the source of this anonymous quote, which I’ve used often. Turns out it’s me. LOL. Why didn’t I remember? I probably wrote it in a document.)

A lot of architecture documentation is out-of-date to the point where it becomes misleading. Code typically evolves faster than one person’s ability to keep an accurate record of the changes.

A fair amount was never in-date in the first place because it describes what the architect wanted the developers to do, and not what they actually did.

The upshot is that architecture documentation is rarely an accurate guide to reality. It’s either stale history, or creation myths.

Recent research found that familiarity with a code base is a far better predictor of a developer’s comprehension of the architecture than access to documentation, and the style of the documentation didn’t seem to make any significant difference.

When I think back to my times as developer rather than architect, that rings true. I’d often skim the documentation – if I looked at it at all – then go look at the code. Because the code is the code. It’s the truth of the architecture.

For architecture documentation to be more useful in aiding comprehension, it has to be tightly-coupled to the reality it describes. Changes to the architecture need to be reflected very soon after they’re made – ideally, pretty much immediately.

I experimented with round-trip modeling of architecture quite deeply, reverse-engineering code on every successful build, producing an updated description for every working version.

But that, too, produced web documentation that I know for a fact almost never got looked at. Least of all by me, as one of the developers.

Reverse-engineering code tends to produce static models – models of structure. I do not find purely structural descriptions very helpful in understanding what code does. It only describes what code is.

If a library has documentation generated from JavaDoc, I’ll tend to ignore that and look for examples of how the library can be used.

When I want to understand a modular design, I’ll look for examples of how key components in the architecture interact in specific usage scenarios – how does the architecture achieve user outcomes?

Quick war story: I joined a team in an investment bank who’d been working in isolation on different COM components. The Friday before our launch date, we still hadn’t seen how these very well-specified components – we all got handed detailed Word documents for our parts – would work together to satisfy key use cases.

Long-story-short, turns out they didn’t. Not a single end-to-end use case scenario.

Every part of the system was described in detail. But the system as a whole didn’t work, and we couldn’t see that until it was too late.

This is why I start with usage scenarios. You can keep your documentation – I’m gonna set breakpoints, run the damn thing, and click some buttons so I can step down through the call stack and build a picture of the real architecture, as it is now, in functional slices.

And I would have damned well designed it that way in the first place, starting with user outcomes and working backwards to an architecture that achieves them.

So, a dynamic model driven by usage examples is preferable in aiding my comprehension of architecture to structural models of static elements and their compile-time relationships.

One neat way of capturing usage examples is by writing them as tests. This can serve a dual purpose, both documenting scenarios as well as enabling us to see explicitly if the software satisfies them.

Tests as specifications – specification by example – is a cornerstone of test-driven approaches to software design. And it can also be a cornerstone of test-driven approaches to software comprehension.

When tests describe user outcomes, we can pull on that thread to visualise how the architecture fulfils them. That so few tools exist to do that automatically – generate diagrams and descriptions of key functions, key components, key interactions as the code is running (a sort of visual equivalent of your IDE’s debugger) – is puzzling. Other fields of design and engineering have had them for decades.

As for understanding the architecture’s history, again I’ll defer to what actually happened over what someone says happened. The version history of a code base can reveal all sorts of interesting things about its evolution.

Of course, the problem with software paleontology is that we’re often working with an incomplete fossil record. Some teams will make many design decisions in each commit, and then document them with a helpful message like “Changed some stuff, bro”.

A more fine-grained version history, where each diff solves one problem and every commit explains what that problem was, provides a more complete picture of not just what was done to the design, but why it was done.

e.g., “Refactor: moved Checkout.confirm() to Order to eliminate Feature Envy”

Couple this with a per-commit history of outcomes – test run results, linter issues, etc – and we can get a detailed picture of the architecture’s evolution, as well as the development process itself. You can tell a lot about an animal’s lifestyle from what it excretes. Or, as Jeff Goldblum put it in Jurassic Park, “That is one big pile of shit”.

Naturally, there will be times when the reason for a design decision isn’t clear from the fossil record. Just as with code comments, in those cases we probably need to add an explanation to that record.

This is my approach to architecture decision records – we don’t record every decision, just the ones that need explaining. And these, too, need to be tightly-coupled to the code affected. So if I’m looking at a piece of code and going “Buh?”, there’s an easy way to find out what happened.

Right now, some of you are thinking “But I can just get Claude to do that”. The problem with LLM-generated summaries is that they’re unreliable narrators – just like us.

I’ve been encouraging developers to keep the receipts for “AI”-generated documentation. It’s pretty clear that humans aren’t reading it most of the time, and when we do, it’s at “LGTM-speed”. We’re not seeing the brown M&Ms in the bowl.

Load unreliable summaries into the context alongside actual code, and models can’t tell the difference. Their own summaries are lying to them. Humans, at least, are capable of approaching documentation skeptically.

Whenever possible, contexts – just like architecture descriptions intended for human eyes – should be grounded in contemporary reality and validated history.

And talking of “AI” – because, it seems, we always are these days – one area of machine learning that I haven’t seen applied to software development yet might lie in mining the rich stream of IDE, source file and code-level events we give off as we work to recognise workflows and intents.

A friend of mine, for his PhD in Machine Vision, trained a model to recognise what people were doing – or attempting to do – in kitchens from videos of them doing it.

Such pattern recognition of developer activity might also be useful to classify probable intent, and even predict certain likely outcomes as we work. At the very least, more accurate and meaningful commit messages could be automatically generated. No more “Changed some stuff, bro”.

At the end of the (very long) day, comprehension is the goal here. Not documentation. If a document is not comprehended – and if nobody’s reading it, then that’s a given – then it’s of no help.

And if it’s going to be read, it needs to be helpful in aiding comprehension. Like, duh!

Personally, what I would find useful is not better documentation, but better visualisation – visualisation of static structure and dynamic behaviour, of components and their interactions, and of the software’s evolution.

And visualisation at multiple scales from individual functions and modules all the way up to systems of systems, and the business processes and problem domains they interact with.

And when we want the team to actually read the documentation, we need to take it to them. Diagrams should be prominently displayed (I’ve spent a lot of time updating domain models to be printed on A0 paper and hung on the wall), explanations should be communicated. Ideally, architecture should be a shared team activity – e.g., with teaming sessions, with code reviews, with pair programming – and an active, ongoing, highly-visible collaboration.

Active participation in architecture is essential to better comprehension. Doing > seeing > reading or hearing.

That also goes for legacy architecture – active engagement (clicking buttons, stepping through call stacks in the debugger, writing tests for existing code) tends to build deeper understanding faster than reading documentation.

The fastest way to understand code is to write it. The second fastest is to debug it.

This is especially critical in highly-iterative, continuous approaches to software and system architecture, where decisions are being made in real-time many times a day. Without a comprehensible, bang-up-to-date picture of the architecture, we’re basing our decisions on shaky foundations.

Like an LLM generating code by matching patterns in its own summaries, we risk coming untethered from reality.

Which would be an apt description of most architecture documents, including mine.

Writing Code May Be Dead (Not Really), But Reading Code Will Live On

“The age of the syntax writer is over”, hailed a post on LinkedIn. (Why is it always LinkedIn?)

I say: welcome to the age of the syntax READER!

If I presented you with a bowl of candy and told you only 1% had broken glass in them, what percentage would you check before giving them to your kids?

The fact is that a 1% error rate would be an order-of-magnitude improvement in the reliability of LLM-generated code, and one we shouldn’t expect for a very long time – if ever – as scaling hits the wall of diminishing returns at exponentially increasing cost.

The need to check generated code isn’t going to go away, and therefore neither is the need to understand it.

And, given the error rates involved, we may need to understand all of it. At least, anyone who doesn’t want to be handing candy-covered shards to their users will.

“But Jason, we don’t need to check machine code or assembly language generated by compilers.”

That’s a category mistake. When was the last time a compiler misinterpreted your source code, or hallucinated output? Compiler boo-boos are very rare.

If my compiler randomly misinterpreted my source code just 1% of the time, then, yes, I’d check all of the generated machine code for anything that was going to be put in the hands of significant numbers of users. And that means I’d need to be able to understand that machine code.

Now for the fun part.

Decades of studies into program comprehension show that one of the best predictors of a person’s ability to understand code is how much experience they have writing it.

Cognitively, we don’t engage with syntax and semantics more deeply than when we’re writing the code ourselves. (Which is why I strongly discourage students from copying and pasting – it stunts their growth.)

In my blog series The AI-Ready Software Developer, I talk about how a mountain of “comprehension debt” can rapidly accumulate as “AI” produces code far faster than we can wrap our heads around it. I call this “LGTM-speed”.

I recommend “staying sharp” where code comprehension is concerned, as well as slowing down code generation to the speed of comprehension. When you’re drinking from a firehose, the limit isn’t the firehose.

This implies that we’re not limited by how many tokens/second “AI” coding assistants can predict, but by how many tokens/second we can understand. That’s the thing we need to optimise.

The best way – the only way, really – to maintain good code comprehension is to write it regularly.

We need to keep our hand in to make sure we don’t get caught in a trap where comprehension debt balloons as our ability to comprehend withers.

That leads to a situation where serious, show-stopping – potentially business-stopping – errors become much more likely to make it into releases, and there’s nobody on the team who can fix them.

And in that scenario, who are they gonna call? The “AI-native developer” who boasts they haven’t written code in months, or the developer who was debugging that kind of code just this morning?

The Age of Coding “Agents”? Or The Age of “LGTM”?

I’ve watched a lot of people using “AI” coding assistants, and noted how often they wave through large batches of code changes the model proposes, sometimes actually saying it out loud: “Looks good to me”.

After nearly 3 years of experimentation using LLMs to generate and modify code, I know beyond any shadow of a doubt that you need to thoroughly check and understand every line of code they produce. Or there may be trouble ahead. (But while there’s music… etc)

But should I be surprised that so many developers are happily waving through code in such a lackadaisical way? Is this anything new, really?

I’ve watched developers check in code they hadn’t even run. Heck, code that doesn’t even compile.

I’ve watched developers copy and paste armfuls of code from sites like Stack Overflow, and not even pause to read it, let alone try to understand it or even – gasp – try to improve it.

I’ve watched developers comment out or delete tests because they were failing. I’ve watched teams take testing out of their build pipeline to get broken software into production.

We’ve been living in an age of “LGTM” for a very long time.

What’s different now is the sheer amount of code being waved through into releases, and just how easy “AI” coding assistants make it for the driver to fall asleep at the wheel.

And when we put our coding assistant into “agent” mode – or, as I call it, “firehose mode” – that’s when things can very quickly run away from us. Dozens or hundreds of changes, perhaps even happening simultaneously as parallel agents make themselves busy on multiple tasks at once.

Even if there were no issues in any of those changes – and the odds against that are extremely remote – when code’s being churned out faster than we’re understanding, it creates a rapidly-growing mountain of comprehension debt.

When the time comes – or should I say, when the times come – that the coding assistant gets stuck in a “doom loop” and we have to fix problems ourselves, that debt has to be repaid with interest.

Agents have no “intelligence”. They’re old-fashioned computer programs that call LLMs when they need sophisticated pattern recognition and token prediction. LLMs don’t follow instructions or rules. Use them for just a few minutes and you’ll see them crashing through their own guardrails, doing things we’ve explicitly told them not to do, and forgetting to do things we insist that they should.

The intelligence in this set-up is us. We’re the ones who can follow rules and instructions. We’re the ones who understand. We’re the ones who reason and plan. And we’re the ones who learn.

In 2025, and probably for many years to come, we are the agents. We’re the only ones qualified for the job.

My advice – based on the best available evidence and a lot of experience using these tools over the past 3 years – remains the same when you’re working on code that matters.

I recommend working one failing test at a time, one refactoring at a time, one bug at a time, and so on.

I recommend thoroughly testing after every step, and carefully reviewing the small amount of code that’s changed.

I recommend committing changes when the tests go green, and being ready to revert when they go red.

I recommend a fresh context, specific to the next step. I recommend relying on deterministic sources of truth – the code as it is (not the model’s summary of it), the actual test results, linter reports, mutation testing scores etc.

I strongly advise against letting LLMs mark their own homework or rely on their version of reality.

And forget “firehose mode” for code that matters. Keep it on a very tight leash.

What’s In A Name?

The idea of “separation of concerns” originated from a need to make it possible for programmers to reason about a piece of code without the need to understand what’s going on inside its dependencies (and its dependencies’ dependencies).

In this sense, the primary benefit of modular design is to reduce cognitive load when working with any part of the system.

But that can only happen if every reference to other parts (e.g., function calls) “says what it does on the tin”, so we can form correct expectations about its behaviour within the context we’re reasoning about.

Ideally, to understand what a dependency does, we shouldn’t need to understand how it does it.

When names are unclear, or even misleading, we form the wrong expectations about what a dependency will do, and are forced to “look inside the box” to understand it.

In the same way that this increases the context size for an LLM and therefore the risk of errors, it increases cognitive load for programmers with the same end result. I see folks complaining often about having to have a bunch of source files open in order to understand what one piece of code is going to do.

It might be helpful here if I put forward a definition of code comprehensibility and a rough way of measuring it.

I see comprehensibility as the likelihood that the target audience (e.g., other team members) will correctly predict what a piece of code will do in specific cases.

ratings = [4, 6, 4, 5]

average_rating = sum(ratings)/len(ratings)

What will the value of average_rating be after that assignment?

Let’s refuctor that code to make it a little less obvious.

r = [4, 6, 4, 5]

ar = s(r)/l(r)

Now you need to know what the functions s and l do. That may mean looking it up, if there’s documentation available – more cognitive load.

Or it may mean actually peeking inside their implementations – even more cognitive load.

We could ask 10 developers to predict what the result will be. If 8 of them predict correctly, we might roughly gauge the comprehensibility of this code for that sample of people – remember who the audience is – is 80%.

Or we could ask an LLM ten times. Though it’s important to remember that LLMs can’t reason about behaviour. They would literally just be matching patterns. And in that sense, this is a reasonable test of how closely the code correlates with examples within the training data distribution.

So, in summary, naming is very, very important in modular software design. Names help us form expectations about behaviour, and if those expectations are correct then this means we don’t need to go beyond a signpost to understand what’s down that road.