The Hottest AI Coding Skill In 2026? Coding Without AI

Psst. If your boss won’t invest in training you in Specification By Example or Test-Driven Development, I’m running out-of-hours workshops in May specifically for self-funding learners. £99 + UK VAT.

Statistics are showing that software developer hiring’s been on the rise again for about a year, but it seems priorities have changed.

Image

Time was that our profession was a pyramid, with a base of entry-level and “junior” hires outnumbering seasoned professionals. But this time, we’re an aging population, with employers favouring developers with significant pre-AI experience.

“The current market has become increasingly senior-driven, with fewer junior roles available and employers expecting even entry-level candidates to have all the skills to hit the ground running.”

Harvey Nash, Software trends in the year ahead: A UK hiring outlook

Meanwhile, my LinkedIn feed’s awash with posts complaining that not only are coding tests still very much a thing, but they’re becoming even more of a thing. “Why”, they lament, “do employers want coding skills when AI can do all that?”

The problem is that – a significant proportion of the time – AI actually can’t do all that. Anyone who’s used AI coding assistants and agents for more than an hour or two will know that there are many times when we have to intervene. And intervening at the very least requires us to really understand what we’re intervening in.

As a trainer and mentor, I’ve watched code comprehension degrade alarmingly over the past 10 years as developers have been relying more and more on copying and pasting code from sources like Stack Overflow.

I especially implore junior developers not to do it. It very noticeably stunts their growth as programmers. The code really does need to go in through your eyes, through your brain and out through your fingers for knowledge to sink in. The language and reasoning centres of our brain need to be actively engaged.

If you lost your phone and urgently needed to call a friend, could you remember their number? Speed-dialing code has a similar effect.

In the last year, that’s gone into warp drive now that we have a powerful tool that automates the copypasta process at scale.

The effect of AI use on code comprehension is well-documented. The more we use it, the less we understand the code that’s being generated. And not just because we didn’t write it, it transpires – our ability to understand code generally atrophies. Use it or lose it.

Employers in 2026, it seems, favour developers who haven’t lost it. As I’ve been only half-joking for over a year, the devs who’ll be in highest demand in this age of AI will be the ones who don’t need it.

With highly public outages becoming routine, it looks like managers are reprioritising to make sure that when the you-know-what hits the fan on a critical system at 2 am, the person answering that call is capable of fixing it even when the AI is having one of its famous senior moments.

Those same employers, of course, then insist that the senior developers they hired because of their effectiveness without AI should use AI as much as possible. Sigh.

In my blog series The AI-Ready Software Developer, I wrote a post called Staying Sharp about how important it’s going to be to maintain our “traditional” programming skills. We need to find some time in the day or the week to leave the proverbial car at home and walk.

And if hitting your token limit is a blocker to doing more programming, you may already have deskilled yourself out of the market.

Essential Code Craft – The Roadmap

Some of you may have noticed that I’ve been running out-of-hours training workshops for self-funding learners recently, under the banner of Essential Code Craft.

In a way, this is a return to the early days of Codemanship when I ran regular weekend workshops – priced for individual pockets – that were mostly attended by developers investing in their own skills and career development.

Many of those people are now CTOs and heads of engineering, and I’ve been fortunate – and grateful – that quite a few have brought me in to provide the same kind of training for their teams.

But with senior engineering leaders now very distracted by the code-generating firehose – and while I wait for them to realise that nothing’s actually changed as far as software engineering fundamentals are concerned – I’m pivoting back to self-funders.

So far – just as it was way back when – the first two workshops filled up quickly. While the boss might not be thinking about investing in their developers at the moment, it seems a lot of developers are looking to invest in themselves.

And this is exactly the moment to do it. While a gazillion developers hunt for magic incantations to make a probabilistic next-token predictor act like something other than a probabilistic next-token predictor, the people who’ve done their homework already know: better results with AI coding tools have very little to do with the tools, and almost everything to do with the processes around them.

And it’s a double-win. The practices that produce the best outcomes with AI are the exact same practices that produce the best outcomes without AI.

The key to being effective with AI is being effective without it.

And here’s the hedge, but only for the informed gamblers – developer hiring is rising again, but the demographic of these new hires is changing. Employers are favouring senior developers with significant pre-LLM experience.

I, and a few others, predicted this would happen. Demand would be highest for people who can do the things AI coding tools can’t – like, well, understand code. I mean really understand it. Not “LGTM” understanding. Deep comprehension of programs.

Not only that, but for all kinds of good reasons – economic, environmental, energy, ethical, geopolitical – the future of hyperscale LLMs is by no means predictable. Folks grappling with reduced token limits and rapidly degrading performance with Anthropic’s newest models will hopefully have figured out by now that building workflows that depend heavily in hyperscale LLMs is building on quicksand.

Who are Acme Megacorp gonna’ hire – the dev who sits on their hands because they’re waiting for their token limit to reset, or the dev who can just carry on at roughly the same overall pace of delivery?

And we should be under no illusions that teams who’ve mastered the fundamentals of software delivery are routinely outperforming teams who haven’t – with or without AI. AI is clearly not the differentiator.

So, whether you’re going to apply these disciplines with Claude Code or Codex, or with IntelliJ or VS Code, they still matter – arguably more than ever.

And what are these disciplines? What is Essential Code Craft?

  • Specification By Example – build shared understanding and pin down requirements with testable specifications
  • Test-Driven Development – rapidly iterate working software designs with short delivery lead times and reliable releases
  • Continuous Integration – keep teams more in sync with their changes, merging and testing them many times a day to ensure a working, shippable-at-any-time product
  • Continuous Collaboration – keep teams on the same page by continuously communicating with practices like pair programming and teaming
  • Refactoring – reshape code to make change easier, while keeping it working and shippable at all times
  • Modular Design – optimise software architecture to localise the “blast radius” and minimise the cost of changes, while making rapid testing and smarter reuse easier
  • Continuous Inspection – minimise the bottleneck and the “LGTM” effect of downstream code review by making it a continuous and highly automated process
  • Continuous Delivery – combine these fundamentals in a delivery process that can get the proverbial peas from the farmer’s field to the kitchen table through rapid, reliable integration, build and deployment pipelines
  • Continuous Improvement – build development capability in an evidence-based way, learning what really works and what doesn’t as you build skills, automate tools and workflows, and explore and experiment with your approach – and that’s where I come in!)

Workshops on Specification By Example and Test-Driven Development are already live and taking registrations. If there’s demand, more will follow.

The roadmap is to build a set of repeating individual workshops, rotating monthly, that will eventually cover all of these disciplines – some explicitly, some implicitly like Continuous Integration and pair programming, which will be an integral part of most workshops.

Self-funders can pick and choose which to attend, and my hope is that they’ll be a bit like Pokemon cards – gotta collect ’em all!

Keep an eye on the Codemanship Ticket Tailor box office for details of upcoming workshops.

Also, details of new workshop times will be posted here first, so subscribe to this blog if you’d like to be kept in the loop for future workshops.

The AI-Ready Software Developer #23 – Speaking Clearly

Psst. If your boss won’t invest in training you in Test-Driven Development, I’m running out-of-hours workshops on April 7 and 11 specifically for self-funding learners. £99 + UK VAT.

I see a lot of wrongheaded takes on “AI” coding assistants and agents online, but one of the most misguided goes something along the lines of “Code doesn’t need to be easy for humans to understand anymore”.

Presumably, these people have never stopped to ask themselves why they’re called “language models”, but let’s entertain their chain of thought for a moment.

The theory is that humans won’t need to understand the code generated by LLMs (wrong!) because they won’t need to edit the code themselves (wrong!), and so code should be optimised for things like token efficiency, not readability (WRONG!)

Some even go so far as to propose token-optimised programming languages designed by and especially for LLMs.

Working backwards through this sequence of compounding errors quickly unpicks the logic.

First of all, even if LLMs could design a general-purpose, Turing-complete programming language from scratch – which they can’t – it would require huge numbers of examples to train them to work with such a language. It might work for small DSLs, but languages like Java and Go are a whole other ballpark.

Without that large corpus of training data – showing examples of every aspect of the language many, many times – every request to generate code in it would be out-of-distribution. There’s a good reason why Claude, GPT, Gemini etc are better at generating code in popular languages. So, are we going to write those millions of examples? In a non-human-readable programming language?

Secondly, when you look at the tokens in code, how much of it is language syntax – reserved words, semicolons and so on – and how much of it is names that we have chosen for the things in our code to make it more self-descriptive?

“Ah, but Jason, AI-generated code doesn’t need to be self-descriptive, because we don’t need to understand it.”

I have bad news for you on that front, I’m afraid. Research – both my experiments and larger-scale studies – have found that the less clear code is to human readers, the less clear it is to LLMs. When code is obfuscated, models make more mistakes interpreting it.

This should come as no surprise. Language Models – the clue is in the name. What did folks think they were matching patterns in?

Human-readable code is model-optimized code.

LLMs don’t “understand” code like compilers do, they pattern-match on language.

So:

  • Meaningful identifiers = stronger semantic anchors
  • Consistent naming = better pattern recall
  • Verbose clarity = richer signal

And, just as it’s incredibly helpful on a software team if everybody’s speaking the same shared language to describe the problem they’re setting out to solve – including (especially) in the code itself – it’s also essential that we use consistent shared language in our prompts and in the code. If I ask Claude to change the “sales tax” calculation logic, and in the code the function’s called ‘v_tx’, we’re probably not going to get a glowing result.

Image
Confusion of Tongues: The Construction of the Tower of Babel, Lucas van Valckenborch, 1594

And then there’s the take that humans won’t need to edit the code. Some even go so far as to say that LLMs will be generating machine code directly, bypassing the human-readable source.

The fact of the matter is that LLMs are unreliable. Very unreliable. I’ve documented the common experience of agents like Claude Code getting stuck in “doom loops”, when a task or problem goes outside their training data distribution, and they just can’t do it. This is a fact of life when we’re working with the technology, and the general consensus – even among the hyperscalers – is that it’s an unfixable problem at any achievable scale.

There will always be a need for the human in the loop, and there will always be a need for that human to understand the code.

If it’s token efficiency you’re after, the smart thing to do is to focus on reducing the probability of mistakes and unnecessary rework. Deliberately obfuscating our code is going to work against us in that respect.

The AI-Ready Software Developer #22 – Test Your Tests

Many teams are learning that the key to speeding up the feedback loops in software development is greater automation.

The danger of automation that enables us to ship changes faster is that it can allow bad stuff to make it out into the real world faster, too.

To reduce that risk, we need effective quality gates that catch the boo-boos before they get deployed at the very latest. Ideally, catching the boo-boos as they’re being – er – boo-boo’d, so we don’t carry on far after a wrong turn.

There’s much talk about automating tests and code reviews in the world of “AI”-assisted software development these days. Which is good news.

But there’s less talk about just how good these automated quality gates are at catching problems. LLM-generated tests and LLM-performed code reviews in particular can often leave gaping holes through which problems can slip undetected, and at high speed.

The question we need to ask ourselves is: “If the agent broke this code, how soon would we know?”

Imagine your automated test suite is a police force, and you want to know how good your police force is at catching burglars. A simple way to test that might be to commit a burglary and see if you get caught.

We can do something similar with automated tests. Commit a crime in the code that leaves it broken, but still syntactically valid. Then run your tests to see if any of them fail.

This is a technique called mutation testing. It works by applying “mutations” to our code – for example, turning a + into a -, or a string into “” – such that our code no longer does what it did before, but we can still run it.

Then our tests are run against that “mutant” copy of the code. If any tests fail (or time-out), we say that the tests “killed the mutant”.

If no tests fail, and the “mutant” survives, then we may well need to look at whether that part of the code is really being tested, and potentially close the gap with new or better tests.

Mutation testing tools exist for most programming languages – like PIT for Java and Stryker for JS and .NET. They systematically go through solution code line by line, applying appropriate mutation operators, and running your tests.

This will produce a detailed report of what mutations were performed, and what the outcome of testing was, often with a summary of test suite “strength” or “mutation coverage”. This helps us gauge the likelihood that at least one test would fail if we broke the code.

This is much more meaningful than the usual code coverage stats that just tell us which lines of code were executed when we ran the tests.

Some of the best mutation testing tools can be used to incrementally do this on code that’s changed, so you don’t need to run it again and again against code that isn’t changing, making it practical to do within, say, a TDD micro-cycle.

So that answers the question about how we minimise the risk of bugs slipping through our safety net.

But what about other kinds of problems? What about code smells, for example?

The most common route teams take to checking for issues relating to maintainability or security etc is code reviews. Many teams are now learning that after-the-fact code reviews – e.g., for Pull Requests – are both too little, too late and a bottleneck that a code-generating firehose easily overwhelms.

They’re discovering that the review bottleneck is a fundamental speed limit for AI-assisted engineering, and there’s much talk online about how to remove or circumvent that limit.

Some people are proposing that we don’t review the code anymore. These are silly people, and you shouldn’t listen to them.

As we’ve known for decades now, when something hurts, you do it more often. The Cliffs Notes version of how to unblock a bottleneck in software development is to put the word “continuous” in front of it.

Here, we’re talking about continuous inspection.

I build code review directly into my Test-Driven micro-cycle. Whenever the tests are passing after a change to the code, I review it, and – if necessary – refactor it.

Continuous inspection has three benefits:

  1. It catches problems straight away, before I start pouring more cement on them
  2. It dramatically reduces the amount of code that needs to be reviewed, so brings far greater focus
  3. It eliminates the Pull Request/after-the-horse-has-bolted code review bottleneck (it’s PAYG code review)

But, as with our automated tests, the end result is only as good as our inspections.

Some manual code inspection is highly recommended. It lets us consider issues of high-level modular design, like where responsibilities really belong and what depends on what. And it’s really the only way to judge whether code is easy to understand.

But manual inspections tend to miss a lot, especially low-level details like unused imports and embedded URLs. There are actually many, many potential problems we need to look for.

This is where automation is our friend. Static analysis – the programmatic checking of code for conformance to rules – can analyse large amounts of code completely systematically for dozens or even hundreds of problems.

Static analysers – you may know them as “linters” – work by walking the abstract syntax tree of the parsed code, applying appropriate rules to each element in the tree, and reporting whenever an element breaks one of our rules. Perhaps a function has too many parameters. Perhaps a class has too many methods. Perhaps a method is too tightly coupled to the features of another class.

We can think of those code quality rules as being like fast-running unit tests for the structure of the code itself. And, like unit tests, they’re the key to dramatically accelerating code review feedback loops, making it practical to do comprehensive code reviews many times an hour.

The need for human understanding and judgement will never go away, but if 80%-90% of your coding standards and code quality goals can be covered by static analysis, then the time required reduces very significantly. (100% is a Fool’s Errand, of course.)

And, just like unit tests, we need to ask ourselves “If I made a mess in my code, would the automated code inspection catch it?”

Here, we can apply a similar technique; commit a crime in the code and see if the inspection detects it.

For example, I cloned a copy of the JUnit 5 framework source – which is a pretty high-quality code base, as you might expect – and “refuctored” it to introduce unused imports into random source files. Then I asked Claude to look for them. This, by the way, is when I learned not to trust code reviews undertaken by LLMs. They’re not linters, folks!

Continuous inspection is an advanced discipline. You have to invest a lot of time and thought into building and evolving effective quality gates. And a big part of that is testing those gates and closing gaps. Out of the box, most linters won’t get you anywhere near the level of confidence you’ll need for continuous inspection.

If we spot a problem that slipped through the net – and that’s why manual inspections aren’t going away (think of them as exploratory code quality testing) – we need to feed that back into further development of the gate.

(It also requires a good understanding of the code’s abstract syntax, and the ability to reason about code quality. Heck, though, it is our domain model, so it’ll probably make you a better programmer.)

Read the whole series

Super-Mediocrity

March will be the third anniversary of the beginning of my journey with Large Language Models and generative “A.I.”

At the time, we were all being dazzled – myself included – by ChatGPT, the chat interface to OpenAI’s “frontier” LLM, GPT-4.

There was much talk at the time of this technology eventually producing Artificial General Intelligence (AGI) – intelligence equal to that of a human being – and, from there, ascending to god-like “super-intelligence”. All we needed was more data and more GPUs.

It’s now becoming very clear that scaling very probably isn’t the path to AGI, let alone super-intelligence. But it was pretty clear to me at the time, even after just a few dozen hours experimenting with the technology.

The way I see it, LLMs are playing a giant game of Blankety Blank. And you don’t win at Blankety Blank by being original or witty or clever. You win at Blankety Blank by being as average as possible.

Image

The more data we train them on, the more compute we use, the more average the output will get, I speculated. Model performance will tend towards the mean.

Three years later, is there any hard evidence to back this up? Turns out there is – the well-documented phenomenon of model collapse.

Train an LLM on human-created text, then train another LLM on outputs from the first LLM, then another from the outputs generated by that “copy”. Researchers found that, from one generation to the next, output degraded until it become little more than gibberish.

What causes this is that output generated by LLMs clusters closer to the mean than the data they’re trained on. Long-tail examples – things that are novel or niche – get effectively filtered out. The text generated by LLMs is measurably less diverse, less surprising, less “smart” than the text they’re trained on.

Given infinite training data, and infinite compute, the resulting model will not become infinitely smart – it will become infinitely average. I coined the term “super-mediocrity” to describe this potential final outcome of scaling LLMs.

(What really strikes me, watching this video again after 3 years, is just how on-the-money I was even then. I guess the lesson is: don’t bet against entropy.)

Naturally, my focus is on software development. The burning question for me is, what does super-mediocre code look like? When it comes to the code models like Claude Opus and GPT-5 are trained on, what’s the mean? And it’s bad news, I’m afraid.

We know what the large-scale publicly-available sources of code examples are – places like Stack Overflow and GitHub. And the large majority of code samples we find on these sites are… how can I put this tactfully?… crap.

The ones that actually even compile often contain bugs. The ones that don’t contain bugs are often written with little thought to making them easy to understand and easy to change. And that’s before we get on to the subject of things like security vulnerabilities.

Hard to believe, I know, but when Olaf posted that answer on Stack Overflow, he wasn’t thinking about those sorts of things. Because who in their right mind would just copy and paste a Stack Overflow answer into their business-critical code? Right? RIGHT?

And LLM-generated code tends towards the average of that. It tends to be idiomatic, “boilerplate” and often subtly wrong – as well as often being more complicated than it needed to be. It’s that junior developer who just copies what they’ve seen other developers do, without stopping to wonder why they did it. Monkey see, monkey do.

What does super-mediocrity at scale look like, we might ask? I think a bit of a clue can be found on the Issues pages of our most-beloved “AI” coding tools.

Image

As a daily user of these tools, I’m often taken aback at just how buggy updates can be. And I see a lot of chatter online complaining about how unreliable some of the most popular “AI” coding assistants are, so I’m evidently not alone.

Anthropic have been boasting about how pretty much 100% of their code’s generated by one of their models these days, usually being driven in FOR loops (you may know them as “agents”).

I’ll skip the jokes about dealers “getting high on their own supply”, and just make a basic observation about the practical implications of attaching a super-mediocrity generator that’s been trained on mostly crap to your development process.

Just as I don’t lift code blindly from Stack Overflow without putting it through some kind of quality check – and that often involves fixing problems in it, which requires me to understand it – I also don’t accept LLM-generated code without putting it through the same filter. It has to go through my brain to make it into the code.

This is an unavoidable speed limit on code generation – code doesn’t get created (or modified) faster than I can comprehend it.

When code generation outruns comprehension, slipping into what I call “LGTM-speed”, well… we see what happens. Problems accumulate faster, while our understanding of the code – and therefore our ability to fix the problems – withers. Mean Time To Failure gets shorter. Mean Time To Recovery gets longer.

Your outages happen more and more often, and they last longer and longer.

Yes, this happens with human teams, too. But an “AI” coding assistant can get us there in weeks instead of years.

As of writing, there’s no shortcut. Sorry.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise rework due to misunderstandings about requirements, they’ll need to describe requirements in a testable way as part of a close and ongoing collaboration with our customers.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of code reviews, they’ll need to build review into the coding workflow itself, and automate the majority of their code quality checks.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise merge conflicts and broken builds, and to minimise software delivery lead times, they’ll need to integrate their changes more often and automatically build and test the software each time to make it ready for automated deployment.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the “blast radius” of changes, they’ll need to cleanly separate concerns in their designs to reduce coupling and increase cohesion.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the cost and the risk of changing code, they’ll need to continuously refactor their code to keep its intent clear, and keep it simple, modular and low in duplication.

“We definitely don’t have time for that, Jason!”

“AI” coding assistants don’t solve any of these problems. They AMPLIFY THEM.

More code, with more problems, hitting these bottlenecks at accelerated speed turns the code-generating firehose into a load test for your development process.

For most teams, the outcome is less reliable software that costs more to change and is delivered later.

Those teams are being easily outperformed by teams who test, review, refactor and integrate continuously, and who build shared understanding of requirements using examples – with and without “AI”.

Will you make time for them in 2026? Drop me a line if you think it’s about time your team addressed these bottlenecks.

Or was productivity never the point?

Rely On AI And Get Left Behind

LLM-based code generation comes with multiple potential gotchas for a business.

More code hitting process bottlenecks like testing, review and integration makes most teams slower.

More problems being generated by text prediction engines that simply don’t understand what the requirements or the code means makes releases less stable.

These are effects we’re seeing today. Most dev teams are shipping less reliable software, and shipping it later thanks to “AI”. Good work, everyone!

But I suspect the biggest risk is creeping up on us without us necessarily realising.

The more teams offload their thinking to these tools, the less they understand the code your business relies on. And the less they understand it, the less able they are to fix problems that “AI” can’t fix. Outages last longer, and fixes are less sticky.

Mean Time To Failure goes down, and Mean Time To Recovery gets longer. More fires, taking longer to put out. Just ask the folks at AWS!

“AI”-assisted programming’s a bit like pedal-assist on electric bicycles. It makes progress feel easy, but we might not realise the impact that’s having on our coding “muscles” – our ability to comprehend and reason about code.

That is, until we run out of juice and have to pedal unaided. That’s when it becomes obvious just how much of our Code Fu we’ve lost as we’ve come to rely on that assistance more and more. Increasingly, I hear developers say “I’ve hit my token limit, so I’m blocked.”

Working in the low-gravity environment of “AI” coding, we may not notice the decline in our own abilities. And the more they decline, the less able we are to notice – until we find ourselves back in 1g and suddenly are unable to walk.

I’m seeing teams where most of the developers now have a hard time working on their code unaided by “AI”. It’s taking them much longer to wrap their heads around it, and they’re missing more and more problems.

Shipping code faster than we can understand it creates comprehension debt – extra time needed to build that understanding when the tool can’t do what we need it to do (and the need for that is very probably never going away).

Relying on LLMs to write the code for us erodes our ability to understand code, full stop. It’s a double-whammy.

There’s little doubting now that the devs who are most effective using “AI” are the ones who can work just fine without it.

Some say “Use AI or get left behind”. The reality is more probably “Rely on AI and get left behind”. Developers need to maintain their cognitive edge, which means keeping their hand in with regular unaided working.

There’s every chance that, in the future, these developers will be most in-demand.

If they don’t, then this is going to add a lot of risk for businesses – not least because the vendors will have them over a barrel. Price plans are heavily subsidised today. That can’t last, and if you’re not able to walk away, then you have no negotiating position.

That more businesses don’t see becoming completely reliant on a third-party as a strategic risk is baffling, especially in such a volatile market as “AI”. It could all implode at any moment.

But the real risk is that, one day, your core product or system will break, and there’ll be nobody capable of fixing it.

Well, nobody you could afford.

Best Way To Keep A Secret In Software Development? Document It

Once upon a time, in a magical land far, far way – just north of London Bridge – I wore the shiny cape and pointy hat of a senior software architect.

A big part of my job was to document architectures, and the key decisions that led to them.

Many diagrams. Many words. Many tables. Many, many documents.

Structure, dynamics, patterns, goals, principles, problem domains, business processes. You name it, I documented it. Much long time writing them, much even longer time keeping them up-to-date. Descriptions of things that aren’t the things themselves, it turns out, go stale fast.

It only takes a couple of gigs to become suspicious that maybe, perhaps, nobody actually reads these documents. So I started keeping the receipts. I’d monitor file access to see when a document was last pulled from our DMS or shared server, or when Wiki pages were last viewed.

My suspicions were confirmed. Teams weren’t looking at the documents much, or – usually – at all. Those fiends!

In their defence, as the saying goes:

Documentation’s useful until you need it

(I looked up the source of this anonymous quote, which I’ve used often. Turns out it’s me. LOL. Why didn’t I remember? I probably wrote it in a document.)

A lot of architecture documentation is out-of-date to the point where it becomes misleading. Code typically evolves faster than one person’s ability to keep an accurate record of the changes.

A fair amount was never in-date in the first place because it describes what the architect wanted the developers to do, and not what they actually did.

The upshot is that architecture documentation is rarely an accurate guide to reality. It’s either stale history, or creation myths.

Recent research found that familiarity with a code base is a far better predictor of a developer’s comprehension of the architecture than access to documentation, and the style of the documentation didn’t seem to make any significant difference.

When I think back to my times as developer rather than architect, that rings true. I’d often skim the documentation – if I looked at it at all – then go look at the code. Because the code is the code. It’s the truth of the architecture.

For architecture documentation to be more useful in aiding comprehension, it has to be tightly-coupled to the reality it describes. Changes to the architecture need to be reflected very soon after they’re made – ideally, pretty much immediately.

I experimented with round-trip modeling of architecture quite deeply, reverse-engineering code on every successful build, producing an updated description for every working version.

But that, too, produced web documentation that I know for a fact almost never got looked at. Least of all by me, as one of the developers.

Reverse-engineering code tends to produce static models – models of structure. I do not find purely structural descriptions very helpful in understanding what code does. It only describes what code is.

If a library has documentation generated from JavaDoc, I’ll tend to ignore that and look for examples of how the library can be used.

When I want to understand a modular design, I’ll look for examples of how key components in the architecture interact in specific usage scenarios – how does the architecture achieve user outcomes?

Quick war story: I joined a team in an investment bank who’d been working in isolation on different COM components. The Friday before our launch date, we still hadn’t seen how these very well-specified components – we all got handed detailed Word documents for our parts – would work together to satisfy key use cases.

Long-story-short, turns out they didn’t. Not a single end-to-end use case scenario.

Every part of the system was described in detail. But the system as a whole didn’t work, and we couldn’t see that until it was too late.

This is why I start with usage scenarios. You can keep your documentation – I’m gonna set breakpoints, run the damn thing, and click some buttons so I can step down through the call stack and build a picture of the real architecture, as it is now, in functional slices.

And I would have damned well designed it that way in the first place, starting with user outcomes and working backwards to an architecture that achieves them.

So, a dynamic model driven by usage examples is preferable in aiding my comprehension of architecture to structural models of static elements and their compile-time relationships.

One neat way of capturing usage examples is by writing them as tests. This can serve a dual purpose, both documenting scenarios as well as enabling us to see explicitly if the software satisfies them.

Tests as specifications – specification by example – is a cornerstone of test-driven approaches to software design. And it can also be a cornerstone of test-driven approaches to software comprehension.

When tests describe user outcomes, we can pull on that thread to visualise how the architecture fulfils them. That so few tools exist to do that automatically – generate diagrams and descriptions of key functions, key components, key interactions as the code is running (a sort of visual equivalent of your IDE’s debugger) – is puzzling. Other fields of design and engineering have had them for decades.

As for understanding the architecture’s history, again I’ll defer to what actually happened over what someone says happened. The version history of a code base can reveal all sorts of interesting things about its evolution.

Of course, the problem with software paleontology is that we’re often working with an incomplete fossil record. Some teams will make many design decisions in each commit, and then document them with a helpful message like “Changed some stuff, bro”.

A more fine-grained version history, where each diff solves one problem and every commit explains what that problem was, provides a more complete picture of not just what was done to the design, but why it was done.

e.g., “Refactor: moved Checkout.confirm() to Order to eliminate Feature Envy”

Couple this with a per-commit history of outcomes – test run results, linter issues, etc – and we can get a detailed picture of the architecture’s evolution, as well as the development process itself. You can tell a lot about an animal’s lifestyle from what it excretes. Or, as Jeff Goldblum put it in Jurassic Park, “That is one big pile of shit”.

Naturally, there will be times when the reason for a design decision isn’t clear from the fossil record. Just as with code comments, in those cases we probably need to add an explanation to that record.

This is my approach to architecture decision records – we don’t record every decision, just the ones that need explaining. And these, too, need to be tightly-coupled to the code affected. So if I’m looking at a piece of code and going “Buh?”, there’s an easy way to find out what happened.

Right now, some of you are thinking “But I can just get Claude to do that”. The problem with LLM-generated summaries is that they’re unreliable narrators – just like us.

I’ve been encouraging developers to keep the receipts for “AI”-generated documentation. It’s pretty clear that humans aren’t reading it most of the time, and when we do, it’s at “LGTM-speed”. We’re not seeing the brown M&Ms in the bowl.

Load unreliable summaries into the context alongside actual code, and models can’t tell the difference. Their own summaries are lying to them. Humans, at least, are capable of approaching documentation skeptically.

Whenever possible, contexts – just like architecture descriptions intended for human eyes – should be grounded in contemporary reality and validated history.

And talking of “AI” – because, it seems, we always are these days – one area of machine learning that I haven’t seen applied to software development yet might lie in mining the rich stream of IDE, source file and code-level events we give off as we work to recognise workflows and intents.

A friend of mine, for his PhD in Machine Vision, trained a model to recognise what people were doing – or attempting to do – in kitchens from videos of them doing it.

Such pattern recognition of developer activity might also be useful to classify probable intent, and even predict certain likely outcomes as we work. At the very least, more accurate and meaningful commit messages could be automatically generated. No more “Changed some stuff, bro”.

At the end of the (very long) day, comprehension is the goal here. Not documentation. If a document is not comprehended – and if nobody’s reading it, then that’s a given – then it’s of no help.

And if it’s going to be read, it needs to be helpful in aiding comprehension. Like, duh!

Personally, what I would find useful is not better documentation, but better visualisation – visualisation of static structure and dynamic behaviour, of components and their interactions, and of the software’s evolution.

And visualisation at multiple scales from individual functions and modules all the way up to systems of systems, and the business processes and problem domains they interact with.

And when we want the team to actually read the documentation, we need to take it to them. Diagrams should be prominently displayed (I’ve spent a lot of time updating domain models to be printed on A0 paper and hung on the wall), explanations should be communicated. Ideally, architecture should be a shared team activity – e.g., with teaming sessions, with code reviews, with pair programming – and an active, ongoing, highly-visible collaboration.

Active participation in architecture is essential to better comprehension. Doing > seeing > reading or hearing.

That also goes for legacy architecture – active engagement (clicking buttons, stepping through call stacks in the debugger, writing tests for existing code) tends to build deeper understanding faster than reading documentation.

The fastest way to understand code is to write it. The second fastest is to debug it.

This is especially critical in highly-iterative, continuous approaches to software and system architecture, where decisions are being made in real-time many times a day. Without a comprehensible, bang-up-to-date picture of the architecture, we’re basing our decisions on shaky foundations.

Like an LLM generating code by matching patterns in its own summaries, we risk coming untethered from reality.

Which would be an apt description of most architecture documents, including mine.

Writing Code May Be Dead (Not Really), But Reading Code Will Live On

“The age of the syntax writer is over”, hailed a post on LinkedIn. (Why is it always LinkedIn?)

I say: welcome to the age of the syntax READER!

If I presented you with a bowl of candy and told you only 1% had broken glass in them, what percentage would you check before giving them to your kids?

The fact is that a 1% error rate would be an order-of-magnitude improvement in the reliability of LLM-generated code, and one we shouldn’t expect for a very long time – if ever – as scaling hits the wall of diminishing returns at exponentially increasing cost.

The need to check generated code isn’t going to go away, and therefore neither is the need to understand it.

And, given the error rates involved, we may need to understand all of it. At least, anyone who doesn’t want to be handing candy-covered shards to their users will.

“But Jason, we don’t need to check machine code or assembly language generated by compilers.”

That’s a category mistake. When was the last time a compiler misinterpreted your source code, or hallucinated output? Compiler boo-boos are very rare.

If my compiler randomly misinterpreted my source code just 1% of the time, then, yes, I’d check all of the generated machine code for anything that was going to be put in the hands of significant numbers of users. And that means I’d need to be able to understand that machine code.

Now for the fun part.

Decades of studies into program comprehension show that one of the best predictors of a person’s ability to understand code is how much experience they have writing it.

Cognitively, we don’t engage with syntax and semantics more deeply than when we’re writing the code ourselves. (Which is why I strongly discourage students from copying and pasting – it stunts their growth.)

In my blog series The AI-Ready Software Developer, I talk about how a mountain of “comprehension debt” can rapidly accumulate as “AI” produces code far faster than we can wrap our heads around it. I call this “LGTM-speed”.

I recommend “staying sharp” where code comprehension is concerned, as well as slowing down code generation to the speed of comprehension. When you’re drinking from a firehose, the limit isn’t the firehose.

This implies that we’re not limited by how many tokens/second “AI” coding assistants can predict, but by how many tokens/second we can understand. That’s the thing we need to optimise.

The best way – the only way, really – to maintain good code comprehension is to write it regularly.

We need to keep our hand in to make sure we don’t get caught in a trap where comprehension debt balloons as our ability to comprehend withers.

That leads to a situation where serious, show-stopping – potentially business-stopping – errors become much more likely to make it into releases, and there’s nobody on the team who can fix them.

And in that scenario, who are they gonna call? The “AI-native developer” who boasts they haven’t written code in months, or the developer who was debugging that kind of code just this morning?

Clean Contexts

You’ve probably heard of “clean code” (and the “clean coder”, and “clean architecture”, and other things Bob Martin has added the word “clean” in front of to get another book out of it).

In this dawning age of “AI”-assisted software development, I’d like to propose clean contexts.

What is a “clean context”? Well, I’m glad you asked.

A clean context:

  • Addresses one problem – one failing test, one code quality rule, one refactoring etc.
  • Is small enough to stay inside the model’s effective context limit – which is going to be orders of magnitude smaller than the advertised maximum context
  • Uses clear and consistent shared language – if you’ve been calling it “sales tax”, don’t suddenly start calling it “VAT”
  • Clarifies with examples that can be used as success criteria (i.e., tests) – The code samples used in training were paired with usage examples, so it improves matching
  • Only contains information pertinent to the task – don’t divert the model’s attention (literally)
  • Only contains accurate information (“ground truth”) – the code and the architecture as it is now (not a bunch of changes back when you asked the tool to summarise it), the test failure message, the mutation testing results and so on. Ground your interactions in reality.
  • Only contains working code – if the model breaks the code, don’t feed it back to it. It can’t tell broken code from working code, and you’ll pollute the context. Revert and try again. The exception to this is bug fixes, of course. But if the model introduced the bug – git reset –hard
  • Contains code that doesn’t go outside the model’s data distribution – LLMs famously choke on code that lacks clarity, is overly complex and lacks separation of concerns because it’s far outside the distribution of examples they were trained on. When it comes to gnarly legacy code, I’ve had more success breaking it down myself initially before letting Claude loose on it. Y’know, like how an adult bird chews the food first before feeding it to its chicks.

And remember that a prompt isn’t the entire context. Claude Code and Cursor will use static analysis to determine what source code needs to be added. Context files may be added (e.g., CLAUDE.md). And of course, everything in the conversation- your (or your agent’s) prompts and the model’s responses – are all part of the context. When an LLM “hallucinates”, that becomes part of the context, and the model has no way of determining fact from its own fiction. It’s all just context to a language model.

This is why I purge and then construct a new, task-specific context with each interaction. Many users are reporting how much more accurate LLMs tend to be with a fresh context.

Our goal with a clean context is to minimise ambiguity and the risk of misinterpretation, to minimise attention dilution and context drift, context pollution and context “rot”, and as much as possible, stay within the LLM’s training data distribution.

Basically, we’re aiming to maximise the chances of an accurate prediction from the LLM, and spend less time cleaning up mistakes and digging the tool out of “doom loops”.

Importantly, working in small steps – solving one problem at a time – opens up many more opportunities after each step to get feedback from testing, code review and merging, so clean contexts are highly compatible with much more iterative approaches.

Just as Continuous Delivery enables us to make progress by putting one foot in front of the other, ensuring a working product after every step, we also aim to start every step with a clean context that significantly reduces the risk of a stumble.