code quality – Codemanship's Blog

The Hottest AI Coding Skill In 2026? Coding Without AI

_Psst. _{If your boss won’t invest in training you in Specification By Example or Test-Driven Development, I’m running out-of-hours workshops in May specifically for self-funding learners. £99 + UK VAT.}

Statistics are showing that software developer hiring’s been on the rise again for about a year, but it seems priorities have changed.

Time was that our profession was a pyramid, with a base of entry-level and “junior” hires outnumbering seasoned professionals. But this time, we’re an aging population, with employers favouring developers with significant pre-AI experience.

“The current market has become increasingly senior-driven, with fewer junior roles available and employers expecting even entry-level candidates to have all the skills to hit the ground running.”

Harvey Nash, Software trends in the year ahead: A UK hiring outlook

Meanwhile, my LinkedIn feed’s awash with posts complaining that not only are coding tests still very much a thing, but they’re becoming even more of a thing. “Why”, they lament, “do employers want coding skills when AI can do all that?”

The problem is that – a significant proportion of the time – AI actually can’t do all that. Anyone who’s used AI coding assistants and agents for more than an hour or two will know that there are many times when we have to intervene. And intervening at the very least requires us to really understand what we’re intervening in.

As a trainer and mentor, I’ve watched code comprehension degrade alarmingly over the past 10 years as developers have been relying more and more on copying and pasting code from sources like Stack Overflow.

I especially implore junior developers not to do it. It very noticeably stunts their growth as programmers. The code really does need to go in through your eyes, through your brain and out through your fingers for knowledge to sink in. The language and reasoning centres of our brain need to be actively engaged.

If you lost your phone and urgently needed to call a friend, could you remember their number? Speed-dialing code has a similar effect.

In the last year, that’s gone into warp drive now that we have a powerful tool that automates the copypasta process at scale.

The effect of AI use on code comprehension is well-documented. The more we use it, the less we understand the code that’s being generated. And not just because we didn’t write it, it transpires – our ability to understand code generally atrophies. Use it or lose it.

Employers in 2026, it seems, favour developers who haven’t lost it. As I’ve been only half-joking for over a year, the devs who’ll be in highest demand in this age of AI will be the ones who don’t need it.

With highly public outages becoming routine, it looks like managers are reprioritising to make sure that when the you-know-what hits the fan on a critical system at 2 am, the person answering that call is capable of fixing it even when the AI is having one of its famous senior moments.

Those same employers, of course, then insist that the senior developers they hired because of their effectiveness without AI should use AI as much as possible. Sigh.

In my blog series The AI-Ready Software Developer, I wrote a post called Staying Sharp about how important it’s going to be to maintain our “traditional” programming skills. We need to find some time in the day or the week to leave the proverbial car at home and walk.

And if hitting your token limit is a blocker to doing more programming, you may already have deskilled yourself out of the market.

Essential Code Craft – The Roadmap

Some of you may have noticed that I’ve been running out-of-hours training workshops for self-funding learners recently, under the banner of Essential Code Craft.

In a way, this is a return to the early days of Codemanship when I ran regular weekend workshops – priced for individual pockets – that were mostly attended by developers investing in their own skills and career development.

Many of those people are now CTOs and heads of engineering, and I’ve been fortunate – and grateful – that quite a few have brought me in to provide the same kind of training for their teams.

But with senior engineering leaders now very distracted by the code-generating firehose – and while I wait for them to realise that nothing’s actually changed as far as software engineering fundamentals are concerned – I’m pivoting back to self-funders.

So far – just as it was way back when – the first two workshops filled up quickly. While the boss might not be thinking about investing in their developers at the moment, it seems a lot of developers are looking to invest in themselves.

And this is exactly the moment to do it. While a gazillion developers hunt for magic incantations to make a probabilistic next-token predictor act like something other than a probabilistic next-token predictor, the people who’ve done their homework already know: better results with AI coding tools have very little to do with the tools, and almost everything to do with the processes around them.

And it’s a double-win. The practices that produce the best outcomes with AI are the exact same practices that produce the best outcomes without AI.

The key to being effective with AI is being effective without it.

And here’s the hedge, but only for the informed gamblers – developer hiring is rising again, but the demographic of these new hires is changing. Employers are favouring senior developers with significant pre-LLM experience.

I, and a few others, predicted this would happen. Demand would be highest for people who can do the things AI coding tools can’t – like, well, understand code. I mean really understand it. Not “LGTM” understanding. Deep comprehension of programs.

Not only that, but for all kinds of good reasons – economic, environmental, energy, ethical, geopolitical – the future of hyperscale LLMs is by no means predictable. Folks grappling with reduced token limits and rapidly degrading performance with Anthropic’s newest models will hopefully have figured out by now that building workflows that depend heavily in hyperscale LLMs is building on quicksand.

Who are Acme Megacorp gonna’ hire – the dev who sits on their hands because they’re waiting for their token limit to reset, or the dev who can just carry on at roughly the same overall pace of delivery?

And we should be under no illusions that teams who’ve mastered the fundamentals of software delivery are routinely outperforming teams who haven’t – with or without AI. AI is clearly not the differentiator.

So, whether you’re going to apply these disciplines with Claude Code or Codex, or with IntelliJ or VS Code, they still matter – arguably more than ever.

And what are these disciplines? What is Essential Code Craft?

Specification By Example – build shared understanding and pin down requirements with testable specifications
Test-Driven Development – rapidly iterate working software designs with short delivery lead times and reliable releases
Continuous Integration – keep teams more in sync with their changes, merging and testing them many times a day to ensure a working, shippable-at-any-time product
Continuous Collaboration – keep teams on the same page by continuously communicating with practices like pair programming and teaming
Refactoring – reshape code to make change easier, while keeping it working and shippable at all times
Modular Design – optimise software architecture to localise the “blast radius” and minimise the cost of changes, while making rapid testing and smarter reuse easier
Continuous Inspection – minimise the bottleneck and the “LGTM” effect of downstream code review by making it a continuous and highly automated process
Continuous Delivery – combine these fundamentals in a delivery process that can get the proverbial peas from the farmer’s field to the kitchen table through rapid, reliable integration, build and deployment pipelines
Continuous Improvement – build development capability in an evidence-based way, learning what really works and what doesn’t as you build skills, automate tools and workflows, and explore and experiment with your approach – and that’s where I come in!)

Workshops on Specification By Example and Test-Driven Development are already live and taking registrations. If there’s demand, more will follow.

The roadmap is to build a set of repeating individual workshops, rotating monthly, that will eventually cover all of these disciplines – some explicitly, some implicitly like Continuous Integration and pair programming, which will be an integral part of most workshops.

Self-funders can pick and choose which to attend, and my hope is that they’ll be a bit like Pokemon cards – gotta collect ’em all!

Keep an eye on the Codemanship Ticket Tailor box office for details of upcoming workshops.

Also, details of new workshop times will be posted here first, so subscribe to this blog if you’d like to be kept in the loop for future workshops.

The AI-Ready Software Developer #23 – Speaking Clearly

_Psst. _{If your boss won’t invest in training you in Test-Driven Development, I’m running out-of-hours workshops on April 7 and 11 specifically for self-funding learners. £99 + UK VAT.}

I see a lot of wrongheaded takes on “AI” coding assistants and agents online, but one of the most misguided goes something along the lines of “Code doesn’t need to be easy for humans to understand anymore”.

Presumably, these people have never stopped to ask themselves why they’re called “language models”, but let’s entertain their chain of thought for a moment.

The theory is that humans won’t need to understand the code generated by LLMs (wrong!) because they won’t need to edit the code themselves (wrong!), and so code should be optimised for things like token efficiency, not readability (WRONG!)

Some even go so far as to propose token-optimised programming languages designed by and especially for LLMs.

Working backwards through this sequence of compounding errors quickly unpicks the logic.

First of all, even if LLMs could design a general-purpose, Turing-complete programming language from scratch – which they can’t – it would require huge numbers of examples to train them to work with such a language. It might work for small DSLs, but languages like Java and Go are a whole other ballpark.

Without that large corpus of training data – showing examples of every aspect of the language many, many times – every request to generate code in it would be out-of-distribution. There’s a good reason why Claude, GPT, Gemini etc are better at generating code in popular languages. So, are we going to write those millions of examples? In a non-human-readable programming language?

Secondly, when you look at the tokens in code, how much of it is language syntax – reserved words, semicolons and so on – and how much of it is names that we have chosen for the things in our code to make it more self-descriptive?

“Ah, but Jason, AI-generated code doesn’t need to be self-descriptive, because we don’t need to understand it.”

I have bad news for you on that front, I’m afraid. Research – both my experiments and larger-scale studies – have found that the less clear code is to human readers, the less clear it is to LLMs. When code is obfuscated, models make more mistakes interpreting it.

This should come as no surprise. Language Models – the clue is in the name. What did folks think they were matching patterns in?

Human-readable code is model-optimized code.

LLMs don’t “understand” code like compilers do, they pattern-match on language.

So:

Meaningful identifiers = stronger semantic anchors
Consistent naming = better pattern recall
Verbose clarity = richer signal

And, just as it’s incredibly helpful on a software team if everybody’s speaking the same shared language to describe the problem they’re setting out to solve – including (especially) in the code itself – it’s also essential that we use consistent shared language in our prompts and in the code. If I ask Claude to change the “sales tax” calculation logic, and in the code the function’s called ‘v_tx’, we’re probably not going to get a glowing result.

Confusion of Tongues: The Construction of the Tower of Babel, Lucas van Valckenborch, 1594

And then there’s the take that humans won’t need to edit the code. Some even go so far as to say that LLMs will be generating machine code directly, bypassing the human-readable source.

The fact of the matter is that LLMs are unreliable. Very unreliable. I’ve documented the common experience of agents like Claude Code getting stuck in “doom loops”, when a task or problem goes outside their training data distribution, and they just can’t do it. This is a fact of life when we’re working with the technology, and the general consensus – even among the hyperscalers – is that it’s an unfixable problem at any achievable scale.

There will always be a need for the human in the loop, and there will always be a need for that human to understand the code.

If it’s token efficiency you’re after, the smart thing to do is to focus on reducing the probability of mistakes and unnecessary rework. Deliberately obfuscating our code is going to work against us in that respect.

The AI-Ready Software Developer #22 – Test Your Tests

Many teams are learning that the key to speeding up the feedback loops in software development is greater automation.

The danger of automation that enables us to ship changes faster is that it can allow bad stuff to make it out into the real world faster, too.

To reduce that risk, we need effective quality gates that catch the boo-boos before they get deployed at the very latest. Ideally, catching the boo-boos as they’re being – er – boo-boo’d, so we don’t carry on far after a wrong turn.

There’s much talk about automating tests and code reviews in the world of “AI”-assisted software development these days. Which is good news.

But there’s less talk about just how good these automated quality gates are at catching problems. LLM-generated tests and LLM-performed code reviews in particular can often leave gaping holes through which problems can slip undetected, and at high speed.

The question we need to ask ourselves is: “If the agent broke this code, how soon would we know?”

Imagine your automated test suite is a police force, and you want to know how good your police force is at catching burglars. A simple way to test that might be to commit a burglary and see if you get caught.

We can do something similar with automated tests. Commit a crime in the code that leaves it broken, but still syntactically valid. Then run your tests to see if any of them fail.

This is a technique called mutation testing. It works by applying “mutations” to our code – for example, turning a + into a -, or a string into “” – such that our code no longer does what it did before, but we can still run it.

Then our tests are run against that “mutant” copy of the code. If any tests fail (or time-out), we say that the tests “killed the mutant”.

If no tests fail, and the “mutant” survives, then we may well need to look at whether that part of the code is really being tested, and potentially close the gap with new or better tests.

Mutation testing tools exist for most programming languages – like PIT for Java and Stryker for JS and .NET. They systematically go through solution code line by line, applying appropriate mutation operators, and running your tests.

This will produce a detailed report of what mutations were performed, and what the outcome of testing was, often with a summary of test suite “strength” or “mutation coverage”. This helps us gauge the likelihood that at least one test would fail if we broke the code.

This is much more meaningful than the usual code coverage stats that just tell us which lines of code were executed when we ran the tests.

Some of the best mutation testing tools can be used to incrementally do this on code that’s changed, so you don’t need to run it again and again against code that isn’t changing, making it practical to do within, say, a TDD micro-cycle.

So that answers the question about how we minimise the risk of bugs slipping through our safety net.

But what about other kinds of problems? What about code smells, for example?

The most common route teams take to checking for issues relating to maintainability or security etc is code reviews. Many teams are now learning that after-the-fact code reviews – e.g., for Pull Requests – are both too little, too late and a bottleneck that a code-generating firehose easily overwhelms.

They’re discovering that the review bottleneck is a fundamental speed limit for AI-assisted engineering, and there’s much talk online about how to remove or circumvent that limit.

Some people are proposing that we don’t review the code anymore. These are silly people, and you shouldn’t listen to them.

As we’ve known for decades now, when something hurts, you do it more often. The Cliffs Notes version of how to unblock a bottleneck in software development is to put the word “continuous” in front of it.

Here, we’re talking about continuous inspection.

I build code review directly into my Test-Driven micro-cycle. Whenever the tests are passing after a change to the code, I review it, and – if necessary – refactor it.

Continuous inspection has three benefits:

It catches problems straight away, before I start pouring more cement on them
It dramatically reduces the amount of code that needs to be reviewed, so brings far greater focus
It eliminates the Pull Request/after-the-horse-has-bolted code review bottleneck (it’s PAYG code review)

But, as with our automated tests, the end result is only as good as our inspections.

Some manual code inspection is highly recommended. It lets us consider issues of high-level modular design, like where responsibilities really belong and what depends on what. And it’s really the only way to judge whether code is easy to understand.

But manual inspections tend to miss a lot, especially low-level details like unused imports and embedded URLs. There are actually many, many potential problems we need to look for.

This is where automation is our friend. Static analysis – the programmatic checking of code for conformance to rules – can analyse large amounts of code completely systematically for dozens or even hundreds of problems.

Static analysers – you may know them as “linters” – work by walking the abstract syntax tree of the parsed code, applying appropriate rules to each element in the tree, and reporting whenever an element breaks one of our rules. Perhaps a function has too many parameters. Perhaps a class has too many methods. Perhaps a method is too tightly coupled to the features of another class.

We can think of those code quality rules as being like fast-running unit tests for the structure of the code itself. And, like unit tests, they’re the key to dramatically accelerating code review feedback loops, making it practical to do comprehensive code reviews many times an hour.

The need for human understanding and judgement will never go away, but if 80%-90% of your coding standards and code quality goals can be covered by static analysis, then the time required reduces very significantly. (100% is a Fool’s Errand, of course.)

And, just like unit tests, we need to ask ourselves “If I made a mess in my code, would the automated code inspection catch it?”

Here, we can apply a similar technique; commit a crime in the code and see if the inspection detects it.

For example, I cloned a copy of the JUnit 5 framework source – which is a pretty high-quality code base, as you might expect – and “refuctored” it to introduce unused imports into random source files. Then I asked Claude to look for them. This, by the way, is when I learned not to trust code reviews undertaken by LLMs. They’re not linters, folks!

Continuous inspection is an advanced discipline. You have to invest a lot of time and thought into building and evolving effective quality gates. And a big part of that is testing those gates and closing gaps. Out of the box, most linters won’t get you anywhere near the level of confidence you’ll need for continuous inspection.

If we spot a problem that slipped through the net – and that’s why manual inspections aren’t going away (think of them as exploratory code quality testing) – we need to feed that back into further development of the gate.

(It also requires a good understanding of the code’s abstract syntax, and the ability to reason about code quality. Heck, though, it is our domain model, so it’ll probably make you a better programmer.)

Read the whole series

Super-Mediocrity

March will be the third anniversary of the beginning of my journey with Large Language Models and generative “A.I.”

At the time, we were all being dazzled – myself included – by ChatGPT, the chat interface to OpenAI’s “frontier” LLM, GPT-4.

There was much talk at the time of this technology eventually producing Artificial General Intelligence (AGI) – intelligence equal to that of a human being – and, from there, ascending to god-like “super-intelligence”. All we needed was more data and more GPUs.

It’s now becoming very clear that scaling very probably isn’t the path to AGI, let alone super-intelligence. But it was pretty clear to me at the time, even after just a few dozen hours experimenting with the technology.

The way I see it, LLMs are playing a giant game of Blankety Blank. And you don’t win at Blankety Blank by being original or witty or clever. You win at Blankety Blank by being as average as possible.

The more data we train them on, the more compute we use, the more average the output will get, I speculated. Model performance will tend towards the mean.

Three years later, is there any hard evidence to back this up? Turns out there is – the well-documented phenomenon of model collapse.

Train an LLM on human-created text, then train another LLM on outputs from the first LLM, then another from the outputs generated by that “copy”. Researchers found that, from one generation to the next, output degraded until it become little more than gibberish.

What causes this is that output generated by LLMs clusters closer to the mean than the data they’re trained on. Long-tail examples – things that are novel or niche – get effectively filtered out. The text generated by LLMs is measurably less diverse, less surprising, less “smart” than the text they’re trained on.

Given infinite training data, and infinite compute, the resulting model will not become infinitely smart – it will become infinitely average. I coined the term “super-mediocrity” to describe this potential final outcome of scaling LLMs.

(What really strikes me, watching this video again after 3 years, is just how on-the-money I was even then. I guess the lesson is: don’t bet against entropy.)

Naturally, my focus is on software development. The burning question for me is, what does super-mediocre code look like? When it comes to the code models like Claude Opus and GPT-5 are trained on, what’s the mean? And it’s bad news, I’m afraid.

We know what the large-scale publicly-available sources of code examples are – places like Stack Overflow and GitHub. And the large majority of code samples we find on these sites are… how can I put this tactfully?… crap.

The ones that actually even compile often contain bugs. The ones that don’t contain bugs are often written with little thought to making them easy to understand and easy to change. And that’s before we get on to the subject of things like security vulnerabilities.

Hard to believe, I know, but when Olaf posted that answer on Stack Overflow, he wasn’t thinking about those sorts of things. Because who in their right mind would just copy and paste a Stack Overflow answer into their business-critical code? Right? RIGHT?

And LLM-generated code tends towards the average of that. It tends to be idiomatic, “boilerplate” and often subtly wrong – as well as often being more complicated than it needed to be. It’s that junior developer who just copies what they’ve seen other developers do, without stopping to wonder why they did it. Monkey see, monkey do.

What does super-mediocrity at scale look like, we might ask? I think a bit of a clue can be found on the Issues pages of our most-beloved “AI” coding tools.

As a daily user of these tools, I’m often taken aback at just how buggy updates can be. And I see a lot of chatter online complaining about how unreliable some of the most popular “AI” coding assistants are, so I’m evidently not alone.

Anthropic have been boasting about how pretty much 100% of their code’s generated by one of their models these days, usually being driven in FOR loops (you may know them as “agents”).

I’ll skip the jokes about dealers “getting high on their own supply”, and just make a basic observation about the practical implications of attaching a super-mediocrity generator that’s been trained on mostly crap to your development process.

Just as I don’t lift code blindly from Stack Overflow without putting it through some kind of quality check – and that often involves fixing problems in it, which requires me to understand it – I also don’t accept LLM-generated code without putting it through the same filter. It has to go through my brain to make it into the code.

This is an unavoidable speed limit on code generation – code doesn’t get created (or modified) faster than I can comprehend it.

When code generation outruns comprehension, slipping into what I call “LGTM-speed”, well… we see what happens. Problems accumulate faster, while our understanding of the code – and therefore our ability to fix the problems – withers. Mean Time To Failure gets shorter. Mean Time To Recovery gets longer.

Your outages happen more and more often, and they last longer and longer.

Yes, this happens with human teams, too. But an “AI” coding assistant can get us there in weeks instead of years.

As of writing, there’s no shortcut. Sorry.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.