Essential Code Craft – Test-Driven Development – April 7 & 11

Something I hear often: “I’d love to go on one of your courses, but my boss won’t pay for it”.

Codemanship’s new Essential Code Craft training workshops are aimed at software developers who are self-funding their professional growth. If your employer won’t invest in you, perhaps you can invest in you. (Businesses and other VAT-registered entities should visit codemanship.co.uk for details of corporate training for teams.)

For nearly 30 years, Test-Driven Development has been the technical core of successful agile software development.

Teams have shortened delivery lead times dramatically, while actually improving the reliability of their releases, and lowering the cost of changing software using TDD.

And TDD is proving to be not just compatible with AI-assisted software development, but essential.

In this introductory workshop, you will learn how to solve problems working in TDD micro-cycles, rigorously specifying desired software behaviour using tests, writing the simplest solution code to pass those tests, and refactoring safely to enable a simple, clean design to emerge.

The emphasis will be on learning by doing, with succinct practical instruction and guidance from a 25-year+ TDD practitioner and teacher.

You will work in pairs in your chosen programming language, swapping through Continuous Integration into a shared GitHub* repository after each TDD cycle, reinforcing the relationship between TDD, refactoring, version control and CI.

When you register, you’ll be asked to list up to 3 programming languages you’re comfortable working in (e.g., Java, Ruby, Go), and I’ll use that to pair folks as best I can up for the exercise. (Tip: put at least one popular one on the list – we may struggle to find you a pairing partner for Prolog)

I’ll demonstrate in either Java, Python, JS or C#, depending on which of those is listed most often by registrants.

* Requires an active personal GitHub account

This workshop includes a 15-minute break

Find out more and register here

How Can We Progress Past “AI” Woo?

For decades, when I’ve interacted with folks working in machine learning, one of the big questions that comes up is “How do we test these systems?”

Traditional software’s behaviour is (usually) predictably repeatable, in the sense that the exact same inputs provided to the system in the exact same internal state will produce the exact same output.

If I ask to debit $51 from an account with $50 available, the system will say “No can do, amigo”.

With, say, a Large Language Model, the exact same input fed to the exact same model – their internal state doesn’t change – can produce surprisingly different outputs each time.

When I say “I tried this, and it didn’t work”, and then you say “Well, I tried it, and it did work”, that’s a bit like saying “Well, I threw a 7, so you must have been throwing the dice wrong”.

This is where probabilistic technology collides with 70+ years of experience with deterministic software. We’re still treating it like the banking system, expecting our results to be reproducible in the same way.

When we consider what’s real and what works when we’re using ML, we have to look beyond data points and start looking for statistically significant trends. We don’t throw the dice, get the 7 we wanted, and declare “These dice are better”. We throw the dice a bunch of times, and see what the distribution of outputs is.

I’ve been using small-scale, closed-loop experiments – once it’s started, I can’t intervene – to test claims about what improves model accuracy, with hard “either it did or it didn’t” fitness functions to gauge success. (Typically, a hidden acceptance test suite, scoring on how many tests passed).

But these are small-scale. I might run them 10x, because I simply can’t afford the tokens. If I published the results, that’s the first thing folks would say: “But this is just 10 data points, Jason”.

And they’d be quite right to, just as I’d be right to take a single data point (especially an anecdotal one) with a big pinch of salt.

And that’s why I’ve leaned more on related – the same phenomena but in different problem domains – large-scale peer-reviewed studies and on the science (e.g., statistical mechanics). So the stage I’m at is: “In theory, this should happen, and in my small experiment, that’s kind of what happened”.

In that respect, I appear to be ahead of the curve, with most folks still relying entirely on how it “feels”. Confirmation bias and projection still dominate the discourse.

It turns out there’s a significant correlation between confidence in “AI” output and belief in the paranormal. Our susceptibility to suggestion appears to be a factor here.

Meanwhile, I see more and more of the ideas presented in my AI-Ready Software Developer blog series being incorporated into workflows and even into the tools themselves. I can’t take the credit, of course. But when a bunch of different people independently come to similar conclusions, that’s interesting data in itself. We could all just be suffering the same delusions, though. Just because most teams use Pull Requests, that doesn’t make them a good idea.

In my defence, I’ve tried very hard to follow the best available evidence, and I’ve been taking nobody’s word for it. This has, at times, made me about as welcome as James Randi at a spoon-bending convention.

I see folks claiming that all of the frontier models and all of the “AI” coding assistants took a big step up in performance in the last 2-3 months. Claude users are saying it. Cursor users are saying it. Copilot users are saying it.

But did the technology really shift over Christmas, or was it actually a shift in expectations and incentives feeding a sort of mass delusion fueled by social proof?

When the CTO comes back to the office in January and declares “This is the future! Get with it, teams!”, does our perception of tool performance change in response? Is this how religions start?

But this is nothing new in our field.

If we’re to move forward in our understanding of what’s real and what works, and avoid surrendering to “AI Woo”, perhaps we need a way to test our hypotheses at a statistically significant scale, and in a more methodical way?

It’s my hope that – on this particularly important subject for our profession – we can just this once move beyond social proof, beyond online debate, beyond committees and manifestos drawn up by people who just happened to be in the room at the time, and beyond appeals to authority, and engage with the reality in a more objective way.

I, for one, would really like to know what that reality is.

Super-Mediocrity

March will be the third anniversary of the beginning of my journey with Large Language Models and generative “A.I.”

At the time, we were all being dazzled – myself included – by ChatGPT, the chat interface to OpenAI’s “frontier” LLM, GPT-4.

There was much talk at the time of this technology eventually producing Artificial General Intelligence (AGI) – intelligence equal to that of a human being – and, from there, ascending to god-like “super-intelligence”. All we needed was more data and more GPUs.

It’s now becoming very clear that scaling very probably isn’t the path to AGI, let alone super-intelligence. But it was pretty clear to me at the time, even after just a few dozen hours experimenting with the technology.

The way I see it, LLMs are playing a giant game of Blankety Blank. And you don’t win at Blankety Blank by being original or witty or clever. You win at Blankety Blank by being as average as possible.

Image

The more data we train them on, the more compute we use, the more average the output will get, I speculated. Model performance will tend towards the mean.

Three years later, is there any hard evidence to back this up? Turns out there is – the well-documented phenomenon of model collapse.

Train an LLM on human-created text, then train another LLM on outputs from the first LLM, then another from the outputs generated by that “copy”. Researchers found that, from one generation to the next, output degraded until it become little more than gibberish.

What causes this is that output generated by LLMs clusters closer to the mean than the data they’re trained on. Long-tail examples – things that are novel or niche – get effectively filtered out. The text generated by LLMs is measurably less diverse, less surprising, less “smart” than the text they’re trained on.

Given infinite training data, and infinite compute, the resulting model will not become infinitely smart – it will become infinitely average. I coined the term “super-mediocrity” to describe this potential final outcome of scaling LLMs.

(What really strikes me, watching this video again after 3 years, is just how on-the-money I was even then. I guess the lesson is: don’t bet against entropy.)

Naturally, my focus is on software development. The burning question for me is, what does super-mediocre code look like? When it comes to the code models like Claude Opus and GPT-5 are trained on, what’s the mean? And it’s bad news, I’m afraid.

We know what the large-scale publicly-available sources of code examples are – places like Stack Overflow and GitHub. And the large majority of code samples we find on these sites are… how can I put this tactfully?… crap.

The ones that actually even compile often contain bugs. The ones that don’t contain bugs are often written with little thought to making them easy to understand and easy to change. And that’s before we get on to the subject of things like security vulnerabilities.

Hard to believe, I know, but when Olaf posted that answer on Stack Overflow, he wasn’t thinking about those sorts of things. Because who in their right mind would just copy and paste a Stack Overflow answer into their business-critical code? Right? RIGHT?

And LLM-generated code tends towards the average of that. It tends to be idiomatic, “boilerplate” and often subtly wrong – as well as often being more complicated than it needed to be. It’s that junior developer who just copies what they’ve seen other developers do, without stopping to wonder why they did it. Monkey see, monkey do.

What does super-mediocrity at scale look like, we might ask? I think a bit of a clue can be found on the Issues pages of our most-beloved “AI” coding tools.

Image

As a daily user of these tools, I’m often taken aback at just how buggy updates can be. And I see a lot of chatter online complaining about how unreliable some of the most popular “AI” coding assistants are, so I’m evidently not alone.

Anthropic have been boasting about how pretty much 100% of their code’s generated by one of their models these days, usually being driven in FOR loops (you may know them as “agents”).

I’ll skip the jokes about dealers “getting high on their own supply”, and just make a basic observation about the practical implications of attaching a super-mediocrity generator that’s been trained on mostly crap to your development process.

Just as I don’t lift code blindly from Stack Overflow without putting it through some kind of quality check – and that often involves fixing problems in it, which requires me to understand it – I also don’t accept LLM-generated code without putting it through the same filter. It has to go through my brain to make it into the code.

This is an unavoidable speed limit on code generation – code doesn’t get created (or modified) faster than I can comprehend it.

When code generation outruns comprehension, slipping into what I call “LGTM-speed”, well… we see what happens. Problems accumulate faster, while our understanding of the code – and therefore our ability to fix the problems – withers. Mean Time To Failure gets shorter. Mean Time To Recovery gets longer.

Your outages happen more and more often, and they last longer and longer.

Yes, this happens with human teams, too. But an “AI” coding assistant can get us there in weeks instead of years.

As of writing, there’s no shortcut. Sorry.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise rework due to misunderstandings about requirements, they’ll need to describe requirements in a testable way as part of a close and ongoing collaboration with our customers.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of code reviews, they’ll need to build review into the coding workflow itself, and automate the majority of their code quality checks.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise merge conflicts and broken builds, and to minimise software delivery lead times, they’ll need to integrate their changes more often and automatically build and test the software each time to make it ready for automated deployment.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the “blast radius” of changes, they’ll need to cleanly separate concerns in their designs to reduce coupling and increase cohesion.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the cost and the risk of changing code, they’ll need to continuously refactor their code to keep its intent clear, and keep it simple, modular and low in duplication.

“We definitely don’t have time for that, Jason!”

“AI” coding assistants don’t solve any of these problems. They AMPLIFY THEM.

More code, with more problems, hitting these bottlenecks at accelerated speed turns the code-generating firehose into a load test for your development process.

For most teams, the outcome is less reliable software that costs more to change and is delivered later.

Those teams are being easily outperformed by teams who test, review, refactor and integrate continuously, and who build shared understanding of requirements using examples – with and without “AI”.

Will you make time for them in 2026? Drop me a line if you think it’s about time your team addressed these bottlenecks.

Or was productivity never the point?

Rely On AI And Get Left Behind

LLM-based code generation comes with multiple potential gotchas for a business.

More code hitting process bottlenecks like testing, review and integration makes most teams slower.

More problems being generated by text prediction engines that simply don’t understand what the requirements or the code means makes releases less stable.

These are effects we’re seeing today. Most dev teams are shipping less reliable software, and shipping it later thanks to “AI”. Good work, everyone!

But I suspect the biggest risk is creeping up on us without us necessarily realising.

The more teams offload their thinking to these tools, the less they understand the code your business relies on. And the less they understand it, the less able they are to fix problems that “AI” can’t fix. Outages last longer, and fixes are less sticky.

Mean Time To Failure goes down, and Mean Time To Recovery gets longer. More fires, taking longer to put out. Just ask the folks at AWS!

“AI”-assisted programming’s a bit like pedal-assist on electric bicycles. It makes progress feel easy, but we might not realise the impact that’s having on our coding “muscles” – our ability to comprehend and reason about code.

That is, until we run out of juice and have to pedal unaided. That’s when it becomes obvious just how much of our Code Fu we’ve lost as we’ve come to rely on that assistance more and more. Increasingly, I hear developers say “I’ve hit my token limit, so I’m blocked.”

Working in the low-gravity environment of “AI” coding, we may not notice the decline in our own abilities. And the more they decline, the less able we are to notice – until we find ourselves back in 1g and suddenly are unable to walk.

I’m seeing teams where most of the developers now have a hard time working on their code unaided by “AI”. It’s taking them much longer to wrap their heads around it, and they’re missing more and more problems.

Shipping code faster than we can understand it creates comprehension debt – extra time needed to build that understanding when the tool can’t do what we need it to do (and the need for that is very probably never going away).

Relying on LLMs to write the code for us erodes our ability to understand code, full stop. It’s a double-whammy.

There’s little doubting now that the devs who are most effective using “AI” are the ones who can work just fine without it.

Some say “Use AI or get left behind”. The reality is more probably “Rely on AI and get left behind”. Developers need to maintain their cognitive edge, which means keeping their hand in with regular unaided working.

There’s every chance that, in the future, these developers will be most in-demand.

If they don’t, then this is going to add a lot of risk for businesses – not least because the vendors will have them over a barrel. Price plans are heavily subsidised today. That can’t last, and if you’re not able to walk away, then you have no negotiating position.

That more businesses don’t see becoming completely reliant on a third-party as a strategic risk is baffling, especially in such a volatile market as “AI”. It could all implode at any moment.

But the real risk is that, one day, your core product or system will break, and there’ll be nobody capable of fixing it.

Well, nobody you could afford.

100% Autonomous “Agentic” Coding Is A Fool’s Errand

Though I’ve seen very little evidence of it being attempted on production systems with real users (because risk), my socials are flooded with posts about people’s attempts to crack fully-autonomous, completely-unattended software creation and evolution using “agents” at scale.

Demonstrations by Cursor and Anthropic of large-scale development done – they claim – almost entirely by agents working in parallel have proven that the current state of the art produces software that doesn’t work. Perhaps, to those businesses, that’s just a minor detail. In the real world, we kind of prefer it when it does.

I’ve attempted experiments myself to see if I can get to a set-up good enough that I can hit “Play” and walk away to leave the agents to it while I go to the proverbial pub.

That seems to be the end goal here – the pot of gold at the end of the rainbow. Whoever makes that work will surely, at the very least, make a name for themselves, and probably a few coins.

I’ve seen many people – some who understand this technology far better than me – attempt the same thing. Curiously, they don’t seem to have nailed it either, but are convinced that somebody else must have.

It’s that FOMO, I suspect, that continues to drive people to try, despite repeated failures.

But, as of writing, I’ve seen no concrete evidence that anybody has done it successfully on any appreciable scale. (And no, a GitHub repo you claim was 100% agent-generated, “Trust me bro”, doesn’t qualify, I’m afraid.)

The rules of my closed-loop experiments are quite simple: I can take as much time as I like setting things up for Claude Code in read-only planning mode, but once the wheels of code generation are set in motion, we’re like an improv troupe – everything it suggests, the answer is automatically “yes”. I just let it play out.

Progress is measured with pre-baked automated acceptance tests driving deployed software, which act as a rough proxy for “value created”, and help to avoid confirmation bias and the kind of “LGTM” assessments of progress that plague accounts of “agentic” achievements right now. It’s very a much an “either it did, or it didn’t” final quality bar.

I can’t intervene until either Claude says it’s done, or progress stalls. I can’t correct anything. I can’t edit any of the generated files. I have to simply sit back, watch and wait.

So far, no matter how I dice it and slice it, no set-up has produced 100% autonomous completion, or anything close.

No doubting, the tools are improving – using LLMs in smarter ways. But there’s only so much we can do with context management, workflow, agent coordination, quality gates and version control before we reach the limits of reliability that are possible when LLMs are involved. I suspect some of us are almost at that plateau already.

Agents – with those faulty narrators at their core – will always get stuck in “doom loops” where the problem falls outside their training data distribution, or the constraints we try to impose on them conflict.

Round and round little Ralph Wiggum will go, throwing the dice again and again in the hope of getting 13, or any prime number greater than 5 that’s also divisible by 3.

Out-of-distribution problems will always be a feature of generative transformers. It’s an unfixable problem. The best solution OpenAI have managed to come up with is having the model look at the probabilities and, if there’s no clear winner for next token, reply “I don’t know”. That’s not good news for full autonomy.

And, no, a swarm of Ralphs won’t solve the problem, either. It just creates another major problem – coordination. No matter how many lanes in your motorway, ultimately every change has to go through the same garden gate of integration at the end.

A bunch of agents checking in on top of each other will almost certainly break the build, and once the build’s broken, everybody’s blocked, and your beeper is summoning you back from the proverbial pub to unblock them.

One amusing irony of all these attempts to fully define 100% autonomous “agentic” workflows is that it’s turning many advocates into software process engineers.

Just taking quality gates as the example, a completely automated code quality check will require us to precisely and completely describe exactly what we mean by “quality”, and in some form that can be directly interpreted against, for example, the code’s abstract syntax tree.

I know Feature Envy when I see it, but describing it precisely in those terms is a whole other story. Computing has a long history of teaching us that there are many things we thought we understood that, when we try to explain it to the computer, it turns out we don’t.

Software architecture and design is replete with woolly concepts – what exactly is a “responsibility”, for example? How could we instruct a computer to recognise when a function or a class has more than one reason to change? (Answers on a postcard, please.)

Fully autonomous code inspections are really, really, really (really) hard.

90% automated? Definitely do-able. But skill, nuance and judgement will likely always be required for the inevitable edge cases.

Having worked quite extensively in software process engineering earlier in my career, I know from experience that it’s largely a futile effort.

We naively believed that if we just described the processes well enough – the workflows, the inputs, the outputs, the roles and the rules – then we could shove a badger in a bowtie into any of those roles and the process would work. No skill or judgement was required.

You can probably imagine why this appealed to the people signing the salary cheques.

It didn’t work, of course. Not just because it’s way, way harder to describe software development processes to that level of precision, but also because – you guessed it – teams never actually did it the way the guidance told them to. They painted outside the lines, and we just couldn’t stop them.

In 2026, some of us are making the same mistakes all over again, only now the well-dressed badger’s being paid by the token.

We might get 80% of the way and think we’re one-fifth away from full autonomy, but the long and checkered history of AI research is littered with the discarded bones of approaches that got us “most of the way”. Close, but no cigar.

It turns out that last few percent is almost always exponentially harder to achieve, as it represents the fundamental limits of the technology. On the graph of progress vs. cost, 100% is typically an asymptote. We need to recognise a wall when we see one and back away to where the costs make sense.

Attempting to achieve better outcomes using agents with more autonomy seems like a reasonable pursuit, as long as we’re actually getting those better outcomes – shorter lead times, more reliable releases, more satisfied customers.

Folks I know being successful with an “agentic” approach have stepped back from searching for the end of that rainbow, and have focused on what can be achieved while staying very much in the loop.

They let the firehose run in short, controlled bursts and check the results thoroughly – using a combination of automated checks and their expert judgement – after every one. And for a host of reasons, that’s probably why they’re getting better results.

It’s highly likely there’s no end to the “agentic” rainbow. Perhaps we should start looking for some gold where we actually are, using tools we’ve actually got?

Best Way To Keep A Secret In Software Development? Document It

Once upon a time, in a magical land far, far way – just north of London Bridge – I wore the shiny cape and pointy hat of a senior software architect.

A big part of my job was to document architectures, and the key decisions that led to them.

Many diagrams. Many words. Many tables. Many, many documents.

Structure, dynamics, patterns, goals, principles, problem domains, business processes. You name it, I documented it. Much long time writing them, much even longer time keeping them up-to-date. Descriptions of things that aren’t the things themselves, it turns out, go stale fast.

It only takes a couple of gigs to become suspicious that maybe, perhaps, nobody actually reads these documents. So I started keeping the receipts. I’d monitor file access to see when a document was last pulled from our DMS or shared server, or when Wiki pages were last viewed.

My suspicions were confirmed. Teams weren’t looking at the documents much, or – usually – at all. Those fiends!

In their defence, as the saying goes:

Documentation’s useful until you need it

(I looked up the source of this anonymous quote, which I’ve used often. Turns out it’s me. LOL. Why didn’t I remember? I probably wrote it in a document.)

A lot of architecture documentation is out-of-date to the point where it becomes misleading. Code typically evolves faster than one person’s ability to keep an accurate record of the changes.

A fair amount was never in-date in the first place because it describes what the architect wanted the developers to do, and not what they actually did.

The upshot is that architecture documentation is rarely an accurate guide to reality. It’s either stale history, or creation myths.

Recent research found that familiarity with a code base is a far better predictor of a developer’s comprehension of the architecture than access to documentation, and the style of the documentation didn’t seem to make any significant difference.

When I think back to my times as developer rather than architect, that rings true. I’d often skim the documentation – if I looked at it at all – then go look at the code. Because the code is the code. It’s the truth of the architecture.

For architecture documentation to be more useful in aiding comprehension, it has to be tightly-coupled to the reality it describes. Changes to the architecture need to be reflected very soon after they’re made – ideally, pretty much immediately.

I experimented with round-trip modeling of architecture quite deeply, reverse-engineering code on every successful build, producing an updated description for every working version.

But that, too, produced web documentation that I know for a fact almost never got looked at. Least of all by me, as one of the developers.

Reverse-engineering code tends to produce static models – models of structure. I do not find purely structural descriptions very helpful in understanding what code does. It only describes what code is.

If a library has documentation generated from JavaDoc, I’ll tend to ignore that and look for examples of how the library can be used.

When I want to understand a modular design, I’ll look for examples of how key components in the architecture interact in specific usage scenarios – how does the architecture achieve user outcomes?

Quick war story: I joined a team in an investment bank who’d been working in isolation on different COM components. The Friday before our launch date, we still hadn’t seen how these very well-specified components – we all got handed detailed Word documents for our parts – would work together to satisfy key use cases.

Long-story-short, turns out they didn’t. Not a single end-to-end use case scenario.

Every part of the system was described in detail. But the system as a whole didn’t work, and we couldn’t see that until it was too late.

This is why I start with usage scenarios. You can keep your documentation – I’m gonna set breakpoints, run the damn thing, and click some buttons so I can step down through the call stack and build a picture of the real architecture, as it is now, in functional slices.

And I would have damned well designed it that way in the first place, starting with user outcomes and working backwards to an architecture that achieves them.

So, a dynamic model driven by usage examples is preferable in aiding my comprehension of architecture to structural models of static elements and their compile-time relationships.

One neat way of capturing usage examples is by writing them as tests. This can serve a dual purpose, both documenting scenarios as well as enabling us to see explicitly if the software satisfies them.

Tests as specifications – specification by example – is a cornerstone of test-driven approaches to software design. And it can also be a cornerstone of test-driven approaches to software comprehension.

When tests describe user outcomes, we can pull on that thread to visualise how the architecture fulfils them. That so few tools exist to do that automatically – generate diagrams and descriptions of key functions, key components, key interactions as the code is running (a sort of visual equivalent of your IDE’s debugger) – is puzzling. Other fields of design and engineering have had them for decades.

As for understanding the architecture’s history, again I’ll defer to what actually happened over what someone says happened. The version history of a code base can reveal all sorts of interesting things about its evolution.

Of course, the problem with software paleontology is that we’re often working with an incomplete fossil record. Some teams will make many design decisions in each commit, and then document them with a helpful message like “Changed some stuff, bro”.

A more fine-grained version history, where each diff solves one problem and every commit explains what that problem was, provides a more complete picture of not just what was done to the design, but why it was done.

e.g., “Refactor: moved Checkout.confirm() to Order to eliminate Feature Envy”

Couple this with a per-commit history of outcomes – test run results, linter issues, etc – and we can get a detailed picture of the architecture’s evolution, as well as the development process itself. You can tell a lot about an animal’s lifestyle from what it excretes. Or, as Jeff Goldblum put it in Jurassic Park, “That is one big pile of shit”.

Naturally, there will be times when the reason for a design decision isn’t clear from the fossil record. Just as with code comments, in those cases we probably need to add an explanation to that record.

This is my approach to architecture decision records – we don’t record every decision, just the ones that need explaining. And these, too, need to be tightly-coupled to the code affected. So if I’m looking at a piece of code and going “Buh?”, there’s an easy way to find out what happened.

Right now, some of you are thinking “But I can just get Claude to do that”. The problem with LLM-generated summaries is that they’re unreliable narrators – just like us.

I’ve been encouraging developers to keep the receipts for “AI”-generated documentation. It’s pretty clear that humans aren’t reading it most of the time, and when we do, it’s at “LGTM-speed”. We’re not seeing the brown M&Ms in the bowl.

Load unreliable summaries into the context alongside actual code, and models can’t tell the difference. Their own summaries are lying to them. Humans, at least, are capable of approaching documentation skeptically.

Whenever possible, contexts – just like architecture descriptions intended for human eyes – should be grounded in contemporary reality and validated history.

And talking of “AI” – because, it seems, we always are these days – one area of machine learning that I haven’t seen applied to software development yet might lie in mining the rich stream of IDE, source file and code-level events we give off as we work to recognise workflows and intents.

A friend of mine, for his PhD in Machine Vision, trained a model to recognise what people were doing – or attempting to do – in kitchens from videos of them doing it.

Such pattern recognition of developer activity might also be useful to classify probable intent, and even predict certain likely outcomes as we work. At the very least, more accurate and meaningful commit messages could be automatically generated. No more “Changed some stuff, bro”.

At the end of the (very long) day, comprehension is the goal here. Not documentation. If a document is not comprehended – and if nobody’s reading it, then that’s a given – then it’s of no help.

And if it’s going to be read, it needs to be helpful in aiding comprehension. Like, duh!

Personally, what I would find useful is not better documentation, but better visualisation – visualisation of static structure and dynamic behaviour, of components and their interactions, and of the software’s evolution.

And visualisation at multiple scales from individual functions and modules all the way up to systems of systems, and the business processes and problem domains they interact with.

And when we want the team to actually read the documentation, we need to take it to them. Diagrams should be prominently displayed (I’ve spent a lot of time updating domain models to be printed on A0 paper and hung on the wall), explanations should be communicated. Ideally, architecture should be a shared team activity – e.g., with teaming sessions, with code reviews, with pair programming – and an active, ongoing, highly-visible collaboration.

Active participation in architecture is essential to better comprehension. Doing > seeing > reading or hearing.

That also goes for legacy architecture – active engagement (clicking buttons, stepping through call stacks in the debugger, writing tests for existing code) tends to build deeper understanding faster than reading documentation.

The fastest way to understand code is to write it. The second fastest is to debug it.

This is especially critical in highly-iterative, continuous approaches to software and system architecture, where decisions are being made in real-time many times a day. Without a comprehensible, bang-up-to-date picture of the architecture, we’re basing our decisions on shaky foundations.

Like an LLM generating code by matching patterns in its own summaries, we risk coming untethered from reality.

Which would be an apt description of most architecture documents, including mine.

Why LLMs Will Always Need An Expert In The Loop

The gargantuan valuations of “AI” companies – companies making or using LLMs – are built on things that investors believe the technology is going to do in the future.

That future, they’ve promised us many times, has been “just around the corner” since 2022. But the reality, when we actually observe it, has remained far less impressive.

We can keep falling for the “Yes, but this time it’s different” claims every time a new model comes out, or we can look ahead at the theoretical limits of the technology.

Large Language Models are massive, hyperdimensional, probabilistic pattern-matching and text prediction systems (yep, they’re basically predictive texting). What theoretical discipline could meaningfully predict the limits of their performance as they scale towards infinity?

There’s a branch of physics called Statistical Mechanics that enables researchers to describe the behaviour of complex, probabilistic systems at scale. SM uses probability to predict macroscopic thermodynamical properties of systems like entropy, pressure and heat based on the microscopic properties of atoms and molecules.

Tools from statistical mechanics are increasingly used to understand deep neural networks at scale by treating their millions or billions of parameters as interacting degrees of freedom in a high-dimensional system.

Concepts such as energy landscapes, phase transitions, and collective behavior help explain why models can exhibit sudden changes in capability, remarkable robustness – and the lack thereof – or predictable scaling trends as size and data grow.

Researchers have been using statistical mechanics to explore questions like how much more accurate LLMs are likely to get with greater scale.

For folks claiming that “hallucinations” will soon be a solved problem, it’s not good news.

In their paper, “The wall confronting large language models”, Peter V. Coveney and Sauro Succi showed that producing a model with an order of magnitude better accuracy – e.g., wrong 3% of the time instead of 30% – would require 10²⁰ times more data and compute to train. .

We’re talking Dyson Spheres (plural) to train a model of that size on similar timescales to the frontier models of today.

And it would require a further 10²⁰ times to go from 3% to 0.3%.

To scale an LLM to be as reliable as a modern compiler – as some people wrongheadedly claim they are – you’d probably need to be a Type V civilisation. And why would anybody burn the multiverse to create something undergrads can probably make?

Any kind of text-predicting AI that performs with that level of accuracy won’t be a generative transformer. It’ll be technology that doesn’t exist yet.

“Ah, but Jason, models are improving in benchmarks all the time.”

Yeah, about that. What researchers and users are finding when they measure it objectively is that, while benchmark scores do indeed improve, real-world performance is a very different story.

The newer scores can be explained by the fact that many benchmarks contain questions, prompts, or examples that overlap with the model’s training data. Combined with overfitting – finetuning models to benchmark data – this creates the illusion of solving problems when it’s actually statistical recall. Models are being taught to the test.

Summary: thoroughly testing & reviewing LLM output is here to stay.

Another question statistical mechanics sheds light on is how much better deep neural networks will get at handling long-range, “deep” patterns.

In his 2019 paper, “Mutual Information Scaling and Expressive Power of Sequence Models“, Huitao Shen links long‑range dependencies in recurrent neural networks to mutual information scaling, a concept also studied in statistical mechanics. Shen found that exponential decay of mutual information with distance implies difficulty in capturing very long dependencies.

And in his paper “Theoretical Limitations of Self-Attention in Neural Sequence Models“, Michael Hahn explains how attention at long ranges will always be a problem for generative transformers at any scale.

Cliffs Notes version: LLMs will always be driving in thick fog. This explains why they’re so bad at modular design.

The practical upshot of this is that LLMs don’t now, and very probably never will, “see” the bigger picture. For the foreseeable future, high-level concerns around modular design, dependencies and architecture will remain human concerns.

Any AI that can effectively handle long-range patterns will be built on a technology that doesn’t exist yet.

So, there are two very likely futures we can work with (or, more accurately, around). We should not expect LLMs to get significantly more reliable, and we should not expect them to make sound decisions about modular design, dependencies and high-level architecture on non-trivial systems.

The full skinny: a software developer will be required behind the wheel for any “AI” or “agentic” workflow that’s built around LLMs.

It’s physics.

An Ode To “It Can’t Be Done”

Yesterday, I blogged on its 25th birthday about the Manifesto for Agile Software Development.

Whenever I mention Agile (as she’s known to her friends and to her enemies), inevitably there’ll be at least one comment along the lines of “It can’t be done” or “Nobody’s ever done it”.

They may qualify that by citing organisational or technical obstacles to applying the values and principles laid out in the document, like entrenched management or unskilled developers or a risk-averse, command-and-control culture.

For sure, these are all obstacles I’ve faced many times. Fair comment.

Having said that, these are also obstacles that I’ve overcome in 90% of instances.

It’s possible to manage upwards, winning the support needed for – at the very least – your customer to actively engage with an agile development process.

You don’t need to change the whole organisation to create an island of self-organisation and rapid feedback. This is why I focus on teams and individuals, and don’t get involved with organisation-level “agile transformations”. From what I’ve seen in the last 25 years, They. Just. Don’t. Work. Not on that scale. But that’s a whole other post.

As for developer skills, well there’s a reason I necessarily ended up becoming a trainer and mentor in skills like usage-driven analysis & design, TDD, refactoring and continuous integration. To some extent, I’ve had to train most of the teams I’ve worked with going back to the late 1990s.

My response when friends complain that the team they’re stuck with “don’t know what they’re doing” is to ask “And what are you going to do about that?”

Command-and-control cultures are often a product of lack of trust. And in many cases, that lack of trust has been earned by past performance. Many disappointments, many broken promises. So organisations micromanage, and if anything’s guaranteed to end an experiment in software agility, it’s that.

To earn trust, teams need to deliver rapidly, reliably and sustainably. To gain the autonomy needed to deliver rapidly, reliably and sustainably, teams need to be trusted. Catch 22.

Here’s the thing, though. Nothing’s stopping you from having a conversation about that. There’s no law that says you can’t spell out the impasse to the managers you’ll be needing that trust from.

There’s a window of opportunity here for someone coming in to the organisation with fresh eyes to break the cycle. And that’s been my specialty for most of my career.

This is where it can help to have some courage, as well as a healthy disrespect for hierarchical authority.

Does it ruffle feathers? You betcha! You’ll struggle to find a project manager who has a good word to say about me.

Did we actually deliver? Almost always, yes.

Often begrudgingly, organisations have to acknowledge that the medicine is working after the team’s delivered and delivered and delivered, and users are making positive noises about the quality of it.

The fact that it’s hardly ever your code that wakes the head of engineering up at 2 am definitely sends a message.

The bargain with the Devil, of course, is that you have to deliver, and deliver, and deliver. Results are a much easier sell than promises. You really need to know what you’re doing.

While the rest of the engineering department may still be stuck in micromanagement hell, my experience has been that a reliable track record – suitably trumpeted to make sure the right people notice (dev teams need good PR) – will usually encourage management to back off.

Not always, of course. Sometimes it really is just about status and control, and I’ve used more ruthless tactics in those situations.

To borrow from comedian Stewart Lee, I can do office politics. I just choose not to.

I’d much rather have an honest conversation, but in leadership roles, my job is to remove the obstacles standing in the way of the team (and the customer, and sometimes the customer is the obstacle, so some tough love may be required).

If the choice is between creating value, and preserving the status quo, you can probably guess which door I normally go through.

Some teams are working in industries that are heavily regulated, where development processes are highly prescribed. There are a lot of boxes that need to be ticked before software can be released.

Here’s a little secret that I’ll let you have for free: you can fit a more iterative process inside a less iterative process without the high-ups and the box-tickers realising. As John Daniels and Alan Cameron Wills once said to me when we hired their company to do some specialised custom work for my client, “Yeah, we can make it look like that.”

The manual demands “Big Design Up-Front”? We stick a UML reverse-engineering step into our build pipeline. And for once, they get an architecture model that accurately reflects the actual design! We don’t need to mention that we already wrote the code.

(I would normally do something like that anyway, because an up-to-date-ish bigger picture helps us steer the architecture more effectively.)

More generally, when teams I’ve worked with have faced obstacles to rapid, reliable and sustainable development, we’ve almost always found ways to remove them or go around them or dig a tunnel under them.

9 times out of 10, the solution is very simply not to ask for permission. Just f**king do it.

Do we need the boss’s approval to write the tests first? No. If we’re delivering, what business is that of theirs?

Do we need C-suite buy-in to check-in changes in smaller batches? Again. Who’s gonna know?

It’s a happy accident that some of the changes we could make that have the highest leverage – the biggest impact for the smallest investment – are also the least visible outside the team. We don’t need to get budget to start writing unit tests. We don’t need the CTO’s signature to run a linter.

It’s usually only when the change you want to make will require people outside the team to change that buy-in is really required. This is where cohesive cross-functional teams shine best. The less we need from other teams or from management, the more autonomous we can choose to be.

And becoming agile under the radar reduces the risk of attracting the attention of organisational antibodies. We can be delivering very visible results, and the mechanisms involved can be largely invisible.

I’ve worked in teams where we’ve done everything from setting up our own token-ring network because the client wouldn’t allow us on theirs (obviously, don’t do this if you work for MI5), to inventing a team member so we could use their PC as a build server after they refused to give us a machine.

If it’s in the business’s best interest, we can usually find a way. Just takes a bit of creative thinking, a pinch of courage, and a little dash of charm. “Wah, wah, wah, we want a build machine!” is less persuasive than you think. At the very least, be ready to explain why it’s in their interest. A lot of the time, “It can’t be done” really means “I couldn’t persuade them”.

I’d guesstimate that more than half the successes I’ve been involved with doing software development in an agile way have been despite the hand we were dealt.

And never underestimate the bargaining power of a united team. Those 10% of times when greater agility stayed firmly, stubbornly out of reach were when I was the lone voice in the wilderness. In those situations, I cut my losses. And I’ve grown much better at recognising the signs before I engage.

By 2000, I was usually in a position to change the makeup of my team, too. Extreme Programmers of a feather and so on.

The poker metaphor is very appropriate here, since the real benefit of agile software development is minimising risk. We don’t let uncertainty pile up into big batches and big releases, and managers usually realise that their picture of actual progress is far more realistic, enabling them to make more informed decisions grounded in the reality of user feedback about working software.

So not only can it be done, it has been done, and successfully, many times. I’ve seen it, and I’ve lived it.

You might as well tell me that my house “can’t be built”.

Psst. Did you know that, as well as delivering high-quality training and coaching in agile development skills like TDD, refactoring and CI, I’m also available as a fractional principal engineer. I’ve worked with clients on an ongoing basis helping dev teams to turn “It can’t be done” into “We did it!” Let me have those difficult conversations with management and absorb the heat coming from the project office.

Tech Leaders: Low-Performing Teams Are A Gift, Not A Curse

It’s an age-old story. You’re parachuted in to lead a software development organisation that’s experiencing – let’s be diplomatic – challenges.

Releases are far apart and full of drama. Lead times are long, and unless it’s super, super necessary – and they’re prepared to dig deep into their pockets and wait – the business has just stopped asking for changes. It saves times. They know the answer will probably be “too expensive”.

The only respite comes with code freezes and release embargos – never on a Friday, never over the holidays. But is a break really a break when you know what you’ll be coming back to?

Morale is low, the best people keep leaving, and they can almost smell the chaos on you when you’re trying to hire them.

The finger-pointing is never-ending. And soon enough, those fingers will be pointing at you. Honeymoon periods don’t last long, and the clock is ticking to effect noticeable positive change. They lit the fuse the moment you said “yes”.

I know, right! Nightmare!

Except it isn’t. For a leader new in the chair, it’s actually a golden ticket.

The mistake most tech leaders make is to go strategic – bring in the big guns, make the big changes. It’s a big song and dance number, often with the words “Agile” and/or “transformation” above the door.

This creates highly-visible activity at management and process level. But not meaningful change. Agility Theatre is first and foremost still theatre. You’re focusing on trying to speed up the outer feedback loops of software and systems development – at high cost – without recognising where the real leverage is.

Where I work as a teacher and advisor is at the level of small, almost invisible changes to the way software developers work day-to-day, hour-to-hour, minute-to-minute.

The biggest leverage in a system of feedback loops within feedback loops is in the innermost loops.

Think about it in programming terms: you have nested loops in a function that’s really slow, and you need to speed it up. Which loop do you focus your attention on – the outer loop, or the inner loop? Which one would optimising give you the most leverage for the least change?

I worked with a FinTech company in London who called me in because they’d been going round in circles trying to stabilise a release. Every time they deployed, the system exploded on contact with users – costing a lot of money in fines and compensation, and doing much damage to their reputation.

“Agile coaches” had been engaging with management and, to a lesser extent, teams to try to “fix the process”.

I took one look, saw their total reliance on manual testing, and recognised the “doom loop” immediately. Testing the complete system took a big old QA team several weeks, during which time the developers were busy fixing the bugs reported from the last test cycle, while – of course – introducing all-new bugs (and reintroducing a few old favourites).

They’d been stuck in this loop for 9 months, at a cost running into millions of pounds – plus whatever they’d been paying for the “Agile coaching” that had made no discernible difference. You can’t fix this problem with “stand-ups”, “cadences” and “epics”.

I took the developers into a meeting room and taught them to write NUnit tests. Prioritising the most unstable features, the problem was gone within 6 months, never to return.

And something else happened – something quite magical. Not only did the software become much more reliable, but release cycles got shorter. Better software, sooner.

That’s quite the notch on the belt for a new CTO or head of engineering, dontcha think?

After decades of helping organisations build software development capability, I know from wide experience that taking a team from bad to okay is way easier than taking them from okay to good, which is way easier than from good to excellent.

Having helped teams climb that ladder all the way to the top, I can attest that the first rung is a real gift. Your CEO or investors probably can’t distinguish excellent from good, but the difference between bad and okay is very noticeable to businesses.

Unit testing is table stakes in that journey, but the difference it makes is often huge. Low-visibility, but very high-leverage.

Here’s the 101: the inner loops of software development – coding, testing, reviewing, refactoring, merging – are where the highest-leverage changes can be made at surprisingly low cost.

90% of the time, nobody but the developers themselves need get involved. No extra budget’s required (you might even save money). No management permission need be sought. No buy-in from other parts of the organisation is necessary.

High-leverage changes that can slip under the radar, but that can also have quite profound impact.

And results are much easier to sell than promises. Imagine, when your honeymoon period’s up, being able to point to a graph showing how lead times have shrunk, and how fires in production have become rare.

I’ve seen that buy tech leaders the higher trust and autonomy they need to make those strategic changes that the infamous organisational antibodies may well have attacked before. They were immunised by results.

Now for the really good news. In all those cases I’ve been involved in – many over the years – tech leaders only did one thing.

I am not an “Agile coach” or a “transformation lead”. I do not sit in meetings with management. I do not address strategic matters or matters of high-level process.

I work directly with developers, helping them to make those high-leverage, mostly invisible changes to their innermost workflows. Your CEO will likely never meet me, or even hear my name.

The first rule of Code Craft is that we do not talk about Code Craft.

Right now, you may be thinking, “But Jason, we’ve got AI. Developers can produce 2x the code in the same time.”

Do you think 2x the code hitting that FinTech company’s testing bottleneck would have made things better? Or would that just have been more bugs hitting QA in bigger batches?

Attaching a code-generating firehose to your dev process is very likely to overwhelm bottlenecks like testing, code review and merging – making the system as a whole slower and leakier. I call it “LGTM-speed”.

Tightening up those inner loops is even more essential when we’re using “AI” code generators. If we don’t, lead times get longer, releases get less stable, and the cost of change goes up. Worse software, later.

Good luck presenting that graph to the C-suite!

If you need help tightening up the inner feedback loops in your software development processes – with or without “AI” – that’s right up my street.

Visit https://codemanship.co.uk/ for details of training and coaching in the technical practices that enable rapid, reliable and sustainable evolution of software to meet changing needs.