Uncategorized – Codemanship's Blog

Essential Code Craft – Test-Driven Development – April 7 & 11

Something I hear often: “I’d love to go on one of your courses, but my boss won’t pay for it”.

Codemanship’s new Essential Code Craft training workshops are aimed at software developers who are self-funding their professional growth. If your employer won’t invest in you, perhaps you can invest in you. (Businesses and other VAT-registered entities should visit codemanship.co.uk for details of corporate training for teams.)

For nearly 30 years, Test-Driven Development has been the technical core of successful agile software development.

Teams have shortened delivery lead times dramatically, while actually improving the reliability of their releases, and lowering the cost of changing software using TDD.

And TDD is proving to be not just compatible with AI-assisted software development, but essential.

In this introductory workshop, you will learn how to solve problems working in TDD micro-cycles, rigorously specifying desired software behaviour using tests, writing the simplest solution code to pass those tests, and refactoring safely to enable a simple, clean design to emerge.

The emphasis will be on learning by doing, with succinct practical instruction and guidance from a 25-year+ TDD practitioner and teacher.

You will work in pairs in your chosen programming language, swapping through Continuous Integration into a shared GitHub* repository after each TDD cycle, reinforcing the relationship between TDD, refactoring, version control and CI.

When you register, you’ll be asked to list up to 3 programming languages you’re comfortable working in (e.g., Java, Ruby, Go), and I’ll use that to pair folks as best I can up for the exercise. (Tip: put at least one popular one on the list – we may struggle to find you a pairing partner for Prolog)

I’ll demonstrate in either Java, Python, JS or C#, depending on which of those is listed most often by registrants.

* Requires an active personal GitHub account

This workshop includes a 15-minute break

Find out more and register here

How Can We Progress Past “AI” Woo?

For decades, when I’ve interacted with folks working in machine learning, one of the big questions that comes up is “How do we test these systems?”

Traditional software’s behaviour is (usually) predictably repeatable, in the sense that the exact same inputs provided to the system in the exact same internal state will produce the exact same output.

If I ask to debit $51 from an account with $50 available, the system will say “No can do, amigo”.

With, say, a Large Language Model, the exact same input fed to the exact same model – their internal state doesn’t change – can produce surprisingly different outputs each time.

When I say “I tried this, and it didn’t work”, and then you say “Well, I tried it, and it did work”, that’s a bit like saying “Well, I threw a 7, so you must have been throwing the dice wrong”.

This is where probabilistic technology collides with 70+ years of experience with deterministic software. We’re still treating it like the banking system, expecting our results to be reproducible in the same way.

When we consider what’s real and what works when we’re using ML, we have to look beyond data points and start looking for statistically significant trends. We don’t throw the dice, get the 7 we wanted, and declare “These dice are better”. We throw the dice a bunch of times, and see what the distribution of outputs is.

I’ve been using small-scale, closed-loop experiments – once it’s started, I can’t intervene – to test claims about what improves model accuracy, with hard “either it did or it didn’t” fitness functions to gauge success. (Typically, a hidden acceptance test suite, scoring on how many tests passed).

But these are small-scale. I might run them 10x, because I simply can’t afford the tokens. If I published the results, that’s the first thing folks would say: “But this is just 10 data points, Jason”.

And they’d be quite right to, just as I’d be right to take a single data point (especially an anecdotal one) with a big pinch of salt.

And that’s why I’ve leaned more on related – the same phenomena but in different problem domains – large-scale peer-reviewed studies and on the science (e.g., statistical mechanics). So the stage I’m at is: “In theory, this should happen, and in my small experiment, that’s kind of what happened”.

In that respect, I appear to be ahead of the curve, with most folks still relying entirely on how it “feels”. Confirmation bias and projection still dominate the discourse.

It turns out there’s a significant correlation between confidence in “AI” output and belief in the paranormal. Our susceptibility to suggestion appears to be a factor here.

Meanwhile, I see more and more of the ideas presented in my AI-Ready Software Developer blog series being incorporated into workflows and even into the tools themselves. I can’t take the credit, of course. But when a bunch of different people independently come to similar conclusions, that’s interesting data in itself. We could all just be suffering the same delusions, though. Just because most teams use Pull Requests, that doesn’t make them a good idea.

In my defence, I’ve tried very hard to follow the best available evidence, and I’ve been taking nobody’s word for it. This has, at times, made me about as welcome as James Randi at a spoon-bending convention.

I see folks claiming that all of the frontier models and all of the “AI” coding assistants took a big step up in performance in the last 2-3 months. Claude users are saying it. Cursor users are saying it. Copilot users are saying it.

But did the technology really shift over Christmas, or was it actually a shift in expectations and incentives feeding a sort of mass delusion fueled by social proof?

When the CTO comes back to the office in January and declares “This is the future! Get with it, teams!”, does our perception of tool performance change in response? Is this how religions start?

But this is nothing new in our field.

If we’re to move forward in our understanding of what’s real and what works, and avoid surrendering to “AI Woo”, perhaps we need a way to test our hypotheses at a statistically significant scale, and in a more methodical way?

It’s my hope that – on this particularly important subject for our profession – we can just this once move beyond social proof, beyond online debate, beyond committees and manifestos drawn up by people who just happened to be in the room at the time, and beyond appeals to authority, and engage with the reality in a more objective way.

I, for one, would really like to know what that reality is.

Super-Mediocrity

March will be the third anniversary of the beginning of my journey with Large Language Models and generative “A.I.”

At the time, we were all being dazzled – myself included – by ChatGPT, the chat interface to OpenAI’s “frontier” LLM, GPT-4.

There was much talk at the time of this technology eventually producing Artificial General Intelligence (AGI) – intelligence equal to that of a human being – and, from there, ascending to god-like “super-intelligence”. All we needed was more data and more GPUs.

It’s now becoming very clear that scaling very probably isn’t the path to AGI, let alone super-intelligence. But it was pretty clear to me at the time, even after just a few dozen hours experimenting with the technology.

The way I see it, LLMs are playing a giant game of Blankety Blank. And you don’t win at Blankety Blank by being original or witty or clever. You win at Blankety Blank by being as average as possible.

The more data we train them on, the more compute we use, the more average the output will get, I speculated. Model performance will tend towards the mean.

Three years later, is there any hard evidence to back this up? Turns out there is – the well-documented phenomenon of model collapse.

Train an LLM on human-created text, then train another LLM on outputs from the first LLM, then another from the outputs generated by that “copy”. Researchers found that, from one generation to the next, output degraded until it become little more than gibberish.

What causes this is that output generated by LLMs clusters closer to the mean than the data they’re trained on. Long-tail examples – things that are novel or niche – get effectively filtered out. The text generated by LLMs is measurably less diverse, less surprising, less “smart” than the text they’re trained on.

Given infinite training data, and infinite compute, the resulting model will not become infinitely smart – it will become infinitely average. I coined the term “super-mediocrity” to describe this potential final outcome of scaling LLMs.

(What really strikes me, watching this video again after 3 years, is just how on-the-money I was even then. I guess the lesson is: don’t bet against entropy.)

Naturally, my focus is on software development. The burning question for me is, what does super-mediocre code look like? When it comes to the code models like Claude Opus and GPT-5 are trained on, what’s the mean? And it’s bad news, I’m afraid.

We know what the large-scale publicly-available sources of code examples are – places like Stack Overflow and GitHub. And the large majority of code samples we find on these sites are… how can I put this tactfully?… crap.

The ones that actually even compile often contain bugs. The ones that don’t contain bugs are often written with little thought to making them easy to understand and easy to change. And that’s before we get on to the subject of things like security vulnerabilities.

Hard to believe, I know, but when Olaf posted that answer on Stack Overflow, he wasn’t thinking about those sorts of things. Because who in their right mind would just copy and paste a Stack Overflow answer into their business-critical code? Right? RIGHT?

And LLM-generated code tends towards the average of that. It tends to be idiomatic, “boilerplate” and often subtly wrong – as well as often being more complicated than it needed to be. It’s that junior developer who just copies what they’ve seen other developers do, without stopping to wonder why they did it. Monkey see, monkey do.

What does super-mediocrity at scale look like, we might ask? I think a bit of a clue can be found on the Issues pages of our most-beloved “AI” coding tools.

As a daily user of these tools, I’m often taken aback at just how buggy updates can be. And I see a lot of chatter online complaining about how unreliable some of the most popular “AI” coding assistants are, so I’m evidently not alone.

Anthropic have been boasting about how pretty much 100% of their code’s generated by one of their models these days, usually being driven in FOR loops (you may know them as “agents”).

I’ll skip the jokes about dealers “getting high on their own supply”, and just make a basic observation about the practical implications of attaching a super-mediocrity generator that’s been trained on mostly crap to your development process.

Just as I don’t lift code blindly from Stack Overflow without putting it through some kind of quality check – and that often involves fixing problems in it, which requires me to understand it – I also don’t accept LLM-generated code without putting it through the same filter. It has to go through my brain to make it into the code.

This is an unavoidable speed limit on code generation – code doesn’t get created (or modified) faster than I can comprehend it.

When code generation outruns comprehension, slipping into what I call “LGTM-speed”, well… we see what happens. Problems accumulate faster, while our understanding of the code – and therefore our ability to fix the problems – withers. Mean Time To Failure gets shorter. Mean Time To Recovery gets longer.

Your outages happen more and more often, and they last longer and longer.

Yes, this happens with human teams, too. But an “AI” coding assistant can get us there in weeks instead of years.

As of writing, there’s no shortcut. Sorry.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.