Codemanship's Blog – Musings And Mutterings By Jason Gorman

Essential Code Craft – Test-Driven Development – April 7 & 11

Something I hear often: “I’d love to go on one of your courses, but my boss won’t pay for it”.

Codemanship’s new Essential Code Craft training workshops are aimed at software developers who are self-funding their professional growth. If your employer won’t invest in you, perhaps you can invest in you. (Businesses and other VAT-registered entities should visit codemanship.co.uk for details of corporate training for teams.)

For nearly 30 years, Test-Driven Development has been the technical core of successful agile software development.

Teams have shortened delivery lead times dramatically, while actually improving the reliability of their releases, and lowering the cost of changing software using TDD.

And TDD is proving to be not just compatible with AI-assisted software development, but essential.

In this introductory workshop, you will learn how to solve problems working in TDD micro-cycles, rigorously specifying desired software behaviour using tests, writing the simplest solution code to pass those tests, and refactoring safely to enable a simple, clean design to emerge.

The emphasis will be on learning by doing, with succinct practical instruction and guidance from a 25-year+ TDD practitioner and teacher.

You will work in pairs in your chosen programming language, swapping through Continuous Integration into a shared GitHub* repository after each TDD cycle, reinforcing the relationship between TDD, refactoring, version control and CI.

When you register, you’ll be asked to list up to 3 programming languages you’re comfortable working in (e.g., Java, Ruby, Go), and I’ll use that to pair folks as best I can up for the exercise. (Tip: put at least one popular one on the list – we may struggle to find you a pairing partner for Prolog)

I’ll demonstrate in either Java, Python, JS or C#, depending on which of those is listed most often by registrants.

* Requires an active personal GitHub account

This workshop includes a 15-minute break

Find out more and register here

The AI-Ready Software Developer #22 – Test Your Tests

Many teams are learning that the key to speeding up the feedback loops in software development is greater automation.

The danger of automation that enables us to ship changes faster is that it can allow bad stuff to make it out into the real world faster, too.

To reduce that risk, we need effective quality gates that catch the boo-boos before they get deployed at the very latest. Ideally, catching the boo-boos as they’re being – er – boo-boo’d, so we don’t carry on far after a wrong turn.

There’s much talk about automating tests and code reviews in the world of “AI”-assisted software development these days. Which is good news.

But there’s less talk about just how good these automated quality gates are at catching problems. LLM-generated tests and LLM-performed code reviews in particular can often leave gaping holes through which problems can slip undetected, and at high speed.

The question we need to ask ourselves is: “If the agent broke this code, how soon would we know?”

Imagine your automated test suite is a police force, and you want to know how good your police force is at catching burglars. A simple way to test that might be to commit a burglary and see if you get caught.

We can do something similar with automated tests. Commit a crime in the code that leaves it broken, but still syntactically valid. Then run your tests to see if any of them fail.

This is a technique called mutation testing. It works by applying “mutations” to our code – for example, turning a + into a -, or a string into “” – such that our code no longer does what it did before, but we can still run it.

Then our tests are run against that “mutant” copy of the code. If any tests fail (or time-out), we say that the tests “killed the mutant”.

If no tests fail, and the “mutant” survives, then we may well need to look at whether that part of the code is really being tested, and potentially close the gap with new or better tests.

Mutation testing tools exist for most programming languages – like PIT for Java and Stryker for JS and .NET. They systematically go through solution code line by line, applying appropriate mutation operators, and running your tests.

This will produce a detailed report of what mutations were performed, and what the outcome of testing was, often with a summary of test suite “strength” or “mutation coverage”. This helps us gauge the likelihood that at least one test would fail if we broke the code.

This is much more meaningful than the usual code coverage stats that just tell us which lines of code were executed when we ran the tests.

Some of the best mutation testing tools can be used to incrementally do this on code that’s changed, so you don’t need to run it again and again against code that isn’t changing, making it practical to do within, say, a TDD micro-cycle.

So that answers the question about how we minimise the risk of bugs slipping through our safety net.

But what about other kinds of problems? What about code smells, for example?

The most common route teams take to checking for issues relating to maintainability or security etc is code reviews. Many teams are now learning that after-the-fact code reviews – e.g., for Pull Requests – are both too little, too late and a bottleneck that a code-generating firehose easily overwhelms.

They’re discovering that the review bottleneck is a fundamental speed limit for AI-assisted engineering, and there’s much talk online about how to remove or circumvent that limit.

Some people are proposing that we don’t review the code anymore. These are silly people, and you shouldn’t listen to them.

As we’ve known for decades now, when something hurts, you do it more often. The Cliffs Notes version of how to unblock a bottleneck in software development is to put the word “continuous” in front of it.

Here, we’re talking about continuous inspection.

I build code review directly into my Test-Driven micro-cycle. Whenever the tests are passing after a change to the code, I review it, and – if necessary – refactor it.

Continuous inspection has three benefits:

It catches problems straight away, before I start pouring more cement on them
It dramatically reduces the amount of code that needs to be reviewed, so brings far greater focus
It eliminates the Pull Request/after-the-horse-has-bolted code review bottleneck (it’s PAYG code review)

But, as with our automated tests, the end result is only as good as our inspections.

Some manual code inspection is highly recommended. It lets us consider issues of high-level modular design, like where responsibilities really belong and what depends on what. And it’s really the only way to judge whether code is easy to understand.

But manual inspections tend to miss a lot, especially low-level details like unused imports and embedded URLs. There are actually many, many potential problems we need to look for.

This is where automation is our friend. Static analysis – the programmatic checking of code for conformance to rules – can analyse large amounts of code completely systematically for dozens or even hundreds of problems.

Static analysers – you may know them as “linters” – work by walking the abstract syntax tree of the parsed code, applying appropriate rules to each element in the tree, and reporting whenever an element breaks one of our rules. Perhaps a function has too many parameters. Perhaps a class has too many methods. Perhaps a method is too tightly coupled to the features of another class.

We can think of those code quality rules as being like fast-running unit tests for the structure of the code itself. And, like unit tests, they’re the key to dramatically accelerating code review feedback loops, making it practical to do comprehensive code reviews many times an hour.

The need for human understanding and judgement will never go away, but if 80%-90% of your coding standards and code quality goals can be covered by static analysis, then the time required reduces very significantly. (100% is a Fool’s Errand, of course.)

And, just like unit tests, we need to ask ourselves “If I made a mess in my code, would the automated code inspection catch it?”

Here, we can apply a similar technique; commit a crime in the code and see if the inspection detects it.

For example, I cloned a copy of the JUnit 5 framework source – which is a pretty high-quality code base, as you might expect – and “refuctored” it to introduce unused imports into random source files. Then I asked Claude to look for them. This, by the way, is when I learned not to trust code reviews undertaken by LLMs. They’re not linters, folks!

Continuous inspection is an advanced discipline. You have to invest a lot of time and thought into building and evolving effective quality gates. And a big part of that is testing those gates and closing gaps. Out of the box, most linters won’t get you anywhere near the level of confidence you’ll need for continuous inspection.

If we spot a problem that slipped through the net – and that’s why manual inspections aren’t going away (think of them as exploratory code quality testing) – we need to feed that back into further development of the gate.

(It also requires a good understanding of the code’s abstract syntax, and the ability to reason about code quality. Heck, though, it is our domain model, so it’ll probably make you a better programmer.)

Read the whole series

Engineering Leaders: Your AI Adoption Doesn’t Start With AI

In the past few months, I’ve been hearing from more and more teams that the use of AI coding tools is being strongly encouraged in their organisations.

I’ve also been hearing that this mandate often comes with high expectations about the productivity gains leaders expect this technology to bring. But this narrative is rapidly giving way to frustration when these gains fail to materialise.

The best data we have shows that a minority of development teams are reporting modest gains – in the order of 5%-15% – in outcomes like delivery lead times and throughput. The rest appear to be experiencing negative impacts, with lead times growing and the stability of releases getting worse.

The 2025 DevOps Research & Assessment State of AI-assisted Software Development report makes it clear that the teams reporting gains were already high-performing or elite by DORA’s classification, releasing frequently, with short lead times and with far fewer fires in production to put out.

As the report puts it, this is not about tools or technology – and certainly not about AI. It’s about the engineering capability of the team and the surrounding organisation.

It’s about the system.

Teams who design, test, review, refactor, merge and release in bigger batches are overwhelmed by what DORA describes as “downstream chaos” when AI code generation makes those batches even bigger. Queues and delays get longer, and more problems leak into releases.

Teams who design, test, review, refactor, merge and release continuously in small batches tend to get a boost from AI.

In this respect, the team’s ranking within those DORA performance classifications is a reasonably good predictor of the impact on outcomes when AI coding assistants are introduced.

The DORA website helpfully has a “quick check” diagnostic questionnaire that can give you a sense of where your team sits in their performance bands.

(Answer as accurately as you can. Perception and aspiration aren’t capability.)

The overall result is usefully colour-coded. Red is bad, blue is good. Average is Meh. Yep, Meh is a colour.

If your team’s overall performance is in the purple or red, AI code generation’s likely to make things worse.

If your team’s performance is comfortably in the blue, they may well get a little boost. (You can abandon any hopes of 2x, 5x or 10x productivity gains. At the level of team outcomes, that’s pure fiction.)

The upshot of all this is that before you even think about attaching a code-generating firehose to your development process, you need to make sure the team’s already performing at a blue level.

If they’re not, then they’ll need to shrink their batch sizes – take smaller steps, basically – and accelerate their design, test, review, refactor and merge feedback loops.

Before you adopt AI, you need to be AI-ready.

Many teams go in the opposite direction, tackling whole features in a single step – specifying everything, letting the AI generate all the code, testing it after-the-fact, reviewing the code in larger change-sets (“LGTM”), doing large-scale refactorings using AI, and integrating the whole shebang in one big bucketful of changes.

Heavy AI users like Microsoft and Amazon Web Services have kindly been giving us a large-scale demonstration of where that leads – more bugs, more outages, and significant reputational damage.

A smaller percentage of teams are learning that what worked well before AI works even better with it. Micro-iterative practices like Test-Driven Development, Continuous Integration, Continuous Inspection, and real refactoring (one small change at a time) are not just compatible with AI-assisted development, they’re essential for avoiding the “downstream chaos” DORA finds in the purple-to-red teams.

And while many focus on the automation aspects of Continuous Delivery – and a lot of automation is required to accelerate the feedback loops – by far the biggest barrier to pushing teams into the blue is skills.

Yes. SKILLS.

Skills that most developers, regardless of their level of experience, don’t have. The vast majority of developers have never even seen practices like TDD, refactoring and CI being performed for real.

That’s certainly because real practitioners are pretty rare, so they’re unlikely to bump into one. But much of this is because of their famously steep learning curves. TDD, for example, takes months of regular practice to to be able to use it on real production systems.

And, as someone who’s been practicing TDD and teaching it for more than 25 years, I know it requires ongoing mindful practice to maintain the habits that make it work. Use it or lose it!

An experienced guide can be incredibly valuable in that journey. It’s unrealistic to expect developers new to these practices to figure it all out for themselves.

Maybe you’re lucky to have some of the 1% of software developers – yes, it really is that few – who can actually do this stuff for real. Or even one of the 0.1% who has had a lot of experience helping developers learn them. (Just because they can do it, it doesn’t necessarily follow that they can teach it.)

This is why companies like mine exist. With high-quality training and mentoring from someone who not only has many thousands of hours of practice, but also thousands of hours of experience teaching these skills, the journey can be rapidly accelerated.

I made all the mistakes so that you don’t have to.

And now for the good news: when you build this development capability, the speed-ups in release cycles and lead times, while reliability actually improves, happen whether you’re using AI or not.

The AI-Ready Software Developer #21 – Stuck In A “Doom Loop”? Drop A Gear

I find it helpful to visualise agentic workflows as sequences of dice throws. Take the now-popular “Ralph Wiggum loop“. You want 7. The agent throws the dice, and it’s 5. That fails your test for 7, so the agent reverts the code, flushes the context, and tries again, repeating this cycle until it throws the 7 you wanted.

Different numbers have different probabilities, and therefore might take more or less throws to achieve. Throwing 7 might take 6 throws. Throwing 2 or 12 might take 36 throws. It depends on the probability distribution associated with 2x six-sided dice.

The outputs of Large Language Models also depend on probability distributions, and some outputs will take more attempts to achieve than others, depending on the input and on the distribution of data “learned” by the model.

A toy example of the distribution of probabilities of next-token predictions by an LLM. Note how there’s a very obvious choice because that exact token sequence appears often in the training data. The model’s prediction will be confident.

On average, a more complex context – one that asks the model to match on more complex patterns – will tend to require more improbable pattern matches and produce less confident predictions. This is “hallucination” territory. Little Ralph goes round and round in circles trying in vain to throw 13.

I call these “doom loops”.

An adapted agentic workflow recognises when a step has gone outside the model’s distribution (e.g., by the number of failed iterations) and “drops a gear” – meaning it switches back into read-only planning mode and asks the model to break that step down into simpler contributing steps that are less improbable – more in-distribution.

(It’s a phrase I’ve been using informally for years in pairing sessions when a problem’s “gradient” turns out to be steeper than we can handle in a single step. It harks back to an idea in Kent Beck’s book Test-Driven Development By Example, where he talks about how we might want the teeth on a pulley to be smaller the heavier the load. The fiddlier the problem, the smaller the steps.)

If you think about it, this is exactly what “multi-step reasoning” is – tackling a more complex problem as a sequence of simpler problems. You can try this experiment to see the impact it has on the accuracy of outputs:

If you’re familiar with the Mars Rover programming exercise, where the surface of Mars is a 2D grid and you can drive a rover over it by sending it sequences of commands (F – forward, B – back, L – left, R – right), you can task a model like GPT-5.x with solving problems like “Given the rover is at (5,9) facing North and it’s told to go ‘RFFRBBBLB’, what is its final position and direction?”

In one interaction, instruct the model additionally “Don’t show your working. Respond with the final answer only.” Odds are it’ll get it wrong every time.

Then let it “reason”. It’ll likely get it right first or second time, and it does this by breaking the pattern down:

“R turns the rover clockwise, so it’s at (5,9) facing East.

FF moves the rover forward 2 squares in the +x direction, so it’s at (7,9) facing East.”

etc etc.

When little Ralph tries to throw 13, the model just can’t do it. It’s OOD. When the agent drops a gear, it might try to throw 6+7, both of which are comfortably in-distribution.

Final note, sometimes a problem is just not in the data. It doesn’t matter how many times we throw the dice, or how we break it down, the model just can’t do it. Be ready to step in.

Read the full series here

A New DORA Performance Level – Catastrophically Bad

DORA (DevOps Research & Assessment) has 4 broad levels for dev team performance: Elite, High, Medium, and Low.

A Low-Performing team deploys less than once a month, has lead times for changes > 1 month, sees as many as half their deployments go boom, and takes more than a week to fix them.

An Elite team deploys changes multiple times a day, has lead times typically of < 1 day, failure rates < 15% (fewer than 1 in 8 deployments go boom), and can fix failed releases in under an hour.

I picture a Low-Performing team walking a tightrope between two mountain peaks. It’s a long way to safety (working, shippable code), a long way down if they fall, and a long climb back up to try again.

I picture an Elite team as walking the same length tightrope, tied to wooden posts a few feet apart, just 3 feet off the ground. Safety’s never far away, and a fall’s no big deal. They can quickly recover.

I would like to propose a 5th performance level, one that I’ve seen for real more than once: Catastrophically Bad.

At this level, the tightrope has snapped.

I’ve seen teams stuck in a death spiral where no changes can be deployed because every release goes boom. One example was a financial services company here in London who’d been running themselves ragged trying to stabilise a release for almost a year.

Every deployment had to be rolled back, and every deployment cost them high six-figures in client compensation, not to mention loss of reputation.

You know the drill: testing was done 100% manually and took weeks. While the devs were fixing the bugs testing found, they were introducing all-new bugs (and reintroducing a few old favourites).

A change failure rate of 100% and a lead time of – effectively – infinity.

How does a Catastrophically Bad development process that delivers nothing turn into at the very least a Low-Performing process that delivers something? (And yes, this is the mythical “hyper-productivity” the Scrum folks told you about – and no, Scrum isn’t the answer).

What we did with my client started with 2 fundamental changes:

Fast-running automated “smoke” tests, selected by analysing which features broke most often
CI traffic lights – a wrapper around version control that forced check-ins to go in single file, and blocked check-ins and check-outs of changes when the light wasn’t “green” (meaning, no check-in in progress and build is working).

It took 12 weeks to go from Catastrophically Bad to Medium-Performing. (Pro tip for new leaders and process improvement consultants – poorly-performing teams are a gift because they have such low-hanging fruit).

You can build from here. In this case, by showing teams how to safely change legacy code in ways that add more fast-running regression tests and gradually simplify and modularise the parts of the code that are changing the most.

(The title image is taken from a Catastrophically Bad agentic “team”)

Yeah, About Your CLAUDE.md file…

After 3 years experimenting with and learning about Large Language Models, I’ve long suspected that project-level or global context files are a waste of time.

I’ve been advocating small, task-specific contexts, grounded in reality – not in somebody or something’s summary of what was real at some point in the past. I’ve certainly seen it produce more correct output more often, measured by acceptance tests passed.

So it’s nice to finally see some empirical evidence from an independent study lend weight to that.

“Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%. Behaviorally, both LLM-generated and developer-provided context files encourage broader exploration (e.g., more thorough testing and file traversal), and coding agents tend to respect their instructions. Ultimately, we conclude that unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements.”

_{Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?}

A combination of attention dilution and context rot can produce an effect that’s like sending someone to the shops with a 1979 street map and instructions that include guidance on when to empty the dishwasher and where the spare towels are kept.

But I appreciate that some people luuuurve their CLAUDE.md and AGENTS.md files, and swear blind that they help. Whether they actually do, in any measurable way, seems to be the question here. And the data says “probably not”.

My CLAUDE.md file usually has one globally-applicable piece of advice: “KEEP IT SIMPLE”.

When I find myself repeating the same instructions multiple times (e.g., how to move an instance method, with an example), I’ll capture that in a task-specific .md file to recall only when it’s relevant.

And I haven’t been relying on summaries of the code or the architecture for quite a long time. The moment your “AI” coding assistant changes the code, they’re out of date, and the model can’t tell the difference between what’s real and what’s summarised. Plus, we know that LLMs are unreliable narrators, so any LLM-generated context files are likely to mislead. Besides, we have static analysis for the kind of thing.

As more hard data comes in, it lends increasing weight to my notion of a “clean context”.

Is AI Killing Software Engineering Jobs?

One claim I see repeated widely on social media is that LLM-based coding tools are taking software engineering jobs. These are often accompanied by extrapolations that soon – any time now – the profession of software engineering will be 80%, 90%, even 100% done.

The evidence put forward to support these claims is the large number of tech layoffs and the decimation of the jobs market in recent years. The narrative typically starts with the launch of ChatGPT, and uses a little chronological sleight of hand to make the start of the downturn coincide. A sort of AI “Event One”, for the Whovians among us.

I also a see a lot of stress, and angst and worry in response to these claims, quite understandably.

But are the claims true?

Let’s look at the actual order of events, shall we? First of all, ChatGPT was launched on November 30th 2022. Before that, most of us had never heard of it.

So when did the downturn actually start?

The inflection point occurred around mid-May of that year. So whatever Event One was that triggered the downturn, it wasn’t ChatGPT. Had you even heard of Large Language Models in May 2022?

What did happen around that time that might have caused businesses to reconsider their headcount?

Note the massive ramp-up in job postings in the preceding two years. Businesses were hiring software engineers like they were going out of fashion.

How could they afford these huge increases in headcount? It’s simple. Borrowing was cheap. Not just cheap, but effectively free. At the start of the pandemic, central banks dropped their base interest rates to roughly zero to encourage borrowing and stimulate their locked-down economies. In particular, to stimulate investment in hiring.

And it evidently worked where software engineers were concerned. This coincided with a period of time when businesses had to adapt to a remote-first reality for employees and customers, and were discovering that their “digital transformations” had left them not quite as digital as they might have hoped. Programming skills were selling like hotcakes as they frantically tried to plug the gaps in their operations.

Then around February 2022, inflation kicked in – inevitably, when you pump free money into an economy – and interest rates started to go up again, quite sharply.

It takes a while for the oil tanker to turn around, but turn it did in May in a complete 180°.

By the time ChatGPT was launched, the trend was already more than half played-out. Hiring dropped to pre-pandemic levels and then some, but has actually been growing again over the last year or so.

Job postings for software engineers are up about 5% on this time last year, and could be set to recover fully to pre-pandemic levels. Let’s wait and see.

“But, Jason, the CEO of AcmeAI said they were replacing devs with agents.”

First of all, what is the CEO of AcmeAI selling? And which sounds better to investors: “Oops, we over-hired!” or “We’re cutting edge with this AI stuff!”

As always, don’t listen to what these people say. Watch what they do. All of the businesses who’ve claimed to be replacing engineers with “AI” are actually hiring engineers, and in large numbers. Sure, not as large as in 2021-22, but they’re definitely planning on using human engineers for the foreseeable future. Just perhaps cheaper ones, maybe offshore.

As for the tumbleweed on job postings Main Street, I know a lot of software folks, and I’ve heard from many of them how – when this downturn was in full swing – they decided to stay put longer than they normally might have in a healthier market.

Without that usual background level of churn, open positions become much rarer. There are fewer dead man’s shoes to fill. It creates a negative feedback loop that feeds on itself. The fewer the live job postings, the more fear, the more people stay put, the fewer the live job postings.

Mix that with a narrative that says “It wuz AI wot done it, guv’nor!” and you get… well, LinkedIn, basically. And thus, the illusion of a disappearing profession is created.

But in term of actual jobs – as opposed to job openings – the professional developer population was actually growing during this same period, albeit an aging population.

I saw this after the dotcom bubble burst in 2000, after the financial crash in 2008, and now, I strongly suspect, after the post-pandemic inflation crisis of 2022. Money ain’t what it used to be. It’s a boom-and-bust cycle that comes around roughly once a decade. Our profession probably needs to get better at handling it.

This will no doubt be cold comfort to the many excellent software professionals who just happened to find themselves without a chair when the music stopped. It genuinely is completely random. Give it a few years, and the industry may well come to regret letting those people leave the game.

Just as they’ll very likely regret kicking away the on-ramp to the profession at the same time.

I have little sympathy for the folks spreading this kind of fear, uncertainty and doubt, based on made-up and distorted facts. You’re causing distress and making things worse for a lot of good people. And based on what?

Only time will tell how this is going to pan out, but there are plenty of good reasons to believe that demand for software engineers will recover like it has every time before.

ADDENDUM: One way that “AI” might kill software engineering jobs, of course, is when the bubble inevitably bursts and trillions of dollars are wiped off the valuations of AI companies and their hardware, Cloud and other services suppliers. But the economic, social and geopolitical ramifications of that are likely to reach far beyond our profession.

The Double-Edged Sword of Automation

Here’s a fun fact: my business has a fax number. No, really.

Nobody’s faxed me for more than 20 years, but I still have a fax number. I pay a few quid each month for a fax-to-email service just so I can have an active fax number, just in case.

Why do I still have a fax number? Because when some of my corporate clients sign my company up as a supplier, their system requires a fax number, and I’m scared that, one day, one of them will actually try to fax me.

How can it be, in 2026, that business systems still require a fax number? It’s arcane. Well, this is the thing – most business systems are arcane. Most business systems are legacy systems. They’re hard and risky to change without breaking business processes, so businesses tend not to change them unless they really, really have to. (A big part of my training and coaching is teaching teams how to slow down and even reverse software aging, by the way.)

This is the double-edged sword of automation; computerising business processes can make our work easier, but it can make changing how we work much harder. Automation bakes in workflows.

The other problem automation often brings is the edge cases the automators – that’s us, by the way – didn’t think of when we were doing the automating. Daily life in this digital world can involve a lot of “falling through the cracks” of computer systems that were designed years ago for an ideal, cracks-free world that doesn’t exist anymore (and very probably never did), often by people with no first-hand experience of the reality of those business processes.

Savvy business leaders are well-aware of the limitations of automation, and for decades have left space in their operations for human judgement and human adaptation.

But humans take up space and you have to feed them and sign cards when they leave, so CEOs thought all their birthdays had come at once when someone claimed they’d invented a chatbot that could reason and adapt like a human. “We can fill the gaps with that!”

Three years later, most people have cottoned on to the fact that if there’s two things the chatbot can’t do, it’s reasoning and adapting. It’s just more automation, only this time, not predictable or reliable enough for many use cases.

And now the fashion is for the automators to automate their automating with these language-predicting automatons. And the risk is that we’ll walk into exactly the same trap.

To be fair to the chatbots, development teams baking in their processes with automation is nothing new. I’ve lost count of the times I’ve been coaching pairs and they’ve been unable to change how they’re working – and that’s why I’m there, in case you were wondering – because the “DevOps team” (an oxymoron if ever there was one) has baked the integration process into an automated, one-size-fits-all workflow.

Let’s take TDD as an example. Test-Driven Development is being cited more and more as an essential part of an “AI”-assisted development workflow. (Although it looks like Google Search didn’t get that memo.)

Search trends reveal no renewed interest in Test-Driven Development in recent years

I, of course, agree. Prompting with tests helps pin down the meaning of requirements, producing more accurate pattern-matches from the model. And the rapid feedback loops of TDD, with continuous testing, code review and refactoring, tightly-coupled to version control and continuous integration, can minimise the risks of having the world’s most well-read idiot write the code for us.

I know many of us have attempted to completely automate the TDD workflow for agents so we can feed in our test examples (you may have heard these referred to as “specifications”), hit Play, and go to the proverbial pub while the agent crunches through them in a red-green-refactor “Ralph Wiggum loop”.

I’ve yet to see an attempt that isn’t clunky at best – running into walls and falling over often – and downright hilarious in some cases. Please, if you’re going to try to automate TDD, at least take 5 minutes to find out what it actually is.

I know the TDD workflow very, very well. I’ve performed that red-green-refactor loop hundreds of thousands of times over many years, I’ve taught it to thousands of developers, and I’ve watched thousands of developers doing it (with me saying “Now run your tests!” thousands of times – I could be a t-shirt).

I’ve thought very deeply about it. Talked to many, many practitioners about it. Experimented in many, many ways with it.

And, importantly, made numerous attempts to precisely codify it. But no matter how precise and detailed my TDD state machines are, I keep running into the double-edged sword of automation. It gets especially fiddly trying to coordinate concurrent TDD loops working on the same code base.

The TDD process unavoidably requires space for human judgement and for adaptation to outlier scenarios. It requires Actual Intelligence. So it always come back to the simplest workflow: red->green->refactor-> rinse & repeat, leaving space for me to apply judgement and to adapt after every step. (It also prevents the code running ahead of my understanding of it.)

While I have fun experimenting with automated “agentic” workflows, when it matters, I keep myself very much in the loop. Not just because stochastic parrot, but also because I will never be able to think of every eventuality.

The closer we try to get to 100% automation – trying to plan for every possible scenario – the more the cost skyrockets. 100% is an asymptote – an anti-gravity wall that pushes us back the closer we get. Attempting to reach it is a Fool’s Errand.

And the closer we get to 100% automation, the more we bake ourselves into a workflow that leaves us less and less room for manoeuvre.

How Can We Progress Past “AI” Woo?

For decades, when I’ve interacted with folks working in machine learning, one of the big questions that comes up is “How do we test these systems?”

Traditional software’s behaviour is (usually) predictably repeatable, in the sense that the exact same inputs provided to the system in the exact same internal state will produce the exact same output.

If I ask to debit $51 from an account with $50 available, the system will say “No can do, amigo”.

With, say, a Large Language Model, the exact same input fed to the exact same model – their internal state doesn’t change – can produce surprisingly different outputs each time.

When I say “I tried this, and it didn’t work”, and then you say “Well, I tried it, and it did work”, that’s a bit like saying “Well, I threw a 7, so you must have been throwing the dice wrong”.

This is where probabilistic technology collides with 70+ years of experience with deterministic software. We’re still treating it like the banking system, expecting our results to be reproducible in the same way.

When we consider what’s real and what works when we’re using ML, we have to look beyond data points and start looking for statistically significant trends. We don’t throw the dice, get the 7 we wanted, and declare “These dice are better”. We throw the dice a bunch of times, and see what the distribution of outputs is.

I’ve been using small-scale, closed-loop experiments – once it’s started, I can’t intervene – to test claims about what improves model accuracy, with hard “either it did or it didn’t” fitness functions to gauge success. (Typically, a hidden acceptance test suite, scoring on how many tests passed).

But these are small-scale. I might run them 10x, because I simply can’t afford the tokens. If I published the results, that’s the first thing folks would say: “But this is just 10 data points, Jason”.

And they’d be quite right to, just as I’d be right to take a single data point (especially an anecdotal one) with a big pinch of salt.

And that’s why I’ve leaned more on related – the same phenomena but in different problem domains – large-scale peer-reviewed studies and on the science (e.g., statistical mechanics). So the stage I’m at is: “In theory, this should happen, and in my small experiment, that’s kind of what happened”.

In that respect, I appear to be ahead of the curve, with most folks still relying entirely on how it “feels”. Confirmation bias and projection still dominate the discourse.

It turns out there’s a significant correlation between confidence in “AI” output and belief in the paranormal. Our susceptibility to suggestion appears to be a factor here.

Meanwhile, I see more and more of the ideas presented in my AI-Ready Software Developer blog series being incorporated into workflows and even into the tools themselves. I can’t take the credit, of course. But when a bunch of different people independently come to similar conclusions, that’s interesting data in itself. We could all just be suffering the same delusions, though. Just because most teams use Pull Requests, that doesn’t make them a good idea.

In my defence, I’ve tried very hard to follow the best available evidence, and I’ve been taking nobody’s word for it. This has, at times, made me about as welcome as James Randi at a spoon-bending convention.

I see folks claiming that all of the frontier models and all of the “AI” coding assistants took a big step up in performance in the last 2-3 months. Claude users are saying it. Cursor users are saying it. Copilot users are saying it.

But did the technology really shift over Christmas, or was it actually a shift in expectations and incentives feeding a sort of mass delusion fueled by social proof?

When the CTO comes back to the office in January and declares “This is the future! Get with it, teams!”, does our perception of tool performance change in response? Is this how religions start?

But this is nothing new in our field.

If we’re to move forward in our understanding of what’s real and what works, and avoid surrendering to “AI Woo”, perhaps we need a way to test our hypotheses at a statistically significant scale, and in a more methodical way?

It’s my hope that – on this particularly important subject for our profession – we can just this once move beyond social proof, beyond online debate, beyond committees and manifestos drawn up by people who just happened to be in the room at the time, and beyond appeals to authority, and engage with the reality in a more objective way.

I, for one, would really like to know what that reality is.

What Makes AI Agents Particularly Dangerous Is “Silent Failure”

One of the most potentially dangerous failure modes of LLM-based coding assistants and agents is “silent failure” – the “AI” equivalent of that old Visual Basic chestnut, On Error Resume Next.

For example, a common strategy they use when they’re not able to fix a problem they created is to delete failing tests, or remove testing from the build completely, to sweep the broken crockery under the proverbial rug.

Sure, all the builds are green. But is the code really being tested?

It’s important to be aware of this. LLMs are unreliable narrators. When they say they’re “done”, or that they’ve “fixed” a problem, you really need to check. Sorry, that need is never going away.

And before the inevitable “You must be holding it wrong” folks pipe up, everybody experiences this, no matter how many flight hours they have with these tools. Though, of course, not everybody notices. And not everybody’s telling the truth about it.

It’s an unfixable problem, so the way you really “Hold it right” is to know it when you see it, and work around it.

As always, the skill here is:

Knowing what questions to ask
Being able to determine if the answer’s correct
Knowing how to fix it when it isn’t

While companies like Anthropic and Cursor are locked in a “space race” to extend the time that their agents are able to work unattended – the Holy Grail of agentic “AI” being 100% autonomy (which I believe is a Fool’s Errand) – we need to be careful that they’re not achieving the illusion of greater autonomy through silent failing.