Uncategorized – Codemanship's Blog

A Car Crash In Slow Motion

Since I’m among friends, I hope I can be open with you.

I started Codemanship 17 years ago in my late 30s, as a response to being asked by a recruiter for the gazillionth time “Why are you still a software developer?”

I’d been contracting for 12 years, and been programming professionally for 18, and that is what I do. I passed through lead developer roles into architect and then senior/head architect roles, and decided to walk my career back to being hands-on as a developer, but with enough authority and control over how my teams worked to do a good job – despite management.

Over the previous decade, I’d spent more and more time mentoring developers, as well as bits of structured training here and there.

The job that flipped the switch in me to make the jump permanently to that role was working as a Software Development Coach – a title I invented for myself because I didn’t like the one they’d given me (Technical Architect) – at BBC Worldwide. Even if I say so myself, I made a difference there. Not just to one developer or one team. Software development at BBC Worldwide was different by the time I moved on.

I didn’t move far, though. The lunchtime talks and workshops I’d instigated were increasingly being attended by software engineers from down the road at BBC corporate. After the Worldwide gig, I spent several years popping in to various BBC sites in London and Manchester running training and a very successful peer-led coaching experiment in TV Platforms (the iPlayer folks).

And in between, I ran my last team full-time for a small new consulting company owned by two guys who knew nothing about software development. I won’t go into the details, but that last contract left such a sour taste – and the BBC work was looming – that I finally did it and became a full-time trainer and coach, and founded Codemanship in 2009.

If you’ve started your own business, you’ll know that the first 2-3 years can be tough. I had savings, but one client was never going to be enough. I had a pretty modest goal: make half what I was making as a contractor, doing something I really enjoy. (And I like to think it has real value, too – we’ll circle back to that.)

So I found a small building on Blackfriars Road in South London – opposite Southwark tube station and, importantly, a decent boozer – and rented training rooms for weekend courses for folks funding their own career growth. I didn’t have the business contacts, but I knew a lot of software developers.

For a fraction of the price of corporate training, groups of ~16 enthusiastic folks gave up their weekends and a few hundred quid of their hard-earned cash to do what you might recognise as the ancestors of the Code Craft courses that I’ve run for many clients and for thousands of developers since then.

It was such good value that folks flew in from as far afield as Russia – back when they could – and the US and Canada. They’d book a hotel for a day or two after to see the sights – make a city break of it.

We’d spend a day TDD-ing and refactoring and SOLID-ing and wotnot, and then retire to The Ring to reflect on the day and talk around what we’d covered, going into really cool asides and general chit-chat.

Training has never felt like work to me – hence my ambition to do it for my living – but these weekend workshops felt even less like work, and more like tech meetups or conferences. Y’know – the interesting bits in between and after the talks. If there were 17 of us in the workshop, there’d usually be 8-10 of us in the pub after. That’s a sociable ratio.

Private workshops for corporate clients rarely end like that. (With exceptions, of course – Hi to the folks at Hostelworld in Porto and their magically replenishing beer fridge!)

The weekend workshops felt much more like community events than corporate training. They’re very hands-on, folks are pairing up and meeting new people, and I am “Jason from Twitter/LinkedIn/Jason’s Blog”, and not just “That Guy Who’s Running The Course I’ve Been Told To Go On”.

Although it never really occurred to me, these original training cohorts became my busy bees, buzzing from gig to gig, gaining seniority and influence year on year, until corporate orders started coming in from places they were working.

Codemanship’s client base grew quite organically from things like this, as well as from my activities within the developer community – speaking at events, organising conferences, going to meetups etc.

In fits and starts, the business grew. Sure, there were fallow periods, and there were busy periods. And I didn’t feel the need after a while to run out-of-hours training. Organising small, low-priced public workshops is a lot more work per £ it brings in. There was just enough corporate training and coaching to keep the lights on. And, generally, the trend was “number go up” by roughly 10% a year.

By 2019, I was on track to achieve my goal – half my contract income doing something I love. At the time, I really didn’t think I was reaching for the Moon.

Then, in early 2020… Well, you know what happened in early 2020.

Business just disappeared for 6 months. But then something fell into my lap, courtesy of Nat Pryce, that ultimately led to autumn 2020 to autumn 2022 being the best two years the business ever had. And although I knew an ongoing coaching gig was going to be financial aberration, and that I shouldn’t get used to it, during the same time, the training side of the business grew too – beyond my original goal.

The summer of 2022 was the peak. I was better off than I’d been for many years, and was even about to put in an offer on a detached house in Wiltshire – with a garage and a garden and a utility room! Imagine that -having a room just for utility! Folks like me living in an expensive-ish London postcode can only dream of such luxuries.

Then, as with all peaks, it’s downhill on the other side. Unwittingly, I’d been the beneficiary of a hiring frenzy bankrolled by free credit during the lockdown era. It was only then I realised just how closely my sales tracked with entry-level hiring. In the room, I saw plenty of senior developers. But, it turns out, those workshops wouldn’t have been booked at all if it wasn’t for the junior intake in the room. I’d become the Onboarding Guy.

In 2023, sales dropped 65% – an almost exact match to entry-level hiring here in the UK and Europe. And the ongoing coaching gig had already ended with the rapid rise in interest rates, so that generous tap was turned off.

That’s a big pay cut.

But, I’d been through these cycles before, and had weathered them with savings and loans. So that’s what I did. I didn’t panic. I didn’t think “Shit, I need to get a contract”. I thought “This, too, shall pass.”

In 2024, it didn’t pass. Entry-level hiring fell further. Layoffs, layoffs, layoffs in the news. But still, I didn’t panic. I had savings. I had time.

In 2025, hiring started to recover, but not entry-level hiring. (Hey, it’s a good job all those senior developers know TDD, right?)

But by the middle of the year, something gave me hope and made me stay my course. By this time, after two years of experimenting with and researching AI-assisted coding, I’d figured something out – the principles and practices that I’d been teaching for 25 years, far from becoming less important, were becoming more important than ever with the rise of AI.

Clever Jason! They’ll be queuing outside my door any moment.

As 2025 went on, and more and more good data rolled in, that position just got more and more solid.

Any minute now…

AI coding tool adoption passed a tipping point over the Christmas break, as many engineering leaders finally found some time to play with the technology, got it to build them a Calendar app or a TO-DO list in a nanosecond, and came back to the office and proclaimed to their teams “You will use this!” Because real software development is exactly the same as doing self-contained mini projects by yourself for fun.

So we’ve been seeing more and more teams finding out what I – and many others – figured out a year or more ago. Without solid engineering foundations, that stuff will hurt you. It’ll slow down your release cycles, it’ll make your lead times longer, and it’ll create a growing mountain of quality problems that leaks into production. The evidence is now overwhelming that’s really what’s happening for the majority of teams.

And when I’ve spoken to engineering leaders and polled them about it, they agree that engineering foundations are more important then ever.

Any minute now…

In the meantime, my savings are gone, my credit cards are maxxed out, and orders in 2026 are down to 10% of what they were five years ago.

Two months ago, I pivoted back to where I started – out-of-hours training for people funding their own learning. If your boss doesn’t see the value in engineering foundations, maybe you do. And, if I set the price accordingly, maybe I can put that kind of training within your reach.

And this actually started well. The first few workshops on Tuesday evenings and Saturday mornings sold out. And they’ve really taken me back to that training room on Blackfriars Road, because they’ve felt much more like community events with some training thrown in to give us something to talk about.

I’ve really been enjoying the after-workshop discussions, and the ratio has again been very sociable – typically more than 70% stay on to chat. And this gave me hope.

To quote John Cleese in the 80s movie Clockwise, “It’s not the despair, Laura. I can stand the despair. It’s the hope.”

After those first few, interest has dropped off dramatically. I suspect the 60 or so folks who’ve bought tickets are the extent of the market within my reach.

I do not know where it goes from here.

So, after a little cry early this morning – not kidding – I think maybe it’s time to do some adulting and let go of this particular life goal. I can’t hold out any longer. In fact, I should have let go last year because now I’m a year older and in a real hole.

I don’t know what I’m going to do next. I find myself 55 years old and having not been employed by anybody else for 17 years. Friends will know that I’ve stayed very hands-on and current throughout that, and am still very capable of working as a developer and also leading teams. And – having used so many over the years – I can learn programming languages, tools and tech stacks very fast, even at my age.

But it’s not you I’d need to convince. As I understand it, job applicants have to contend with so many layers of corporate gatekeepers these days (human and AI) – who wouldn’t know a software developer from a hole in the ground – that I suspect I will struggle to get in front of the right people.

The final public workshops will go ahead as planned. Folks have bought tickets. And there are still some places left – I’d be very happy if you could join us. This could well be your last chance to experience what I’ve spent 17 years making a unique, hands-on training experience.

June 16 & 20 – Refactoring

June 30 – Specification By Example

July 7-9 – Code Craft (one final public voyage for my flagship 3-day workshop)

And if you ask me to run a private workshop for your team, I’m not going to say no. I’d be a fool to.

But I’m officially now “between careers”. Where that ends up at my age… I guess I’m going to find out.

Codemanship has turned out to be half my entire career. I’d hoped one day it would be my retirement. I love to do this job when I’m given the chance. And, if you follow me on social media, you probably know I do it even when nobody’s paying, which has been most of the time. (And, of course, there were times when I didn’t realise I wasn’t being paid – but that’s the life of a small business owner.)

I can’t complain. It’s been my dream job, and I’m very grateful to everyone I’ve met along the way.

CRESS Principles for Context Engineering – S is for Small

We’ll get to the effective context limits of Large Language Models in due course, but let’s open with a software engineering fundamental.

More than 75 years of people writing computer programs has taught us a few hard lessons, and one the of the most important is the size of the steps we take.

In my early days, I could code for hours before compiling and running my program. And when I ran the program, it inevitably didn’t work. Of course.

So I’d spend even more hours trying to figure out why it wasn’t working, and fixing the bugs.

As a young software developer, I believed debugging was Programming Skill #1, because I spent so much of my time doing it.

I later learned that this approach to writing programs was called “code-and-fix” development by those in the know. It’s the equivalent of shooting an entire movie without checking the footage, and then trying to fix all the mistakes in the editing suite.

“Code-and-fix” is very costly, and the end result is less than ideal, to say the least. A lot of the bugs never got fixed, because there just wasn’t time. (And because I was dumb enough to provide estimates that didn’t take debugging into account).

Now, here’s the funny part. I’d code for hours – hundreds of lines of code in one sitting – and then hit “Run”. But how I debugged code was a whole different approach. In debugging mode, I’d focus on one problem at a time, make one change to the code, run it again to see if that fixed it. If it did, I’d move on to the next bug in the (usually long) list.

The breakthrough came when I realised that’s how I should have written the code in the first place – solve one problem at a time, running the program many times an hour to check that the problem was indeed solved before moving on to the next problem.

Don’t shoot the whole movie and then look at the footage. Shoot one short take, and then go to video village and see how it looks. Actor didn’t hit his mark? Let’s go again now. Y’know, while the actor’s still here, along with the crew, and the set.

If I screw up a single change to a single line of code and find out immediately, it’s a quick and easy fix – I know exactly which change broke the code. If I screw up a bunch of changes to a bunch of code and only then find out, I’m going to end up in the debugger.

Having learned to work in small steps, making one change to the software at a time and getting feedback from running and testing the software, I found myself dealing with far fewer bugs, and – counterintuitively, because it feels slower – actually shipping sooner.

Changing code is like walking a tightrope. When we make lots of changes and then test the software, we’re walking a tightrope tied between two mountain peaks – by the time we reach the middle, it’s a long way to safety (working code) and a long way down if we fall.

When we make one change at a time and get feedback from testing as we go, our tightrope is tied to wooden posts a few feet a part and a few feet off the ground. We’re never far from safety, and if we do fall, we can just get back on the rope at the last point of safety with little time or effort wasted.

Importantly, as we progress in small steps from one tested, working version of the code to the next, every one of those posts represents a potential release. We’re never far from software that’s shippable.

This accelerates a more important feedback loop. When we can ship more often, we can get user feedback from working software more often. This enables us to learn what works and what doesn’t faster. And, it turns out, that learning is where the real value tends to be found – not from what we planned to deliver, but what we learned from what we delivered.

Working in smaller, tested steps gives us many more opportunities to steer the ship away from the rocks and towards the docks.

There are secondary systemic benefits for teams to doing the work in smaller slices, too. Larger batches of changes hitting downstream bottlenecks in the development process like testing, code review and merging to the release branch makes these activities take longer. Our changes spend a lot of time sitting in queues waiting their turn. The more changes in progress – the more cars on the road, if you like – the more time’s spent waiting instead of moving forward.

Faster cars != faster traffic.

LLMs can generate a lot of code very fast, and the tendency for AI-assisted development to exacerbate these bottlenecks – leading to worse software delivery performance overall – is well-documented.

The impact of batch sizes on delivery lead times and release stability is so big – much, much bigger than AI code generation – that it’s a mystery why more teams don’t pull that lever.

Now for the really fun part; for all the reasons I’ve stated, it serves us well to work in small, tested slices – putting one foot in front of the other – whether we’re using AI or not.

But when we are using AI, it helps us in another important way. These days, LLMs have large advertised maximum context sizes in the order of as much as 1 million tokens. But they do not remain effective at that order of magnitude, or indeed anywhere near it.

The accuracy of token predictions drops off rapidly with contexts as small as just a few hundred tokens, according to independent studies.

As the amount of text an LLM has to keep track of grows, its performance tends to get worse for a few reasons. One is “attention dilution,” where the model spreads its focus too thinly across too much information, making its predictions less confident and precise.

Another is “probability collapse,” where the model struggles more as a conversation or task becomes longer and less similar to examples it saw during training – like how a chess-playing model can make increasingly poor moves deep into a game. Together, these effects make LLMs less reliable and effective when handling larger contexts.

For these reasons, contexts should be as small as possible – contain the least amount of information the model requires for the job at hand. We’ll explore the importance of being specific in the next post, but suffice to say that when extraneous or irrelevant information’s included, it reduces the chances of getting the outcome we want.

Tools like Claude Code and Cursor will typically “compress” contexts when they get too large – which involves summarising parts of the context, and that’s a lossy process. But if you see them doing that, the context is already way outside of the effective zone. In my own workflows, I very rarely see it happening.

When we work in small, tested steps – tackling on problem at a time – and apply the CRESS principles we’ve covered so far, this tends to keep contexts in the order of a few hundreds tokens, comfortably within the limit where models are effective.

When we don’t, we tend to end up spending more time fixing problems, more time doing retakes, and more time with our work sitting in queues. And right now, this is the average picture for the majority of teams, because they’re not slicing the work thinner like they should be. Indeed, some of them are actively moving in the opposite direction and making these problems worse, enthusiastically cheered on by AI vendors wo ought to know better.

On final word about context size: another major factor in how much information needs to be included in each interaction with a coding model is the “blast radius” of the code affected.

If our code has low modularity and poor separation of concerns, a single functional change could bring many source files into play, all of which will need to be included in the context.

If our design effectively localises the impact of changes by splitting code up into cohesive and loosely-coupled modules, then a lot less of it needs to be included.

As with the “small” in “small steps”, “modular” enjoys a wide range of interpretations. What we’re learning with AI coding tools is that what’s really required is – as I saw someone describe it recently – a kind of “radical modularity”. When I looked at code they described as “radically modular”, to me it, I just saw modular code as I understand that to mean. I suppose it’s a bit like how what we call “organic food” in the UK, in France they just call “food”.

LLMs, famously by now, have a bit of a problem with modular design. They’re very good at generating code that they’re pretty bad at modifying later, and the lack of separation of concerns in generated code appears to be one of the main culprits. A program I might have implemented in 100 source files, Claude Code might squeeze into a dozen.

So you really need to keep on top of that, continuously reviewing and refactoring the design to steer yourself clear of a Big Ball of LLM-Unfriendly Mud.

You might be thinking “I’ll just get the LLM to handle that” right now. That would be a mistake. Research shows that models struggle to learn long-range patterns. Matching local patterns is where LLMs are strongest. They can’t do “bigger picture”. Basically, they’re driving in fog, at any scale of model.

Modular design remains very much a “you” thing.

May Workshops for Self-Funding Learners – Update

Hiya. Just a quick note about the Essential Code Craft training workshops aimed at self-funding learners that are happening this month.

Specification By Example

Tuesday May 12 (evening) has 3 places available. I’m guessing it’ll be sold out by the end of this week.

Saturday May 16 (morning) is half-full at time of writing.

-> Register

Test-Driven Development

Tuesday May 19 (evening) is sold out, but you can add yourself to the waitlist in case anybody drops out and to be among the first to hear about future workshops.

Saturday May 23 (morning) still has plenty of places available.

-> Register

And if you want to keep an eye out for future workshops in June and beyond, bookmark our dedicated web page for self-funding learners on Ticket Tailor.

Upcoming skills we’ll be covering include Modular Design and Refactoring. Y’know? Boring skills that are nevertheless essential.

Essential Code Craft – Workshops In May

May will soon be upon us, and I’ve scheduled three out-of-hours workshops for self-funding learners in my Essential Code Craft series.

If your employer won’t invest in you, invest in yourself and join us.

Tues May 12th 18:45 BST & Sat May 16th 09:45 BST- Specification By Example

Over more than 70 years of developing software products and systems, we’ve learned that misunderstandings about the meaning of requirements is one of the biggest sources of avoidable rework.

Reducing ambiguity in specifications can dramatically reduce the risk of misinterpretation, whether it’s among human stakeholders or when we’re working with AI coding tools.

Tues May 19th 18:45 BST – Test-Driven Development

For nearly 30 years, Test-Driven Development has been the technical core of successful agile software development.

Teams have shortened delivery lead times dramatically, while actually improving the reliability of their releases, and lowering the cost of changing software using TDD.

And TDD is proving to be not just compatible with AI-assisted software development, but essential.

Specification By Example Was Essential Before AI. It’s Twice As Essential Now.

_Psst. _{If your boss won’t invest in training you in Specification By Example, I’m running out-of-hours workshops on May 12 and 16 specifically for self-funding learners. £99 + UK VAT.}

The research I’ve done over the last 3 years into AI-assisted programming, including my own closed-loop experiments, found that one major factor in the likelihood that an LLM will correctly interpret a specification is whether or not examples are included to clarify requirements.

Completion rates – as measured by acceptance tests passed – improve dramatically, even in a single pass.

In multiple passes, with feedback from acceptance testing, models given examples converge on impressive completion of ~80%, while without examples they tend to just go around in circles, with completion barely improving.

This should come as no surprise, because we saw a similar effect with dev teams before AI coding tools appeared on the scene. Teams who clarify requirements using examples are much more likely to interpret what the customer (or the product manager, or the business analyst) means correctly.

And, as requirements misunderstandings are typically one of the biggest sources of avoidable rework, they save a lot of time and money correcting mistakes that could have been spotted before a line of code was written.

The LLM equivalent means the same outcomes – features delivered as the prompter intended – in fewer passes, using fewer tokens (and burning down fewer proverbial forests).

Done right, specifications with examples can be translated pretty directly into executable tests that can drive the design and development of working software using techniques like Test-Driven Development.

A specification for totaling items in an order that uses examples is test-ready. Essentially, it is a test.

    def test_one_item(self):
        product = Product(id=327, price=159.95, stock=7, hold=1)
        order = Order([Item(product=product, quantity=1)])

        total = order.total()
        
        self.assertEqual(total, 159.5)

When I’ve provided specifications with examples in TDD training workshops, and measured successful interpretation of requirements by students, I’ve found the same trend that my experiments found with LLMs – it roughly doubles, and often hits 100% completion.

When I don’t include examples… Well, I’ve lost count of the number of times students thought that telling the Mars Rover to turn right moved it x+1, or that Roman Numerals should be converted into integers. As a trainer, it saves me and my students a lot of time – especially if I get to them later.

But humans can do things LLMs can’t, like understand, reason and learn. So levels of misinterpretation tend to be lower, because we can apply an understanding of the world and judge whether a requirement makes sense. Misinterpretation by AI coding assistants is a higher risk, and therefore the need to clarify is significantly heightened.

As is the need to use language consistently. While some folks claim – presumably because they haven’t tested it in any meaningful way – that LLMs don’t need code to be human-readable, the evidence is clear that they really, really do.

I’ve seen many times myself how completion rates dropped significantly when code wasn’t clearly and consistently signposted, using language that had a close conceptual correlation to our specifications. If I call it “sales tax” in one interaction, and “VAT” in another, the model struggles to anchor on a name for that variable, often interpreting them as distinct variables in the code.

Specifying with examples gives us an opportunity to establish a shared vocabulary for describing our problem domain, which aids communication between stakeholders, but also between humans and LLMs.

When developers are stuck for a name for a class or a function that makes the intent of that code clear, I encourage them to write that intent in plain English and take inspiration from that. Specifying with examples can help establish a shared language before a line of code’s been written.

The AI-Ready Software Developer #24 – Specification Is A Conversation

_Psst. _{If your boss won’t invest in training you in Specification By Example, I’m running out-of-hours workshops on May 12 and 16 specifically for self-funding learners. £99 + UK VAT.}

A sentiment I see often on social media about “AI”-assisted and agentic coding goes something along the lines of “If you’re just translating specs into code, your job is disappearing”.

It sounds reasonable on the surface, if you believe that’s all many programmers were doing. Someone – say, a product manager or an architect – hands the programmer a specification for a feature, and the programmer just “codes it up” like a pharmacist filling a prescription.

But was that ever really a thing?

In reality, most software specifications are incomplete and ambiguous, and often contain logical contradictions that are hard to spot – because of the incompleteness and the ambiguity.

Think of the movie script that contains the line “A huge battle ensues”. The studio asks “How much will that cost?” The producer has absolutely no idea, because that part of the script still needs to be written. The line’s just a placeholder for more work to flesh out the details. And in software development, just as in movie-making, the devil is in the details. That’s where the time and the money goes.

And that’s the reality of software specifications written in natural languages like English, even ones written by programmers. At best, they’re placeholders for conversations. Extreme Programming actually makes that explicit: a “user story” is not a requirements specification. It’s just a placeholder for a chat with the person who wrote it. That’s why it’s a waste of time making them detailed.

And this means that the programmer’s job is not just to “code up the spec”, it’s to figure out what the specification actually means. What exactly happens in this huge battle?

And because specifications are incomplete and ambiguous and often contradictory, this process inevitably has to walk everything back to figuring out what the need being addressed is in the first place.

This is why, for so many years, I drummed it into product managers, requirements analysts and the like to come to the team not with the “what”, and certainly not with the “how”, but with the “why” – what problem are we aiming to solve?

We then work as a team – leveraging our combined expertise in systems design and development and the problem domain – to learn together how to solve the problem through successive iterations of best guesses informed by rapid user feedback.

Now, I could be wrong, but that doesn’t sound like “just translating specs into code” to me.

The promise of this new generation of “AI” coding tools is that non-programmers will be able to iterate working software by themselves. And this is true, to an extent.

Tools like Claude Code and Cursor have proven themselves to be very useful for generating prototypes and proofs-of-concept with no programmer involvement, enabling business analysts, UX designers, product managers, start-up founders and pastry chefs to test simple ideas quickly and cheaply.

The problem is that without the expert judgement of experienced programmers, it doesn’t mature to reliable, scalable, secure software that stands up to real-world production rigours.

So, at some point, you’ll have to pick up the phone to Programmers-R-Us and get some involved if you want your experiment to scale. Have your cheque book ready!

And this is where the problems really start. You now have kind-of, sort-of working software that validates your idea. There’s a fork in the road here. You can either:

Have the programmers find and fix all the problems to make the prototype market-ready
Use the prototype as the specification and have the programmers build a production-quality version from scratch with, y’know, tests and architecture and stuff

Let’s go through Door #1.

So, the prototype sort-of works, but there are bugs – oh, boy are there bugs?! – and security vulnerabilities and performance bottlenecks and scaling blockers and some gone-off cheddar and discarded prams and all the kind of stuff that LLMs will tend to leave in your code if you let them. Which you did, because you can’t tell a switch statement from a discarded pram.

So the programmers need to test the software thoroughly to find all the usage scenarios where the software doesn’t do what it’s supposed to. And there’s the Catch 22. What is it supposed to do? If only there was a complete and precise specification!

We don’t fare much better with Door #2.

Now your programmers have to reverse-engineer the prototype to figure out what it does. What happens when the user leaves that field blank and clicks “Continue”? What happens when the clock strikes midnight and interest needs to be applied to the account? What happens when the vehicle remains stationary for more than 5 minutes? Edge case after edge case after edge case.

You run into the exact same wall. A complete and precise specification for any non-trivial software is made up of thousands of definitive answers to these kinds of questions. Software systems are the most complex machines we’ve ever built. You can’t specify them on the back of a cigarette packet.

Or, to put it the way a customer once put it to me when we had this discussion about a new feature, “Why are you making it so complicated, Jason?”

“I’m not making it complicated. That’s how complicated what you’re asking for is.”

The system has to handle all of these inputs in some meaningful way, otherwise it will break. If the user’s email address isn’t valid, a whole bunch of features won’t work. Are you happy for the system to just not work for those users?

Then, as it almost always does, the conversation turned into a negotiation about the scope and complexity of that feature for the next release. We can always remove one variable now, and add it in a later iteration. It’s an old physics trick (see: Special Relativity).

And this is why requirements specifications are placeholders for conversations. If there’s no conversation, issues will not get addressed by experts who understand them until much, much later when they’re much, much harder to fix.

This is why, as a tech lead, I almost always – when presented with a “requirements specification” as a fait accompli – pressed the “Reset” button and started the conversation again at “Okay, so what seems to be the problem?”

That’s Door #3 – involve programmers early. Because those conversations have to happen whether you like it or not, and the sooner you have them, the sooner you’ll converge on a workable, production-ready solution.

A simple prototype can help you validate your idea before you pick up that phone, but the more design decisions you make before involving experts, the bigger and badder the catch-up’s going to be later. And you might be surprised – when you have a clear end goal in mind – how simple the simplest proof-of-concepts can be.

I’ve been in this game for 34 years, and in that time I’ve seen countless attempts to demarcate this process of building an understanding of not just what the software needs to do, but why it needs to do it.

They all inevitably walk into the same wall. You cannot pay someone else to understand something for you. It’s like paying someone to revise for your exams.

Software specification is necessarily a conversation between people with needs – and, ideally, money – and people who specialise in meeting needs using computers. T’was ever thus, t’will ever be.

Unless, of course, your specification is complete, consistent and mathematically precise.

And a complete, consistent, mathematically precise specification of a computer program is that computer program. That’s what source code is, and that’s why programming languages were invented.

A person who just translates complete, consistent and mathematically precise specifications into executable code is a compiler.

Fans of Spec-Driven Development may be feeling vindicated now because you believe your specifications are complete, consistent and precise. If you’ve clarified requirements using examples – to me and you, tests – that might push them towards being of that integrity.

But even if your specs really are completely complete, and completely consistent and completely precise – and even if LLMs were capable of reliably translating such specifications into code (which they’re not) – you need to remember that it will still be full of assumptions about what’s really needed. Basically, a formal specification is just formalised guesswork.

To quote the Second Doctor, “Logic, my dear Zoe, merely enables one to be wrong with authority”.

The real knowledge isn’t in the spec, or in the code, it’s in the feedback we get when people use it in the real world. In this sense, iterating is the ultimate requirements discipline – it’s where most of the real value gets discovered.

So, by all means, spec away. But don’t spec far – just enough to test an assumption with user feedback from working software. And user feedback’s like code reviews – the more changes we ask for feedback on, the less attention gets paid to most of them.

Research has found that when users give feedback, they often anchor on one or two standout moments – positive or negative – rather than the entire user experience. Psychologists call it the “peak-end rule” – it’s “LGTM” for user eyeballs.

Spec one change to functionality at a time, build it in rapid, tested iterations, ship it through a reliable delivery pipeline, and then go get that focused feedback. Because the spec very probably will need to change.

And if the spec rarely changes, I’d worry that we aren’t listening to our users. Either that or we got incredibly lucky (or clairvoyant).

It’s all one big, ongoing conversation.

Essential Code Craft – Test-Driven Development – April 7 & 11

Something I hear often: “I’d love to go on one of your courses, but my boss won’t pay for it”.

Codemanship’s new Essential Code Craft training workshops are aimed at software developers who are self-funding their professional growth. If your employer won’t invest in you, perhaps you can invest in you. (Businesses and other VAT-registered entities should visit codemanship.co.uk for details of corporate training for teams.)

For nearly 30 years, Test-Driven Development has been the technical core of successful agile software development.

Teams have shortened delivery lead times dramatically, while actually improving the reliability of their releases, and lowering the cost of changing software using TDD.

And TDD is proving to be not just compatible with AI-assisted software development, but essential.

In this introductory workshop, you will learn how to solve problems working in TDD micro-cycles, rigorously specifying desired software behaviour using tests, writing the simplest solution code to pass those tests, and refactoring safely to enable a simple, clean design to emerge.

The emphasis will be on learning by doing, with succinct practical instruction and guidance from a 25-year+ TDD practitioner and teacher.

You will work in pairs in your chosen programming language, swapping through Continuous Integration into a shared GitHub* repository after each TDD cycle, reinforcing the relationship between TDD, refactoring, version control and CI.

When you register, you’ll be asked to list up to 3 programming languages you’re comfortable working in (e.g., Java, Ruby, Go), and I’ll use that to pair folks as best I can up for the exercise. (Tip: put at least one popular one on the list – we may struggle to find you a pairing partner for Prolog)

I’ll demonstrate in either Java, Python, JS or C#, depending on which of those is listed most often by registrants.

* Requires an active personal GitHub account

This workshop includes a 15-minute break

Find out more and register here

How Can We Progress Past “AI” Woo?

For decades, when I’ve interacted with folks working in machine learning, one of the big questions that comes up is “How do we test these systems?”

Traditional software’s behaviour is (usually) predictably repeatable, in the sense that the exact same inputs provided to the system in the exact same internal state will produce the exact same output.

If I ask to debit $51 from an account with $50 available, the system will say “No can do, amigo”.

With, say, a Large Language Model, the exact same input fed to the exact same model – their internal state doesn’t change – can produce surprisingly different outputs each time.

When I say “I tried this, and it didn’t work”, and then you say “Well, I tried it, and it did work”, that’s a bit like saying “Well, I threw a 7, so you must have been throwing the dice wrong”.

This is where probabilistic technology collides with 70+ years of experience with deterministic software. We’re still treating it like the banking system, expecting our results to be reproducible in the same way.

When we consider what’s real and what works when we’re using ML, we have to look beyond data points and start looking for statistically significant trends. We don’t throw the dice, get the 7 we wanted, and declare “These dice are better”. We throw the dice a bunch of times, and see what the distribution of outputs is.

I’ve been using small-scale, closed-loop experiments – once it’s started, I can’t intervene – to test claims about what improves model accuracy, with hard “either it did or it didn’t” fitness functions to gauge success. (Typically, a hidden acceptance test suite, scoring on how many tests passed).

But these are small-scale. I might run them 10x, because I simply can’t afford the tokens. If I published the results, that’s the first thing folks would say: “But this is just 10 data points, Jason”.

And they’d be quite right to, just as I’d be right to take a single data point (especially an anecdotal one) with a big pinch of salt.

And that’s why I’ve leaned more on related – the same phenomena but in different problem domains – large-scale peer-reviewed studies and on the science (e.g., statistical mechanics). So the stage I’m at is: “In theory, this should happen, and in my small experiment, that’s kind of what happened”.

In that respect, I appear to be ahead of the curve, with most folks still relying entirely on how it “feels”. Confirmation bias and projection still dominate the discourse.

It turns out there’s a significant correlation between confidence in “AI” output and belief in the paranormal. Our susceptibility to suggestion appears to be a factor here.

Meanwhile, I see more and more of the ideas presented in my AI-Ready Software Developer blog series being incorporated into workflows and even into the tools themselves. I can’t take the credit, of course. But when a bunch of different people independently come to similar conclusions, that’s interesting data in itself. We could all just be suffering the same delusions, though. Just because most teams use Pull Requests, that doesn’t make them a good idea.

In my defence, I’ve tried very hard to follow the best available evidence, and I’ve been taking nobody’s word for it. This has, at times, made me about as welcome as James Randi at a spoon-bending convention.

I see folks claiming that all of the frontier models and all of the “AI” coding assistants took a big step up in performance in the last 2-3 months. Claude users are saying it. Cursor users are saying it. Copilot users are saying it.

But did the technology really shift over Christmas, or was it actually a shift in expectations and incentives feeding a sort of mass delusion fueled by social proof?

When the CTO comes back to the office in January and declares “This is the future! Get with it, teams!”, does our perception of tool performance change in response? Is this how religions start?

But this is nothing new in our field.

If we’re to move forward in our understanding of what’s real and what works, and avoid surrendering to “AI Woo”, perhaps we need a way to test our hypotheses at a statistically significant scale, and in a more methodical way?

It’s my hope that – on this particularly important subject for our profession – we can just this once move beyond social proof, beyond online debate, beyond committees and manifestos drawn up by people who just happened to be in the room at the time, and beyond appeals to authority, and engage with the reality in a more objective way.

I, for one, would really like to know what that reality is.

Super-Mediocrity

March will be the third anniversary of the beginning of my journey with Large Language Models and generative “A.I.”

At the time, we were all being dazzled – myself included – by ChatGPT, the chat interface to OpenAI’s “frontier” LLM, GPT-4.

There was much talk at the time of this technology eventually producing Artificial General Intelligence (AGI) – intelligence equal to that of a human being – and, from there, ascending to god-like “super-intelligence”. All we needed was more data and more GPUs.

It’s now becoming very clear that scaling very probably isn’t the path to AGI, let alone super-intelligence. But it was pretty clear to me at the time, even after just a few dozen hours experimenting with the technology.

The way I see it, LLMs are playing a giant game of Blankety Blank. And you don’t win at Blankety Blank by being original or witty or clever. You win at Blankety Blank by being as average as possible.

The more data we train them on, the more compute we use, the more average the output will get, I speculated. Model performance will tend towards the mean.

Three years later, is there any hard evidence to back this up? Turns out there is – the well-documented phenomenon of model collapse.

Train an LLM on human-created text, then train another LLM on outputs from the first LLM, then another from the outputs generated by that “copy”. Researchers found that, from one generation to the next, output degraded until it become little more than gibberish.

What causes this is that output generated by LLMs clusters closer to the mean than the data they’re trained on. Long-tail examples – things that are novel or niche – get effectively filtered out. The text generated by LLMs is measurably less diverse, less surprising, less “smart” than the text they’re trained on.

Given infinite training data, and infinite compute, the resulting model will not become infinitely smart – it will become infinitely average. I coined the term “super-mediocrity” to describe this potential final outcome of scaling LLMs.

(What really strikes me, watching this video again after 3 years, is just how on-the-money I was even then. I guess the lesson is: don’t bet against entropy.)

Naturally, my focus is on software development. The burning question for me is, what does super-mediocre code look like? When it comes to the code models like Claude Opus and GPT-5 are trained on, what’s the mean? And it’s bad news, I’m afraid.

We know what the large-scale publicly-available sources of code examples are – places like Stack Overflow and GitHub. And the large majority of code samples we find on these sites are… how can I put this tactfully?… crap.

The ones that actually even compile often contain bugs. The ones that don’t contain bugs are often written with little thought to making them easy to understand and easy to change. And that’s before we get on to the subject of things like security vulnerabilities.

Hard to believe, I know, but when Olaf posted that answer on Stack Overflow, he wasn’t thinking about those sorts of things. Because who in their right mind would just copy and paste a Stack Overflow answer into their business-critical code? Right? RIGHT?

And LLM-generated code tends towards the average of that. It tends to be idiomatic, “boilerplate” and often subtly wrong – as well as often being more complicated than it needed to be. It’s that junior developer who just copies what they’ve seen other developers do, without stopping to wonder why they did it. Monkey see, monkey do.

What does super-mediocrity at scale look like, we might ask? I think a bit of a clue can be found on the Issues pages of our most-beloved “AI” coding tools.

As a daily user of these tools, I’m often taken aback at just how buggy updates can be. And I see a lot of chatter online complaining about how unreliable some of the most popular “AI” coding assistants are, so I’m evidently not alone.

Anthropic have been boasting about how pretty much 100% of their code’s generated by one of their models these days, usually being driven in FOR loops (you may know them as “agents”).

I’ll skip the jokes about dealers “getting high on their own supply”, and just make a basic observation about the practical implications of attaching a super-mediocrity generator that’s been trained on mostly crap to your development process.

Just as I don’t lift code blindly from Stack Overflow without putting it through some kind of quality check – and that often involves fixing problems in it, which requires me to understand it – I also don’t accept LLM-generated code without putting it through the same filter. It has to go through my brain to make it into the code.

This is an unavoidable speed limit on code generation – code doesn’t get created (or modified) faster than I can comprehend it.

When code generation outruns comprehension, slipping into what I call “LGTM-speed”, well… we see what happens. Problems accumulate faster, while our understanding of the code – and therefore our ability to fix the problems – withers. Mean Time To Failure gets shorter. Mean Time To Recovery gets longer.

Your outages happen more and more often, and they last longer and longer.

Yes, this happens with human teams, too. But an “AI” coding assistant can get us there in weeks instead of years.

As of writing, there’s no shortcut. Sorry.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.