Codemanship's Blog – Page 3 – Musings And Mutterings By Jason Gorman

Tech Leaders: Low-Performing Teams Are A Gift, Not A Curse

It’s an age-old story. You’re parachuted in to lead a software development organisation that’s experiencing – let’s be diplomatic – challenges.

Releases are far apart and full of drama. Lead times are long, and unless it’s super, super necessary – and they’re prepared to dig deep into their pockets and wait – the business has just stopped asking for changes. It saves times. They know the answer will probably be “too expensive”.

The only respite comes with code freezes and release embargos – never on a Friday, never over the holidays. But is a break really a break when you know what you’ll be coming back to?

Morale is low, the best people keep leaving, and they can almost smell the chaos on you when you’re trying to hire them.

The finger-pointing is never-ending. And soon enough, those fingers will be pointing at you. Honeymoon periods don’t last long, and the clock is ticking to effect noticeable positive change. They lit the fuse the moment you said “yes”.

I know, right! Nightmare!

Except it isn’t. For a leader new in the chair, it’s actually a golden ticket.

The mistake most tech leaders make is to go strategic – bring in the big guns, make the big changes. It’s a big song and dance number, often with the words “Agile” and/or “transformation” above the door.

This creates highly-visible activity at management and process level. But not meaningful change. Agility Theatre is first and foremost still theatre. You’re focusing on trying to speed up the outer feedback loops of software and systems development – at high cost – without recognising where the real leverage is.

Where I work as a teacher and advisor is at the level of small, almost invisible changes to the way software developers work day-to-day, hour-to-hour, minute-to-minute.

The biggest leverage in a system of feedback loops within feedback loops is in the innermost loops.

Think about it in programming terms: you have nested loops in a function that’s really slow, and you need to speed it up. Which loop do you focus your attention on – the outer loop, or the inner loop? Which one would optimising give you the most leverage for the least change?

I worked with a FinTech company in London who called me in because they’d been going round in circles trying to stabilise a release. Every time they deployed, the system exploded on contact with users – costing a lot of money in fines and compensation, and doing much damage to their reputation.

“Agile coaches” had been engaging with management and, to a lesser extent, teams to try to “fix the process”.

I took one look, saw their total reliance on manual testing, and recognised the “doom loop” immediately. Testing the complete system took a big old QA team several weeks, during which time the developers were busy fixing the bugs reported from the last test cycle, while – of course – introducing all-new bugs (and reintroducing a few old favourites).

They’d been stuck in this loop for 9 months, at a cost running into millions of pounds – plus whatever they’d been paying for the “Agile coaching” that had made no discernible difference. You can’t fix this problem with “stand-ups”, “cadences” and “epics”.

I took the developers into a meeting room and taught them to write NUnit tests. Prioritising the most unstable features, the problem was gone within 6 months, never to return.

And something else happened – something quite magical. Not only did the software become much more reliable, but release cycles got shorter. Better software, sooner.

That’s quite the notch on the belt for a new CTO or head of engineering, dontcha think?

After decades of helping organisations build software development capability, I know from wide experience that taking a team from bad to okay is way easier than taking them from okay to good, which is way easier than from good to excellent.

Having helped teams climb that ladder all the way to the top, I can attest that the first rung is a real gift. Your CEO or investors probably can’t distinguish excellent from good, but the difference between bad and okay is very noticeable to businesses.

Unit testing is table stakes in that journey, but the difference it makes is often huge. Low-visibility, but very high-leverage.

Here’s the 101: the inner loops of software development – coding, testing, reviewing, refactoring, merging – are where the highest-leverage changes can be made at surprisingly low cost.

90% of the time, nobody but the developers themselves need get involved. No extra budget’s required (you might even save money). No management permission need be sought. No buy-in from other parts of the organisation is necessary.

High-leverage changes that can slip under the radar, but that can also have quite profound impact.

And results are much easier to sell than promises. Imagine, when your honeymoon period’s up, being able to point to a graph showing how lead times have shrunk, and how fires in production have become rare.

I’ve seen that buy tech leaders the higher trust and autonomy they need to make those strategic changes that the infamous organisational antibodies may well have attacked before. They were immunised by results.

Now for the really good news. In all those cases I’ve been involved in – many over the years – tech leaders only did one thing.

I am not an “Agile coach” or a “transformation lead”. I do not sit in meetings with management. I do not address strategic matters or matters of high-level process.

I work directly with developers, helping them to make those high-leverage, mostly invisible changes to their innermost workflows. Your CEO will likely never meet me, or even hear my name.

The first rule of Code Craft is that we do not talk about Code Craft.

Right now, you may be thinking, “But Jason, we’ve got AI. Developers can produce 2x the code in the same time.”

Do you think 2x the code hitting that FinTech company’s testing bottleneck would have made things better? Or would that just have been more bugs hitting QA in bigger batches?

Attaching a code-generating firehose to your dev process is very likely to overwhelm bottlenecks like testing, code review and merging – making the system as a whole slower and leakier. I call it “LGTM-speed”.

Tightening up those inner loops is even more essential when we’re using “AI” code generators. If we don’t, lead times get longer, releases get less stable, and the cost of change goes up. Worse software, later.

Good luck presenting that graph to the C-suite!

If you need help tightening up the inner feedback loops in your software development processes – with or without “AI” – that’s right up my street.

Visit https://codemanship.co.uk/ for details of training and coaching in the technical practices that enable rapid, reliable and sustainable evolution of software to meet changing needs.

Am I Anti-AI? No. I’m Anti-Harm.

If you’ve been following my blog posts and social media noisings about generative “AI” (see, there’s those quotation marks again) – and especially about LLM-assisted software development – you might be forgiven for thinking that I’m against it.

For sure, I take a skeptical, evidence-based view. I’ve followed the data, learned the underlying mechanics, and in many instances tested claims myself to try and establish what’s real and what works. (And for it to work, it has to be real.)

This has led me to a far less sensational position than many you’ll see out there on Teh Internets. And, from the perspective of people who’ve drunk the Kool-Aid – and now insist that we all drink it – that can look like an anti-AI take.

For the sake of clarity, I’m not “anti-AI”.

I’m anti:

Shipping code at LGTM-speed

Replacing user research & feedback with hallucinations

Abdicating thinking

BDUF in .md files

Burning the planet for – in most cases – lower productivity & worse reliability

Stealing IP on an industrial scale

Economy-distorting investment fraud

Gross invasions of privacy

“AI psychosis”

Technofeudalism & technofascism (Let’s be honest – the folks controlling the hyper-scale models are not big fans of democracy)

A future is possible where development teams use small, specialised language models, trained on ethically-sourced data using a tiny fraction of the compute and energy, and running on consumer hardware under our control. Nobody else sees our data.

These models, and the data they’re trained on, can be a common good – prepared by communities with the express intent of licensing them for ethically-trained-and-operated “AI”.

Base models can be finetuned for our purposes using our data (e.g., our code).

And they won’t be trained to try to fool us into believing they think, feel or care about us, or to drive engagement.

Devs will use them when it adds value, not just “because AI”.

And they’ll use them in highly iterative processes where they’re 100% engaged – continuously testing, inspecting, refactoring and integrating code in small batches, and never letting things slide into LGTM-speed.

So, yes, I’m actually “pro-AI” – I use it every day.

I just have a different vision of the future to what the hyper-scalers and their backers would prefer – democratic, environmentally-friendly, economically viable, genuinely useful, and ethically and socially responsible.

Thank you for your attention to this matter.

</rant>

Coding Is When We’re Least Productive

One old dragon that’s reared its head again in this “age of AI” is the very wrongheaded notion that productivity == code. Managers aim to maximise the amount of code their dev teams produce, and so they maximise the time devs spend writing code.

Let me tell you a story about the value of code.

When I first started contracting, I worked on a Point Of Sale system upgrade for a major retailer. I was stuck on a feature, and just couldn’t wrap my head around the use case because I’d never actually seen their existing system in operation.

So I walked to a local branch on the high street, showed them my security pass, and asked if I could observe and – when the cashier wasn’t busy – ask questions.

They told me they could do better than that. Upstairs they had a room with a working till in it, which they used to train new staff. They also used it to test system updates, which was the first time I’d seen what we now call a “model office” for software testing.

We were able to run through the use case scenarios I was stuck on, with them showing me how to use the POS system to achieve specific goals. The mist cleared. It was like I’d been reading a book on how to ride a bicycle that had no pictures, and then someone gave me a bicycle.

Enlightened, I walked back to the office and changed about three lines of code. That was all the coding I did that day. Three lines, in 8 hours.

But if I hadn’t made that trip and seen for myself, and had a chance to talk to a department manager in that store, those three lines would have been applying special offers wrong. (If only the person who wrote the spec had done this in the first place…)

Now multiply that error by 250 branches nationwide. I potentially saved my client a bunch of money and embarrassment with that 3-line change.

Now, I consider that a productive day.

But had I been measured on my contribution by lines of code, or commits, or features finished, it would have been seen as a very unproductive day by my manager.

I may have felt pressured to stay at my workstation, bashing out more code, compounding the mistake and costing my client more money.

A very teachable moment for me early in my career. The lightbulb pinged: some code is worth more than others, and coding and productivity aren’t the same thing.

We could even argue that coding is the interruption. What’s the shortcut in IntelliJ that tells you if you’re writing the wrong code? Oh, that’s right. There isn’t one.

When I’m heads-down-coding, I’m not seeing, I’m not asking, and I’m not learning about the problem. To do that, I have to get up from my desk, go to where the problem is and/or the people I need to ask are, and have a conversation. Let the dog see the rabbit.

This takes time. And if we’re producing code faster than we can validate it – either by exploring the problem ourselves, or learning from user feedback if our release cycles are fast enough – then we’re piling assumptions on top of assumptions.

I’ve seen so many times how 10 lines of code can end up being worth £millions, and 10,000 ends up being worthless.

If productivity, in reality, is a measure of how much net value we create, then that learning feedback loop is where the real productivity happens, and not at our desks punching keys. Coding is when we’re least productive.

Writing Code May Be Dead (Not Really), But Reading Code Will Live On

“The age of the syntax writer is over”, hailed a post on LinkedIn. (Why is it always LinkedIn?)

I say: welcome to the age of the syntax READER!

If I presented you with a bowl of candy and told you only 1% had broken glass in them, what percentage would you check before giving them to your kids?

The fact is that a 1% error rate would be an order-of-magnitude improvement in the reliability of LLM-generated code, and one we shouldn’t expect for a very long time – if ever – as scaling hits the wall of diminishing returns at exponentially increasing cost.

The need to check generated code isn’t going to go away, and therefore neither is the need to understand it.

And, given the error rates involved, we may need to understand all of it. At least, anyone who doesn’t want to be handing candy-covered shards to their users will.

“But Jason, we don’t need to check machine code or assembly language generated by compilers.”

That’s a category mistake. When was the last time a compiler misinterpreted your source code, or hallucinated output? Compiler boo-boos are very rare.

If my compiler randomly misinterpreted my source code just 1% of the time, then, yes, I’d check all of the generated machine code for anything that was going to be put in the hands of significant numbers of users. And that means I’d need to be able to understand that machine code.

Now for the fun part.

Decades of studies into program comprehension show that one of the best predictors of a person’s ability to understand code is how much experience they have writing it.

Cognitively, we don’t engage with syntax and semantics more deeply than when we’re writing the code ourselves. (Which is why I strongly discourage students from copying and pasting – it stunts their growth.)

In my blog series The AI-Ready Software Developer, I talk about how a mountain of “comprehension debt” can rapidly accumulate as “AI” produces code far faster than we can wrap our heads around it. I call this “LGTM-speed”.

I recommend “staying sharp” where code comprehension is concerned, as well as slowing down code generation to the speed of comprehension. When you’re drinking from a firehose, the limit isn’t the firehose.

This implies that we’re not limited by how many tokens/second “AI” coding assistants can predict, but by how many tokens/second we can understand. That’s the thing we need to optimise.

The best way – the only way, really – to maintain good code comprehension is to write it regularly.

We need to keep our hand in to make sure we don’t get caught in a trap where comprehension debt balloons as our ability to comprehend withers.

That leads to a situation where serious, show-stopping – potentially business-stopping – errors become much more likely to make it into releases, and there’s nobody on the team who can fix them.

And in that scenario, who are they gonna call? The “AI-native developer” who boasts they haven’t written code in months, or the developer who was debugging that kind of code just this morning?

Ready, Fire, Aim!

I teach Test-Driven Development. You may have heard.

And as a teacher of TDD for some quarter of a century now, you can probably imagine that I’ve heard every reason for not doing TDD under the Sun. (And some more reasons under the Moon.)

“It won’t work with our tech stack” is one of the most common, and one of the most easily addressed. I’ve done and seen done TDD on all of the tech stacks, at all levels of abstraction from 4GLs down through assembly language to the hardware design itself. If you can invoke it and get an output, you can automatically test it. And if you can automatically test it, you can write that test first.

(Typically, what they really mean is that the architecture of the framework(s) they’re using doesn’t make unit testing easy. That’s about separation of concerns, though, and usually work-aroundable.)

The second most common reason I hear is perhaps the more puzzling: “But how can I write tests first if I don’t know what the code’s supposed to do?”

The implication here is that developers are writing solution code without a clear idea of what they expect it to do – that they’re retrofitting intent to implementations.

I find that hard to imagine. When I write code, I “hear the tune” in my head, so to speak. The intended meaning is clear to me. When I run it, my understanding might turn out to be wrong. But there is an expectation of what the code will do: I think it’s going to do X.

My best guess is that we all kind of sort of have those inner expectations when we write code. The code has meaning to us, even if we turn out to have understood it wrong when we run it.

So I could perhaps rephrase “How can I write tests first if I don’t know what the code’s supposed to do?” to articulate what might actually be happening:

“How do I express what I want the code to do before I’ve seen that code?”

Take this example of code that calculates the total of items in a shopping basket:

			
class Basket:
    def __init__(self, items):
        self.items = items
    def total(self):
        sum = 0.0
        for item in self.items:
            sum += item.price * item.quantity
        return sum

		

When I write this code, in my head – often subconsciously – I have expectations about what it’s going to do. I start by declaring a sum of zero, because an empty basket will have a total of zero.

Then, for every item in the basket, I add that item’s price multiplied by it’s quantity to the sum.

So, in my head, there’s an expectation that if the basket had one item with a quantity of one, the total would equal just the price of that item.

If that item had a quantity of two, then the total would be the price multiplied by two.

If there were two items, the total would be the sum of price times quantity of both items.

And so on.

You’ll notice that my thinking isn’t very abstract. I’m thinking more with examples than with symbols.

No items.
One item with quantity of one.
One item with quantity of two.
Two items.

If you asked me to write unit tests for the total function, these examples might form the basis of them.

A test-driven approach just flips the script. I start by listing examples of what I expect the function to do, and then – one example at a time – I write a failing test, write the simplest code to pass the test, and then refactor if I need to before moving on to the next example.

    def test_total_of_empty_basket(self):
        items = []
        basket = Basket(items)
        
        self.assertEqual(0.0, basket.total())

			
class Basket:
    def __init__(self, items):
        self.items = items
    def total(self):
        return 0.0

		

What I’m doing – and this is part of the art of Test-Driven Development – is externalising the subconscious expectations I would no doubt have as I write the total function’s implementation.

Importantly, I’m not doing it in the abstract – “the total of the basket is the sum of price times quantity for all of its items”.

I’m using concrete examples, like the total of an empty basket, or the total of a single item of quantity one.

“But, Jason, surely it’s six of one and half-a-dozen of the other whether we write the tests first or write the implementation first. Why does it matter?”

The psychology of it’s very interesting. You may have heard life coaches and business gurus tell their audience to visualise their goal – picture themselves in their perfect home, or sipping champagne on their yacht, or making that acceptance speech, or destabilising western democracy. It’s good to have goals.

When people set out with a clear goal, we’re much more likely to achieve it. It’s a self-fulfilling prophecy.

We make outcomes visible and concrete by adding key details – how many bedrooms does your perfect home have? How big is the yacht? Which Oscar did you win? How little regulation will be applied to your business dealings?

What should the total of a basket with no items be? What should the total of a basket with a single item with price 9.99 and quantity 1 be?

    def test_total_of_single_item(self):
        items = [
            Item(9.99, 1),
        ]
        basket = Basket(items)

        self.assertEqual(9.99, basket.total())

We precisely describe the “what” – the desired properties of the outcome – and work our way backwards directly to the “how”? What would be the simplest way of achieving that outcome?

			
class Basket:
    def __init__(self, items):
        self.items = items
    def total(self):
        if len(self.items) > 0:
            return self.items[0].price
        return 0.0

		

Then we move on to the next outcome – the next example:

    def test_total_of_item_with_quantity_of_2(self):
        items = [
            Item(9.99, 2)
        ]
        basket = Basket(items)

        self.assertEqual(19.98, basket.total())

			
class Basket:
    def __init__(self, items):
        self.items = items
    def total(self):
        if len(self.items) > 0:
            item = self.items[0]
            return item.price * item.quantity
        return 0.0

		

And then our final example:

    def test_total_of_two_items(self):
        items = [
            Item(9.99, 1),
            Item(5.99, 1)
        ]
        basket = Basket(items)

        self.assertEqual(15.98, basket.total())

			
class Basket:
    def __init__(self, items):
        self.items = items
    def total(self):
        sum = 0.0
        for item in self.items:
            sum += item.price * item.quantity
        return sum

		

If we enforce that items must have a price >= 0.0 and an integer quantity > 0, this code should cover any list of items, including an empty list, with any price and any quantity.

And our unit tests cover every outcome. If I were to break this code so that, say, an empty basket causes an error to be thrown, one of these tests would fail. I’d know straight away that I’d broken it.

This is another self-fulfilling prophecy of starting with the outcome and working directly backwards to the simplest way of achieving it – we end up with the code we need, and only the code we need, and we end up with tests that give us high assurance after every change that those outcomes are still being satisfied.

Which means that if I were to refactor the design of the total function:

    def total(self):
        return sum(
                map(lambda item: item.subtotal(), self.items))

I can do that with high confidence.

If I write the code and then write tests for it, several things tend to happen:

I may end up with code I didn’t actually need, and miss code I did need
I may well miss important cases, because unit tests? Such a chore when the work’s already done! I just wanna ship it!
It’s not safe to refactor the new code without those tests, so I have to leave that until the end, and – well, yeah. Refactoring? Such a chore! etc etc etc.
The tests I choose – the “what” – are now being driven by my design – the “how”. I’m asking “What test do I need to cover that branch?” and not “What branch do I need to pass that test?”

And finally, there’s the issue of design methodology. Any effective software design methodology is usually usage-driven. We don’t start by asking “What does this feature do?” We start by asking “How will this feature be used?”

What the feature does is a consequence of how it will be used. We don’t build stuff and then start looking for use cases for it. Well, I don’t, anyway.

In a test-driven approach, my tests are the first users of the total function. That’s what my tests are about – user outcomes. I’m thinking about the design from the user’s – the external – perspective and driving the design of my code from the outside in.

I’m not thinking “How am I going to test this total function?” I’m thinking “How will the user know the total cost of the basket?” and my tests reveal the need for a total function. I use it in the test, and that tells me I need it.

“Test-driven”. In case you were wondering what that meant.

When we design code from the user’s perspective, we’re far more likely to end up with useful code. And when we design code with tests playing the role of the user, we’re far more likely to end up with code that works.

One final question: if I find myself asking “What is this function supposed to do?”, is that a cue for me to start writing code in the hope that somebody will find a use for it?

Or is that my cue to go and speak to someone who understands the user’s needs?

Thanks to AI, Your Waterfall Is Showing

Here’s my hypothesis (and I’ve seen real-world examples with client teams that make me ask this question):

Dev teams bring “AI” coding assistants into their daily workflows. Very quickly, much more code starts hitting the feedback loops in their process: testing, code review, integration, user feedback.

This starts to overwhelm those loops. Delays get longer, more Bad Stuff leaks through into production, systems get less stable and teams end up spending more and more time playing whack-a-mole with issues.

Far from making these teams faster overall, the traffic jams get worse and their journeys take even longer.

So some teams adapt*. They reduce batch sizes going into the feedback loops: fewer changes being tested, reviewed and integrated at a time, in tighter feedback loops.

We know that if you tighten these feedback loops, three things tend to happen:

1. Lead times shrink

2. Reliability improves

3. Cost of change goes down

My hypothesis is that when we see positive systemic impact from “AI” code generation, it’s actually more attributable to the team adapting to it, and not directly to the “AI” itself.

“AI” code generation load-tests your dev process and forces you to address the worst bottlenecks.

Basically, “Your Waterfall is showing”.

* And, of course, some teams run in the exact opposite direction, getting even more Waterfall. Silly Billies!

It just so happens that I specialise in helping development teams build the technical skills needed to shrink lead times, improve reliability and lower cost of change – with and without AI.

I know, right! What a happy coincidence!

Visit my website to find out more.

Finally! Proof That Agentic AI Scales (For Creating Broken Software)

Some of the marketing choices made by the “AI” industry over the last few years have seemed a little… odd.

The latest is a “breakthrough” in “agentic AI” coding heralded by Cursor, in which they claim that a 3+ million-lines-of-code (MLOC) web browser was generated by 100 or so agents in a week.

It certainly sounds impressive, and many of the usual AI boosters have been amplifying it online as “proof” that agentic software development works at scale.

But don’t start ordering your uniform to fight in the Butlerian Jihad just yet. They might be getting a little ahead of themselves.

Did 100 agents generate 3 MLOC in about a week? It would appear so, yes. So that part of the claim’s probably true.

Did 100 agents generate a working web browser? Well, I couldn’t get it to work. And, apparently, other developers couldn’t get it to work.

Feel free to try it yourself if you have a Rust compiler.

And while you’re looking at the repo – and it surprises me it didn’t occur to them that anybody might – you might want to hop over to the Action performance metrics on the Insights page.

An 88% job failure rate is very high. It’s kind of indicative of a code base that doesn’t work. And looking at the CI build history on the Actions page, it appears it wasn’t working for a long time. I couldn’t go back far enough to find out when it became a sea of red builds.

Curiously, near to the end, builds suddenly started succeeding. Did the agents “fix” the build in the same way they sometimes “fix” failing tests, I wonder? If you’re a software engineering researcher, I suspect there’s at least one PhD project hiding in the data.

But, true to form, it ended on a broken build and what does indeed appear to be broken software.

The repo’s Action usage metrics tell an interesting story.

The total time GitHub spent running builds on this repo was 143,911 minutes. That’s 4 months of round-the-clock builds in about a week.

This strongly suggests that builds were happening in parallel, and that strongly suggests agents were checking in changes on top of each other. It also suggests agents were pulling changes while CI builds were in progress.

This is Continuous Integration 101. While a build is in progress, the software’s like Schrödinger’s Cat – simultaneously working and not working. Basically, we don’t know if the changes being tested in that check-in have broken the software.

The implication is, if our goal is to keep the code working, that nobody else should push or pull changes until they know the build’s green. And this means that builds shouldn’t be happening in parallel on the same code base.

Your dev team – agentic or of the meat-puppet variety – may be a 100-lane motorway, but a safe CI pipeline remains a garden gate.

The average job queuing time in the Action performance metrics illustrates what happens when a 100-lane motorway meets a garden gate.

And the 88% build failure rate illustrates what happens when motorists don’t stop for it.

The other fact staring us in the face is that the agents could not have been doing what Kent Beck calls “Clean Check-ins” – only checking in code that’s passing the tests.

They must have been pulling code from broken builds to stay in sync, and pushing demonstrably broken code (if they were running the tests, of course).

In the real world, when the build breaks and we can’t fix it quickly, we roll back to the previous working version – the last green build. Their agentic pile-on doesn’t appear to have done this. It broke, and they just carried on 88% of the time.

Far from proving that agentic software development works at scale, this experiment has proved my point. You can’t outrun a bottleneck.

If the agents had been constrained to producing software that works, all their check-ins would have had to go in single file – one at a time through the garden gate.

That’s where those 143,911 total build minutes tell a very different story. That’s the absolute minimum time it would have taken – with no slip-ups, no queueing etc -to produce a working web browser on that scale.

Realistically, with real-world constraints and LLMs’ famous unreliability – years, if ever. I strongly suspect it just wouldn’t be possible, and this experiment has just strengthened that case.

Who cares how fast we can generate broken code?

The discipline of real Continuous Integration – that results in working, shippable software – is something we explore practically with a team CI & CD exercise on my 3-day Code Craft training workshop. If you book it by January 31st 2026, you could save £thousands with our 50% off deal.

Productivity Theatre

The value proposition of Large Language Models is that they might boost our productivity as programmers (when we use them with good engineering discipline). And there’s no doubting that there are things we can do faster using this technology.

It would be a mistake, though, to assume that we can do everything faster using them.

I’ve watched many developers prompting, say, Claude or Cursor asking them to perform tasks that they could have done much faster – and more reliably – themselves using “classical” tools or just typing the damn code instead of a prompt.

For example, there’s been times when I’ve seen developers writing prompts like “Claude, please extract lines 23-29 into a new method called foo that returns the value of x” when their IDE could do that with a few keystrokes.

In these moments, the tool isn’t making them more productive. It’s making them less productive. So we might, when we find ourselves doing it – and I certainly have – pause to reflect on why.

It could be that we just don’t know the easier way. You might be surprised at how many developers haven’t even looked at the refactoring menu in their IDE, for example. Or that we know there’s an easier way, but don’t want to take the time to learn it.

In the latter case, it’s true that it would probably take them longer the first time. So they continue doing it the long way. Arrested development – often under time pressure, or perceived time pressure – is a common condition in our profession.

But in many cases, it seems performative. We know there’s a quicker, easier way, but we feel we need to show that it can be done – a bit like those people who insist you can cook anything in a microwave. Yes, technically you can, but is that always the best or the easiest option?

Someone calling themselves an “AI engineer” or “AI-native” might feel the need to signal to the people around them that they can indeed cook anything in the proverbial microwave.

And then it ceases to be about productivity. It’s about making a point, and demonstrating prowess to peers, superiors and random strangers on LinkedIn. The technology has become part of their professional identity.

Sacrificing real productivity in service to a specific technology or a technique is nothing new, of course. Software developers have been applying the “if all you’ve got is a hammer” principle for many decades – “I don’t know how we’re going to solve this problem, but we’re going to do it with microservices” sort of thing.

Quite often, these decisions – conscious or unconscious – seem to be career and status-driven. If “AI-native” is hot in the job market, that’s what we want on our CV. “AI when it makes sense” is not hot right now. It may be rational, but it’s less in-demand.

I’m still very much unfashionably rational, having sustained a long career by avoiding getting pigeonholed in the many fads and fashions that have come and gone. I’m interested in what’s real and in what works.

You never know. One day that might catch on.

If you want to hone your “classical” software engineering skills for the times when those are the better option, as well as learn how to apply engineering principles to “AI”-assisted development in an evidence-based approach that more and more developers are discovering gets better results – if it’s better results you’re after, of course – then check out my training website for details of courses and coaching, and oodles of free learning resources.

The Gorman Paradox – Solution II: They’re In The Bin

Software development’s essentially a learning process. Most of the value in a product or system’s added in response to user and eventually market feedback.

With each iteration we get the design less wrong. With each iteration, we learn.

The effect of batch size on learning is profound.

I urge teams to work on the basis that every design decision is guesswork until it hits the real world. We can’t know with certainty that we made the right decisions.

Getting user feedback is the only meaningful mechanism we have to “turn the cards over” and found out if we guessed right. In this sense, learning is characterised as reducing or eliminating uncertainty in product design. Teams who do this faster will tend to out-learn their competition.

Imagine trying to guess a random 4-digit number in one go vs. guessing one digit at a time.

In both approaches, we start with the same odds of guessing it right: 1/10,000. But with each guess, the uncertainty collapses orders of magnitude faster when we’re guessing one digit at a time. The latter approach out-learns the former.

Even if we had an “AI” random 4-digit number generator that enabled us to make 10x as many guesses in the same time, guessing one digit at a time would still out-learn us.

The chances of a complete solution delivered in a single pass – guessing all 4 digits in one go – being even on the same continent as correct are vanishingly remote, and we learn very little because of the nature of user feedback.

If I deliver 50 changes (e.g., new features) in a single release and ask users “waddaya think?”, I won’t get meaningful feedback about all 50 changes.

Most likely I’ll get general feedback of the “LGTM” or “meh” variety, and maybe some specific feedback about things that stood out. (Bugs in a release tend to overshadow anything else, for example – the proverbial fly in the soup. “Waddaya think of the soup?” “There’s a fly in it!”)

If I deliver ONE change, they’ll probably have something meaningful to say about it. We can at least observe what impact that one change has on user behaviour (e.g., engagement, completing tasks etc).

So we learn faster when we iterate fewer changes into the hands of users at a time. This inevitably forces us to apply the brakes on the creation of code, because we need to wait for feedback, and we need to do that often.

I see many posts here from folks claiming to have generated entire applications in days or even hours using LLM-based coding tools. That’s the equivalent of “guessing all 4 digits at a time using an ‘AI’ 4-digit number generator”. That’s an entire application – hundreds of design decisions – created without any user feedback.

Creating an entire application in a single pass is every bit as “Big Design Up-Front” as wireframing or modeling the whole thing in UML in advance. And assumptions and guesses in your early decisions get compounded in later decisions, piling up uncertainty under a mountain of interconnected complexity. Failure is almost inevitable.

This is another potential solution to the Gorman Paradox.

Where are all the “AI”-generated apps? In the bin.

It just so happens that I train and mentor teams in the technical practices that enable them to learn faster from user and market feedback. I know, right! What are the chances?

And it also just so happens that any Codemanship training course booked by January 31st 2026 is HALF-PRICE. Which is nice.

“First, We Model The Domain…”

In my previous blog post talking about the preciseness of software specifications, I used an example from one of my training workshops to illustrate the value in adding clarity when we have a shared understanding of the problem domain.

Now, when many developers see a UML class diagram – especially ones who lived through the age of Big Architecture in the 1990s – they immediately draw connotations of BDUF (Big Design Up-Front). And to be fair, it’s understandable how visual modeling, and UML in particular, gained that reputation, with it’s association with heavyweight model-driven development processes.

But teams who reject visual modeling outright because it’s “not Agile” are throwing the baby out with the BDUF bathwater.

I recounted in my post how providing a basic domain model with the requirements dramatically reduced misinterpretations in the training exercise.

And I’ve seen it have the same effect in real projects, too. As a tech lead I would often take on the responsibility of creating visual models based on our actual code and displaying them prominently in the team space. As the code evolved, I’d regularly update the models so they were a slightly-lagging but mostly accurate reflection of our design.

Domain models – the business concepts and their relationships – have proven to be the most useful things to share, helping to keep the team on the same page in our understanding of what it is we’re actually talking about.

Most importantly, there’s no hint of BDUF in sight. I describe the domain concepts that are pertinent to the test cases we’re working on. The model grows as our software grows, working in vertical slices in tight feedback loops, and never getting ahead of ourselves. We don’t model the entire problem domain, just the concepts we need for the functionality we’re working on.

In this sense, to describe our approach to design as “domain-driven” might be misleading. The domain doesn’t drive the design, user needs do. And user needs dictate what domain concepts our design needs.

Let’s examine the original requirements:

• Add item – add an item to an order. An order item has a product and a quantity. There must be sufficient stock of that product to fulfil the order

• Total including shipping – calculate the total amount payable for the order, including shipping to the address

• Confirm – when an order is confirmed, the stock levels of every product in the items are adjusted by the item quantity, and then the order is added to the sales history.

I’d tackle these one at a time. The domain model for Add Item would look like:

Note that Product price isn’t pertinent to this use case, so it’s not in the model.

When I start working on the next use case, Total Including Shipping, the domain model evolves.

And it evolves again to handle the Confirm use case.

And the level of detail in the model, again, is only what’s pertinent. We do not need to know about the getters and the setters and all that low-level malarkey. We can look at the code to get the details. Otherwise it just becomes visual clutter, making the models less useful as communication tools.

Another activity in which visual modeling can really help is as an aid to collaborative design.

I’ve seen so many times developers or pairs picking up different requirements and going off into their respective corners, designing in isolation and coming up with some serious trainwrecks – duplicated concepts, mismatched architectures, conflicting assumptions, and so on.

It’s the classic “Borrow a video”/”Return a video” situation, where we end up with two versions of the same conceptual model that don’t connect.

It’s especially risky early in the life of a software product, when a basic architecture hasn’t been established yet and everything’s up in the air.

I’ve found it very helpful in those early stages to get everybody around a whiteboard and lay out designs for their specific requirement that’s part of the same model. So if somebody’s already added a Rental class, they add their behaviour around that, and not around their own rental concept.

As the code grows, maintaining a picture of what’s in it – especially domain concepts – gives the team a shared map of what things are and where things go, and a shared vocabulary for discussing and reasoning about problems together.

This is part of the wider discipline of Continuous Architecture, where understanding, planning, evaluating and steering software design is happening throughout the day.

The opposite of Big Design Up-Front.

If your team wants to level up their capability to rapidly, reliably and sustainably evolve working software to meeting changing business needs, check out my live, instructor-led and very hands-on training workshops.