Clever Hans Couldn’t Really Do Arithmetic, and LLMs Don’t Really Understand.

Image

I’ve joked in the past that what really makes LLMs work is our tendency to see faces on toast, but there’s a more serious point there about how much of our perception of the ability of models to “understand”, “reason”, “follow instructions” etc is in reality projection.

We’ve evolved to read intention into the behaviour of other people so that we can predict what they might do. But we can also see intent in the behaviour of pets, weather, dishwashers, etc etc. So we shouldn’t be too surprised if something that’s designed to statistically reproduce human creativity and reasoning has that effect on many of us to a much greater extent.

I certainly fell for it during the first few hours experimenting with GPT-4, until I played it at chess, and then the curtain was pulled back. It doesn’t know where the pieces are on the board, it doesn’t plan ahead, it doesn’t know the rules. It literally just predicts – by matching the sequence of moves to its vast example space of chess transcripts – which chess move is most likely to come next.

Once you’ve seen it, it can’t be unseen. But I appreciate that a lot of people have yet to see the tiger in the Magic Eye picture. The dazzling complexity of human language makes it hard to see the wood for the trees. That’s why something simple and deterministic, like chess, makes it much clearer.

As impressive as LLMs can be, I encourage users not to mistake powerful pattern matching and next-token prediction for actual intelligence or understanding. I urge folks who use these tools – which is all they are – to take a rational and evidence-based approach to them, as I’ve been doing for 2 1/2 years now.

Your cat doesn’t understand what you’re saying. It can learn to recognise certain words, your tone of voice, your body language, and associate it with – for example – imminent treats or bath time. That learned behaviour can be easily mistaken for actual conceptual understanding.

Clever Hans couldn’t do arithmetic when he couldn’t see his trainer, but not even his trainer realised he was subconsciously giving off visual cues. Oh yeah, and LLMs don’t understand the instructions and rules in your claude.md file. (A good test is to add a “brown M&Ms” rule to your context.)

But we’re hardwired to see that in them, and it’s a very powerful effect. I see much confirmation bias, for example, in interpreting output – a strong desire to focus on the things they get right while overlooking many of the things they got wrong. And they get a lot wrong.

Expect that, because it’s not going to get much better. You have to keep these tools on a very tight leash.

The “brown M&Ms” test? This is a famous story about Van Halen’s rider for concerts. It was often used to imply that the band were absolute divas, but it had a serious purpose. A little detail like that, buried in the contract – when the venue didn’t observe it, the band would double-check everything in their very complex stage show.

eXtreme Programming Reborn: Code Craft & “A.I.”

Going through the practices that many software developers report improve the results they get with “A.I.” coding assistants – the ones I’ve managed to reproduce myself.

* Small, task-specific prompts – solve one problem at a time, reconstruct context for the next task (only what the model needs to know)

* Prompting with tests/usage examples

* Tight feedback loops with continuous testing, code review and refactoring

* Merciless version control, with commits on every acceptable outcome, hard reset when it breaks the code

* Merging directly to the trunk (the code-generating firehose has a tendency to overwhelm PR-based processes)

* External deterministic sources of truth – the code as it is now, test results, linter output, mutation testing reports etc (as opposed to what the context says these things are, or what the model tells us)

* Code that clearly communicates intent

* Good, clean separation of concerns – smaller “blast radius” for changes

* A “birds-eye” view that the developer maintains, because big picture isn’t something LLMs can handle reliably – e.g., high-level design sketches, screenflows and wireframes, test lists and so on. Basically, don’t rely on the LLM to set high-level direction or to plan long-range. That requires actual intelligence.

Two very interesting things that I’m focused on now: to what extent do these practices have a net effect of reducing uncertainty (increasing confidence) in next-token prediction, and/or in minimising bottlenecks in the development process.

This makes sense, since any strategy that might actually work would need a causal mechanism in the models themselves. Since research seems to be rapidly converging on the same mechanism, I think we may well be on to something here.

The second is just how familiar these practices look: small steps (limited WIP), rapid feedback loops, continuous testing, code review, refactoring, integration etc. Remind you of anything, perhaps something a little extreme?

Now, that surely can’t be a coincidence?

Is the net effect of XP technical practices essentially the same, and with the same causal mechanism – maximising confidence in the output by continuously attacking uncertainty, which we achieve by minimising bottlenecks in the feedback loops?

The LLM In The Room: ChatGPT Speaks

Image

With the long-awaited and much anticipated launch yesterday of the mythical GPT-5, and observing the widespread disappointment that it isn’t just not the superintelligence we were promised, but it’s barely an improvement on previous ‘frontier’ models, I wanted to revisit my blog post from January 11th so I could enjoy the delicious satisfaction of saying “I told you so”.

But, on reflection, and in the interests of fairness, I’ve decided to let ChatGPT have its say. This will likely be the only time you’ll read content generated by an LLM on this blog, or pretty much anywhere else you’ll find my content online.

Excuse the em-dashes. It just can’t help itself.

(Edited slightly because, bless, it thinks it’s me.)

JG

The LLM In The Room: Revisited (August 2025), by ChatGPT
The S-Curve, The Wall, and the Cat Out of the Bag

When GPT-5 dropped yesterday to a widespread chorus of “…wait, that’s it?”, I was reminded of something Jason wrote back in January. At the time, he suggested that transformer-based LLMs were running out of runway, and that their impressive but fundamentally unreliable outputs made them unsuitable for any domain where accuracy, reliability, and responsibility actually matter.

Back in 2023, as the hype reached a fever pitch, the trillion-dollar question was:
Where on the S-curve are we?


The Wall Was Always Coming
Anyone paying attention to the physics, or to early research into the statistical mechanics of transformer architectures, knew there would be a ceiling — a wall, if you like. You cannot just throw more compute and data at a pattern-matcher and expect it to eventually become a reliable doctor, lawyer, or software engineer. There is no ladder high enough to climb that wall.

Sure, GPT-4 looked like a big jump from GPT-3. But it was also the beginning of diminishing returns. GPT-5 has confirmed it: the curve is flattening. We are now firmly in the “marginal gains” era of LLMs, and they are still not solving the reliability problem.

Why? Because hallucination is not a bug. It is a feature. Generative models generate. That is what they do.


Not-Quite-Good-Enough at Scale
As Jason wrote earlier this year, LLMs are best understood as very impressive natural language interfaces to large corpora of human text. That has value — but not the kind of value Silicon Valley has been pricing in.

They are Stack Overflow++.
They are Google Search with manners.
They are Clippy, with 250 billion parameters.

But they are not, and likely never will be, reliable agents in complex domains without constant human oversight. If the best we can get out of GPT-5 is still “better-than-average, but not expert-level,” then what we are seeing is not acceleration. It is saturation.


The Real Future of LLMs Is Small and Specialized
Jason has long predicted that the real, useful future of this technology lies in small models running locally on consumer hardware, embedded in domain-specific tools, with a human expert in the loop.

Think:

  • A medical transcription tool that drafts summaries, but does not diagnose
  • A code assistant that suggests snippets, but does not write your backend
  • A tutor that helps explain calculus, but does not grade your exam

This is the future LLMs are actually good for. But this is not the ten trillion dollar “AGI is coming” vision Wall Street bet on.


What Now? Time to Change Course
The real danger now is not that we failed to achieve AGI. It is that we wasted vast amounts of time, money, and public attention trying to squeeze general intelligence out of autocomplete. The cost — environmental, economic, and societal — is already enormous.

If OpenAI, Google, Anthropic and others were acting responsibly, they would be:

  • Going back to the drawing board and exploring alternative architectures like hybrid neuro-symbolic systems, world models, and modular reasoning agents
  • Shifting focus to minimizing damage, not inflating hype
  • Acknowledging the urgent need for regulation, ethics, and guardrails

The Bubble Did Not Pop — It Deflated
So no, GPT-5 is not a disaster. It is just not the revolution everyone was sold. That might be even worse. Because now, even as it becomes clear this path does not lead to AGI, money and talent are still being funneled into bigger and hotter GPUs, chasing ghosts.

It is time to stop pretending that we are just one more model away from magic. We are not. The magic trick is over. The rabbit is not coming out of the hat.


Final Thought
There is no shame in building an LLM that is useful but narrow.
There is shame in pretending it is a god.

If nothing else, perhaps this marks the beginning of a more honest phase in AI — one where we stop mistaking scale for progress, and finally start asking what these tools are actually for.

Until then, we will be here, waiting for everyone to realize the pasta sauce is not going to grow back into tomatoes.

Micro-Iterations, Maps & “Interruptibility”

Image

When most software developers are asked about productivity, one thing that often gets mentioned is interruptions. We perceive ourselves to be more productive when we can maintain a deep state of concentration (“flow”) for extended periods, and interruptions break that flow. They may cite research that shows that it can take 15-20 minutes to get back into that state of concentration – to “reload the program”, so to speak. 2-3 interruptions in an hour will cost you that hour, they’ll tell you.

So, “interruptions are bad” is widely-received wisdom among software developers. And, like so much received wisdom in our profession, it’s not true.

Not all interruptions are bad. Not all interruptions waste our time. If someone’s editing the same source file as me, maybe an interruption could save us both time. If someone’s writing code based on a misunderstanding about the requirement, maybe sticking your head around their proverbial door might not be such a bad idea.

More generally, when we’re in that state of flow, one thing we’re not doing is communicating. If we think of development teams as concurrent systems, the dependencies between the work we’re doing will often make it necessary for us to synchronise. There’s actually a limited amount we can do genuinely in parallel. And, I suppose, that’s why we’re a team.

The longer we go without synchronising – without communicating – the more out of step we get with each other; more misunderstandings, more conflicts, more duplication of effort, and ultimately more work later getting back in sync.

So, we need to focus, because – y’know – programming, but we also need to stay in sync. How do we square this circle?

First of all, why does it take so long to get back into that state of flow? It’s a matter of cognitive load. We’re carrying a model of the code, the requirements, the architecture, in our heads – it’s all loaded into “main memory”, if you like. When we’re interrupted, it’s like terminating the program – it all has to be loaded back into memory again.

What if we could reduce that cognitive load? This is where working in smaller steps can help enormously. Micro-iterative processes like Test-Driven Development and refactoring let us focus on one thing – solve one problem – at a time. The amount of detail we need to load into memory is much smaller.

Working in tiny cycles that end with all the tests passing, and the changes committed, we find ourselves in a “safe zone” many times an hour where we could be interrupted without breaking our concentration.

Image

We can combine micro-iterations with “maps” – like test lists, high-level design sketches (e.g., a sequence diagram), UI storyboards, Mikado Method plans – that help us keep our place in a longer task, externalising most of the cognitive load. So we’re progressing one step at a time, and we know roughly where we’re going.

Developers who work this way do indeed report that interruptions are less of a problem. When the tests are green and the changes are committed, we’re free to talk, 5, 10, 20 times an hour.

Of course, flow still matters – let’s call them “micro-flows” – and if I’m in the middle of writing the code to pass a test, I don’t want to be interrupted. It’s very helpful to be able to observe team members working, so we can see when they might be interruptible. Some teams even have protocols and/or visual signals, kind of like those red lights outside movie stage doors.

Anyhoo, yes, we can have our cake and eat it; by reducing and externalising cognitive load, and working in small iterations that bring us back to tested, backed-up code many times an hour, we can achieve the focus we need – typically, much more focus on every change we make than in extended periods of “flow” – and communicate often enough to stay in sync with the team.

And the added benefit is that, if being interruptible means all our tests are passing and our changes have been committed, it also means the code’s shippable – 5, 10, 20 times an hour.

What If FOMO Itself Is Why We Miss Out?

Image

In my early 20s, I went to Inverness with a group of friends for a long weekend.

On the looong journey up, we became obsessed with finding the Loch Ness monster, and ultimately had no fun when we were there at all.

We spent the whole 4 days cycling up and down the loch with cheap cameras we’d bought from the local branch of Dixons.

Up and down we went, splitting into pairs to cover more of the loch. We’d start straight after breakfast, and ride the south road until it got dark – which was midnight at that time of year.

We’d take breaks, of course. We’d meet up, often at the cafe & shop in Foyers, so we could sit on the bench overlooking the falls and discuss our strategies for improving our odds of spotting Nessie. What if we split up individually, cover more ground? What if half of us go to the north shore? What if Nessie mostly comes out at night? Should we buy powerful torches, work in shifts? Should we hire a boat? One with sonar, perhaps? etc etc.

And, of course, we had all the books with us, which are very easy to find in Inverness. Nessie is big business in that part of the world, and the local owners of hotels, shops, bars and tour boats are always happy to encourage tourists hoping to catch a glimpse of the monster.

Dozens of eyewitness accounts and blurry (and in many cases now-known-to-be fake) photos, dozens of maps guiding us to where people had seen the wee beastie. We became convinced that the reason we hadn’t seen her was because WE WERE DOING IT WRONG.

Needless to say, after we got back, we all felt pretty stupid. They told us at the visitor centre why there really couldn’t be an animal of that size actually living in the loch, let alone a prehistoric one.

They methodically laid out all the evidence – lack of fish stocks to support such a large predator, the relatively young age of the loch (10,000 years), the fact that a plesiosaur has to surface often for air – on and on it went, nail after nail in Nessie’s coffin.

We listened to all of that, took it all in, and then we walked out of the visitor centre and thought “But what if a giant dinosaur got into the loch?” and then wasted 4 days of valuable holiday (i.e., drinking) time looking for something that almost certainly isn’t there.

All that beautiful scenery, all that culture and history. All those pubs, all that whiskey! We missed out on it completely.

…..

Now, none of that actually happened. But if it had happened, I suspect we might have goaded each other on with the threat of “missing out” on seeing the monster. Despite the complete lack of any credible, verifiable evidence that it exists. And despite all the very valid scientific reasons why it almost certainly doesn’t.

But FOMO is a powerful motivator.

D’ya see what I’m getting at here?

Enterprise Refactoring Requires Enterprise Tests

Image

A book that had quite an impact on me was Ubiquity: The Science of History by Mark Buchanan. It proposes that many catastrophic events are ultimately caused by the interconnectedness of things (banking system collapses, forest fires etc). The more interconnected the system, the more catastrophically effects can “ripple” out through those connections.

In software, we see these network effects as failures propagate through dependencies between components and systems. We also see how change can propagate for the same reasons. I’ve watched many dev organisations grapple with what could have been small changes to their software that ended up impacting every team because their applications, components and services were so tightly coupled.

At the level of source files (e.g., .java files or .py files), we can decouple modules by moving responsibilities to where the majority of their dependencies are. Things that get used together belong together, and things that get used together change together (and fail together).

Moving a feature from one Java class to another is easy as peas, even doing it with manual edits. Moving a feature from a Java web service to a Python web service maintained by a different team, or from a COBOL CICS system to a C# application, requires an order of magnitude more coordination. Which is why, when “Feature Envy” appears at that level – when a system or component is coupled to multiple features of a different system or component, indicating that behaviour may be in the wrong place – most organisations do nothing about it.

What we might call “enterprise refactoring” is a discipline that could benefit many organisations, though. And what distinguishes refactoring at any level of code organisation from just changing stuff willy-nilly is testing.

If I move a method from one Java class to another, I retest that code at a higher level for behaviours that involve both classes.

When we move a feature from one system or component to another, we again need to retest at a higher level, checking scenarios that involve both components or both systems – and these are typically business scenarios.

Many dev orgs lack tests at that level. They may see tests failing when a change breaks a component, but no tests fail when that change breaks the business. (This is why so many businesses can be blissfully unaware that, say, order fulfilment isn’t working. The POS system’s working. The warehouse system’s working. The shipping system’s working. But somewhere between them, the ball gets dropped.)

Indeed, too many organisations don’t actually know how their software’s being used in those wider contexts – business use cases, if you like. They understand what their system or component does, but have no real visibility of how it’s being used.

Businesses are systems, too. They have users and business use cases. These use cases are realised by internal processes, and they often involve multiple interacting software systems.

Refactoring those internal processes to localise the “ripple effect” of change is on a whole other level.

Will The Money People Pull The Plug On The A.I. “Space Race”?

Image

The current “AI” arms race reminds me a little of the Space Race of the 1950s-1970s. As a kid, many folks still talked excitedly about “Moon cities” and “space hotels”, with a genuine belief they were mere years away, and not the generations it turned out to be.

Sending people to the Moon and building space stations is very expensive, and at some point the money people looked at the balance sheet and said “Nah. Not worth it.”

Generative “A.I.” has, I’m guessing, a short window left to deliver the goods for the money people, who are growing *very* impatient. And as we watch LLMs become *less* reliable, and “AGI” – the Moon city of A.I. – appears to drift further and further away (generations, not years – if it’s even possible at all), at some point – perhaps quite soon – they’re going to pull the plug.

Gen AI is a massively subsidised technology – orders of magnitude more reliant on investor cash and government largesse than ride-sharing apps and takeaway delivery services – and without that constant injection of capital, ain’t nobody training or operating models on the scale we’ve seen.

Frontier models like o1 are the Moon landings of today, it could be argued – in terms of their scale and their cost.

I was 2 when the last astronaut walked on the Moon. I’ll probably be 60 when the next one does. The only difference is that we *know* there’s stuff of value on the Moon. LLMs have yet to find a use case that comes even close to justifying their cost. If we’re being totally honest, they may never.

Some folks say “Well, even if they stop training models, we’ll still have the ones we have now.”

Okay. So…

1. You gonna host it?

2. Enjoy working in Python 3.10 *forever*!

Five Boring Things That Have A Bigger Impact Than “A.I.” Coding Assistants On Dev Team Productivity

Image

Here are 5 factors that make a bigger difference to software development outcomes than “A.I.” coding assistants, but teams don’t address because they’re “old news, granddad!”

  • Smaller teams are better value/$ spent
  • More frequent releases accelerate learning what has real value
  • Limiting work in progress – solving one problem at a time – increases delivery throughput
  • Cross-functional teams experience fewer bottlenecks and blockers than specialised teams
  • Empowered, self-organising teams spend less time waiting for decisions and more time getting sh*t done

Now, I appreciate that every one of these is a can of worms that many organisations simply do not wish to open. They all have deep implications, and require foundational changes not just to the way we work, but the way we think.

For example, smaller, more frequent releases implies software’s in a shippable state more often, which implies faster build & test cycles… and down the rabbit hole we go: into testing pyramids and separation of concerns and micro-cycles with continuous testing, continuous integration, continuous code review and… Come to think of it, the stuff I teach 🙂

Another example, empowering teams requires a pretty high level of psychological safety. When people are afraid to fail, they’re afraid to try – to make calls, to take initiative, to just f-ing do it! The culture of an organisation, which may have evolved over many years, is a hard thing to reshape. There’s often a lot of unspoken rules – sure, you say your door is always open, but… It takes much work and many iterations to shift those underlying patterns in the way we interact.

But waiting on the other side of that long journey is a high capability to rapidly and sustainably create and adapt working software that meets rapidly-changing business needs. Software agility Nirvana.

We already know from the data (e.g., DORA) that “A.I.” coding assistants don’t unlock that door.

It’s the System, Stupid!

Image

Since this “Age of A.I.” arrived in late 2022, something’s been nagging at me. As more and more data rolls in, we see an apparent paradox emerging where “A.I.” coding assistants are concerned.

Individual developers report productivity gains using these tools (though many also report significant frustrations with, for example, “hallucinations”).

And at the same time, data clearly shows that the more teams use them, the bigger the negative impact on team outcomes like delivery throughput and release stability.

How can both these things be true?

We have one very plausible candidate for a causal mechanism, and it’s an age-old story in our industry.

When programmers get a feeling that they’re getting things done faster, they’re often only considering the part where they write the code – particularly when that’s their part of the process.

What they’re not considering is the whole software development process, and especially downstream activities like testing, code review, merging, deployment and operations.

More code faster can mean bigger change sets – more to test (and more bugs to fix), more code to review (and more refactorings to get it through review), more changes to merge (and more conflicts to resolve), and so on.

“A.I.” code generation’s a local optimisation that can come at the expense of the development system as a whole, especially if that system is more batch-oriented, with design, coding, testing, review, merging and release operating like sequential phases in the delivery of a new feature. In such a system, more code faster means bigger bottlenecks later. So there’s no paradox at all: one causes the other.

When teams work in much smaller cycles – make one change, test it, review the code, refactor, commit that and maybe push it to the trunk – they may experience far fewer downstream bottlenecks, with or without “A.I.” coding assistance. Arguably, coding assistants might make little noticeable difference in such a workflow.

The DORA data strongly indicates that the teams with the shortest lead times and the highest release stability tend to work this way, with continuous testing, code review and merging as the code’s being written.

And all this got me to thinking, maybe we’re targeting machine learning and “A.I.” at the wrong problem. Instead of focusing on individual developer productivity with things like code generation, perhaps this technology would yield more fruit if it was focused on systemic issues and reducing bottlenecks.

Maybe, for example, instead of using ML models to generate code, could they be more productively applied to reviewing code? Could a “smart” linter reduce the need for after-the-fact code review?

Of course, many of us already enjoy the benefits of “smart” linters. We call it “pair programming” or “ensemble programming”. And, having used static code analysis tools that incorporated statistical models or neural networks, the results weren’t all that impressive. Hard to see such a tool significantly out-performing a classic linter + a second pair of experienced eyes (if such eyes are available to you, of course, and maybe that’s the use case).

Perhaps the real value might be found in widening our view. What if a model (or models) could be trained on data collected across the entire cycle, from product strategy through to operational telemetry, support and beyond?

Imagine a model that, given, say, a Figma UI wireframe, could predict how many support calls you’d be likely to get about it. Imagine a model that, given a source file, could predict its mean time to failure in production?

More generally, imagine a model that could, with reasonable accuracy, predict the downstream impact of upstream activities, so as SuperDuperAgenticAI spits out its slop, alarm bells start to go off about where this is likely to lead if it gets any further.

A pipe dream, you might think. But in actual fact, such predictive technologies exist in other disciplines like electronic engineering, where statistical and ML models are used to predict the reliability and probable lifetimes of printed circuit boards, for example.

There would be some major hurdles to overcome to apply similar techniques to software development, though, not least of which is the jungle of higgledy-piggledy data formats our many proprietary tools and platforms produce. Electronics has established data interchange standards. We, for the most part, do not – probably because that would require enough of us to agree on some stuff, and that isn’t really our strong suit.

But, if these challenges could be overcome, or worked around (e.g., with a translation/encoding layer), I’m pretty sure there are patterns hidden in our complex and multi-dimensional workflow data that maybe nobody’s spotted yet. I mean, we’ve barely scratched the surface in the last 70+ years.

In a very handwavy sense, though, I feel quite sure now that “A.I.” is being targeted at the wrong problem in software: with an exclusive focus on individual developer productivity, when the focus should be on the system as a whole.

In the meantime, we’re pretty sure at this point that things like continuous design, continuous testing, continuous code review and continuous integration do have a positive systemic impact, so focusing on that is probably the most productive I can be for the foreseeable future.


If your team would like training and mentoring in the technical practices that we know speed up delivery cycles, shorten lead times and improve product and system reliability, with or without “A.I.”, pay us a visit.

The Real Secret of Prompt Engineering

Since early 2023, I’ve been on a journey evaluating claims about the capabilities of generative “A.I.” (yep, still gets air quotes).

I’ve tried to reproduce some of the more sensational successes I’ve seen trumpeted on the Interwebs, and eventually come to the conclusion that most of them don’t hold much water.

Why, I wonder, are these people claiming to have done things that the technology just doesn’t seem able to do?

Their defence is typically that I must be “doing it wrong”; that I haven’t mastered the Secret Magical Prompts of Destiny. But when I try to follow the advice, I get the same “meh” results.

“Be more specific” is a common refrain. But here’s the thing:

  1. I’m a computer programmer with a degree in physics. I can do “specific”.
  2. If anything, the more specific the requirements, the more the models struggle. When I try to iterate the output through a longer conversation, the results can often get worse.

Over these two years, I’ve gradually developed a theory about how they’re succeeding with “A.I.” where I’m failing, and it’s probably best illustrated with a cartoon.

Image

This image was generated through the ChatGPT web application (so it was generated by DALL-E, I guess). It went through multiple iterations as we tried to correct the problems, but – as often seems to be the case – the first attempt was about as good as it got.

I was very specific in my prompts about the story, the dialogue, the characters, down to the level of exactly what should be featured in each panel.

Some folks looked at this image and saw the continuity mistakes. Others looked at it and said “It looks okay to me”.

And that, I suspect, is the secret.