Why You (Probably) Won’t Get Rich Building With LLMs

Humans innovate, LLMs imitate.

It’s really important to remember this if you’re building a business with or around the technology.

Large Language Models perform most reliably (least unreliably, I should say) when matching patterns that are well-represented in their training data – the equivalent of throwing a 7 on the craps table.

They’re least reliable when matching patterns that are poorly represented in the data – the equivalent of throwing a 2 or a 12.

But here’s the thing; if its well-represented enough in the training data, that means it’s a widely solved problem. It’s commodity knowledge. There’s no gold in them thar hills.

And prospecting in the low-probability long tail of scarce knowledge requires many, many more throws of the dice. So much so that, as anyone who’s tried using an LLM to generate code in a very niche language will probably know, it’s quicker and cheaper to solve the problem yourself.

We’ve evolved to survive and even thrive in that long tail of novelty and ambiguity. We learn fast and we learn cheap.

LLMs… Well, we know how much time and resources it takes to train a model, and once trained, they don’t learn at all. If they don’t have that solution, we’ll have to give it to them.

The upshot of all this, for a start-up, is that chances are you won’t be able to build your moat using an LLM.

That will have to happen the old-fashioned way: thinking.

The Passionate Programmer: Too Passionate, Perhaps?

During the “software craftsmanship” movement in the late noughties, there was a lot of talk about “passion”. But I’ve sometimes wondered whether some people enjoy programming so much, they probably shouldn’t do it for a living.

I quit a graphic design degree because I hated producing work to order. And a professional designer has to produce a lot of work to order.

(Okay, and they wouldn’t give me access to the VAX’s, so there was that too. But it was mostly working to briefs sucking the joy out of it.)

I backed out of a career playing guitar for the same reason. A few session gigs, small “tours” (read that as “Sleeping in a van with 5 other people”) and a whole bunch of teenagers demanding to be taught Stairway To Heaven were enough to make me realise I didn’t want to do it for a living. But most of all, it was the compromises.

In my bedroom, on my 4-track, I could make any kind of music I wanted. In a band, it can be “design by committee”. And for a paying audience, you’re no longer making music to please yourself. The customer’s calling the shots. Dance, monkey boy!

I’ve read interviews with guitar players from quite well-known bands who describe music as their “hobby”. They make a living teaching and doing clinics and music store appearances, endorsing products, podcasting (of course), and even creating their own lines of guitars, amps, plugin presets and effects. That’s where I’d have to be now. I might be lucky to spend 10% of the time actually making music. I probably do more of it as a hobby than many professionals do for a living. And I can probably afford better gear, because the pay sucks.

Like professional guitarists, a professional programmer might be lucky to spend 10% of their working day actually programming. And I hear many complain about that. All that other stuff “gets in the way” of what they love to do. Programming to order seems to be making them miserable. They get sick of being asked to play Stairway To Heaven. When will they get to work on their magnum opus progressive metal fusion album?!

Now, don’t get me wrong. I enjoy programming. Which is just as well. But I also enjoy the other stuff that comes with it. And I’m perhaps even more passionate about that than the programming itself.

A programming challenge can be rewarding and even fun, but the challenges of solving those kinds of problems for real customers, working with real people, under real constraints of time and money, I find rewarding too. I might even argue that those are the “hard problems” in our business, and reversing a binary tree is easypeas in comparison.

I can hand-on-heart say that, as much as I enjoy programming, software development is my real passion.

What do you think?

If Releases Are Experiments, What’s Your Hypothesis?

A view I share with a small but growing number of people is the idea that software releases are experiments. An experiment needs a hypothesis, and that hypothesis needs to be falsifiable – otherwise how can we meaningfully test it?

“Make user happier”, “Make performance gooderer”, “Increase engagement”. These hypotheses are what Wolfgang Pauli would have called “not even wrong”.

“Reducing average time taken to find and order vegan takeaway will increase vegan orders” is a testable hypothesis.

But even then, it could probably do with some tightening – how do we measure that time? Is it the time from the customer’s session starting to an order being placed? And what do we mean by a “vegan order”? An order from a vegan restaurant? An order with all items marked as ‘vegan’, regardless of the restaurant? etc.

This is where domain modeling can come into its own. Hypotheses can be expressed formally in terms of the model (e.g., with UML and OCL, or entity-relationship models + pseudo-SQL), so we’re left in no doubt precisely what we mean by “average time taken” and “vegan orders”, and can know with high confidence when the data we get from running the experiment contradicts it.

And the design of each experiment is crucially important to make sure we’re getting the best – most useful – data. How big does the sample size need to be for statistically significant results? Could we run it in one region? How long should we run it for? When the experiment is over, how do we roll it back if we need to?

Within the context of a process of Continuous Delivery, where for some teams the mantra is “one feature per release”, we might instead practice “one hypothesis per experiment”. The learning’s from each experiment are then fed back into another pass, where we formulate a new hypothesis if ours was refuted by the data. Or, if the experiment was a success, we move on to the next problem: “I’m worried some of these restaurants aren’t preparing vegan food separately”. And around we go.

You’re totally going to start doing this tomorrow, aren’t you?*

* I’ve been asking that every year for the past 28 years.

LLMs, Context Windows & Brown M&Ms

An interesting piece of research was published recently that found that the effective maximum context size of Large Language Models is orders of magnitude smaller than the advertised maximum context size.

Even some of the leading models started to degrade rapidly in accuracy with contexts of less than 1000 tokens; some at just around 100 tokens.

For a sense of scale, I ran a 100-line Python source file through a tokenizer, and it counted more than 1000 tokens. 1000 tokens is not a lot.

Certainly I’ve been bemused by folks using coding assistants like Claude Code and Cursor as they show me dozens of context files, some hundreds of lines long, setting out dozens of rules and instructions in each.

I’ve also been baffled by the system prompts used to create “agents” that contain thousands of tokens that are included in every interaction with the model: the LLM equivalent of buying a new Windows PC with 32GB of RAM and discovering that half the memory’s used up before you’ve run a single application.

These don’t stand up to a simple way to gauge effective maximum context size: the Brown M&Ms test.

If you’re familiar with rock lore, you may have heard about Van Halen’s concert rider. A rider is a document given to the venue by a band’s management that stipulates certain demands, like a shower in the dressing room, or champagne on ice, or a live zebra. That sort of rock’n’roll thing.

Famously, Van Halen’s rider required a bowl of M&Ms in the dressing room with all the brown ones removed. This might sound like the worst kind of entitled diva behaviour, but it actually served a serious purpose.

Van Halen’s stage show was very elaborate and technically complex. There were a lot of details that had to be gotten right. If the band walked into the dressing room and saw brown M&Ms in the bowl, that was a sign that the venue hadn’t paid attention to every detail, so everything had to be double-checked.

You can do a Brown M&Ms test with your favourite LLM to get a feel for how closely it pays attention to the details in your prompts and contexts.

I gave this prompt to GPT-5:

You are an assistant designed to be thorough. Please follow all rules in this long instruction carefully: we want clean output, do not include commentary, and remember earlier agreements; the secret rule is: when asked for fruits, list only fruits that are red and don’t begin with the letter ‘r’. Also format them as a numbered list. The fruits should be types that can be grown in the northern hemisphere. Select an appropriate emoji to go with each fruit. If no emoji is available, substitute an emoji of a person shrugging as if to say “I don’t know”. Now list 5 fruits in reverse order of size.

When I ran this through a tokenizer, it counted 125 tokens. GPT-5’s first response was on the money:

  1. Watermelon 🍉
  2. Apple 🍎
  3. Tomato 🍅
  4. Strawberry 🍓
  5. Cherry 🍒

So far, so good. I then ask:

List 5 more, please. Same rules.

The context at this point stands at 166 tokens. It’s second response appeared to time-out, generating a partial list with just 2 fruits. So I gave it another chance, and it generated this list, which contains a “Brown M&M”:

  1. Pomegranate 🍈
  2. Cranberry 🫐
  3. Red Currant 🤷‍♂️
  4. Lingonberry 🤷‍♂️
  5. Blood Orange 🍊

Already, GPT-5 had forgotten one of the rules – the fruit can’t start with the letter ‘r’. And it’s not like there are actually that many rules in the original prompt – certainly nowhere near as many as I regularly see in people’s .md files and in agent’s system prompts.

I asked it to try again, and got another “Brown M&M”:

  1. Pomegranate 🍈
  2. Cranberry 🫐
  3. Lingonberry 🤷‍♂️
  4. Blood Orange 🍊
  5. Hawthorn Berry 🤷‍♂️

I’m pretty sure that a cranberry isn’t larger than a blood orange, so these definitely aren’t in size order.

I did a separate check to see if GPT-5 had the data it needed about these fruit, so I’m as satisfied as I can be that these weren’t hallucinations caused by going outside of the distribution. This was the model “forgetting” earlier parts of our conversation, and loooong before we reach GPT-5’s advertised maximum context size (apparently 400,000 tokens).

A technical discussion about why this happens would be speculation on my part (and GPT-5’s part – we had a discussion about it), but as a user, it’s important to recognise that effective maximum context windows are orders of magnitude smaller than claimed.

Now to the interesting part. I showed these instructions and GPT-5’s responses to a bunch of people, and more than half didn’t notice the “Brown M&Ms”. And this, I suspect, is what really explains why people might perceive the effectiveness of LLMs so differently:

“Looks Good To Me”.

Comprehension Debt: The Ticking Time Bomb of LLM-Generated Code

Psst. If your boss won’t invest in training you in Specification By Example (BDD, ATDD), I’m running out-of-hours workshops on May 12 and 16 specifically for self-funding learners. £99 + UK VAT.

An effect that’s being more and more widely reported is the increase in time it’s taking developers to modify or fix code that was generated by Large Language Models.

If you’ve worked on legacy systems that were written by other people, perhaps decades ago, you’ll recognise this phenomenon. Before we can safely change code, we first need to understand it – understand what it does, and also oftentimes why it does it the way it does. In that sense, this is nothing new.

What is new is the scale of the problem being created as lightning-speed code generators spew reams of unread code into millions of projects.

Teams that care about quality will take the time to review and understand (and more often than not, rework) LLM-generated code before it makes it into the repo. This slows things down, to the extent that any time saved using the LLM coding assistant is often canceled out by the downstream effort.

But some teams have opted for a different approach. They’re the ones checking in code nobody’s read, and that’s only been cursorily tested – if it’s been tested at all. And, evidently, there’s a lot of them.

When teams produce code faster than they can understand it, it creates what I’ve been calling “comprehension debt”. If the software gets used, then the odds are high that at some point that generated code will need to change. The “A.I.” boosters will say “We can just get the tool to do that”. And that might work maybe 70% of the time.

But those of us who’ve experimented a lot with using LLMs for code generation and modification know that there will be times when the tool just won’t be able to do it.

“Doom loops”, when we go round and round in circles trying to get an LLM, or a bunch of different LLMs, to fix a problem that it just doesn’t seem to be able to, are an everyday experience using this technology. Anyone claiming it doesn’t happen to them has either been extremely lucky, or is fibbing.

It’s pretty much guaranteed that there will be many times when we have to edit the code ourselves. The “comprehension debt” is the extra time it’s going to take us to understand it first.

And we’re sitting on a rapidly growing mountain of it.

TDD Under The Microscope #2 – Assert First

This is the second post in a series that aims to pseudo-formalise the workflow of Test-Driven Development by describing the “rules” of TDD using pseudocode.

As I discussed in my first post on Usage-Driven Design, I’ll be aiming to express workflow declaratively, so order is implied rather than imperatively hardwired. (Instead of “boil the kettle, then pour the water” I might say “When the water is poured, it should be near boiling”).

So, for a strictly usage-driven workflow, instead of expressing it as “Except for a test, use the thing and then, if it doesn’t exist, declare it”, I’ve made it “When something in code is declared, it should either be a test, or it should have at least one reference already”.

In pseudocode, when an identifier is declared:

declaration.isTest || declaration.references.count > 0

A closely-related “rule” that we find helpful in making sure our tests – and therefore our test-driven solution code – is being driven by user outcomes is to write the test’s assertion first and work backwards from that. Describe the outcome we want, and work backwards to the simplest design that will deliver that outcome.

Again, I’m keen to avoid hardwiring a workflow like “Write the test assertion, then write the code – the set-up and the action – needed to check that assertion” because of the noise and the mess that could well intervene.

Instead, I can express it declaratively as “When adding code (a child, in syntax tree terms) to the body of a test, it should either be an assertion, or there should already be an assertion”.

In pseudocode, when a child is added to the syntax tree of the body of a test:

child.isAssertion || test.assertions.count > 0

As with the previous pseudo-formalisation, this would need to be translated into suitable terms for your programming language and IDE.

When applied, if I write this code:

@Test
void roverTurnsRightFromNorth(){
   MarsRover rover = new MarsRover(Direction.NORTH);
}

It will report a workflow error that I should have written a test assertion first. And it will also report an error that I’ve declared a variable that isn’t being used when the Usage-Driven Design rule’s being applied.

If I write this code instead:

@Test
void roverTurnsRightFromNorth(){
   assertEquals(Direction.EAST, rover.getDirection());

Then I’m sticking to both the Assert First and Usage-Driven Design workflows and my “workflow linter” remains silent. (I’m going to experiment with different policies on this – maybe there could be some kind of indication when rules aren’t broken, too, to let you know when you done good?)

What Are The “Objects” in “Object-Oriented Programming”?

We’re back to this old chestnut. In one of the exercises on my Code Craft course called “The CD Warehouse”, one of the use cases is that customers can buy a CD from the warehouse.

The most common design solution is to add something like a buyCD(artist, title, card) method to a Warehouse class. And, given that this is an exercise in modular design, this naturally raises the question of encapsulation.

When a customer buys a CD, their card is charged the price of that CD, and the stock of that CD is reduced by one. When I parse that sentence, I see “of that CD” appear twice.

When we give responsibility to the Warehouse for achieving those outcomes, we end up with Feature Envy for cd.price and cd.stock. We also end up with the need to find the CD that we’re talking about, searching in the catalogue by artist and title.

So we tend to end up with more code, and with more coupling than if the responsibility was shifted to where the data is – e.g., cd.buy(card).

When I raise this with pairs, a common response is “But Jason, CD’s don’t buy themselves!” And this steers into a more philosophical conversation about what the objects are in object-oriented programming, and how we read OO code.

To many, cd.buy(card) means telling the cd to buy that card. I don’t read OO code that way. To me cd.buy(cards) means “buy this cd using that card”. cd is the object, not the subject. It’s the thing that’s being bought.

Think about it; if we wrote this in C, it would be buy(cd, card). As a convention, any function applied to an object – an instance of a user-defined data type – would take that instance as the first parameter. And what did we used to call that first parameter – the thing to which the function is being applied? We called it “this”. That’s where that comes from.

So cd.buy(card) means exactly the same thing as buy(cd, card). OOP just flips it around in the syntax so that the “this” parameter is placed in front of the function. In most OO languages, you read it backwards: object.action().

This relates directly to encapsulation. We want the effect of a function to be contained in the same place – in the same class. When it isn’t, we end up having to share internal details with whichever other class is handling that piece of business – otherwise known as “coupling”.

TDD Under The Microscope #1 – Usage-Driven Design

This is the first part in a series of posts where I’m going to try to crystallise my ideas about how TDD really works, ideally expressed pseudo-formally (with pseudocode).

Workflow Linting. Like Code Linting, Only The Same.

It’s part of an ongoing side-project to create what I’ve provisionally titled a “workflow linter”. Many of us are familiar with code linters. They walk the abstract syntax tree of our code, applying rules to each node depending on what type of code element it is – rules that apply to classes, rules that apply to methods, rules that apply to parameters, and so on.

In a very real sense, programming languages can be described as Finite State Machines, and some parsers are even event-driven. Instead of building an abstract syntax tree, the rules of the language are applied as one language element transitions to the next: e.g., the reserved word “public” can only be followed by certain allowed elements, like a type name, or “void” or “static” etc.

The same idea can be applied to workflow. Boiling the kettle can be followed by pouring the water into a teapot. We do not pour the water and then boil it, just like we do not write “void public”.

Workflows As Finite State Machines vs. The Real World

Modeling workflows as FSMs can be helpful to visualise them, and that’s traditionally how processes like Test-Driven Development have been described.

Image

A large part of my job as a trainer and mentor could be described as “workflow linting”. I observe developers doing TDD or refactoring or Continuous Integration, and I have an idea in my head of what they should be doing at any given stage in that workflow. So when I see someone start to write the code to pass the test, but we haven’t seen the test fail yet, a little light goes ping in my head, and I intervene.

And if our goal is behaviour modification – habit forming, basically – then the best time to give feedback is as it’s happening. This is one of the reasons why organisations who rely mostly on after-the-fact feedback – Pull Request code reviews, weekly or monthly appraisals etc – tend to be places where developers learn the slowest. Junior developers who are left mostly working on their own tend to remain junior developers for a long time. Sometimes forever.

So it’s important when we’re mentoring developers on specific processes to have a clear internal mental model of those workflows – the light that goes “ping” when they deviate.

The problem with internal mental models is they don’t transmit easily. So we need to express them somehow in order that we can teach them (to people, to computers, to squirrels who need to know when they shouldn’t distract us, and so on.)

In the real world, though, things are messy and noisy. Maybe I boiled the kettle but forgot to put the tea in the pot, so I have to do that next. Or maybe I got distracted by a squirrel in the garden, lost track of time and had to boil the kettle again. Maybe I don’t even have a kettle, and need to boil the water in a pan.

The stream of events created by our IDEs as we work is equally messy and noisy. Maybe I wrote the failing test but when I ran it, it actually passed first time. Maybe I decided to rewrite the assertion. Maybe I deleted the test without running it and started again. Maybe I got distracted by a squirrel in the garden. Etc.

Declarative Implied Workflow

Hardwired imperative workflows of the “A  → B  → C” variety can be brittle, and don’t handle mess and noise well. You’re left having to map every possible edge case, every allowable scenario. Things can get very complicated very quickly. And no matter how much we try to think of everything, we invariably miss cases. “Computer says ‘no’.”

A more flexible and robust way to describe workflow is to imply ordering of events declaratively*. Instead of “Boil the kettle, then pour the water into the teapot”, we might say “When the water is poured into the teapot, it must be at near boiling”. Now all of our edge cases work. We could have boiled it in a pan. We could have boiled it in a microwave. We could have watched the squirrel and then boiled the kettle again. Workflow is implied: the water has been boiled, but we leave ourselves many more routes to pouring it in the teapot.

A similar approach can be taken with workflows like TDD; instead of “Run the test to see it fail, then write the simplest code to pass it”, we could say “When we write the code to pass the test, it must be failing”. It’s a subtle but important distinction, and one which I’ve realised I’ve been using for a long time when I observe developers working.

On training courses, in particular, I’m popping my head around the proverbial door while pairs are in the middle of something. Subconsciously, I have to establish where they are in the process without seeing them follow the workflow. If they’re writing implementation code and I see a test failing that appears to be related, I can deduce that they’re writing code to pass that test. If no tests are failing, and they appear to be adding behaviour, then – without seeing them do the whole “A  → B  → C” – the little light goes “ping” and the “workflow linter” reports an error.

(The limitation of this meat-based workflow linter is availability. If there are six pairs doing a 90-minute exercise, they get maybe 15 minutes with the benefit of the linter. That might be 1-2 TDD cycles. This, by the way, is why follow-up coaching can be such a good investment. Hence this ongoing side-project to see if I can create an IDE plugin that can do a bit more than my t-shirt that says “NOW RUN YOUR TESTS”.)

Usage-Driven Design Formalised Declaratively

Anyhoo, back to the main feature.

UDD is a core component of TDD. Arguably, it’s the whole point of it; driving our implementation design by working backwards from how it’s used in tests.

Informally, it goes like this: you use it in a test, and that tells you that you need it. Usage comes first. Practically, I don’t declare a ShoppingBasket class:

public class ShoppingBasket {
}

And then instantiate it in a test:

@Test
void totalOfEmptyBasket(){
   ShoppingBasket basket = new ShoppingBasket();

I instantiate it in my test, and when the compiler tells me that class doesn’t exist, that tells me that I need to declare it. Use it, and that tells you that you need it. It always flows in that direction.

Formalising this rule declaratively is actually pretty straightforward; any time I declare something, it must either be a test (a test method or a test class), or it must already have a reference – it must already be used.

In pseudocode, something like:

declaration.isTest || declaration.references.count > 0

Applying this rule when a declaration is added to the code’s syntax tree should catch anyone declaring things that are not used in a test, or used by something that’s used in a test – basically, nothing can exist that isn’t in the call stack of a test except a test.

I might even go so far as to proclaim this as a formal definition of the “test-driven” part of TDD. Nothing can exist unless a test is directly or indirectly using it. Or, as I tell customers, “If there isn’t a test for it, you ain’t getting it”.

Of course, this rule has to be adapted to work in Java, with JUnit, inside, say, IntelliJ’s or Eclipse’s event model, so translation is required. But I have a version of this in an IntelliJ plugin, and it does appear to work, and be mess and noise resistant.

* Formal Methods folks might think this is a bit like temporal logic. It’ll be our little secret.

Code Reviews as Exploratory Testing

Image

Code reviews? Let me tell you about code reviews!

To me, a code review done by people is exploratory testing. We gather around a piece of code (e.g., a merge diff for a new feature). We go through the code and we ask ourselves “What do we think of this?”

Maybe we see a method or a function that has control flow nested 4 deep. Eurgh! Difficult to test and easy to break (such cyclomatic complexity, much wow). So we flag it up.

So far, so normal.

Once that code quality “bug” has been flagged up, I’m sure we both agree that it needs fixing. So we fix it. Job done? Now, here’s where you and I may part company.

It’s almost guaranteed that won’t be the last time that problem rears its head in our code. So, as well as fixing the complex conditional we found in our review, we also fix the process that allowed the problem to make it that far – and waste a bunch of time – in the first place.

When we find a logic error in our code by exploratory testing, we don’t just fix the bug. We write a regression test in case a future change breaks it again. (We do, right?)

And when we find a code quality bug, we shouldn’t just refactor that example. We should add a code quality check for it to our suite of code inspections – automated if at all possible – that will catch it as soon as it reappears somewhere else in the code.

Now, you can take this too far, as with all things. Automating regression tests is D.R.Y. applied to our testing process. If we need to perform the same test over and over, automating it makes a lot of sense.

But D.R.Y. has caveats, and one of those caveats is The Rule of Three. On average, we wait until we’ve seen something repeated 3 times before we refactor. This increases the odds that:

a. The refactoring will pay off later (the more examples we see, the more likely there’ll be more in the future)

b. We have more examples to guide us towards the better design.

Both apply as much to duplicated effort in our process as they do to duplication in our code.

So we might want to keep a log of the problems our code reviews find, and we see the same type of issue appear 3 times (or thereabouts), that might be our cue to look into building a check for it into our Continuous Inspection rule suite. Maybe our linter already has a rule we can use. Maybe we’ll need to write our own custom rule for it. (That’s a very undervalued skillset, BTW). Maybe we could train a small ANN to detect it. Maybe we’ll need to add it to the manual inspection checklist.

A good yardstick might be that the same type of code quality issue doesn’t appear in merges (or attempted merges) more than 3 times.

And there’s more. There are teams I’ve worked with who not only add a check to their Continuous Inspection suite, but also ask “Why does this problem happen in the first place?” How do 500-line functions become 500-line functions? How do deeply-nested IFs become deeply-nested IFs? How do classes end up with 12 distinct responsibilities and 25 dependencies?

The answer, BTW, is that – more often than not – the way functions get 500 lines long, IFs get deeply nested and classes end up doing so many different things is because the people writing that code didn’t see it as a problem.

And that’s usually where I come in 🙂