architecture – Codemanship's Blog

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise rework due to misunderstandings about requirements, they’ll need to describe requirements in a testable way as part of a close and ongoing collaboration with our customers.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of code reviews, they’ll need to build review into the coding workflow itself, and automate the majority of their code quality checks.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise merge conflicts and broken builds, and to minimise software delivery lead times, they’ll need to integrate their changes more often and automatically build and test the software each time to make it ready for automated deployment.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the “blast radius” of changes, they’ll need to cleanly separate concerns in their designs to reduce coupling and increase cohesion.

“No, we don’t have time for that, Jason.”

I’ve spent the best part of 3 decades telling teams that to minimise the cost and the risk of changing code, they’ll need to continuously refactor their code to keep its intent clear, and keep it simple, modular and low in duplication.

“We definitely don’t have time for that, Jason!”

“AI” coding assistants don’t solve any of these problems. They AMPLIFY THEM.

More code, with more problems, hitting these bottlenecks at accelerated speed turns the code-generating firehose into a load test for your development process.

For most teams, the outcome is less reliable software that costs more to change and is delivered later.

Those teams are being easily outperformed by teams who test, review, refactor and integrate continuously, and who build shared understanding of requirements using examples – with and without “AI”.

Will you make time for them in 2026? Drop me a line if you think it’s about time your team addressed these bottlenecks.

Or was productivity never the point?

Best Way To Keep A Secret In Software Development? Document It

Once upon a time, in a magical land far, far way – just north of London Bridge – I wore the shiny cape and pointy hat of a senior software architect.

A big part of my job was to document architectures, and the key decisions that led to them.

Many diagrams. Many words. Many tables. Many, many documents.

Structure, dynamics, patterns, goals, principles, problem domains, business processes. You name it, I documented it. Much long time writing them, much even longer time keeping them up-to-date. Descriptions of things that aren’t the things themselves, it turns out, go stale fast.

It only takes a couple of gigs to become suspicious that maybe, perhaps, nobody actually reads these documents. So I started keeping the receipts. I’d monitor file access to see when a document was last pulled from our DMS or shared server, or when Wiki pages were last viewed.

My suspicions were confirmed. Teams weren’t looking at the documents much, or – usually – at all. Those fiends!

In their defence, as the saying goes:

Documentation’s useful until you need it

(I looked up the source of this anonymous quote, which I’ve used often. Turns out it’s me. LOL. Why didn’t I remember? I probably wrote it in a document.)

A lot of architecture documentation is out-of-date to the point where it becomes misleading. Code typically evolves faster than one person’s ability to keep an accurate record of the changes.

A fair amount was never in-date in the first place because it describes what the architect wanted the developers to do, and not what they actually did.

The upshot is that architecture documentation is rarely an accurate guide to reality. It’s either stale history, or creation myths.

Recent research found that familiarity with a code base is a far better predictor of a developer’s comprehension of the architecture than access to documentation, and the style of the documentation didn’t seem to make any significant difference.

When I think back to my times as developer rather than architect, that rings true. I’d often skim the documentation – if I looked at it at all – then go look at the code. Because the code is the code. It’s the truth of the architecture.

For architecture documentation to be more useful in aiding comprehension, it has to be tightly-coupled to the reality it describes. Changes to the architecture need to be reflected very soon after they’re made – ideally, pretty much immediately.

I experimented with round-trip modeling of architecture quite deeply, reverse-engineering code on every successful build, producing an updated description for every working version.

But that, too, produced web documentation that I know for a fact almost never got looked at. Least of all by me, as one of the developers.

Reverse-engineering code tends to produce static models – models of structure. I do not find purely structural descriptions very helpful in understanding what code does. It only describes what code is.

If a library has documentation generated from JavaDoc, I’ll tend to ignore that and look for examples of how the library can be used.

When I want to understand a modular design, I’ll look for examples of how key components in the architecture interact in specific usage scenarios – how does the architecture achieve user outcomes?

Quick war story: I joined a team in an investment bank who’d been working in isolation on different COM components. The Friday before our launch date, we still hadn’t seen how these very well-specified components – we all got handed detailed Word documents for our parts – would work together to satisfy key use cases.

Long-story-short, turns out they didn’t. Not a single end-to-end use case scenario.

Every part of the system was described in detail. But the system as a whole didn’t work, and we couldn’t see that until it was too late.

This is why I start with usage scenarios. You can keep your documentation – I’m gonna set breakpoints, run the damn thing, and click some buttons so I can step down through the call stack and build a picture of the real architecture, as it is now, in functional slices.

And I would have damned well designed it that way in the first place, starting with user outcomes and working backwards to an architecture that achieves them.

So, a dynamic model driven by usage examples is preferable in aiding my comprehension of architecture to structural models of static elements and their compile-time relationships.

One neat way of capturing usage examples is by writing them as tests. This can serve a dual purpose, both documenting scenarios as well as enabling us to see explicitly if the software satisfies them.

Tests as specifications – specification by example – is a cornerstone of test-driven approaches to software design. And it can also be a cornerstone of test-driven approaches to software comprehension.

When tests describe user outcomes, we can pull on that thread to visualise how the architecture fulfils them. That so few tools exist to do that automatically – generate diagrams and descriptions of key functions, key components, key interactions as the code is running (a sort of visual equivalent of your IDE’s debugger) – is puzzling. Other fields of design and engineering have had them for decades.

As for understanding the architecture’s history, again I’ll defer to what actually happened over what someone says happened. The version history of a code base can reveal all sorts of interesting things about its evolution.

Of course, the problem with software paleontology is that we’re often working with an incomplete fossil record. Some teams will make many design decisions in each commit, and then document them with a helpful message like “Changed some stuff, bro”.

A more fine-grained version history, where each diff solves one problem and every commit explains what that problem was, provides a more complete picture of not just what was done to the design, but why it was done.

e.g., “Refactor: moved Checkout.confirm() to Order to eliminate Feature Envy”

Couple this with a per-commit history of outcomes – test run results, linter issues, etc – and we can get a detailed picture of the architecture’s evolution, as well as the development process itself. You can tell a lot about an animal’s lifestyle from what it excretes. Or, as Jeff Goldblum put it in Jurassic Park, “That is one big pile of shit”.

Naturally, there will be times when the reason for a design decision isn’t clear from the fossil record. Just as with code comments, in those cases we probably need to add an explanation to that record.

This is my approach to architecture decision records – we don’t record every decision, just the ones that need explaining. And these, too, need to be tightly-coupled to the code affected. So if I’m looking at a piece of code and going “Buh?”, there’s an easy way to find out what happened.

Right now, some of you are thinking “But I can just get Claude to do that”. The problem with LLM-generated summaries is that they’re unreliable narrators – just like us.

I’ve been encouraging developers to keep the receipts for “AI”-generated documentation. It’s pretty clear that humans aren’t reading it most of the time, and when we do, it’s at “LGTM-speed”. We’re not seeing the brown M&Ms in the bowl.

Load unreliable summaries into the context alongside actual code, and models can’t tell the difference. Their own summaries are lying to them. Humans, at least, are capable of approaching documentation skeptically.

Whenever possible, contexts – just like architecture descriptions intended for human eyes – should be grounded in contemporary reality and validated history.

And talking of “AI” – because, it seems, we always are these days – one area of machine learning that I haven’t seen applied to software development yet might lie in mining the rich stream of IDE, source file and code-level events we give off as we work to recognise workflows and intents.

A friend of mine, for his PhD in Machine Vision, trained a model to recognise what people were doing – or attempting to do – in kitchens from videos of them doing it.

Such pattern recognition of developer activity might also be useful to classify probable intent, and even predict certain likely outcomes as we work. At the very least, more accurate and meaningful commit messages could be automatically generated. No more “Changed some stuff, bro”.

At the end of the (very long) day, comprehension is the goal here. Not documentation. If a document is not comprehended – and if nobody’s reading it, then that’s a given – then it’s of no help.

And if it’s going to be read, it needs to be helpful in aiding comprehension. Like, duh!

Personally, what I would find useful is not better documentation, but better visualisation – visualisation of static structure and dynamic behaviour, of components and their interactions, and of the software’s evolution.

And visualisation at multiple scales from individual functions and modules all the way up to systems of systems, and the business processes and problem domains they interact with.

And when we want the team to actually read the documentation, we need to take it to them. Diagrams should be prominently displayed (I’ve spent a lot of time updating domain models to be printed on A0 paper and hung on the wall), explanations should be communicated. Ideally, architecture should be a shared team activity – e.g., with teaming sessions, with code reviews, with pair programming – and an active, ongoing, highly-visible collaboration.

Active participation in architecture is essential to better comprehension. Doing > seeing > reading or hearing.

That also goes for legacy architecture – active engagement (clicking buttons, stepping through call stacks in the debugger, writing tests for existing code) tends to build deeper understanding faster than reading documentation.

The fastest way to understand code is to write it. The second fastest is to debug it.

This is especially critical in highly-iterative, continuous approaches to software and system architecture, where decisions are being made in real-time many times a day. Without a comprehensible, bang-up-to-date picture of the architecture, we’re basing our decisions on shaky foundations.

Like an LLM generating code by matching patterns in its own summaries, we risk coming untethered from reality.

Which would be an apt description of most architecture documents, including mine.

“First, We Model The Domain…”

In my previous blog post talking about the preciseness of software specifications, I used an example from one of my training workshops to illustrate the value in adding clarity when we have a shared understanding of the problem domain.

Now, when many developers see a UML class diagram – especially ones who lived through the age of Big Architecture in the 1990s – they immediately draw connotations of BDUF (Big Design Up-Front). And to be fair, it’s understandable how visual modeling, and UML in particular, gained that reputation, with it’s association with heavyweight model-driven development processes.

But teams who reject visual modeling outright because it’s “not Agile” are throwing the baby out with the BDUF bathwater.

I recounted in my post how providing a basic domain model with the requirements dramatically reduced misinterpretations in the training exercise.

And I’ve seen it have the same effect in real projects, too. As a tech lead I would often take on the responsibility of creating visual models based on our actual code and displaying them prominently in the team space. As the code evolved, I’d regularly update the models so they were a slightly-lagging but mostly accurate reflection of our design.

Domain models – the business concepts and their relationships – have proven to be the most useful things to share, helping to keep the team on the same page in our understanding of what it is we’re actually talking about.

Most importantly, there’s no hint of BDUF in sight. I describe the domain concepts that are pertinent to the test cases we’re working on. The model grows as our software grows, working in vertical slices in tight feedback loops, and never getting ahead of ourselves. We don’t model the entire problem domain, just the concepts we need for the functionality we’re working on.

In this sense, to describe our approach to design as “domain-driven” might be misleading. The domain doesn’t drive the design, user needs do. And user needs dictate what domain concepts our design needs.

Let’s examine the original requirements:

• Add item – add an item to an order. An order item has a product and a quantity. There must be sufficient stock of that product to fulfil the order

• Total including shipping – calculate the total amount payable for the order, including shipping to the address

• Confirm – when an order is confirmed, the stock levels of every product in the items are adjusted by the item quantity, and then the order is added to the sales history.

I’d tackle these one at a time. The domain model for Add Item would look like:

Note that Product price isn’t pertinent to this use case, so it’s not in the model.

When I start working on the next use case, Total Including Shipping, the domain model evolves.

And it evolves again to handle the Confirm use case.

And the level of detail in the model, again, is only what’s pertinent. We do not need to know about the getters and the setters and all that low-level malarkey. We can look at the code to get the details. Otherwise it just becomes visual clutter, making the models less useful as communication tools.

Another activity in which visual modeling can really help is as an aid to collaborative design.

I’ve seen so many times developers or pairs picking up different requirements and going off into their respective corners, designing in isolation and coming up with some serious trainwrecks – duplicated concepts, mismatched architectures, conflicting assumptions, and so on.

It’s the classic “Borrow a video”/”Return a video” situation, where we end up with two versions of the same conceptual model that don’t connect.

It’s especially risky early in the life of a software product, when a basic architecture hasn’t been established yet and everything’s up in the air.

I’ve found it very helpful in those early stages to get everybody around a whiteboard and lay out designs for their specific requirement that’s part of the same model. So if somebody’s already added a Rental class, they add their behaviour around that, and not around their own rental concept.

As the code grows, maintaining a picture of what’s in it – especially domain concepts – gives the team a shared map of what things are and where things go, and a shared vocabulary for discussing and reasoning about problems together.

This is part of the wider discipline of Continuous Architecture, where understanding, planning, evaluating and steering software design is happening throughout the day.

The opposite of Big Design Up-Front.

If your team wants to level up their capability to rapidly, reliably and sustainably evolve working software to meeting changing business needs, check out my live, instructor-led and very hands-on training workshops.

Modular Design: The Secret Sauce

In pretty much every Codemanship training course, I try to stress the fundamental importance of modular design in software development.

When we fail to separate the different concerns in our design in a way that enables us to understand, test, change and reuse them independently of the rest of the design, bad things happen. (Not “bad things can happen”. Bad things happen. End of.)

“What are those bad things, Jason?”

Bad Thing #1 – If it’s not a cleanly separated concern, in order to understand how, say, mortgage repayments are calculated, we might need to understand how to read the result from a web page using XPath, fetch account data from the database, and get the latest base interest rate from a web service. Lack of separation of concerns increases cognitive load, which increases the time taken to make changes, and increases the risk of mistakes.

Bad Thing #2 – If it’s not a cleanly separated concern, to test that the mortgage repayments are calculated correctly, we might need integration tests that hit databases and web services, or even end-to-end tests that go through the UI.

Testing’s the inner feedback loop of software development. If the inner feedback loop’s slow, the outer feedback loops will be very slow – delivery lead times go from days to weeks to months to maybe never. Slow test suites kill businesses every day. I’m not kidding.

Bad Thing #3 – If it’s not a cleanly separated concern, changing the code that calculates mortgage repayments will mean modifying files that handle other concerns. And changing the code that handles those other concerns will mean modifying the files that handle calculating mortgage repayments. This is where teams start tripping over each other’s feet.

Modules that do many things will often have many dependencies, too. We end up with big modules that are tightly-coupled to many other modules, so even tiny changes can have a wide “blast radius” – many files end up being affected. This means more to test, more to review, more to merge (with more merge conflicts to resolve – everybody’s editing the same files). These bigger batch sizes can easily overwhelm downstream bottlenecks in software delivery. Couple that with slow build & test cycles, and it’s a perfect storm.

Bad Thing #4 – Say we want to calculate mortgage repayments on our website, but we’d also like to offer that feature in our Android app. If the code that fetches the account data, gets the latest base interest rate, calculates the repayments, and renders the result in HTML are all in the same module, we have a problem. To use the radio, we need the whole Mercedes.

If we want to use it in a different car, we have to buy a new radio. What teams will often do is duplicate logic in multiple places. And when (yes, WHEN) that logic needs to change, they’ve multiplied the cost and the risk of doing that.

It’s no coincidence teams categorised in the DORA reports as “elite” have high modularity in their designs. You don’t get fast release cycles and short lead times without it. Simple as that.

The next question is, what does that degree of modularity actually look like? It’s much more modular than most developers think.

Refactoring – The Most Important, Least-Understood Dev Skill

At the moment, I offer 5 “off-the-shelf” training workshops focused on the core technical practices that enable rapid, reliable and sustained evolution of working software to meet changing needs.

Basically, the practices that have been shown to reduce delivery lead times, while improving release stability and reducing cost of change.

They’re self-supporting (e.g., can’t have continuous testing without good separation of concerns) – so ideally, your team would apply all of them in a “virtuous circle”.

But when I look at the sales history of each workshop, there’s a worrying imbalance.

* Code Craft (the flagship workshop) sells 49% of the time.

* The 2-day introduction to Test-Driven Development, aimed at less experienced developers, sells 32% of the time.

* The 1-day introduction to Unit Testing sells 9% of the time.

* The 2-day Design Principles deep-dive sells 8% of the time.

* And the 2-day Refactoring deep-dive only 2%. In fact, nobody’s booked a refactoring workshop since before the pandemic!

Refactoring, as a skill, exercises many of the “muscle groups” involved in Continuous Delivery, and is one of the most challenging to learn.

It’s also one of the most valuable. Whether you’re doing TDD or not, whether you’re continuously integrating or not, whether you’re agile or not – the ability to safely and predictably reshape code to accommodate change is gold.

Without it, you are far more likely to break Gorman’s First Law of Software Development:

Thou shalt not break shit that was working

Especially when you consider that most developers are working on hard-to-change legacy code most of the time. Refactoring is the skill for working with legacy products and systems.

I promised I wouldn’t be mentioning it this week, but I’ll just subtly hint that this problem is currently accelerating because of… well, y’know.

I routinely cite it as the second most important software development skill. (Can you guess what I believe is the first?)

It’s ironic, then, that it’s one of the rarest and one of the least in-demand, if job specs and training orders are any indication.

For sure, most developers will use the word (typically not knowing what it means), and most developers will claim they do it. But the large majority have never even seen it being done – hence the many misapprehensions about what it is.

At the very least, it would be a step-change for the profession if the average software developer could recognise the most common “code smells” and had a decent set of primitive refactorings in their repertoire to deal with them. I call this “short-form” refactoring.

And ideally, a good percentage of us would be capable of “long-form” refactoring so we can reshape architecture at a higher level safely. The best software architects have learned to think that way. (e.g., Joshua Kerievsky’s excellent book Refactoring To Patterns).

If you’d like to build your team’s Refactoring Fu, visit the website for details.

(Well, a man can hope, can’t he?)

The AI-Ready Software Developer #14 – Continuous Architecture

They say a journey of a thousand miles starts with a single step, and the art of “AI”-assisted software development is very much putting one foot in front of the other. But we still need to look where we’re going.

One complaint that’s often leveled at micro-iterative development practices like Test-Driven Development and refactoring is that they can produce ad-hoc, “that’ll do for now” architectures.

There’s some truth to this for teams who are unskilled at high-level software design and lack the refactoring skills to reshape architecture as it emerges.

Software design at the level of individual tests or behaviours or modules could be considered the “short form”. Software architecture at the component and system level is the “long form”.

Other creative disciplines have their short forms and their long forms. A paragraph vs. a novel. A melody vs. a symphony. A scene vs. a movie.

We’ve probably all sat through a movie directed by someone very experienced at the short form (e.g., adverts) to find that, while individual scenes or shots are beautifully done, in the overall experience, pacing and structure are all over the place.

There is structure in a paragraph, in a melody, in a scene, and there is structure in a chapter, in a movement, and in an act. And there’s structure in a novel, in a symphony, and in a feature film.

It’s wheels within wheels. Or turtles all the way down, if you prefer.

What goes wrong in many dev teams is they’re not operating across these multiple levels of structure.

They may be entirely focused on the code window in front of them – beautiful prose, but the story makes no sense.

They may be thinking about their service-oriented architecture, but not taking care of the internal design of each service – a thrilling narrative, but it reads like it was written by a fourth grader. (This is a very real risk when architecture and implementation are considered separate activities or roles – or worse still, teams!)

And you’ll be amazed how often an “insignificant” implementation detail can end up fundamentally changing the higher-level architecture without teams realising it’s happened. It’s chilling the havoc that can be wreaked just by adding a dependency to left-pad strings.

This is especially true when we attach a code-generating firehose to the process. Structure is now emerging non-deterministically at a rate of knots, at least in part generated by LLMs who are gonna do what they’re gonna do, no matter what we told them to do.

And we must stay aware that LLMs do not operate effectively at larger scales of code organisation, mostly because of their very limited effective context windows, and the power-law distribution of examples in their training data – many short forms, vanishingly few long forms. Architecture just isn’t their strong suit.

So it’s essential to keep a handle on the structures as they emerge – visualising, analysing and steering the architecture into a form that’s going to do what we need today and be open to tomorrow’s inevitable changes.

The most effective developers don’t just focus on the code in front of them and the task at hand. They also see how it fits into a bigger picture. They see the jigsaw as well as the pieces.

When we consider composition – how the pieces fit together into bigger structures – then dependencies, coupling and cohesion start to loom large. As we climb the scale of code organisation, the arrows become more important than the boxes.

There are tools we can use to help us see beyond the code window. Some are simple – pencil and paper, marker and whiteboard, Sharpie and Post-It note. Some are very sophisticated, like Rational Rose. (Maybe a bit too sophisticated.)

Visualising what we’ve got has always proven to be valuable in helping us to comprehend and reason about design at a higher-level. I’ll often spot a problem only when I’ve seen it on a diagram.

It’s also very handy for communicating design concepts more clearly and efficiently – an essential tool in collaborative design. And machine vision has matured to a point where that can include collaborating with “AI”.

Analysing what we’ve got – complexity, coupling, cohesion and other qualities of modular designs – is also a powerful tool for understanding problems and for exploring solutions to those problems.

Planning where we want to take the architecture next is the end goal of visualising and analysing it. It could involve a simple sketch on a whiteboard, or figuring out key roles, responsibilities and collaborations using CRC cards, or it could just be a conversation.

Figuring out how we’re going to get there safely, in a sequence of small feedback cycles, and without overwhelming the delivery process, is where we need to scale up our refactoring skills. Techniques like the Mikado Method for planning large-scale (long-form) refactorings can be very helpful here, provided you’ve got the small-scale refactoring skills to execute those plans.

Teams working with or without – but especially with – “AI” coding assistants need to master the short form and the long form, and all the forms in between. They need to see the wood and the trees.

As I argued in my previous post, big picture thinking is a job for Actual Intelligence.

And they need to visualise, analyse, communicate, plan and execute architecture continuously.

The AI-Ready Software Developer #13 – You Are The Intelligence

Human beings are funny old things. Over millions of years of evolution we’ve developed some traits that served us well in the wild, but might arguably work against us in our domesticated form.

We’re susceptible to psychological ticks that can distort our thinking and make us act irrationally, and even act against our own interests.

One of those ticks is our tendency to assign agency or intent to things that demonstrably don’t have it. We evolved to have Theory of Mind so that we can put ourselves in another person or animal’s shoes, and ask “Can that sabre-toothed tiger see me behind this tree?” or “Is Ugg planning to steal my best rock?”

The problems can start when we apply Theory of Mind to the weather (why does it insist on raining as soon as I put my coat on?), machinery (this washing machine hates me!), or – just for instance – a Large Language Model.

It’s understandable when we mistake software that matches patterns and predicts what comes next for something that actually thinks, because the patterns it’s matching are products of actual thinking – Actual Intelligence.

Heck, when he was stranded on that remote island, Tom Hanks formed a close friendship with a volleyball, and all that took was a handprint with eyes. The bar before anthropomorphism kicks in isn’t set very high.

Many LLM users ascribe qualities and abilities to the models that they demonstrably don’t have, like the ability to reason or to understand or to plan.

What they can do is to help us to reason and to understand and to plan.

Very importantly, we can also learn. In real time. From surprisingly few examples. And we don’t need a 100 MW power supply and the contents of Lake Michigan to do it.

In a collaboration between a human expert and an LLM, if we assign roles according to our strengths, the LLM is the powerful statistical pattern matcher and token predictor, trained on the sum total of current human knowledge – be it accurate or not – as of its training cut-off date. But it cannot think. It’s the world’s most well-read idiot. And we are the brains of the outfit.

We also need to remember that, despite what enthusiastic promoters of “agentic” coding assistants claim, LLMs have no capability to see the bigger picture and to think and plan strategically about things like the business domain, the user’s goals, the system architecture, or any of those “bird’s eye” concerns. Because they have no ability to think.

When we ask them to, they’ll “hallucinate” a high-level plan for us quite happily (and there I go, anthropomorphising). Like most “AI” output, it will look very plausible – more convincing than a handprint on a volleyball. But on closer inspection, there’s a very high probability that it will be full of Brown M&Ms. At such context sizes, it’s pretty much guaranteed.

And this is where psychology comes in again. Some people don’t see the problems. Maybe they don’t recognise them when they see them? Maybe they choose not to see them? Some folks really want to believe…

I have found it necessary to continually remind myself of the true nature of LLMs when I’m using them, and of the inherent – and very probably unfixable – limitations of their architecture.

The developers I’m seeing getting the best results using LLMs use them in ways that play to the tool’s strengths, and retain complete control over work that plays to theirs – keeping the LLM on a very short leash. They have the map. They set the route. They do the navigating.

The AI-Ready Software Developer #4 – Continuous Testing

Now, where were we? Ah, yes.

So, we’re working in small steps with our LLM, solving one problem at a time, which makes it easier for the model to pay attention to important details (just like in real life).

We’re keeping our contexts small, and making them more specific by clarifying with examples to reduce the risk of misinterpretation (just like in real life).

And we’re cleanly separating the different concerns in our architecture to limit the “blast radius” when the model changes code, reducing the risk of boo-boos (just like in real life), and keeping diffs smaller. (More about that in a future post – for now, smaller diffs == gooderer).

When we apply all three of these practices together, it opens a door: we can test more often.

Those examples we used to clarify our requirements can become tests we can perform after the model has done that work to check that it did what we told it to.

We could perform these tests ourselves by running the software or by accessing the code directly at the command line in Read-Evaluate-Print loops (REPL). Or, if a UI is involved, we could run it and click the buttons ourselves.

I highly recommend seeing it work with your own eyes at least once. Trust no one, Agent Mulder!

But what about code that was working that the model has since changed? As the software grows, manually retreading dozens, hundreds, maybe thousands of tests to make sure we’re obeying Gorman’s First Law of Software Development :

“Thou shalt not break shit that was working”

– is going to take a lot of time. Eventually, our development process will become O(n!) complex, where n is the number of tests, and every time we add a new one – one problem at a time, remember? – we have to repeat the existing tests.

Automation to the rescue! If we find ourselves performing the same test over and over, we can write code to perform it for us. Or we can get the LLM to write it for us (be careful here – triple-check every test the model writes! Been burned by that multiple times.)

And this is where clean separation of concerns turns into a superhero.

If the code that, say, calculates mortgage repayments is buried inside the module that generates the Repayments web page, and which also directly accesses an external web service to get interest rates, then you’ll have little choice but to test through the browser (or something pretending to be the browser).

But if there’s a separate MortgageCalculator module that does this work, and that module is decoupled from the code that fetches interest rates, a test can be automated directly against it that will run very reliably and very fast – milliseconds instead of seconds. Thousands of those kinds of “unit” tests could run in seconds instead of hours.

Which means comprehensively retesting your software after every small step, giving you an instant heads-up if the LLM broke anything (AND IT WILL), becomes completely practical.

Once again, you won’t be surprised to learn that this is very good news whether we’re using “AI” or not. Many teams consider it essential.

The AI-Ready Software Developer #1 – Separation of Concerns

Can we talk about separation of concerns and cognitive load?

One thing about LLM coding assistants that’s very interesting is how they tend to crap out on code that has poor separation of concerns.

Despite some pretty darn big advertised maximum context sizes (e.g., GPT-5 has 400K tokens), the effective maximum context size – beyond which accuracy degrades rapidly, and downstream rework multiplies – is orders of magnitude smaller. Studies have found it to be in the order of 100 – 1000 tokens, even for the hyperscale “frontier” models.

Coding assistants like Claude Code and Codex use static dependency analysis to determine what source files need to be added to the context for a particular task.

If you ask it to make a change to a 1,000-line source file that has 20 direct dependencies on other source files, that’s a lot of context.

If you ask it to make a change to a 100-line source file with 2 direct dependencies on other files, the context is much smaller. Changes with a smaller “blast radius” are less likely to go wrong.

I think of LLM context as being analogous to cognitive load: in order to understand Module A, what else do I need to understand?

Higher modularity tends to reduce cognitive load when it’s done effectively. If I can understand the contents of Module A, I shouldn’t need to understand the contents of any of its dependencies. To reverse an old marketing slogan, each dependency “Says what it does on the tin”, so to speak.

A useful test of code comprehensibility is to ask people to predict what a method or function will do in a specific test case without letting them see the implementations of any other methods or functions it uses. What these dependencies are doing should be obvious from the outside, and the details of how they do it should be irrelevant.

And it turns out that’s good advice when working with LLMs, too. Signposting dependencies clearly (e.g., with intent-revealing names and/or type information) helps models pattern-match more accurately – they don’t need to “guess”. And from experiment I’ve seen it reduce context size – “cognitive load” – on many occasions, producing fewer “hallucinations” in the output.

In languages like C# and Java, we don’t get much of a choice over whether we provide type information (though watch out for those implied types!)

In dynamically-typed languages, I’ve found I need to be more careful. If, for example, a dependency name doesn’t correspond to its type in my Python code, I’ll add a type hint to give the model more to go on.

One final thought: LLMs are famously good at generating code they’re bad at modifying. I routinely see larger source files with lots of dependencies being spat out by Claude, GPT-5, Llama etc. So you need to keep on top of your modular design.

(I see some folks posting that they get the model to review and break modules down once a day. IME, generated code can be tripping the model up before lunch, so I’d recommend refactoring more often than that. Indeed, I recommend refactoring continuously.)

What Are The “Objects” in “Object-Oriented Programming”?

We’re back to this old chestnut. In one of the exercises on my Code Craft course called “The CD Warehouse”, one of the use cases is that customers can buy a CD from the warehouse.

The most common design solution is to add something like a buyCD(artist, title, card) method to a Warehouse class. And, given that this is an exercise in modular design, this naturally raises the question of encapsulation.

When a customer buys a CD, their card is charged the price of that CD, and the stock of that CD is reduced by one. When I parse that sentence, I see “of that CD” appear twice.

When we give responsibility to the Warehouse for achieving those outcomes, we end up with Feature Envy for cd.price and cd.stock. We also end up with the need to find the CD that we’re talking about, searching in the catalogue by artist and title.

So we tend to end up with more code, and with more coupling than if the responsibility was shifted to where the data is – e.g., cd.buy(card).

When I raise this with pairs, a common response is “But Jason, CD’s don’t buy themselves!” And this steers into a more philosophical conversation about what the objects are in object-oriented programming, and how we read OO code.

To many, cd.buy(card) means telling the cd to buy that card. I don’t read OO code that way. To me cd.buy(cards) means “buy this cd using that card”. cd is the object, not the subject. It’s the thing that’s being bought.

Think about it; if we wrote this in C, it would be buy(cd, card). As a convention, any function applied to an object – an instance of a user-defined data type – would take that instance as the first parameter. And what did we used to call that first parameter – the thing to which the function is being applied? We called it “this”. That’s where that comes from.

So cd.buy(card) means exactly the same thing as buy(cd, card). OOP just flips it around in the syntax so that the “this” parameter is placed in front of the function. In most OO languages, you read it backwards: object.action().

This relates directly to encapsulation. We want the effect of a function to be contained in the same place – in the same class. When it isn’t, we end up having to share internal details with whichever other class is handling that piece of business – otherwise known as “coupling”.