The AI-Ready Software Developer #16 – A Token of Our eXtreme

Some of you reading the posts in this series might be thinking, “This all sounds a bit familiar”. And you’d be right.

“AI” coding assistants built around Large Language Models may be a relatively new technology, but we’re discovering that the best ways to use them are decades old.

Some teams have been working in small batches – solving one problem at a time – and testing, reviewing, refactoring and integrating their code continuously for decades.

Some teams are cohesive, cross-functional and largely autonomous – adapting and self-organising to address problems in the moment, instead of waiting for permission from above.

Some teams have been using examples to pin down the meaning of system requirements, and driving the design of the software directly from executable interpretations of those examples (you may know them as “tests”), since the 1950s.

One software development methodology in particular encapsulates all of the principles and practices we’ve explored in this series: eXtreme Programming (XP).

XP was born in the mid-1990s, and is most closely associated with Kent Beck. It’s a shining example of what can happen when you get the right people in the room.

It was undoubtedly the main inspiration for the Agile Software Development movement that started at a ski resort – where else? – in 2001, but was subsequently somewhat overshadowed by it after. But for many of us, when you say “Agile”, we still hear “XP”.

If you look carefully at the group photo on the Agile Manifesto’s home page, and check the original signatories, you’ll see that most of the people attending that summit – like Ron Jeffries and Ward Cunningham – were closely involved with the early evolution of XP.

In this new age of “AI”-assisted programming, XP is experiencing something of a renaissance – although many folks currently rediscovering the approach might not realise it already has a name. So let me fill in the blanks.

eXtreme Programming brought together key lessons about what works and what doesn’t in developing software learned by programmers over the preceding 40 years.

Core to the technical practices of eXtreme Programming is a micro-iterative process we now call Test-Driven Development (TDD).

In TDD, we work in small steps – solving one problem at a time. We specify using examples (tests). We test continuously. We review the code continuously. We refactor continuously. And when we’re refactoring, we make one small change at a time, testing and reviewing the results at every step.

Many of us build version control into the TDD micro-cycle, committing changes whenever the tests are all “green”. Some even revert if any tests fail and try again, perhaps taking a smaller, safer step. And many of us push our changes directly to the trunk branch multiple times an hour, rather than waiting until a feature’s been completed.

XP teams will often work in pairs so that, as well as having an extra brain for problem-solving and providing direction, there’s also an extra pair of eyes reviewing the code as it’s being written.

XP teams tackle architecture in a highly collaborative and ongoing fashion. Always striving for simplicity, XP teams will have short design sessions throughout the day, using simple modeling techniques to visualise, understand, plan and communicate software design. (A very common misconception about XP is that teams “don’t do any design planning”. They’re doing it all the time.)

XP teams tend to be small and cohesive, encapsulating the skills needed to deliver customer requirements end-to-end whenever possible.

They are also highly autonomous and self-organising, making decisions together when they need to be made, instead of sending them up the chain of command. In XP, the team – working closely with the customer – is in command.

eXtreme Programming works by minimising uncertainty, and it does this by minimising the amount of work in progress – solving one problem at a time, maximising focus and minimising cognitive load – and maximising objective feedback. Basically, they turn the cards over one at a time, as they’re being dealt.

In terms of team productivity, the small batch sizes, fast feedback loops and continuous – rather than phased (blocking)- design, testing, review and integration tends to minimise bottlenecks and maximise the flow of value. Skilled XP teams tend to have very short delivery lead times, produce very stable releases, and are able to sustain the pace of delivery for years on the same product.

When we’re using “AI” coding assistants, uncertainty is even more in play. We can minimise uncertainty in pretty much exactly the same kinds of ways. Smaller steps with less ambiguity, and faster feedback.

Many “AI”-assisted developers are learning that a process like TDD can significantly reduce the “downstream chaos” that DORA data shows plagues most teams using these tools.

Working one test case (one example) at a time reduces context – the LLM equivalent of cognitive load – and specifying with tests minimises semantic ambiguity, dramatically reducing the risk of models misinterpreting our requirements.

It also gives us many opportunities to test (and re-test) the code in smaller feedback cycles, as well as many more opportunities to review generated code and refactor – again, one small step at a time – if there are problems (which there will be, and often!) We can use a process like TDD to keep the model on a very tight leash.

Combine TDD with merciless version control – commit on green, revert on red – to keep the code base on the path of working (shippable) software in every small increment – and do frequent merges to the release branch, and you have an approach that is, in most key respects, eXtreme Programming.

You can call it “eXtreme Vibing”, if you think that will look better on your CV.

The AI-Ready Software Developer #15 – It’s (Still) About The Team

So far, I’ve explored technical bottlenecks in the development process. In this post I want to talk about people bottlenecks. In particular, I want to talk about the team.

So, you’re working in small batches and micro feedback loops, solving one problem at a time. You’re testing continuously. You’re reviewing code continuously. You’re integrating continuously.

But are you making the decisions when they need to be made? Or are you waiting for permission from above?

Lack of autonomy, perhaps as a result of a lack of trust in the team, can create serious bottlenecks in the development process. The less decision-making power teams – and individuals on teams – have, the more decisions have to be sent up the chain of command, and potentially the longer we have to wait for a thumbs-up.

It’s just another queue. (And if the boss is one of those whose diary’s blocked out for the next 3 months, probably a very long queue.)

Just as we can reduce bottlenecks in testing or code review by doing it as the code’s being written, we can reduce bottlenecks in decision-making by making them when a decision’s needed. And that means making the decisions ourselves a lot of the time.

Managers who are prepared to relinquish day-to-day control and let teams just get on with it are sadly not the norm. Sometimes that’s down to a lack of trust that a team can be left to get on with it. (And, to be fair, the team may have earned that mistrust through past disappointments).

It can be a vicious circle: to earn trust and gain autonomy, teams need to deliver consistently and reliably. To deliver consistently and reliably, teams require a significant amount of autonomy.

My party trick as a contractor was to break the cycle, and I did it many times. If autonomy wouldn’t be given freely, as the lead developer, I would take it.

The team then has to put up or shut up, of course. There’s a window of opportunity to get the train moving, and to be seen to be moving. Changes have to happen fast. Look for low-hanging fruit!

That takes some nerve, though. It’s not in everybody’s comfort zone. (I remember being pulled aside by a team member and accused of “setting out to succeed” when we should be “covering our backsides” for when we inevitably fail. I don’t subscribe to that newsletter. I’m here to chew bubblegum and deliver software. And I’m all out of bubblegum.)

Once a consistent pattern of drama-free delivery’s been established, management will tend to back off. (Not always, of course – for some, it really is about control. But that’s a whole other series of posts.)

But the reality is inescapable. If the team doesn’t have sufficient autonomy, that will create bottlenecks in decision-making. I know. It’s so unfair!

The make-up of the team can also create bottlenecks. Over-specialised teams (the “front-end” team, the “back-end” team, the testing team, the “DevOps” team – an oxymoron if ever I saw one) will likely spend a lot of time waiting on other teams.

Organising teams around specific skillsets or technology areas suffers the same consequences as organising our code that way – the “UI layer”, the “services” layer, the “business logic” layer etc.

Image

It ends up creating networks of teams that are tightly coupled to each other. To deliver anything end-to-end requires lots of inter-team communication and coordination, making business outcomes an order of magnitude harder to achieve.

Again, it’s about architecture – coupling and cohesion. Instead of “front-end”, “back-end”, “data” etc, how about the “Mortgage Applications” team, that encapsulates – as much as possible – the skills needed to deliver Mortgage Application functionality end-to-end?

Image

Front-end, back-end, DB, ops, architecture, testing, security, UX design – these can simultaneously become internal communities of practice, if they’re given the time and the space to meet and to share ideas. This would be part of the 20-25% of the total dev budget that organisations need to invest to build long-term capability.

And even when we have cohesive, loosely-coupled teams organised around business outcomes, there’s still much potential for bottlenecks inside the team if members themselves are over-specialised.

If a team of 6 – a nice manageable number for communication – has one testing specialist, and nobody else on the team knows anything about software testing, then there’s going to be a queue forming for their services. That’s not conducive to testing as we code.

Flipping that around, if the testing specialist has no programming skills, they’re going to be waiting for someone to automate their tests. Either that, or the team relies entirely on manual regression testing.

Either way, the testing bottleneck is back!

If everybody on the team had some foundational testing and programming skills – the 20% we might need 80% of the time – then this bottleneck can be minimised. Team members may need to call on the expertise of a testing guru, perhaps inviting them to pair on an serious testing problem, but now one testing specialist may be enough for 5 programmers.

Developers who have deep expertise in one or two disciplines or technologies, but have a practical foundational grasp of others, are often referred to as “T-shaped”, or “generalising specialists”.

It takes quite a lot of experience and ongoing learning to become genuinely T-shaped, which is why T-shaped developers tend to have been working in software in a hands-on capacity for a decade or more.

(That’s not to say, of course, that every developer with a decade or more’s experience is T-shaped. There are plenty who’ve had “1 years’ experience 10 times”.)

The implication for team make-up is that it will be skewed towards more experienced people. Many managers endorse a diamond-shaped distribution of experience on a team, with maybe one very experienced lead or principal developer, a whole bunch of “mid-level” or “journeyman” developers, and maybe one or two junior trainees to maintain a healthy talent pipeline.

I, on the other hand, recommend an upside-down pyramid, with the bulk of the developers being very experienced and T-shaped, and with a narrower pipeline feeding up to that. An aging population, if you like.

And the implication of that is that teams are long-lived, and that the most skilled and experienced developers – regardless of their specialisms – stay developers.

The team is the real product.

Now, some of you may be thinking “But I don’t need to understand X, Y or Z, because Claude or GPT-5 can fill in those gaps for me.”

And it’s certainly true that hyperscale LLMs will have a lot of knowledge about other disciplines in its training data. But, as I’ve argued much in this series, they don’t understand and they don’t think. So how that vast knowledge might be applied on real-world problems would, for me, be a genuine worry.

It may look plausible to my untrained eyes, but… Be wary of falling prey to the Gell-Mann Amnesia effect.

And I don’t think testers, architects, operations, product managers, security experts and others would be too happy to learn we don’t think what they do requires Actual Intelligence, just as we can be offended when people imply the same about us.

The AI-Ready Software Developer – Index

You attached a code-generating firehose to your dev plumbing, and measured the pressure of the water going in to be 10x what it was before.

So you can’t understand why the business is complaining that they’re not getting the power shower you promised them.

Also, why are the carpets so wet?

The explanation is actually quite simple.

Industry data and empirical studies about the impact of AI coding assistants on development team productivity show a clear trend.

AI code generation used by teams with bottlenecks, blockers and quality leaks in the development process makes delivery delays and problems in production worse.

Image
advertisement

And we have a pretty good idea why. Optimising a non-bottleneck like coding in a development system with real bottlenecks – like testing, code review and integration – will make those bottlenecks worse.

Only teams who were already high-performing are seeing any tangible benefits from using AI.

It turns out that the key to being effective with AI coding assistants is being effective without them.

This is my guide – based on experience, experiment and evidence built over the last 3 years – to getting some actual value out of the code-generating firehose.


The AI-Ready Software Developer #14 – Continuous Architecture

They say a journey of a thousand miles starts with a single step, and the art of “AI”-assisted software development is very much putting one foot in front of the other. But we still need to look where we’re going.

One complaint that’s often leveled at micro-iterative development practices like Test-Driven Development and refactoring is that they can produce ad-hoc, “that’ll do for now” architectures.

There’s some truth to this for teams who are unskilled at high-level software design and lack the refactoring skills to reshape architecture as it emerges.

Software design at the level of individual tests or behaviours or modules could be considered the “short form”. Software architecture at the component and system level is the “long form”.

Other creative disciplines have their short forms and their long forms. A paragraph vs. a novel. A melody vs. a symphony. A scene vs. a movie.

We’ve probably all sat through a movie directed by someone very experienced at the short form (e.g., adverts) to find that, while individual scenes or shots are beautifully done, in the overall experience, pacing and structure are all over the place.

There is structure in a paragraph, in a melody, in a scene, and there is structure in a chapter, in a movement, and in an act. And there’s structure in a novel, in a symphony, and in a feature film.

It’s wheels within wheels. Or turtles all the way down, if you prefer.

What goes wrong in many dev teams is they’re not operating across these multiple levels of structure.

They may be entirely focused on the code window in front of them – beautiful prose, but the story makes no sense.

They may be thinking about their service-oriented architecture, but not taking care of the internal design of each service – a thrilling narrative, but it reads like it was written by a fourth grader. (This is a very real risk when architecture and implementation are considered separate activities or roles – or worse still, teams!)

And you’ll be amazed how often an “insignificant” implementation detail can end up fundamentally changing the higher-level architecture without teams realising it’s happened. It’s chilling the havoc that can be wreaked just by adding a dependency to left-pad strings.

This is especially true when we attach a code-generating firehose to the process. Structure is now emerging non-deterministically at a rate of knots, at least in part generated by LLMs who are gonna do what they’re gonna do, no matter what we told them to do.

And we must stay aware that LLMs do not operate effectively at larger scales of code organisation, mostly because of their very limited effective context windows, and the power-law distribution of examples in their training data – many short forms, vanishingly few long forms. Architecture just isn’t their strong suit.

So it’s essential to keep a handle on the structures as they emerge – visualising, analysing and steering the architecture into a form that’s going to do what we need today and be open to tomorrow’s inevitable changes.

The most effective developers don’t just focus on the code in front of them and the task at hand. They also see how it fits into a bigger picture. They see the jigsaw as well as the pieces.

When we consider composition – how the pieces fit together into bigger structures – then dependencies, coupling and cohesion start to loom large. As we climb the scale of code organisation, the arrows become more important than the boxes.

Image

There are tools we can use to help us see beyond the code window. Some are simple – pencil and paper, marker and whiteboard, Sharpie and Post-It note. Some are very sophisticated, like Rational Rose. (Maybe a bit too sophisticated.)

Visualising what we’ve got has always proven to be valuable in helping us to comprehend and reason about design at a higher-level. I’ll often spot a problem only when I’ve seen it on a diagram.

It’s also very handy for communicating design concepts more clearly and efficiently – an essential tool in collaborative design. And machine vision has matured to a point where that can include collaborating with “AI”.

Analysing what we’ve got – complexity, coupling, cohesion and other qualities of modular designs – is also a powerful tool for understanding problems and for exploring solutions to those problems.

Planning where we want to take the architecture next is the end goal of visualising and analysing it. It could involve a simple sketch on a whiteboard, or figuring out key roles, responsibilities and collaborations using CRC cards, or it could just be a conversation.

Figuring out how we’re going to get there safely, in a sequence of small feedback cycles, and without overwhelming the delivery process, is where we need to scale up our refactoring skills. Techniques like the Mikado Method for planning large-scale (long-form) refactorings can be very helpful here, provided you’ve got the small-scale refactoring skills to execute those plans.

Teams working with or without – but especially with – “AI” coding assistants need to master the short form and the long form, and all the forms in between. They need to see the wood and the trees.

As I argued in my previous post, big picture thinking is a job for Actual Intelligence.

And they need to visualise, analyse, communicate, plan and execute architecture continuously.

The AI-Ready Software Developer #13 – *You* Are The Intelligence

Human beings are funny old things. Over millions of years of evolution we’ve developed some traits that served us well in the wild, but might arguably work against us in our domesticated form.

We’re susceptible to psychological ticks that can distort our thinking and make us act irrationally, and even act against our own interests.

One of those ticks is our tendency to assign agency or intent to things that demonstrably don’t have it. We evolved to have Theory of Mind so that we can put ourselves in another person or animal’s shoes, and ask “Can that sabre-toothed tiger see me behind this tree?” or “Is Ugg planning to steal my best rock?”

The problems can start when we apply Theory of Mind to the weather (why does it insist on raining as soon as I put my coat on?), machinery (this washing machine hates me!), or – just for instance – a Large Language Model.

It’s understandable when we mistake software that matches patterns and predicts what comes next for something that actually thinks, because the patterns it’s matching are products of actual thinking – Actual Intelligence.

Heck, when he was stranded on that remote island, Tom Hanks formed a close friendship with a volleyball, and all that took was a handprint with eyes. The bar before anthropomorphism kicks in isn’t set very high.

Many LLM users ascribe qualities and abilities to the models that they demonstrably don’t have, like the ability to reason or to understand or to plan.

What they can do is to help us to reason and to understand and to plan.

Very importantly, we can also learn. In real time. From surprisingly few examples. And we don’t need a 100 MW power supply and the contents of Lake Michigan to do it.

In a collaboration between a human expert and an LLM, if we assign roles according to our strengths, the LLM is the powerful statistical pattern matcher and token predictor, trained on the sum total of current human knowledge – be it accurate or not – as of its training cut-off date. But it cannot think. It’s the world’s most well-read idiot. And we are the brains of the outfit.

We also need to remember that, despite what enthusiastic promoters of “agentic” coding assistants claim, LLMs have no capability to see the bigger picture and to think and plan strategically about things like the business domain, the user’s goals, the system architecture, or any of those “bird’s eye” concerns. Because they have no ability to think.

When we ask them to, they’ll “hallucinate” a high-level plan for us quite happily (and there I go, anthropomorphising). Like most “AI” output, it will look very plausible – more convincing than a handprint on a volleyball. But on closer inspection, there’s a very high probability that it will be full of Brown M&Ms. At such context sizes, it’s pretty much guaranteed.

And this is where psychology comes in again. Some people don’t see the problems. Maybe they don’t recognise them when they see them? Maybe they choose not to see them? Some folks really want to believe…

I have found it necessary to continually remind myself of the true nature of LLMs when I’m using them, and of the inherent – and very probably unfixable – limitations of their architecture.

The developers I’m seeing getting the best results using LLMs use them in ways that play to the tool’s strengths, and retain complete control over work that plays to theirs – keeping the LLM on a very short leash. They have the map. They set the route. They do the navigating.

The AI-Ready Software Developer #12 – Ground Truth

When Large Language Models hit the headlines in late 2022, with much speculation about impending Artificial General Intelligence (AGI) and the displacement of hundreds of millions of knowledge workers – including software developers – I naturally felt I needed to wrap my head around this technology.

After some initial “Wow! How is it doing this?” experimentation, the cracks soon started to show. Sessions with GPT-4 often ended in frustration as the LLM, if it could do what I wanted at all, would require lots of time-consuming coaxing and checking and fixing of outputs.

It would routinely “forget” instructions. It would routinely “lie” to me. And it made a lot of mistakes.

But it was still hard to see what was really going on. On first impressions, LLMs seem like magic.

When I played a game of chess with it, though, the tiger in the Magic Eye picture became visible. Once seen, it can’t be unseen.

For a fair few moves, I was again genuinely amazed that it was actually playing chess. Each move seemed reasonable, and reasoned. Sitting at my home office desk, staring at the screen, I was genuinely getting the feeling that there was some kind of mind looking back at me.

Eventually, the game reached a point where I could see mate in three if I sacrificed my queen. And I distinctly remember thinking, “But surely it can see that?”

It took the bait, and it was indeed checkmate in three more moves. Inevitably.

That’s when I saw the tiger. It doesn’t know where the pieces are. It doesn’t understand the rules of chess. And it’s not looking ahead in the way a human or, more exhaustively, a chess program does.

It’s literally matching the pattern in the sequence of moves so far against, presumably, a large corpus of chess game transcripts in its training data, and predicting what move comes next.

Could be a good move. Could be a bad move. It can’t tell the difference. It has no capacity to understand or reason about chess. It recursively matches input patterns to patterns in the model and predicts what token comes next.

And that’s how LLMs do everything. That’s how they summarise annual reports. That’s how they write poetry. And that’s how they write code. Could be good code. Could be bad code. They have no capacity to understand or reason about code. As a source of truth, this makes them too unreliable for any use case where fidelity matters.

In modular software design, there’s a principle for decoupling called “Tell, Don’t Ask”. I’m going to overload that principle and reuse it in this context.

Instead of asking an LLM for information related to the task at hand, we tell it what it needs to know. Models perform (match and predict) more accurately when the data they’re using comes from the real world and not from the model.

When you “talk” to an LLM, your conversation – your prompts, and the model’s replies – all form part of the context that the model is matching on. That includes all the bad chess moves and all the inaccurate summaries and all the bad poetry. And it also includes all the bad code. All the “hallucinated” libraries. All the incorrectly calculated test data. And all that jazz.

In previous posts, I explained why small, specific contexts work better – produce stronger predictions with fewer errors – and we can expand on that principle by making sure the context in each interaction contains a faithful representation of the real world as it pertains to the task: the code as it is right now, the tests as we specify them, the actual test run results, the actual linter output, the findings of our code review, and so on.

(“Ah, but Jason, LLMs are good at code review.” That doesn’t pass the “Brown M&Ms” test, I’m afraid. Don’t believe me? Take an Open Source code base on GitHub, insert unused imports into randomly-selected source files, and ask GPT-5 or Claude to find them. LLMs aren’t linters.)

Ground every interaction in a more reliable truth. Use deterministic sources of information whenever possible.

And when the model tells you it’s raining, go outside and look!

The AI-Ready Software Developer #11 – Staying Sharp

As has been discussed in previous posts, there will be times – many times – when “AI” coding assistants simply won’t be able to do what we need them to do. And that means that we will be the ones writing or fixing or refactoring that code.

This has two implications:

I’ve been observing how increased reliance on “AI” coding assistants can erode our programming knowledge and skills. Our edge becomes dulled the more we let them do the thinking for us.

This atrophying of cognitive ability is now widely reported, and it’s a serious issue. LLMs aren’t anywhere near reliable enough that human thinking won’t be required, and the more we use them, the less we’re capable of it.

This isn’t a new thing, of course. I’ve observed how increasing reliance on copying and pasting code from sources like Stack Overflow and GitHub have also diminished people’s ability to comprehend and reason about code. It seems that our brains need to be fully engaged for stuff to really sink in.

“But Jason, you learned programming by copying code.”

Absolutely I did. In the 1980s, I’d copy code from books and magazines. Here’s the thing, though: to get the code from the page into my Commodore 64, I had to read it and type the code myself. It had to go in my eyes, through my brain and out my fingers.

Copying isn’t the problem. The problem is pasting. When we skip the “through the brain” step, we don’t engage with source material anywhere near as deeply.

I’d read the code, try to understand the code, write the code, then run it to see what it does. “What it does” is another way of describing the semantics of that code. I observe the syntax of it through copying it manually, and then learn the semantics from executing it.

So I’ve developed a facility for comprehending and reasoning about programs that’s similar in many ways to sight-reading for musicians. When I read code and write code, I can hear the notes. I am quite fluent in code.

(Developers who grew up in the age of copy & paste are often amazed by my seemingly magical ability to execute blocks of code in my head.)

“AI” coding assistants appear to be accelerating this decline in the fluency of software developers who rely on them often. It takes them much longer to understand code, and they find it harder to reason about it – to predict what the code will do if they, say, change an AND to an OR.

So when the tool fails us, we’re far out at sea, struggling to swim. (And “vibe coders” with no programming skills need rescuing.)

The effects of comprehension debt caused by letting “AI” generate code faster than you understand it are compounded by reliance on the tools eroding our programming abilities.

It’s essential, therefore, to keep your hand in. Write code every day. Read, understand, copy (don’t paste) and keep learning.

I’m especially concerned about junior developers relying on “AI” coding assistants. They might think they’re getting more done, but that’s not their main job. The main job of a junior developer is to grow into a senior developer. Reliance on “AI” will stunt your growth.

(I say the exact same thing about copying and pasting.)

For this reason, I recommend to teams that they keep these tools away from their least experienced developers. Yeah, I know. Tough love.

But even senior developers can easily find over-relying on “AI” can quickly turn them back into junior developers if they don’t balance that with a decent amount of hands-on practice.

Turn off your nav computers and use the Force!

The AI-Ready Software Developer #10 – Comprehension Debt

In my previous post, I talked about the need to recognise when an “AI” coding assistant is circling the event horizon of a “doom loop” and take the wheel.

Taking the wheel, of course, requires that you can still drive and you know where the car’s supposed to be going.

In the next post I’ll talk about why it’s so important to maintain your edge as a programmer when you’re using these tools. But in this post, I want to explore one specific aspect of that: our understanding of the code the “AI” is generating.

Legacy code is something that has many software developers running screaming for the hills. A large part of the fear of legacy code is that it can be hard to comprehend, because somebody else – probably somebody who isn’t around anymore – wrote it.

When we’re asked to make a change to code we didn’t have a hand in writing, to do that safely – without breaking the software – we first need to wrap our heads around that code. And that takes time.

Studies vary in the details, but there can be no doubting – from eight decades of the business of software – that developers spend a lot more time reading code than we do writing it.

(The wisdom holds therefore that we should optimise our approach for the ease of reading, not writing, code. Give it another eight decades, and maybe that message will finally sink in.)

The extra time it takes to understand code so that we can change it without breaking it is what I call comprehension debt. The bigger the gap to understanding, the bigger the debt that has to be paid, and the more expensive the change.

Attaching a code-generating firehose to our development process is an accelerant for the creation of comprehension debt. Pre-LLMs, legacy code was a big problem for our industry. Now it’s well on the way to being a major threat to society, with an increasing number of teams – often under pressure from management, who drank the “AI” Kool-Aid – pushing code nobody understands into production.

Maybe it works today, but what happens when it needs to change tomorrow? Because odds are, it will. Code that gets used gets changed.

It’s vitally important to keep on top of the code that the machine is spitting out at a vast rate of knots. It’s vitally important that we really understand it. We need to read it, think about it, and inwardly digest its meaning.

This puts a hard limit on the speed of code generation, which isn’t about how many tokens per second the model can predict, but how many tokens per second we can understand.

When we’re drinking from the firehose, the limit isn’t the firehose. The limit is us.

This is the main reason I don’t let “AI” coding assistants directly affect my source code without running suggestions by me first. Only when I’ve fully grokked – pun intended – the changes and agree with them (which isn’t often) will I let them be applied without any interventions from me.

Working in small steps, solving one problem at a time, really helps here. The less code there is to comprehend, the more focus I can give to every decision the model suggests. I keep “AI” coding assistants on a very short leash. You will not find me – unless for experimentation – using these tools in any kind of “autonomous” or “agentic” mode.

And, of course, the same factors that make code easier to comprehend apply regardless of who wrote the code. Simplicity, clear naming (“says what it does on the tin”), and effective separation of concerns – so we can understand one aspect of the system without having to understand many others – all have their place here.

The usual poor substitutes for clear code – comments and documentation – are what LLMs tend to fall back on, so I look for opportunities to incorporate those messages into the code itself if I feel it’s needed. (And quite often, the comments, docstrings etc that models like Claude Opus and GPT-5 will add to code turn out to be redundant anyway.)

When explaining what code does, I try to make it clear in the code itself – to have it tell its own story. When I feel that I need to explain why it does it the way it does, then I might use inline documentation of some kind, like a comment.

Some “AI” coding assistant users will have the model generate Markdown files with explanations of what was done and why. These are about as useful as you’d expect, if you’ve ever been told to write, say, an architecture document. And, if you actually check the contents thoroughly, they usually don’t pass the “Brown M&Ms” test.

As one person put it: “Documentation is useful until you need it.” Often misleading. Often out of date. Often just ticking a box.

And, just as with legacy code, the big one is fast automated tests. The ability to quickly check that a change hasn’t broken anything is such a big factor in the cost of changing code that, in his book Working Effectively With Legacy Code, Michael Feathers defines “legacy code” as code that lacks those tests.

Well-written automated tests can also serve as living, executable documentation that shows us not just what we expect the code to do, but how to use or reuse it. I’ll take tests over comments and dosctrings any day of the week.

Anyhoo, back to the main point. When developers are generating code faster than they are understanding it, a mountain of comprehension debt can form very quickly.

It’s an age-old category mistake: optimising your dev process for adding, rather than changing, code.

You will pay for comprehension sooner or later, but remember that this debt accrues interest rapidly.

The AI-Ready Software Developer #9 – Well-Trodden Paths

A very common experience for LLM users is what I call the “doom loop”.

You ask the model to do something, it gets it wrong. You say “That’s wrong”, and it apologises , “You’re absolutely right. Louis Armstrong was not the first person to set foot on the Moon. Let me try that again.”

Then it will proceed to either make the exact same mistake again, or a completely new mistake as it tries to fix the first one.

By now, experienced users should be aware that there are some things LLMs simply cannot do. This is where the mask slips and we get a glimpse into their true nature.

Let’s consider a simple example: times tables.

LLMs – even the hyperscale “frontier” models – are good at multiplication… Until they’re not.

Image

It’s completely understandable, when we watch an LLM correctly calculate 3×3, 9×7, 11×4, that we might conclude that it can do multiplication. It’s a multiplying genius!

But as the factors get higher, the model starts to get the answers wrong more and more often. The graph above clearly shows some kind of distribution of accuracy that tails off rapidly.

This distribution of accuracy almost certainly corresponds to the distribution of examples the model was trained on.

Image

When I look online for times tables resources, they rarely go above 12×12. Examples that go up to 20×20 are vanishingly rare.

LLMs do not do multiplications. They pattern-match multiplications taken from examples they were trained on. The fewer the examples, the lower the confidence of their matches, and therefore their token predictions. This is “hallucination” territory.

They perform well on problems that are well-represented in the training data. They perform poorly in the long tail of scarce examples.

Put more simply, there are going to be all kinds of things the model just can’t do – no matter how we prompt it – because the data simply isn’t there.

Text-to-image diffusion models suffer the exact same limitation, and illustrate the problem graphically. Try getting Midjourney to generate an image of a wine glass full to the brim.

Image

The skill here isn’t about prompting or “context engineering”. The skill here is recognising when we’re trying to get a cat to lay eggs.

You may have noticed that examples of “vibe-coded” software generated entirely by “AI” coding assistants are almost always – in fact, I don’t think I’ve seen one that isn’t – solving problems that have been solved many times before. The Calendar app. The TO-DO list. The data aggregator. The HTTP server. etc etc.

One of my favourite jokes is to respond to a social media post boasting about the “app” that Lovable or Cursor generated for someone in “hours” instead of “weeks” with a screen grab of me forking something very similar on GitHub and exclaiming how my tool “did it in seconds”.

We’ve got two choices here. We can stick completely to well-solved problems, which isn’t great news for innovation, and isn’t likely to distinguish your product much – because, basically, if your “AI” coding assistant can do it in hours, anybody’s can.

Or we can recognise this limitation, and work around it. Some cookie cutter jobs are suited to LLMs – “boilerplate” code gets mentioned a lot. Solving hard and novel problems is probably best suited to human minds.

Andrej Karpathy, the inventor of “vibe coding”, would appear to agree.

Image

Again, here the skill is not in how we use “AI” coding assistants, but in when we use them, and when we know to write the code ourselves.

Recognising a doom loop as the model circles its event horizon, and knowing when to cut our losses and intervene by hand, is a skill more and more users of “AI” tools are learning.

I only have my own experiences of using tools like Claude Code and Cursor, and watching other developers use them, to go on here at the moment. (Please point me to any non-vendor-backed studies on this if you know of any.) But after a couple of thousand hours at the wheel, I’ve noticed how, if the model hasn’t managed to do it in the first pass, the odds that it will be able to do it at all are less than 50/50.

LLMs have effective maximum context sizes that are orders of magnitude smaller than advertised. Every new interaction – every additional pass at a problem – takes the context further and further out of the model’s data distribution and into “hallucination” territory.

In code generation, I usually apply a policy of zero tolerance. If a change breaks the code, I do a hard reset – not just of the code, but of the context – and get the model to try again with a “fresh pair of eyes”, typically after I’ve broken the problem down into smaller, less improbable steps.

And I’ll maybe give that 2-3 goes, and if I’m still not getting any joy, I write the code myself. I like to think that I recognise when we’ve gone out of the model’s distribution. I don’t waste any more time trying to get a picture of “an empty room with no elephant in it”. (That’s a fun one to try, by the way.)

The upshot of all this is that you will be writing a significant portion of the code yourself. We’ll talk about that in the next post.

We humans, of course, also have our data distributions – the things we’ve learned in the past – and will often find ourselves working outside of our distribution on novel problems and with novel technologies.

But we have abilities that LLMs don’t. We can reason – actually reason, not just pattern-match reasoning – and we can learn. We can learn fast, from surprisingly few examples, and we can learn cheap. We don’t require gigawatt power and city-scale water supplies, or to see a gazillion examples, to figure out how to use a new library. So this makes us far more suited to navigating unfamiliar territory. We wouldn’t be here today if we weren’t.

Finally, there are implications here for the technologies we can successfully let LLMs work with. Python code is very well-represented in training data sources, for example. Mainframe COBOL, on the other hand, is seen far less often on sites like GitHub and Stack Overflow.

When I’ve tried to work with languages or libraries that are niche, I’ve noticed a much higher incidence of problems in generated code. Basically, the model’s “guessing”.

In those instances, like Andrej, I’ve ended up writing pretty much all of the code myself. It’s just so much quicker.

The AI-Ready Software Developer #8 – Continuous Integration

This new “age of AI” has produced a paradox. While individual developers report “huge” productivity gains, bashing out code faster than ever, these gains mysteriously evaporate when we observe what actually makes it into the hands of end users.

Actually, there’s no mystery. We’ve understood for many decades why individual productivity doesn’t translate into team productivity, and why more code faster doesn’t mean more value sooner.

It’s a common misapprehension; developers who are constrained within “coding” see their part of the process as the development process, and confuse speeding up the creation of code with speeding up the creation of value.

The reality is that “coding” hasn’t been the bottleneck in software development since programmers were literally punching holes in cards representing individual binary digits.

Take any system that has bottlenecks and optimise a non-bottleneck, and you’ll make those real bottlenecks and overall system performance worse.

When we work with “AI” coding assistants, we need to be thinking at the system level. What are the downstream consequences of producing more code faster?

More to test? More to review? More to fix? More to refactor? And more to merge into the release branch. Otherwise ain’t no users getting nothing no time soon.

The Theory of Constraints teaches us that bigger batch sizes – e.g., larger change sets – create longer queues (for code review, for testing, for integration, for deployment, for customer feedback) and systemic delays, and important work ends up languishing like goods lorries stuck in a 6-mile tailback on a Kent motorway.

Counterintuitively for a lot of people, reducing the batch sizes and limiting work in progress actually makes the system faster – in the sense that work is delivered sooner. And businesses like “sooner”, often even more than they like “cheaper”.

Practices we’ve already looked at – solving one problem at a time, testing continuously, reviewing code continuously, refactoring continuously – all help to reduce batch sizes. We don’t solve all of the problems at once. We don’t do all of our testing at the end. We don’t leave code review until all the code’s been written. And so on. We drink from the code-generating firehose one mouthful at a time.

But that is all for naught if, after all that, we try to merge all our changes in one big batch. Teams who work on their own isolated, long-lived branches and who only integrate when, say, a feature is complete – e.g., by submitting Pull Requests that require peer review – are experiencing worsening delays using “AI” to generate code.

To be fair, they were probably experiencing pretty bad delays before “AI”, as evidenced by their 6-mile-long backlogs. But the firehose is demonstrably making the delays worse.

Bosses hooking the firehose to their dev teams expecting to get a power shower are a little disappointed with the results, to say the least. Naively, this is often attributed to teams “using the AI wrong”. That’s rarely the case. The “wrong” here is the development process they’re using it in.

Merging smaller change sets more often reduces these delays. Merging them continuously minimises them.

“But Jason, don’t the changes need to be tested first?”

Yes. And they have been. Continuously.

“Ah, but doesn’t the code need to be reviewed?”

Absolutely. And it has been, and every problem discovered has been addressed. Continuously.

I can imagine some teams watching us automating tests, stopping every couple of minutes to run those tests, reviewing the code every few minutes, refactoring every few minutes, committing every couple of minutes, pushing to the trunk many times an hour, might think “Wow. They’re so slow!”

But our code is ready for immediate release at any time. And we don’t have a huge backlog of work waiting in queues. So our lead times are very short.

Who’s slow now?

Curiously, the latest DORA State of AI-Assisted Software Development report finds a clear trend. Teams who were already continuously testing, reviewing, refactoring and integrating experience a modest but measurable increase in software delivery throughput and a reduction in lead times when they use “AI”, without sacrificing reliability.

Teams working on long-lived branches and relying on after-the-fact testing and code review experience worse systemic performance – lower throughput, longer lead times, and more boo-boos in production.

This is why this series of blog posts, that could maybe become a book or a course or a musical or a cake, refers to developers being “AI-ready” instead of “AI-assisted”.

The key to being effective using “AI” coding assistants is being effective without them.