October 2025 – Codemanship's Blog

The AI-Ready Software Developer #15 – It’s (Still) About The Team

So far, I’ve explored technical bottlenecks in the development process. In this post I want to talk about people bottlenecks. In particular, I want to talk about the team.

So, you’re working in small batches and micro feedback loops, solving one problem at a time. You’re testing continuously. You’re reviewing code continuously. You’re integrating continuously.

But are you making the decisions when they need to be made? Or are you waiting for permission from above?

Lack of autonomy, perhaps as a result of a lack of trust in the team, can create serious bottlenecks in the development process. The less decision-making power teams – and individuals on teams – have, the more decisions have to be sent up the chain of command, and potentially the longer we have to wait for a thumbs-up.

It’s just another queue. (And if the boss is one of those whose diary’s blocked out for the next 3 months, probably a very long queue.)

Just as we can reduce bottlenecks in testing or code review by doing it as the code’s being written, we can reduce bottlenecks in decision-making by making them when a decision’s needed. And that means making the decisions ourselves a lot of the time.

Managers who are prepared to relinquish day-to-day control and let teams just get on with it are sadly not the norm. Sometimes that’s down to a lack of trust that a team can be left to get on with it. (And, to be fair, the team may have earned that mistrust through past disappointments).

It can be a vicious circle: to earn trust and gain autonomy, teams need to deliver consistently and reliably. To deliver consistently and reliably, teams require a significant amount of autonomy.

My party trick as a contractor was to break the cycle, and I did it many times. If autonomy wouldn’t be given freely, as the lead developer, I would take it.

The team then has to put up or shut up, of course. There’s a window of opportunity to get the train moving, and to be seen to be moving. Changes have to happen fast. Look for low-hanging fruit!

That takes some nerve, though. It’s not in everybody’s comfort zone. (I remember being pulled aside by a team member and accused of “setting out to succeed” when we should be “covering our backsides” for when we inevitably fail. I don’t subscribe to that newsletter. I’m here to chew bubblegum and deliver software. And I’m all out of bubblegum.)

Once a consistent pattern of drama-free delivery’s been established, management will tend to back off. (Not always, of course – for some, it really is about control. But that’s a whole other series of posts.)

But the reality is inescapable. If the team doesn’t have sufficient autonomy, that will create bottlenecks in decision-making. I know. It’s so unfair!

The make-up of the team can also create bottlenecks. Over-specialised teams (the “front-end” team, the “back-end” team, the testing team, the “DevOps” team – an oxymoron if ever I saw one) will likely spend a lot of time waiting on other teams.

Organising teams around specific skillsets or technology areas suffers the same consequences as organising our code that way – the “UI layer”, the “services” layer, the “business logic” layer etc.

It ends up creating networks of teams that are tightly coupled to each other. To deliver anything end-to-end requires lots of inter-team communication and coordination, making business outcomes an order of magnitude harder to achieve.

Again, it’s about architecture – coupling and cohesion. Instead of “front-end”, “back-end”, “data” etc, how about the “Mortgage Applications” team, that encapsulates – as much as possible – the skills needed to deliver Mortgage Application functionality end-to-end?

Front-end, back-end, DB, ops, architecture, testing, security, UX design – these can simultaneously become internal communities of practice, if they’re given the time and the space to meet and to share ideas. This would be part of the 20-25% of the total dev budget that organisations need to invest to build long-term capability.

And even when we have cohesive, loosely-coupled teams organised around business outcomes, there’s still much potential for bottlenecks inside the team if members themselves are over-specialised.

If a team of 6 – a nice manageable number for communication – has one testing specialist, and nobody else on the team knows anything about software testing, then there’s going to be a queue forming for their services. That’s not conducive to testing as we code.

Flipping that around, if the testing specialist has no programming skills, they’re going to be waiting for someone to automate their tests. Either that, or the team relies entirely on manual regression testing.

Either way, the testing bottleneck is back!

If everybody on the team had some foundational testing and programming skills – the 20% we might need 80% of the time – then this bottleneck can be minimised. Team members may need to call on the expertise of a testing guru, perhaps inviting them to pair on an serious testing problem, but now one testing specialist may be enough for 5 programmers.

Developers who have deep expertise in one or two disciplines or technologies, but have a practical foundational grasp of others, are often referred to as “T-shaped”, or “generalising specialists”.

It takes quite a lot of experience and ongoing learning to become genuinely T-shaped, which is why T-shaped developers tend to have been working in software in a hands-on capacity for a decade or more.

(That’s not to say, of course, that every developer with a decade or more’s experience is T-shaped. There are plenty who’ve had “1 years’ experience 10 times”.)

The implication for team make-up is that it will be skewed towards more experienced people. Many managers endorse a diamond-shaped distribution of experience on a team, with maybe one very experienced lead or principal developer, a whole bunch of “mid-level” or “journeyman” developers, and maybe one or two junior trainees to maintain a healthy talent pipeline.

I, on the other hand, recommend an upside-down pyramid, with the bulk of the developers being very experienced and T-shaped, and with a narrower pipeline feeding up to that. An aging population, if you like.

And the implication of that is that teams are long-lived, and that the most skilled and experienced developers – regardless of their specialisms – stay developers.

The team is the real product.

Now, some of you may be thinking “But I don’t need to understand X, Y or Z, because Claude or GPT-5 can fill in those gaps for me.”

And it’s certainly true that hyperscale LLMs will have a lot of knowledge about other disciplines in its training data. But, as I’ve argued much in this series, they don’t understand and they don’t think. So how that vast knowledge might be applied on real-world problems would, for me, be a genuine worry.

It may look plausible to my untrained eyes, but… Be wary of falling prey to the Gell-Mann Amnesia effect.

And I don’t think testers, architects, operations, product managers, security experts and others would be too happy to learn we don’t think what they do requires Actual Intelligence, just as we can be offended when people imply the same about us.

The AI-Ready Software Developer – Index

You attached a code-generating firehose to your dev plumbing, and measured the pressure of the water going in to be 10x what it was before.

So you can’t understand why the business is complaining that they’re not getting the power shower you promised them.

Also, why are the carpets so wet?

The explanation is actually quite simple.

Industry data and empirical studies about the impact of AI coding assistants on development team productivity show a clear trend.

AI code generation used by teams with bottlenecks, blockers and quality leaks in the development process makes delivery delays and problems in production worse.

And we have a pretty good idea why. Optimising a non-bottleneck like coding in a development system with real bottlenecks – like testing, code review and integration – will make those bottlenecks worse.

Only teams who were already high-performing are seeing any tangible benefits from using AI.

It turns out that the key to being effective with AI coding assistants is being effective without them.

This is my guide – based on experience, experiment and evidence built over the last 3 years – to getting some actual value out of the code-generating firehose.

The AI-Ready Software Developer #14 – Continuous Architecture

They say a journey of a thousand miles starts with a single step, and the art of “AI”-assisted software development is very much putting one foot in front of the other. But we still need to look where we’re going.

One complaint that’s often leveled at micro-iterative development practices like Test-Driven Development and refactoring is that they can produce ad-hoc, “that’ll do for now” architectures.

There’s some truth to this for teams who are unskilled at high-level software design and lack the refactoring skills to reshape architecture as it emerges.

Software design at the level of individual tests or behaviours or modules could be considered the “short form”. Software architecture at the component and system level is the “long form”.

Other creative disciplines have their short forms and their long forms. A paragraph vs. a novel. A melody vs. a symphony. A scene vs. a movie.

We’ve probably all sat through a movie directed by someone very experienced at the short form (e.g., adverts) to find that, while individual scenes or shots are beautifully done, in the overall experience, pacing and structure are all over the place.

There is structure in a paragraph, in a melody, in a scene, and there is structure in a chapter, in a movement, and in an act. And there’s structure in a novel, in a symphony, and in a feature film.

It’s wheels within wheels. Or turtles all the way down, if you prefer.

What goes wrong in many dev teams is they’re not operating across these multiple levels of structure.

They may be entirely focused on the code window in front of them – beautiful prose, but the story makes no sense.

They may be thinking about their service-oriented architecture, but not taking care of the internal design of each service – a thrilling narrative, but it reads like it was written by a fourth grader. (This is a very real risk when architecture and implementation are considered separate activities or roles – or worse still, teams!)

And you’ll be amazed how often an “insignificant” implementation detail can end up fundamentally changing the higher-level architecture without teams realising it’s happened. It’s chilling the havoc that can be wreaked just by adding a dependency to left-pad strings.

This is especially true when we attach a code-generating firehose to the process. Structure is now emerging non-deterministically at a rate of knots, at least in part generated by LLMs who are gonna do what they’re gonna do, no matter what we told them to do.

And we must stay aware that LLMs do not operate effectively at larger scales of code organisation, mostly because of their very limited effective context windows, and the power-law distribution of examples in their training data – many short forms, vanishingly few long forms. Architecture just isn’t their strong suit.

So it’s essential to keep a handle on the structures as they emerge – visualising, analysing and steering the architecture into a form that’s going to do what we need today and be open to tomorrow’s inevitable changes.

The most effective developers don’t just focus on the code in front of them and the task at hand. They also see how it fits into a bigger picture. They see the jigsaw as well as the pieces.

When we consider composition – how the pieces fit together into bigger structures – then dependencies, coupling and cohesion start to loom large. As we climb the scale of code organisation, the arrows become more important than the boxes.

There are tools we can use to help us see beyond the code window. Some are simple – pencil and paper, marker and whiteboard, Sharpie and Post-It note. Some are very sophisticated, like Rational Rose. (Maybe a bit too sophisticated.)

Visualising what we’ve got has always proven to be valuable in helping us to comprehend and reason about design at a higher-level. I’ll often spot a problem only when I’ve seen it on a diagram.

It’s also very handy for communicating design concepts more clearly and efficiently – an essential tool in collaborative design. And machine vision has matured to a point where that can include collaborating with “AI”.

Analysing what we’ve got – complexity, coupling, cohesion and other qualities of modular designs – is also a powerful tool for understanding problems and for exploring solutions to those problems.

Planning where we want to take the architecture next is the end goal of visualising and analysing it. It could involve a simple sketch on a whiteboard, or figuring out key roles, responsibilities and collaborations using CRC cards, or it could just be a conversation.

Figuring out how we’re going to get there safely, in a sequence of small feedback cycles, and without overwhelming the delivery process, is where we need to scale up our refactoring skills. Techniques like the Mikado Method for planning large-scale (long-form) refactorings can be very helpful here, provided you’ve got the small-scale refactoring skills to execute those plans.

Teams working with or without – but especially with – “AI” coding assistants need to master the short form and the long form, and all the forms in between. They need to see the wood and the trees.

As I argued in my previous post, big picture thinking is a job for Actual Intelligence.

And they need to visualise, analyse, communicate, plan and execute architecture continuously.

The AI-Ready Software Developer #13 – You Are The Intelligence

Human beings are funny old things. Over millions of years of evolution we’ve developed some traits that served us well in the wild, but might arguably work against us in our domesticated form.

We’re susceptible to psychological ticks that can distort our thinking and make us act irrationally, and even act against our own interests.

One of those ticks is our tendency to assign agency or intent to things that demonstrably don’t have it. We evolved to have Theory of Mind so that we can put ourselves in another person or animal’s shoes, and ask “Can that sabre-toothed tiger see me behind this tree?” or “Is Ugg planning to steal my best rock?”

The problems can start when we apply Theory of Mind to the weather (why does it insist on raining as soon as I put my coat on?), machinery (this washing machine hates me!), or – just for instance – a Large Language Model.

It’s understandable when we mistake software that matches patterns and predicts what comes next for something that actually thinks, because the patterns it’s matching are products of actual thinking – Actual Intelligence.

Heck, when he was stranded on that remote island, Tom Hanks formed a close friendship with a volleyball, and all that took was a handprint with eyes. The bar before anthropomorphism kicks in isn’t set very high.

Many LLM users ascribe qualities and abilities to the models that they demonstrably don’t have, like the ability to reason or to understand or to plan.

What they can do is to help us to reason and to understand and to plan.

Very importantly, we can also learn. In real time. From surprisingly few examples. And we don’t need a 100 MW power supply and the contents of Lake Michigan to do it.

In a collaboration between a human expert and an LLM, if we assign roles according to our strengths, the LLM is the powerful statistical pattern matcher and token predictor, trained on the sum total of current human knowledge – be it accurate or not – as of its training cut-off date. But it cannot think. It’s the world’s most well-read idiot. And we are the brains of the outfit.

We also need to remember that, despite what enthusiastic promoters of “agentic” coding assistants claim, LLMs have no capability to see the bigger picture and to think and plan strategically about things like the business domain, the user’s goals, the system architecture, or any of those “bird’s eye” concerns. Because they have no ability to think.

When we ask them to, they’ll “hallucinate” a high-level plan for us quite happily (and there I go, anthropomorphising). Like most “AI” output, it will look very plausible – more convincing than a handprint on a volleyball. But on closer inspection, there’s a very high probability that it will be full of Brown M&Ms. At such context sizes, it’s pretty much guaranteed.

And this is where psychology comes in again. Some people don’t see the problems. Maybe they don’t recognise them when they see them? Maybe they choose not to see them? Some folks really want to believe…

I have found it necessary to continually remind myself of the true nature of LLMs when I’m using them, and of the inherent – and very probably unfixable – limitations of their architecture.

The developers I’m seeing getting the best results using LLMs use them in ways that play to the tool’s strengths, and retain complete control over work that plays to theirs – keeping the LLM on a very short leash. They have the map. They set the route. They do the navigating.

The AI-Ready Software Developer #12 – Ground Truth

When Large Language Models hit the headlines in late 2022, with much speculation about impending Artificial General Intelligence (AGI) and the displacement of hundreds of millions of knowledge workers – including software developers – I naturally felt I needed to wrap my head around this technology.

After some initial “Wow! How is it doing this?” experimentation, the cracks soon started to show. Sessions with GPT-4 often ended in frustration as the LLM, if it could do what I wanted at all, would require lots of time-consuming coaxing and checking and fixing of outputs.

It would routinely “forget” instructions. It would routinely “lie” to me. And it made a lot of mistakes.

But it was still hard to see what was really going on. On first impressions, LLMs seem like magic.

When I played a game of chess with it, though, the tiger in the Magic Eye picture became visible. Once seen, it can’t be unseen.

For a fair few moves, I was again genuinely amazed that it was actually playing chess. Each move seemed reasonable, and reasoned. Sitting at my home office desk, staring at the screen, I was genuinely getting the feeling that there was some kind of mind looking back at me.

Eventually, the game reached a point where I could see mate in three if I sacrificed my queen. And I distinctly remember thinking, “But surely it can see that?”

It took the bait, and it was indeed checkmate in three more moves. Inevitably.

That’s when I saw the tiger. It doesn’t know where the pieces are. It doesn’t understand the rules of chess. And it’s not looking ahead in the way a human or, more exhaustively, a chess program does.

It’s literally matching the pattern in the sequence of moves so far against, presumably, a large corpus of chess game transcripts in its training data, and predicting what move comes next.

Could be a good move. Could be a bad move. It can’t tell the difference. It has no capacity to understand or reason about chess. It recursively matches input patterns to patterns in the model and predicts what token comes next.

And that’s how LLMs do everything. That’s how they summarise annual reports. That’s how they write poetry. And that’s how they write code. Could be good code. Could be bad code. They have no capacity to understand or reason about code. As a source of truth, this makes them too unreliable for any use case where fidelity matters.

In modular software design, there’s a principle for decoupling called “Tell, Don’t Ask”. I’m going to overload that principle and reuse it in this context.

Instead of asking an LLM for information related to the task at hand, we tell it what it needs to know. Models perform (match and predict) more accurately when the data they’re using comes from the real world and not from the model.

When you “talk” to an LLM, your conversation – your prompts, and the model’s replies – all form part of the context that the model is matching on. That includes all the bad chess moves and all the inaccurate summaries and all the bad poetry. And it also includes all the bad code. All the “hallucinated” libraries. All the incorrectly calculated test data. And all that jazz.

In previous posts, I explained why small, specific contexts work better – produce stronger predictions with fewer errors – and we can expand on that principle by making sure the context in each interaction contains a faithful representation of the real world as it pertains to the task: the code as it is right now, the tests as we specify them, the actual test run results, the actual linter output, the findings of our code review, and so on.

(“Ah, but Jason, LLMs are good at code review.” That doesn’t pass the “Brown M&Ms” test, I’m afraid. Don’t believe me? Take an Open Source code base on GitHub, insert unused imports into randomly-selected source files, and ask GPT-5 or Claude to find them. LLMs aren’t linters.)

Ground every interaction in a more reliable truth. Use deterministic sources of information whenever possible.

And when the model tells you it’s raining, go outside and look!

The AI-Ready Software Developer #11 – Staying Sharp

As has been discussed in previous posts, there will be times – many times – when “AI” coding assistants simply won’t be able to do what we need them to do. And that means that we will be the ones writing or fixing or refactoring that code.

This has two implications:

We need to understand the code the LLM has generated
We need to be capable of doing that work

I’ve been observing how increased reliance on “AI” coding assistants can erode our programming knowledge and skills. Our edge becomes dulled the more we let them do the thinking for us.

This atrophying of cognitive ability is now widely reported, and it’s a serious issue. LLMs aren’t anywhere near reliable enough that human thinking won’t be required, and the more we use them, the less we’re capable of it.

This isn’t a new thing, of course. I’ve observed how increasing reliance on copying and pasting code from sources like Stack Overflow and GitHub have also diminished people’s ability to comprehend and reason about code. It seems that our brains need to be fully engaged for stuff to really sink in.

“But Jason, you learned programming by copying code.”

Absolutely I did. In the 1980s, I’d copy code from books and magazines. Here’s the thing, though: to get the code from the page into my Commodore 64, I had to read it and type the code myself. It had to go in my eyes, through my brain and out my fingers.

Copying isn’t the problem. The problem is pasting. When we skip the “through the brain” step, we don’t engage with source material anywhere near as deeply.

I’d read the code, try to understand the code, write the code, then run it to see what it does. “What it does” is another way of describing the semantics of that code. I observe the syntax of it through copying it manually, and then learn the semantics from executing it.

So I’ve developed a facility for comprehending and reasoning about programs that’s similar in many ways to sight-reading for musicians. When I read code and write code, I can hear the notes. I am quite fluent in code.

(Developers who grew up in the age of copy & paste are often amazed by my seemingly magical ability to execute blocks of code in my head.)

“AI” coding assistants appear to be accelerating this decline in the fluency of software developers who rely on them often. It takes them much longer to understand code, and they find it harder to reason about it – to predict what the code will do if they, say, change an AND to an OR.

So when the tool fails us, we’re far out at sea, struggling to swim. (And “vibe coders” with no programming skills need rescuing.)

The effects of comprehension debt caused by letting “AI” generate code faster than you understand it are compounded by reliance on the tools eroding our programming abilities.

It’s essential, therefore, to keep your hand in. Write code every day. Read, understand, copy (don’t paste) and keep learning.

I’m especially concerned about junior developers relying on “AI” coding assistants. They might think they’re getting more done, but that’s not their main job. The main job of a junior developer is to grow into a senior developer. Reliance on “AI” will stunt your growth.

(I say the exact same thing about copying and pasting.)

For this reason, I recommend to teams that they keep these tools away from their least experienced developers. Yeah, I know. Tough love.

But even senior developers can easily find over-relying on “AI” can quickly turn them back into junior developers if they don’t balance that with a decent amount of hands-on practice.

Turn off your nav computers and use the Force!

The AI-Ready Software Developer #10 – Comprehension Debt

In my previous post, I talked about the need to recognise when an “AI” coding assistant is circling the event horizon of a “doom loop” and take the wheel.

Taking the wheel, of course, requires that you can still drive and you know where the car’s supposed to be going.

In the next post I’ll talk about why it’s so important to maintain your edge as a programmer when you’re using these tools. But in this post, I want to explore one specific aspect of that: our understanding of the code the “AI” is generating.

Legacy code is something that has many software developers running screaming for the hills. A large part of the fear of legacy code is that it can be hard to comprehend, because somebody else – probably somebody who isn’t around anymore – wrote it.

When we’re asked to make a change to code we didn’t have a hand in writing, to do that safely – without breaking the software – we first need to wrap our heads around that code. And that takes time.

Studies vary in the details, but there can be no doubting – from eight decades of the business of software – that developers spend a lot more time reading code than we do writing it.

(The wisdom holds therefore that we should optimise our approach for the ease of reading, not writing, code. Give it another eight decades, and maybe that message will finally sink in.)

The extra time it takes to understand code so that we can change it without breaking it is what I call comprehension debt. The bigger the gap to understanding, the bigger the debt that has to be paid, and the more expensive the change.

Attaching a code-generating firehose to our development process is an accelerant for the creation of comprehension debt. Pre-LLMs, legacy code was a big problem for our industry. Now it’s well on the way to being a major threat to society, with an increasing number of teams – often under pressure from management, who drank the “AI” Kool-Aid – pushing code nobody understands into production.

Maybe it works today, but what happens when it needs to change tomorrow? Because odds are, it will. Code that gets used gets changed.

It’s vitally important to keep on top of the code that the machine is spitting out at a vast rate of knots. It’s vitally important that we really understand it. We need to read it, think about it, and inwardly digest its meaning.

This puts a hard limit on the speed of code generation, which isn’t about how many tokens per second the model can predict, but how many tokens per second we can understand.

When we’re drinking from the firehose, the limit isn’t the firehose. The limit is us.

This is the main reason I don’t let “AI” coding assistants directly affect my source code without running suggestions by me first. Only when I’ve fully grokked – pun intended – the changes and agree with them (which isn’t often) will I let them be applied without any interventions from me.

Working in small steps, solving one problem at a time, really helps here. The less code there is to comprehend, the more focus I can give to every decision the model suggests. I keep “AI” coding assistants on a very short leash. You will not find me – unless for experimentation – using these tools in any kind of “autonomous” or “agentic” mode.

And, of course, the same factors that make code easier to comprehend apply regardless of who wrote the code. Simplicity, clear naming (“says what it does on the tin”), and effective separation of concerns – so we can understand one aspect of the system without having to understand many others – all have their place here.

The usual poor substitutes for clear code – comments and documentation – are what LLMs tend to fall back on, so I look for opportunities to incorporate those messages into the code itself if I feel it’s needed. (And quite often, the comments, docstrings etc that models like Claude Opus and GPT-5 will add to code turn out to be redundant anyway.)

When explaining what code does, I try to make it clear in the code itself – to have it tell its own story. When I feel that I need to explain why it does it the way it does, then I might use inline documentation of some kind, like a comment.

Some “AI” coding assistant users will have the model generate Markdown files with explanations of what was done and why. These are about as useful as you’d expect, if you’ve ever been told to write, say, an architecture document. And, if you actually check the contents thoroughly, they usually don’t pass the “Brown M&Ms” test.

As one person put it: “Documentation is useful until you need it.” Often misleading. Often out of date. Often just ticking a box.

And, just as with legacy code, the big one is fast automated tests. The ability to quickly check that a change hasn’t broken anything is such a big factor in the cost of changing code that, in his book Working Effectively With Legacy Code, Michael Feathers defines “legacy code” as code that lacks those tests.

Well-written automated tests can also serve as living, executable documentation that shows us not just what we expect the code to do, but how to use or reuse it. I’ll take tests over comments and dosctrings any day of the week.

Anyhoo, back to the main point. When developers are generating code faster than they are understanding it, a mountain of comprehension debt can form very quickly.

It’s an age-old category mistake: optimising your dev process for adding, rather than changing, code.

You will pay for comprehension sooner or later, but remember that this debt accrues interest rapidly.

The AI-Ready Software Developer #9 – Well-Trodden Paths

A very common experience for LLM users is what I call the “doom loop”.

You ask the model to do something, it gets it wrong. You say “That’s wrong”, and it apologises , “You’re absolutely right. Louis Armstrong was not the first person to set foot on the Moon. Let me try that again.”

Then it will proceed to either make the exact same mistake again, or a completely new mistake as it tries to fix the first one.

By now, experienced users should be aware that there are some things LLMs simply cannot do. This is where the mask slips and we get a glimpse into their true nature.

Let’s consider a simple example: times tables.

LLMs – even the hyperscale “frontier” models – are good at multiplication… Until they’re not.

It’s completely understandable, when we watch an LLM correctly calculate 3×3, 9×7, 11×4, that we might conclude that it can do multiplication. It’s a multiplying genius!

But as the factors get higher, the model starts to get the answers wrong more and more often. The graph above clearly shows some kind of distribution of accuracy that tails off rapidly.

This distribution of accuracy almost certainly corresponds to the distribution of examples the model was trained on.

When I look online for times tables resources, they rarely go above 12×12. Examples that go up to 20×20 are vanishingly rare.

LLMs do not do multiplications. They pattern-match multiplications taken from examples they were trained on. The fewer the examples, the lower the confidence of their matches, and therefore their token predictions. This is “hallucination” territory.

They perform well on problems that are well-represented in the training data. They perform poorly in the long tail of scarce examples.

Put more simply, there are going to be all kinds of things the model just can’t do – no matter how we prompt it – because the data simply isn’t there.

Text-to-image diffusion models suffer the exact same limitation, and illustrate the problem graphically. Try getting Midjourney to generate an image of a wine glass full to the brim.

The skill here isn’t about prompting or “context engineering”. The skill here is recognising when we’re trying to get a cat to lay eggs.

You may have noticed that examples of “vibe-coded” software generated entirely by “AI” coding assistants are almost always – in fact, I don’t think I’ve seen one that isn’t – solving problems that have been solved many times before. The Calendar app. The TO-DO list. The data aggregator. The HTTP server. etc etc.

One of my favourite jokes is to respond to a social media post boasting about the “app” that Lovable or Cursor generated for someone in “hours” instead of “weeks” with a screen grab of me forking something very similar on GitHub and exclaiming how my tool “did it in seconds”.

We’ve got two choices here. We can stick completely to well-solved problems, which isn’t great news for innovation, and isn’t likely to distinguish your product much – because, basically, if your “AI” coding assistant can do it in hours, anybody’s can.

Or we can recognise this limitation, and work around it. Some cookie cutter jobs are suited to LLMs – “boilerplate” code gets mentioned a lot. Solving hard and novel problems is probably best suited to human minds.

Andrej Karpathy, the inventor of “vibe coding”, would appear to agree.

Again, here the skill is not in how we use “AI” coding assistants, but in when we use them, and when we know to write the code ourselves.

Recognising a doom loop as the model circles its event horizon, and knowing when to cut our losses and intervene by hand, is a skill more and more users of “AI” tools are learning.

I only have my own experiences of using tools like Claude Code and Cursor, and watching other developers use them, to go on here at the moment. (Please point me to any non-vendor-backed studies on this if you know of any.) But after a couple of thousand hours at the wheel, I’ve noticed how, if the model hasn’t managed to do it in the first pass, the odds that it will be able to do it at all are less than 50/50.

LLMs have effective maximum context sizes that are orders of magnitude smaller than advertised. Every new interaction – every additional pass at a problem – takes the context further and further out of the model’s data distribution and into “hallucination” territory.

In code generation, I usually apply a policy of zero tolerance. If a change breaks the code, I do a hard reset – not just of the code, but of the context – and get the model to try again with a “fresh pair of eyes”, typically after I’ve broken the problem down into smaller, less improbable steps.

And I’ll maybe give that 2-3 goes, and if I’m still not getting any joy, I write the code myself. I like to think that I recognise when we’ve gone out of the model’s distribution. I don’t waste any more time trying to get a picture of “an empty room with no elephant in it”. (That’s a fun one to try, by the way.)

The upshot of all this is that you will be writing a significant portion of the code yourself. We’ll talk about that in the next post.

We humans, of course, also have our data distributions – the things we’ve learned in the past – and will often find ourselves working outside of our distribution on novel problems and with novel technologies.

But we have abilities that LLMs don’t. We can reason – actually reason, not just pattern-match reasoning – and we can learn. We can learn fast, from surprisingly few examples, and we can learn cheap. We don’t require gigawatt power and city-scale water supplies, or to see a gazillion examples, to figure out how to use a new library. So this makes us far more suited to navigating unfamiliar territory. We wouldn’t be here today if we weren’t.

Finally, there are implications here for the technologies we can successfully let LLMs work with. Python code is very well-represented in training data sources, for example. Mainframe COBOL, on the other hand, is seen far less often on sites like GitHub and Stack Overflow.

When I’ve tried to work with languages or libraries that are niche, I’ve noticed a much higher incidence of problems in generated code. Basically, the model’s “guessing”.

In those instances, like Andrej, I’ve ended up writing pretty much all of the code myself. It’s just so much quicker.

The AI-Ready Software Developer #8 – Continuous Integration

This new “age of AI” has produced a paradox. While individual developers report “huge” productivity gains, bashing out code faster than ever, these gains mysteriously evaporate when we observe what actually makes it into the hands of end users.

Actually, there’s no mystery. We’ve understood for many decades why individual productivity doesn’t translate into team productivity, and why more code faster doesn’t mean more value sooner.

It’s a common misapprehension; developers who are constrained within “coding” see their part of the process as the development process, and confuse speeding up the creation of code with speeding up the creation of value.

The reality is that “coding” hasn’t been the bottleneck in software development since programmers were literally punching holes in cards representing individual binary digits.

Take any system that has bottlenecks and optimise a non-bottleneck, and you’ll make those real bottlenecks and overall system performance worse.

When we work with “AI” coding assistants, we need to be thinking at the system level. What are the downstream consequences of producing more code faster?

More to test? More to review? More to fix? More to refactor? And more to merge into the release branch. Otherwise ain’t no users getting nothing no time soon.

The Theory of Constraints teaches us that bigger batch sizes – e.g., larger change sets – create longer queues (for code review, for testing, for integration, for deployment, for customer feedback) and systemic delays, and important work ends up languishing like goods lorries stuck in a 6-mile tailback on a Kent motorway.

Counterintuitively for a lot of people, reducing the batch sizes and limiting work in progress actually makes the system faster – in the sense that work is delivered sooner. And businesses like “sooner”, often even more than they like “cheaper”.

Practices we’ve already looked at – solving one problem at a time, testing continuously, reviewing code continuously, refactoring continuously – all help to reduce batch sizes. We don’t solve all of the problems at once. We don’t do all of our testing at the end. We don’t leave code review until all the code’s been written. And so on. We drink from the code-generating firehose one mouthful at a time.

But that is all for naught if, after all that, we try to merge all our changes in one big batch. Teams who work on their own isolated, long-lived branches and who only integrate when, say, a feature is complete – e.g., by submitting Pull Requests that require peer review – are experiencing worsening delays using “AI” to generate code.

To be fair, they were probably experiencing pretty bad delays before “AI”, as evidenced by their 6-mile-long backlogs. But the firehose is demonstrably making the delays worse.

Bosses hooking the firehose to their dev teams expecting to get a power shower are a little disappointed with the results, to say the least. Naively, this is often attributed to teams “using the AI wrong”. That’s rarely the case. The “wrong” here is the development process they’re using it in.

Merging smaller change sets more often reduces these delays. Merging them continuously minimises them.

“But Jason, don’t the changes need to be tested first?”

Yes. And they have been. Continuously.

“Ah, but doesn’t the code need to be reviewed?”

Absolutely. And it has been, and every problem discovered has been addressed. Continuously.

I can imagine some teams watching us automating tests, stopping every couple of minutes to run those tests, reviewing the code every few minutes, refactoring every few minutes, committing every couple of minutes, pushing to the trunk many times an hour, might think “Wow. They’re so slow!”

But our code is ready for immediate release at any time. And we don’t have a huge backlog of work waiting in queues. So our lead times are very short.

Who’s slow now?

Curiously, the latest DORA State of AI-Assisted Software Development report finds a clear trend. Teams who were already continuously testing, reviewing, refactoring and integrating experience a modest but measurable increase in software delivery throughput and a reduction in lead times when they use “AI”, without sacrificing reliability.

Teams working on long-lived branches and relying on after-the-fact testing and code review experience worse systemic performance – lower throughput, longer lead times, and more boo-boos in production.

This is why this series of blog posts, that could maybe become a book or a course or a musical or a cake, refers to developers being “AI-ready” instead of “AI-assisted”.

The key to being effective using “AI” coding assistants is being effective without them.

The AI-Ready Software Developer #7 – Commit On Green, Revert On Red

Imagine you’re walking a tightrope tied to the peaks of two mountains. When you reach the middle, it’s a long way to safety – forwards or backwards – and a long way down if you fall.

Changing code’s a bit like walking a tightrope. Every step we take risks a fall, and the more changes we make, the more likely we are to experience catastrophe.

Now imagine the same rope, but now its tied to wooden posts just a few feet apart and a few feet tall. The risk of a fall with each step remains the same, but you’re never far from safety, and if you do fall, it’s no big deal. You can just climb back up and carry on from the last post you reached.

If “safety” in terms of software means code we’re confident works, and that we could therefore ship if we wanted to, than we want those safe points to be close together and “low to the ground” – easy to climb back on at the last safe point and try again if we fall.

In practice, this means getting into the habit of committing our changes whenever we see all the tests pass. Provided they’re good tests, of course. Another thing LLM coding assistants are notorious for is generating meaningless, “weak” tests – or even commenting them out when they fail. Gosh, I wonder where they learned that? Monkey see, monkey do. My advice? Test your tests!

This works hand-in-hand with working in short feedback loops, solving one problem at a time and testing continuously. The bigger the feedback loops, the more changes between safe points, the further apart the wooden posts get, and the bigger the drop if we fall.

And the bigger the sunk cost.

If I change one line of code and tests fail, it’s no big deal to figure out which change broke the code. I can usually see the problem and fix it quickly. If not, I can revert to the previous working commit and try again with very little time lost.

If I change 100 lines of code and tests fail… Well, now I have to figure out which of those 100 changes broke it, and if I can’t, that’s a lot of time lost with a reset. In this situation, we’ll naturally be unwilling to cut our losses.

LLMs can generate a lot of changes very quickly, and because they understand nothing, each change is significantly more likely to break the software.

And models can’t distinguish between working code and broken code. It’s all just context to a language model. Ideally, we don’t want the broken stuff figuring in its machinations, so it’s important to remove broken code as soon as it appears, so the model is building on solid ground whenever possible.

The easy way to do that is a hard reset back to the previous working commit. Otherwise, we can send the model into a “doom loop” where it keeps trying to fix the problem, but actually makes things worse with each attempt, contaminating the context for subsequent passes. This usually means resetting the context, too.

Some “AI” coding assistant users report success with a “three strikes and out” policy. If the tests fail, the model is given two more attempts to fix any problems, before a hard reset. But I’ve been finding that a “zero tolerance” approach works well for me. I revert the code, adapt the prompt – often looking for a smaller intermediate step – and ask the model to try again.

(And, yes, I do have a policy on how many attempts I’ll allow before I write the code myself. We’ll also be talking about when to grab the wheel in a future post.)

LLMs work better on a clean slate.