ai – Codemanship's Blog

It’s the System, Stupid!

Since this “Age of A.I.” arrived in late 2022, something’s been nagging at me. As more and more data rolls in, we see an apparent paradox emerging where “A.I.” coding assistants are concerned.

Individual developers report productivity gains using these tools (though many also report significant frustrations with, for example, “hallucinations”).

And at the same time, data clearly shows that the more teams use them, the bigger the negative impact on team outcomes like delivery throughput and release stability.

How can both these things be true?

We have one very plausible candidate for a causal mechanism, and it’s an age-old story in our industry.

When programmers get a feeling that they’re getting things done faster, they’re often only considering the part where they write the code – particularly when that’s their part of the process.

What they’re not considering is the whole software development process, and especially downstream activities like testing, code review, merging, deployment and operations.

More code faster can mean bigger change sets – more to test (and more bugs to fix), more code to review (and more refactorings to get it through review), more changes to merge (and more conflicts to resolve), and so on.

“A.I.” code generation’s a local optimisation that can come at the expense of the development system as a whole, especially if that system is more batch-oriented, with design, coding, testing, review, merging and release operating like sequential phases in the delivery of a new feature. In such a system, more code faster means bigger bottlenecks later. So there’s no paradox at all: one causes the other.

When teams work in much smaller cycles – make one change, test it, review the code, refactor, commit that and maybe push it to the trunk – they may experience far fewer downstream bottlenecks, with or without “A.I.” coding assistance. Arguably, coding assistants might make little noticeable difference in such a workflow.

The DORA data strongly indicates that the teams with the shortest lead times and the highest release stability tend to work this way, with continuous testing, code review and merging as the code’s being written.

And all this got me to thinking, maybe we’re targeting machine learning and “A.I.” at the wrong problem. Instead of focusing on individual developer productivity with things like code generation, perhaps this technology would yield more fruit if it was focused on systemic issues and reducing bottlenecks.

Maybe, for example, instead of using ML models to generate code, could they be more productively applied to reviewing code? Could a “smart” linter reduce the need for after-the-fact code review?

Of course, many of us already enjoy the benefits of “smart” linters. We call it “pair programming” or “ensemble programming”. And, having used static code analysis tools that incorporated statistical models or neural networks, the results weren’t all that impressive. Hard to see such a tool significantly out-performing a classic linter + a second pair of experienced eyes (if such eyes are available to you, of course, and maybe that’s the use case).

Perhaps the real value might be found in widening our view. What if a model (or models) could be trained on data collected across the entire cycle, from product strategy through to operational telemetry, support and beyond?

Imagine a model that, given, say, a Figma UI wireframe, could predict how many support calls you’d be likely to get about it. Imagine a model that, given a source file, could predict its mean time to failure in production?

More generally, imagine a model that could, with reasonable accuracy, predict the downstream impact of upstream activities, so as SuperDuperAgenticAI spits out its slop, alarm bells start to go off about where this is likely to lead if it gets any further.

A pipe dream, you might think. But in actual fact, such predictive technologies exist in other disciplines like electronic engineering, where statistical and ML models are used to predict the reliability and probable lifetimes of printed circuit boards, for example.

There would be some major hurdles to overcome to apply similar techniques to software development, though, not least of which is the jungle of higgledy-piggledy data formats our many proprietary tools and platforms produce. Electronics has established data interchange standards. We, for the most part, do not – probably because that would require enough of us to agree on some stuff, and that isn’t really our strong suit.

But, if these challenges could be overcome, or worked around (e.g., with a translation/encoding layer), I’m pretty sure there are patterns hidden in our complex and multi-dimensional workflow data that maybe nobody’s spotted yet. I mean, we’ve barely scratched the surface in the last 70+ years.

In a very handwavy sense, though, I feel quite sure now that “A.I.” is being targeted at the wrong problem in software: with an exclusive focus on individual developer productivity, when the focus should be on the system as a whole.

In the meantime, we’re pretty sure at this point that things like continuous design, continuous testing, continuous code review and continuous integration do have a positive systemic impact, so focusing on that is probably the most productive I can be for the foreseeable future.

If your team would like training and mentoring in the technical practices that we know speed up delivery cycles, shorten lead times and improve product and system reliability, with or without “A.I.”, pay us a visit.

The A-Z of Code Craft – D is for D.R.Y.

“Don’t Repeat Yourself” is a widely misunderstood, often misapplied, and consequently much-maligned principle in the design of software.

While it’s true that repetition in code can hurt us, by multiplying the cost of change, it’s by no means the worst thing we can do. Indeed, sometimes repetition can help us if it makes code easier to understand. (If you refactor code to remove duplication, stop to ask if that’s made it harder to follow. If it has, put the duplication back!)

But that’s not what D.R.Y. is really about. Think of it this way: what’s the opposite of duplication? REUSE.

When we see multiple repetitions of a similar thing – be it copied-and-pasted code, or a repeated concept that appears in multiple places (I remember one application that had 3 Customer tables in the database, each created by different people for different features) – that’s a hint about what our design needs to be.

When we refactor to consolidate, we discover the need for reusable abstractions like parameterised functions or shared classes. Duplication points us towards potential modularity.

This is an evidence-based approach to design. We don’t speculate that a function might be reused, we see where it will be reused; we see the need for it in the current code.

Duplication in code can act as bread crumbs leading us to a better design and to genuinely useful – because they’re being used – reusable components. Removing duplication is where some of our most popular libraries and frameworks came from.

As for taking it too far, it’s certainly true that jumping on too quickly can produce over-abstracted code, and a much higher risk of choosing the wrong abstractions. The more examples we see, the more likely an abstraction is to be both the right one, and to actually pay for itself in the future

But let the duplication build up, and the refactoring’s going to take longer. In the zero-sum game of software development, things that take longer are less likely to happen, so we need to strike a balance.

The “Rule of Three” is a rough and ready guide for how many examples we might want to see before we refactor. Sometimes more, sometimes fewer, but on average, around three.

Scale is also a factor here. Reuse creates dependencies. If those cross team boundaries, it really needs to be worth it.

Don’t forget, either, that repetition also applies to not just our code, but our process for creating it. Automating repeated tests (regression tests) is a good example how refactoring duplication of effort in our process can streamline delivery.

Be mindful, though, that just as over-abstraction is a risk in refactoring duplication code, over-zealous automation is a risk in refactoring duplicated effort. I’ve worked with teams who have so many scripts and custom tools that it takes weeks or even months for new joiners to get up to speed, and some of those tools saved them less time and money than they took to create and maintain.

If you’re serious about building your team’s capability to rapidly, reliably and sustainably evolve software to meet rapidly changing business needs, my Code Craft and Test-Driven Development live remote training workshops are HALF PRICE until March 31st 2025.

The A-Z of Code Craft – C is for Continuous

Old-fashioned approaches to creating software often encourage us to think of the activities involved as stages or phases in the process: the design phase, the coding phase, the testing phase, the integration phase, the release phase, and so on.

This approach has some major drawbacks. In fact, many of us have found that it simply doesn’t work on problems of any appreciable complexity.

The moment we start writing code, we see how the design needs to change. The moment we start testing, we see how the code needs to change. The moment we integrate our changes, we see how ours or other people’s code needs to change. The moment we release working software into the world, we learn how the software needs to change.

Around and around we go, feeding back our lessons into a never-ending continuous cycle of designing, coding (which I might argue is also designing), testing, integrating and releasing. The lines between these activities become very blurred. If I’m writing a failing test in a test-driven approach, am I designing, or am I coding, or am I testing? When I’m refactoring code, am I designing, or coding, or testing?

The correct answer is: YES.

And if we work backwards from the goal of having working software that can be shipped at any time, we inevitably arrive at the need for continuous integration, and that doesn’t work without continuous testing, and that doesn’t work if we try to design and write all the code before we do any testing. Instead, we work in micro feedback loops, progressing one small step at a time, gathering feedback throughout so we can iterate towards a good result.

But the continuous loops don’t end there. To ensure the software’s open to change, we also need to be continuously reviewing or inspecting the code. And to get the bigger picture right as the software grows – considering how the pieces of the jigsaw fit together – we need to be continuously architecting our products and systems on that larger scale.

And, finally, to be capable of doing these things well – abilities none of us are born with – we need to be continuously learning. In each nugget of feedback, we can see things that went well, and things we could do better. Rather than saving it all up for a “post-mortem” after a major release, and trying to change 1,001 things in our approach – which never works out! – we need to act on that feedback throughout the process, evolving our approach one lesson at a time.

Some, myself included, might say that if code craft could be crystallised in one word, that word would be “continuous”.

The A-Z of Code Craft – B is for Builds

Automated software builds, where the product’s prepared for a potential release from the source files, are a central part of code craft.

Every time developers merge their code to the main (“trunk”) branch, an automated build’s triggered that checks out the latest version of the code, resolves its dependencies, compiles the code if necessary, and runs the automated tests to make sure it all works in the build environment (and not just on the developer’s machine).

If all is good, and all the tests – and other possible quality checks, like code linting – pass, then we can have confidence that the current version of the code sitting on the trunk could be shipped if we wanted, though that might require further steps like containerisation.

It’s important that our build “pipeline” – the sequence of steps performed in an automated build – contains sufficiently robust quality gates to give us that confidence, or issues may leak into production.

As well as constructing a shippable version of the software from the source code, we can also think of automated builds as being like passport control at an airport departure gate, preventing software getting on the release plane if it’s likely to present a problem.

If any tests or checks fail during an automated build, we say that the build is “broken”, and the software’s blocked from being released. This makes it everybody’s problem, so it’s important that broken builds are fixed quickly, or the code on the trunk is rolled back to the previous working build so the team can carry on delivering value.

The best developers are very “build-aware”. They keep one eye on the status of the build, because it signals changes to the code base being made by other people on the team. If a build succeeds, they’ll get those latest changes and merge them into their local copy to keep in sync. If a build fails, they know it’s not safe to merge their changes into the trunk, or to get changes from the trunk, until the build’s fixed.

The execution time of builds has a profound effect on delivery lead times, due a phenomenon known as “short delay, long queue”. This is why performing builds manually isn’t a realistic option in continuous delivery. It’s very often the case that speeding up builds will increase agility at the team level.

_____________________________

The LLM In The Room

Over 2 years ago, the at-the-time not-for-profit research organisation OpenAI released a new version of their Large Language Model, GPT 3.5, under the friendlier brand name of ChatGPT, and started a media and market frenzy.

This was arguably the first time a chat interface could genuinely fool users into believing it was a person, and there was much talk about the age of “artificial general intelligence” and even “super-intelligence” now being upon us. Many pundits predicted the end of knowledge workers like lawyers, doctors, and – of course – software developers within a few years.

Naturally, this was a claim I had to check out for myself, so when GPT-4 was released a few months later, I signed up for the “Pro” version of ChatGPT to get (limited) access to it and started to experiment in various problem domains, including programming and software development.

Like millions of people, I was initially very impressed with GPT-4 (not so much with 3.5, I have to say). But as I started to try to actually do things – specific things – with it, its limitations became more and more apparent. While it is indeed remarkable that what is essentially a predictive texting engine can write Python or Java or C# that actually compiles – let’s not take that away from OpenAI – the actual code itself was less impressive.

In fact, it was often not acceptable at all. LLMs – and generative transformers more generally – are not very good at specifics. An honest marketing slogan for the technology might be “Impressive, but wrong.”

I found myself having to double-check everything, correct more than half the code, and routinely ended up having to “coach” GPT-4 to get half-decent results that didn’t look like they were written by an intern in a hurry. This often took longer than if I’d just written the code myself. As I’ve evaluated each new model, this has stubbornly remained the case.

No doubt in the intervening 20 months, “A.I. coding assistants” have improved, and I’ve been keeping a close eye on new models as they’ve been emerging to see just how much improved. In January 2025, we’re still at a point where LLM-generated code needs double-checking, correcting and refactoring too often to make them usable on anything beyond small one-shot “How do I…?” tasks. They are – as of today – at best, conversational interfaces to code examples included in their training data. They’re an improvement on Stack Overflow searches.

Hyperbolic claims by some of achieving 10x or even 100x productivity with these tools, or of non-programmers creating complex working products with them, like reports of flying saucers, have a tendency to evaporate on contact with reality. As yet, I’ve not seen a shred of hard evidence to back them up.

More tempered claims of modest productivity gains, backed up by hard data (i.e., not surveys of how productive devs feel LLMs are making them), paint a very ambiguous picture. Maybe they help a little. Maybe they don’t. Programming’s such a small part of software development that even if they did speed it up 10x – which at this point I’m confident they don’t – that’s a 90% saving on 10% of the work. There’s even hard data to suggest that, at the team level – and that’s where the productivity rubber meets the road – extensive LLM use can actually have a small negative impact. More code faster != more value sooner. I try to bear in mind that the feeling of productivity can often be deceptive. (For every “I stayed late and wrote a tonne of code uninterrupted” story, there’s usually at least four more “I spent the whole morning trying to understand 100s of changes some dude had pushed the night before” stories.)

One obvious long-term risk of having a big chunk of your code generated by “AI” at speed is that a team’s understanding of their code base will run away from them, creating a kind of “comprehension debt” that seems likely to significantly increase the cost of fixing problems that the LLM can’t fix. We should keep an eye on the Mean-Time To Recovery of businesses who proudly claim that a growing percentage of their code’s “AI-generated” (presumably to impress investors).

Now, a conversational interface to gazillions of code examples – a kind of Stack Overflow++ – is not to be sniffed at. Good for them! But what it most certainly is not is a replacement for actual software developers. Not even close. But outside of our profession, the confident pronouncements by CEOs and pundits in the media that they are has been doing real damage to the industry.

As “software developers”, they remain stubbornly not good enough. It would appear that this is an un-fixable problem, no matter how much training data and compute they throw at it. Pattern matchers are gonna pattern-match!

At some point, even investors, executives and commentators are going to be confronted with the reality that this technology hasn’t replaced any software developers. If anything, all the low-quality code these tools are churning out is creating a Mount Everest of technical debt that will require even more developers to keep the wheels on their enterprises turning in the future.

At this point, someone usually says “Ah, but Jason, maybe they’re not good enough now, but what about future models?” And this is where we all place our bets.

Some, like Microsoft, OpenAI and Nvidia, are betting that model performance is just going to keep improving until we reach AGI and beyond, even if we have to burn the planet to get there. This is their “growth story” upon which their current stock prices – riding at record highs – are based. If it’s not true, their stock prices will plummet back to what they were before this current “A.I.” bubble started to inflate. That’s trillions of dollars wiped off the NASDAQ. So there’s a lot very wealthy people with a very big interest in it turning out to be true. This is the biggest bet in history.

So anything that one of these models does that kind of sort of looks like AGI – in a certain light, from a distance, if we squint – is leapt upon as evidence that the Singularity is upon us, and that we should all start digging bunkers and buying canned goods in preparation for the inevitable Butlerian Jihad.

I’m skeptical of that. These claims are usually supported by A.I. performance benchmarks, and the models can be trained and fine-tuned to do well in these standard tests. There’s no shortage of training data.

And when I say “well”, I mean not as well as a human expert, but better than the average Joe. And while the gap closes little by little, that “little” seems to get “littler” with each new iteration. I speculated that transformer performance would converge on not-quite-good-enough. Needs more work. See me after. Not so much “super-intelligence” as “super-mediocrity”. Yes, it can write code, but not good code. Yes, it can play chess. Just not well. And so on.

The strength of LLMs is that they are not-quite-good-enough at very many text-based problems. But commercially, what’s the value proposition here? A not-quite-good-enough programmer that is also a not-quite-good-enough tax lawyer? An under-performing car that can also bake cakes is still an under-performing car.

And even as LLMs inch forward, there’s also the cost to consider with each new model. At $20 per month for ChatGPT Pro, OpenAI were losing money hand-over-first. The price of the new plan is ten times that. And they’re still burning through enormous amounts of investor cash. Executives at OpenAI have recently been floating the idea of a $2,000/month plan. But would they break even at that price? Reports that a single task performed by the newest model, in “high-compute” mode, can cost thousands of dollars, and still fall short of expert performance, makes me wonder if the final destination of all this research, all this fanfare, and all this MONEY, might be a world where human experts are both the better and the cheaper option. That would be very funny. I would laugh a lot as the world economy collapses!

Much has been made of the idea that the newest models can follow and evaluate multiple “chains of thought”, and there seems little doubt that this improves their performance in benchmark tests. I’m not at all convinced that this is, as the makers claim, “reasoning”.

There’s also the question of what these models are evaluating their “chain of thought” against. What’s telling them that this is the right maths answer, or the best chess move, or the right Python code? How could a language model know?

I wonder if OpenAI are, in these cases, using their LLMs as interfaces to, say, maths programs, or chess programs, or Python testing or linting tools. And is that “artificial general intelligence”, or is that a natural language interface to point solutions; application-specific intelligence?

And after all that, the end results are still not-quite-good-enough, even with oceans of computing power thrown at the problem.

I don’t have a crystal ball, so this is just a bet. And I’m betting that LLMs will eventually – once decision makers finally see the tiger in the Magic Eye picture of generative A.I. – find their natural fit in the world as very impressive conversational natural language interfaces. The question that follows is: natural language interfaces to what, exactly? And in many cases, the answer is: something we haven’t figured out how to build yet.

So, back into A.I. winter we go, until the next major breakthrough. Perhaps next time, businesses will have been so badly burned by the crash – we’ve never seen tech hyped on this scale before, and it’s distorting everything – that they’ll think a little more critically about claims of “A.G.I.” and “super-intelligence”.

I’d like to think that investors and executives, unlike LLMs, are capable of learning from experience and applying a little dynamic reasoning next time around.

In the meantime, we – software developers, and the businesses who rely on us – have a looming pipeline problem of potentially epic proportions. Businesses who’ve stopped hiring and training entry-level developers because “GitHub Copilot can do what they do” are going to find out what happens when nobody plants tomatoes because “Hey, who needs tomatoes? We’ve already got pasta sauce”.

Combine that with a backlog that stretches to the Moon of real business problems neglected while “A.I.” has been sucking all the oxygen out of the room, and a planet-sized amount of LLM-generated technical debt, and you have the perfect storm.

When that happens, I’ll be here if you need me, shopping for superyachts 🙂

NB: For those thinking “Yes, but what about the environmental and ethical impact of LLMs” As a paid-up member of the Green Party, I’m right there with you. But my argument isn’t aimed at people with a track record of making business decisions on ethical grounds. We don’t live in that world any more (if we ever did).