May 2025 – Codemanship's Blog

Five Boring Things That Have A Bigger Impact Than “A.I.” Coding Assistants On Dev Team Productivity

Here are 5 factors that make a bigger difference to software development outcomes than “A.I.” coding assistants, but teams don’t address because they’re “old news, granddad!”

Smaller teams are better value/$ spent
More frequent releases accelerate learning what has real value
Limiting work in progress – solving one problem at a time – increases delivery throughput
Cross-functional teams experience fewer bottlenecks and blockers than specialised teams
Empowered, self-organising teams spend less time waiting for decisions and more time getting sh*t done

Now, I appreciate that every one of these is a can of worms that many organisations simply do not wish to open. They all have deep implications, and require foundational changes not just to the way we work, but the way we think.

For example, smaller, more frequent releases implies software’s in a shippable state more often, which implies faster build & test cycles… and down the rabbit hole we go: into testing pyramids and separation of concerns and micro-cycles with continuous testing, continuous integration, continuous code review and… Come to think of it, the stuff I teach 🙂

Another example, empowering teams requires a pretty high level of psychological safety. When people are afraid to fail, they’re afraid to try – to make calls, to take initiative, to just f-ing do it! The culture of an organisation, which may have evolved over many years, is a hard thing to reshape. There’s often a lot of unspoken rules – sure, you say your door is always open, but… It takes much work and many iterations to shift those underlying patterns in the way we interact.

But waiting on the other side of that long journey is a high capability to rapidly and sustainably create and adapt working software that meets rapidly-changing business needs. Software agility Nirvana.

We already know from the data (e.g., DORA) that “A.I.” coding assistants don’t unlock that door.

It’s the System, Stupid!

Since this “Age of A.I.” arrived in late 2022, something’s been nagging at me. As more and more data rolls in, we see an apparent paradox emerging where “A.I.” coding assistants are concerned.

Individual developers report productivity gains using these tools (though many also report significant frustrations with, for example, “hallucinations”).

And at the same time, data clearly shows that the more teams use them, the bigger the negative impact on team outcomes like delivery throughput and release stability.

How can both these things be true?

We have one very plausible candidate for a causal mechanism, and it’s an age-old story in our industry.

When programmers get a feeling that they’re getting things done faster, they’re often only considering the part where they write the code – particularly when that’s their part of the process.

What they’re not considering is the whole software development process, and especially downstream activities like testing, code review, merging, deployment and operations.

More code faster can mean bigger change sets – more to test (and more bugs to fix), more code to review (and more refactorings to get it through review), more changes to merge (and more conflicts to resolve), and so on.

“A.I.” code generation’s a local optimisation that can come at the expense of the development system as a whole, especially if that system is more batch-oriented, with design, coding, testing, review, merging and release operating like sequential phases in the delivery of a new feature. In such a system, more code faster means bigger bottlenecks later. So there’s no paradox at all: one causes the other.

When teams work in much smaller cycles – make one change, test it, review the code, refactor, commit that and maybe push it to the trunk – they may experience far fewer downstream bottlenecks, with or without “A.I.” coding assistance. Arguably, coding assistants might make little noticeable difference in such a workflow.

The DORA data strongly indicates that the teams with the shortest lead times and the highest release stability tend to work this way, with continuous testing, code review and merging as the code’s being written.

And all this got me to thinking, maybe we’re targeting machine learning and “A.I.” at the wrong problem. Instead of focusing on individual developer productivity with things like code generation, perhaps this technology would yield more fruit if it was focused on systemic issues and reducing bottlenecks.

Maybe, for example, instead of using ML models to generate code, could they be more productively applied to reviewing code? Could a “smart” linter reduce the need for after-the-fact code review?

Of course, many of us already enjoy the benefits of “smart” linters. We call it “pair programming” or “ensemble programming”. And, having used static code analysis tools that incorporated statistical models or neural networks, the results weren’t all that impressive. Hard to see such a tool significantly out-performing a classic linter + a second pair of experienced eyes (if such eyes are available to you, of course, and maybe that’s the use case).

Perhaps the real value might be found in widening our view. What if a model (or models) could be trained on data collected across the entire cycle, from product strategy through to operational telemetry, support and beyond?

Imagine a model that, given, say, a Figma UI wireframe, could predict how many support calls you’d be likely to get about it. Imagine a model that, given a source file, could predict its mean time to failure in production?

More generally, imagine a model that could, with reasonable accuracy, predict the downstream impact of upstream activities, so as SuperDuperAgenticAI spits out its slop, alarm bells start to go off about where this is likely to lead if it gets any further.

A pipe dream, you might think. But in actual fact, such predictive technologies exist in other disciplines like electronic engineering, where statistical and ML models are used to predict the reliability and probable lifetimes of printed circuit boards, for example.

There would be some major hurdles to overcome to apply similar techniques to software development, though, not least of which is the jungle of higgledy-piggledy data formats our many proprietary tools and platforms produce. Electronics has established data interchange standards. We, for the most part, do not – probably because that would require enough of us to agree on some stuff, and that isn’t really our strong suit.

But, if these challenges could be overcome, or worked around (e.g., with a translation/encoding layer), I’m pretty sure there are patterns hidden in our complex and multi-dimensional workflow data that maybe nobody’s spotted yet. I mean, we’ve barely scratched the surface in the last 70+ years.

In a very handwavy sense, though, I feel quite sure now that “A.I.” is being targeted at the wrong problem in software: with an exclusive focus on individual developer productivity, when the focus should be on the system as a whole.

In the meantime, we’re pretty sure at this point that things like continuous design, continuous testing, continuous code review and continuous integration do have a positive systemic impact, so focusing on that is probably the most productive I can be for the foreseeable future.

If your team would like training and mentoring in the technical practices that we know speed up delivery cycles, shorten lead times and improve product and system reliability, with or without “A.I.”, pay us a visit.

The Real Secret of Prompt Engineering

Since early 2023, I’ve been on a journey evaluating claims about the capabilities of generative “A.I.” (yep, still gets air quotes).

I’ve tried to reproduce some of the more sensational successes I’ve seen trumpeted on the Interwebs, and eventually come to the conclusion that most of them don’t hold much water.

Why, I wonder, are these people claiming to have done things that the technology just doesn’t seem able to do?

Their defence is typically that I must be “doing it wrong”; that I haven’t mastered the Secret Magical Prompts of Destiny. But when I try to follow the advice, I get the same “meh” results.

“Be more specific” is a common refrain. But here’s the thing:

I’m a computer programmer with a degree in physics. I can do “specific”.
If anything, the more specific the requirements, the more the models struggle. When I try to iterate the output through a longer conversation, the results can often get worse.

Over these two years, I’ve gradually developed a theory about how they’re succeeding with “A.I.” where I’m failing, and it’s probably best illustrated with a cartoon.

This image was generated through the ChatGPT web application (so it was generated by DALL-E, I guess). It went through multiple iterations as we tried to correct the problems, but – as often seems to be the case – the first attempt was about as good as it got.

I was very specific in my prompts about the story, the dialogue, the characters, down to the level of exactly what should be featured in each panel.

Some folks looked at this image and saw the continuity mistakes. Others looked at it and said “It looks okay to me”.

And that, I suspect, is the secret.