requirements – Codemanship's Blog

Public Code Craft Training – July 7-9

For the small percentage of engineering orgs who’d genuinely like to be shipping more reliable software and be more responsive to the needs of their business and their users – it’s a niche, I know – I’m running a public 3-day online Code Craft workshop on July 7-9.

If you’re a developer, twist your manager’s arm – especially if they’re expecting you to be more productive using tools like Claude Code and Copilot.

If you’re an engineering leader, this is the real AI-assisted software engineering training your teams need – and, funnily enough, it’s mostly about software engineering and only a little bit about AI. It’s about making teams AI-ready.

It’s 6x half-day modules that give developers a practical, hands-on introduction to the foundational technical practices that enable teams to accelerate release cycles, shrink lead times and improve release reliability – with and without AI.

Specification By Example
Test-Driven Development
Refactoring
Design Principles
Continuous Delivery
Code Craft & AI – grounded on hard data, includes how to apply CRESS principles for context engineering to AI-assisted workflows

To learn more and register, visit https://codemanship.co.uk/codecraft.html

Places are limited.

CRESS Principles for Context Engineering – R is for Refutable

If speculative ideas can not be tested, they’re not science; they don’t even rise to the level of being wrong.

Wolfgang Pauli

When we interact with a language model, we’re communicating in natural language. And communicating in natural language is a lossy process.

There’s what I intended it to mean, and then there’s the meaning the model interprets, and they’re often not the same thing.

Many bad things have happened in the world because the receiver misinterpreted the intent of the sender. So it’s important to know with high confidence if we’ve grabbed the wrong end of the stick.

The inherent ambiguity of natural languages works against our desire to make our meaning clear.

In real-world communication, a simple technique to uncover misunderstandings is to test interpretations to see if they satisfy the original intent.

Including a test in an instruction given to an LLM serves two useful purposes:

It restricts pattern-matching to those that also match the test and not just the natural language instruction. Coding models are actually trained by pairing code samples with tests of some kind, and more recently test execution has been used as a reward function in reinforcement learning. LLMs are sort of build for tests.
It potentially gives us a direct way to check if the output doesn’t satisfy the intent. If our success criteria are turned into executable tests – e.g. unit tests – then we can run them against the output and get immediate feedback.

Imagine we want our LLM to generate code to add items to an online shopping basket. I regularly see prompts that look something like this.

Please generate a Python function for adding items to a shopping
basket. It should take product and quantity as parameters.

But the devil’s in the detail. What exactly are we expecting to happen when the function adds the item? How will we know if it doesn’t happen the way we intended?

I’ve been providing BDD-style tests in my contexts, along the lines of:

Given an empty basket,
And the customer has selected the product with ID 811 and stock of 3
When the customer adds the product to the basket with quantity 2
Then a new order item is added to the basket with product 811 and quantity 2
And 2 of product 811’s stock are put on hold, leaving available stock of 1

This gives the LLM much more to go on regarding the expected behaviour – the precise intent – of adding an item to the basket.

And it can be directly translated into unit tests:

			
class AddToBasket(unittest.TestCase):
    def test_order_item_is_added(self):
        basket = []
        product = Product(id=811, stock=3)
        
        add_to_basket(basket, product, quantity=2)
        item = basket[0]
        
        self.assertEqual(item.product, product)
        self.assertEqual(item.quantity, 2)
    def test_stock_put_on_hold(self):
        basket = []
        product = Product(id=811, stock=3)
        
        add_to_basket(basket, product, quantity=2)
        self.assertEqual(product.hold, 2)
        self.assertEqual(product.available_stock(), 1)

		

(NB: In my workflow, I’d tackle one test at a time – we’ll cover that in the final two letters in CRESS.)

Provided the executable tests the LLM generates match the intent – and it’s really important to check that they do – any implementation it generates will need to pass them.

If the implementation doesn’t pass the tests, or the tests don’t match the intent, I revert the changes, flush the context (see “C is for Current“) and try again – perhaps adding further clarification to the context, like additional tests, if needed.

Does this really make a difference? It certainly does. I conducted closed-loop experiments where I tasked Claude Code – using Opus 4.6 – to implement a set of features for a small, but non-trivial, system.

I’d written my own reference implementation with tests that used a simple API that didn’t reveal any internal design details. I preserved the API and moved the tests to where Claude couldn’t see them, leaving just my instructions and the API for it to work with.

When Claude had finished, I moved the tests back in to the project and ran them, scoring each pass by the % of tests passing.

I didn’t intervene until Claude said it was done. (In real life, I don’t use it this way, of course.)

In one version of the experiment, I provided BDD-style examples in the prompt. In another, I just gave Claude the basic feature descriptions. In both versions, Claude was instructed to generate its own tests from its interpretation of the requirements.

In a single pass, measured by % of tests passing, the difference was big.

Over multiple passes, feeding back test results after each, the difference got even bigger.

With test examples provided, the agent has explicit success criteria to converge on. Without them, it just goes around in circles, literally aimlessly. Poor little Ralph!

One final thought: not all interactions with an AI coding tool will be about adding or changing functionality. What if the task is a refactoring?

Well, hopefully your refactorings have goals – they’re done with intent to improve the design.

In my TDD workflow, at every green light – whenever the tests are passing again – I perform a mini code review on the changes. I might, for example, run a linter over the diff. Let’s say one of my code quality checks – just another kind of test – is for functions or methods that have a cyclomatic complexity > 5.

If the LLM changes a function and makes CC = 6, I now have a failing test. I could revert and feed that back in another pass (and giving an LLM two objectives in the same interaction reduces the odds of either being satisfied, so we could be here all day throwing the dice over and over again).

Or I could ask the LLM to refactor the function, and then run the check again to see if the restructured version is within limits.

However I choose to handle it, importantly I have a clear way to know when it hasn’t worked.

May Workshops for Self-Funding Learners – Update

Hiya. Just a quick note about the Essential Code Craft training workshops aimed at self-funding learners that are happening this month.

Specification By Example

Tuesday May 12 (evening) has 3 places available. I’m guessing it’ll be sold out by the end of this week.

Saturday May 16 (morning) is half-full at time of writing.

-> Register

Test-Driven Development

Tuesday May 19 (evening) is sold out, but you can add yourself to the waitlist in case anybody drops out and to be among the first to hear about future workshops.

Saturday May 23 (morning) still has plenty of places available.

-> Register

And if you want to keep an eye out for future workshops in June and beyond, bookmark our dedicated web page for self-funding learners on Ticket Tailor.

Upcoming skills we’ll be covering include Modular Design and Refactoring. Y’know? Boring skills that are nevertheless essential.

Essential Code Craft – Workshops In May

May will soon be upon us, and I’ve scheduled three out-of-hours workshops for self-funding learners in my Essential Code Craft series.

If your employer won’t invest in you, invest in yourself and join us.

Tues May 12th 18:45 BST & Sat May 16th 09:45 BST- Specification By Example

Over more than 70 years of developing software products and systems, we’ve learned that misunderstandings about the meaning of requirements is one of the biggest sources of avoidable rework.

Reducing ambiguity in specifications can dramatically reduce the risk of misinterpretation, whether it’s among human stakeholders or when we’re working with AI coding tools.

Tues May 19th 18:45 BST – Test-Driven Development

For nearly 30 years, Test-Driven Development has been the technical core of successful agile software development.

Teams have shortened delivery lead times dramatically, while actually improving the reliability of their releases, and lowering the cost of changing software using TDD.

And TDD is proving to be not just compatible with AI-assisted software development, but essential.

Have We Lost Sight Of Our Patients & Their Problems?

_Psst. _{If your boss won’t invest in training you in Specification By Example or Test-Driven Development, I’m running out-of-hours workshops in May specifically for self-funding learners. £99 + UK VAT.}

Many of us pretend that software releases are an end in themselves – that shipping what we said we would means success. We give the medicine to the patient, and that’s the end of the treatment.

Hopefully your doctor isn’t quite so naïve. The treatment doesn’t end when the pharmacist fills the prescription, or even when the patient takes the medicine.

There’s the little matter of the effect of the medicine on the patient – is it actually working? Does their blood pressure go down? Does their heart rhythm stabilise? Is the medicine producing the desired outcome?

In the UK, if you test positive for any one of three conditions – high blood pressure, Type 2 diabetes or high cholesterol – you’ll be tested for the other two. Bad things tend to come in threes.

If it turns out you’ve got the full set, interventions for all three may be required – ranging from lifestyle changes to prescription drugs, depending on how acute each condition is.

And your doctor’s unlikely to prescribe treatments for all three at once, unless it’s really urgent. Typically, they’ll prescribe, say, a calcium blocker for high blood pressure and then monitor your BP for a while – long enough that they’d expect to see some significant change.

Depending on the feedback – the measurements that indicate what effect the treatment’s having – they may up the dose, or add another prescription, or send you on a meditation course, or confiscate your smartphone. It all really depends on what works and what doesn’t, as measured over time.

Once the numbers are going in the right direction, they may then move on to other treatments for other conditions – e.g., statins for your high cholesterol. And again, they’ll monitor what effect each treatment’s having on the patient in reality.

Biology’s complicated, and the effect of a medical intervention on a specific patient can’t be predicted with high accuracy. Yes, statins will probably bring your cholesterol down, just like it probably won’t snow in London in April. But it’s by no means guaranteed.

Businesses are also complicated, even a tiny business like mine. And the effect of an intervention like, say, changing the design of the home page is by no means guaranteed. We might guess that displaying our top-selling vegan products prominently will increase their sales, but until our changes hit the real world, that’s all it is – guesswork.

And if vegan sales go up, do sales of hamburgers and sausages go down?

The word “solution” implies we’re solving a problem, but this all too often gets lost in the cut-and-thrust of software development. We become bogged down in the detail of prescribing and dispensing the medicine, and too easily lose sight of the patient and their condition.

In my workshops for self-funding learners on Specification By Example, you’ll learn to start not with the prescription nor with the pharmacy, but to put the patient & the problem at the centre of the development process.

Specification By Example – May 12 18:45 BST & May 16 09:45 BST

Essential Code Craft – The Roadmap

Some of you may have noticed that I’ve been running out-of-hours training workshops for self-funding learners recently, under the banner of Essential Code Craft.

In a way, this is a return to the early days of Codemanship when I ran regular weekend workshops – priced for individual pockets – that were mostly attended by developers investing in their own skills and career development.

Many of those people are now CTOs and heads of engineering, and I’ve been fortunate – and grateful – that quite a few have brought me in to provide the same kind of training for their teams.

But with senior engineering leaders now very distracted by the code-generating firehose – and while I wait for them to realise that nothing’s actually changed as far as software engineering fundamentals are concerned – I’m pivoting back to self-funders.

So far – just as it was way back when – the first two workshops filled up quickly. While the boss might not be thinking about investing in their developers at the moment, it seems a lot of developers are looking to invest in themselves.

And this is exactly the moment to do it. While a gazillion developers hunt for magic incantations to make a probabilistic next-token predictor act like something other than a probabilistic next-token predictor, the people who’ve done their homework already know: better results with AI coding tools have very little to do with the tools, and almost everything to do with the processes around them.

And it’s a double-win. The practices that produce the best outcomes with AI are the exact same practices that produce the best outcomes without AI.

The key to being effective with AI is being effective without it.

And here’s the hedge, but only for the informed gamblers – developer hiring is rising again, but the demographic of these new hires is changing. Employers are favouring senior developers with significant pre-LLM experience.

I, and a few others, predicted this would happen. Demand would be highest for people who can do the things AI coding tools can’t – like, well, understand code. I mean really understand it. Not “LGTM” understanding. Deep comprehension of programs.

Not only that, but for all kinds of good reasons – economic, environmental, energy, ethical, geopolitical – the future of hyperscale LLMs is by no means predictable. Folks grappling with reduced token limits and rapidly degrading performance with Anthropic’s newest models will hopefully have figured out by now that building workflows that depend heavily in hyperscale LLMs is building on quicksand.

Who are Acme Megacorp gonna’ hire – the dev who sits on their hands because they’re waiting for their token limit to reset, or the dev who can just carry on at roughly the same overall pace of delivery?

And we should be under no illusions that teams who’ve mastered the fundamentals of software delivery are routinely outperforming teams who haven’t – with or without AI. AI is clearly not the differentiator.

So, whether you’re going to apply these disciplines with Claude Code or Codex, or with IntelliJ or VS Code, they still matter – arguably more than ever.

And what are these disciplines? What is Essential Code Craft?

Specification By Example – build shared understanding and pin down requirements with testable specifications
Test-Driven Development – rapidly iterate working software designs with short delivery lead times and reliable releases
Continuous Integration – keep teams more in sync with their changes, merging and testing them many times a day to ensure a working, shippable-at-any-time product
Continuous Collaboration – keep teams on the same page by continuously communicating with practices like pair programming and teaming
Refactoring – reshape code to make change easier, while keeping it working and shippable at all times
Modular Design – optimise software architecture to localise the “blast radius” and minimise the cost of changes, while making rapid testing and smarter reuse easier
Continuous Inspection – minimise the bottleneck and the “LGTM” effect of downstream code review by making it a continuous and highly automated process
Continuous Delivery – combine these fundamentals in a delivery process that can get the proverbial peas from the farmer’s field to the kitchen table through rapid, reliable integration, build and deployment pipelines
Continuous Improvement – build development capability in an evidence-based way, learning what really works and what doesn’t as you build skills, automate tools and workflows, and explore and experiment with your approach – and that’s where I come in!)

Workshops on Specification By Example and Test-Driven Development are already live and taking registrations. If there’s demand, more will follow.

The roadmap is to build a set of repeating individual workshops, rotating monthly, that will eventually cover all of these disciplines – some explicitly, some implicitly like Continuous Integration and pair programming, which will be an integral part of most workshops.

Self-funders can pick and choose which to attend, and my hope is that they’ll be a bit like Pokemon cards – gotta collect ’em all!

Keep an eye on the Codemanship Ticket Tailor box office for details of upcoming workshops.

Also, details of new workshop times will be posted here first, so subscribe to this blog if you’d like to be kept in the loop for future workshops.

Specification By Example Was Essential Before AI. It’s Twice As Essential Now.

_Psst. _{If your boss won’t invest in training you in Specification By Example, I’m running out-of-hours workshops on May 12 and 16 specifically for self-funding learners. £99 + UK VAT.}

The research I’ve done over the last 3 years into AI-assisted programming, including my own closed-loop experiments, found that one major factor in the likelihood that an LLM will correctly interpret a specification is whether or not examples are included to clarify requirements.

Completion rates – as measured by acceptance tests passed – improve dramatically, even in a single pass.

In multiple passes, with feedback from acceptance testing, models given examples converge on impressive completion of ~80%, while without examples they tend to just go around in circles, with completion barely improving.

This should come as no surprise, because we saw a similar effect with dev teams before AI coding tools appeared on the scene. Teams who clarify requirements using examples are much more likely to interpret what the customer (or the product manager, or the business analyst) means correctly.

And, as requirements misunderstandings are typically one of the biggest sources of avoidable rework, they save a lot of time and money correcting mistakes that could have been spotted before a line of code was written.

The LLM equivalent means the same outcomes – features delivered as the prompter intended – in fewer passes, using fewer tokens (and burning down fewer proverbial forests).

Done right, specifications with examples can be translated pretty directly into executable tests that can drive the design and development of working software using techniques like Test-Driven Development.

A specification for totaling items in an order that uses examples is test-ready. Essentially, it is a test.

    def test_one_item(self):
        product = Product(id=327, price=159.95, stock=7, hold=1)
        order = Order([Item(product=product, quantity=1)])

        total = order.total()
        
        self.assertEqual(total, 159.5)

When I’ve provided specifications with examples in TDD training workshops, and measured successful interpretation of requirements by students, I’ve found the same trend that my experiments found with LLMs – it roughly doubles, and often hits 100% completion.

When I don’t include examples… Well, I’ve lost count of the number of times students thought that telling the Mars Rover to turn right moved it x+1, or that Roman Numerals should be converted into integers. As a trainer, it saves me and my students a lot of time – especially if I get to them later.

But humans can do things LLMs can’t, like understand, reason and learn. So levels of misinterpretation tend to be lower, because we can apply an understanding of the world and judge whether a requirement makes sense. Misinterpretation by AI coding assistants is a higher risk, and therefore the need to clarify is significantly heightened.

As is the need to use language consistently. While some folks claim – presumably because they haven’t tested it in any meaningful way – that LLMs don’t need code to be human-readable, the evidence is clear that they really, really do.

I’ve seen many times myself how completion rates dropped significantly when code wasn’t clearly and consistently signposted, using language that had a close conceptual correlation to our specifications. If I call it “sales tax” in one interaction, and “VAT” in another, the model struggles to anchor on a name for that variable, often interpreting them as distinct variables in the code.

Specifying with examples gives us an opportunity to establish a shared vocabulary for describing our problem domain, which aids communication between stakeholders, but also between humans and LLMs.

When developers are stuck for a name for a class or a function that makes the intent of that code clear, I encourage them to write that intent in plain English and take inspiration from that. Specifying with examples can help establish a shared language before a line of code’s been written.

The AI-Ready Software Developer #24 – Specification Is A Conversation

_Psst. _{If your boss won’t invest in training you in Specification By Example, I’m running out-of-hours workshops on May 12 and 16 specifically for self-funding learners. £99 + UK VAT.}

A sentiment I see often on social media about “AI”-assisted and agentic coding goes something along the lines of “If you’re just translating specs into code, your job is disappearing”.

It sounds reasonable on the surface, if you believe that’s all many programmers were doing. Someone – say, a product manager or an architect – hands the programmer a specification for a feature, and the programmer just “codes it up” like a pharmacist filling a prescription.

But was that ever really a thing?

In reality, most software specifications are incomplete and ambiguous, and often contain logical contradictions that are hard to spot – because of the incompleteness and the ambiguity.

Think of the movie script that contains the line “A huge battle ensues”. The studio asks “How much will that cost?” The producer has absolutely no idea, because that part of the script still needs to be written. The line’s just a placeholder for more work to flesh out the details. And in software development, just as in movie-making, the devil is in the details. That’s where the time and the money goes.

And that’s the reality of software specifications written in natural languages like English, even ones written by programmers. At best, they’re placeholders for conversations. Extreme Programming actually makes that explicit: a “user story” is not a requirements specification. It’s just a placeholder for a chat with the person who wrote it. That’s why it’s a waste of time making them detailed.

And this means that the programmer’s job is not just to “code up the spec”, it’s to figure out what the specification actually means. What exactly happens in this huge battle?

And because specifications are incomplete and ambiguous and often contradictory, this process inevitably has to walk everything back to figuring out what the need being addressed is in the first place.

This is why, for so many years, I drummed it into product managers, requirements analysts and the like to come to the team not with the “what”, and certainly not with the “how”, but with the “why” – what problem are we aiming to solve?

We then work as a team – leveraging our combined expertise in systems design and development and the problem domain – to learn together how to solve the problem through successive iterations of best guesses informed by rapid user feedback.

Now, I could be wrong, but that doesn’t sound like “just translating specs into code” to me.

The promise of this new generation of “AI” coding tools is that non-programmers will be able to iterate working software by themselves. And this is true, to an extent.

Tools like Claude Code and Cursor have proven themselves to be very useful for generating prototypes and proofs-of-concept with no programmer involvement, enabling business analysts, UX designers, product managers, start-up founders and pastry chefs to test simple ideas quickly and cheaply.

The problem is that without the expert judgement of experienced programmers, it doesn’t mature to reliable, scalable, secure software that stands up to real-world production rigours.

So, at some point, you’ll have to pick up the phone to Programmers-R-Us and get some involved if you want your experiment to scale. Have your cheque book ready!

And this is where the problems really start. You now have kind-of, sort-of working software that validates your idea. There’s a fork in the road here. You can either:

Have the programmers find and fix all the problems to make the prototype market-ready
Use the prototype as the specification and have the programmers build a production-quality version from scratch with, y’know, tests and architecture and stuff

Let’s go through Door #1.

So, the prototype sort-of works, but there are bugs – oh, boy are there bugs?! – and security vulnerabilities and performance bottlenecks and scaling blockers and some gone-off cheddar and discarded prams and all the kind of stuff that LLMs will tend to leave in your code if you let them. Which you did, because you can’t tell a switch statement from a discarded pram.

So the programmers need to test the software thoroughly to find all the usage scenarios where the software doesn’t do what it’s supposed to. And there’s the Catch 22. What is it supposed to do? If only there was a complete and precise specification!

We don’t fare much better with Door #2.

Now your programmers have to reverse-engineer the prototype to figure out what it does. What happens when the user leaves that field blank and clicks “Continue”? What happens when the clock strikes midnight and interest needs to be applied to the account? What happens when the vehicle remains stationary for more than 5 minutes? Edge case after edge case after edge case.

You run into the exact same wall. A complete and precise specification for any non-trivial software is made up of thousands of definitive answers to these kinds of questions. Software systems are the most complex machines we’ve ever built. You can’t specify them on the back of a cigarette packet.

Or, to put it the way a customer once put it to me when we had this discussion about a new feature, “Why are you making it so complicated, Jason?”

“I’m not making it complicated. That’s how complicated what you’re asking for is.”

The system has to handle all of these inputs in some meaningful way, otherwise it will break. If the user’s email address isn’t valid, a whole bunch of features won’t work. Are you happy for the system to just not work for those users?

Then, as it almost always does, the conversation turned into a negotiation about the scope and complexity of that feature for the next release. We can always remove one variable now, and add it in a later iteration. It’s an old physics trick (see: Special Relativity).

And this is why requirements specifications are placeholders for conversations. If there’s no conversation, issues will not get addressed by experts who understand them until much, much later when they’re much, much harder to fix.

This is why, as a tech lead, I almost always – when presented with a “requirements specification” as a fait accompli – pressed the “Reset” button and started the conversation again at “Okay, so what seems to be the problem?”

That’s Door #3 – involve programmers early. Because those conversations have to happen whether you like it or not, and the sooner you have them, the sooner you’ll converge on a workable, production-ready solution.

A simple prototype can help you validate your idea before you pick up that phone, but the more design decisions you make before involving experts, the bigger and badder the catch-up’s going to be later. And you might be surprised – when you have a clear end goal in mind – how simple the simplest proof-of-concepts can be.

I’ve been in this game for 34 years, and in that time I’ve seen countless attempts to demarcate this process of building an understanding of not just what the software needs to do, but why it needs to do it.

They all inevitably walk into the same wall. You cannot pay someone else to understand something for you. It’s like paying someone to revise for your exams.

Software specification is necessarily a conversation between people with needs – and, ideally, money – and people who specialise in meeting needs using computers. T’was ever thus, t’will ever be.

Unless, of course, your specification is complete, consistent and mathematically precise.

And a complete, consistent, mathematically precise specification of a computer program is that computer program. That’s what source code is, and that’s why programming languages were invented.

A person who just translates complete, consistent and mathematically precise specifications into executable code is a compiler.

Fans of Spec-Driven Development may be feeling vindicated now because you believe your specifications are complete, consistent and precise. If you’ve clarified requirements using examples – to me and you, tests – that might push them towards being of that integrity.

But even if your specs really are completely complete, and completely consistent and completely precise – and even if LLMs were capable of reliably translating such specifications into code (which they’re not) – you need to remember that it will still be full of assumptions about what’s really needed. Basically, a formal specification is just formalised guesswork.

To quote the Second Doctor, “Logic, my dear Zoe, merely enables one to be wrong with authority”.

The real knowledge isn’t in the spec, or in the code, it’s in the feedback we get when people use it in the real world. In this sense, iterating is the ultimate requirements discipline – it’s where most of the real value gets discovered.

So, by all means, spec away. But don’t spec far – just enough to test an assumption with user feedback from working software. And user feedback’s like code reviews – the more changes we ask for feedback on, the less attention gets paid to most of them.

Research has found that when users give feedback, they often anchor on one or two standout moments – positive or negative – rather than the entire user experience. Psychologists call it the “peak-end rule” – it’s “LGTM” for user eyeballs.

Spec one change to functionality at a time, build it in rapid, tested iterations, ship it through a reliable delivery pipeline, and then go get that focused feedback. Because the spec very probably will need to change.

And if the spec rarely changes, I’d worry that we aren’t listening to our users. Either that or we got incredibly lucky (or clairvoyant).

It’s all one big, ongoing conversation.

Will You Finally Address Your Development Bottlenecks In 2026?

I’ve spent the best part of 3 decades telling teams that to minimise the bottleneck of testing changes to their code, they’ll need to build testing right into their innermost workflow, and write fast-running automated regression tests.