An effect that’s being more and more widely reported is the increase in time it’s taking developers to modify or fix code that was generated by Large Language Models.
If you’ve worked on legacy systems that were written by other people, perhaps decades ago, you’ll recognise this phenomenon. Before we can safely change code, we first need to understand it – understand what it does, and also oftentimes why it does it the way it does. In that sense, this is nothing new.
What is new is the scale of the problem being created as lightning-speed code generators spew reams of unread code into millions of projects.
Teams that care about quality will take the time to review and understand (and more often than not, rework) LLM-generated code before it makes it into the repo. This slows things down, to the extent that any time saved using the LLM coding assistant is often canceled out by the downstream effort.
But some teams have opted for a different approach. They’re the ones checking in code nobody’s read, and that’s only been cursorily tested – if it’s been tested at all. And, evidently, there’s a lot of them.
When teams produce code faster than they can understand it, it creates what I’ve been calling “comprehension debt”. If the software gets used, then the odds are high that at some point that generated code will need to change. The “A.I.” boosters will say “We can just get the tool to do that”. And that might work maybe 70% of the time.
But those of us who’ve experimented a lot with using LLMs for code generation and modification know that there will be times when the tool just won’t be able to do it.
“Doom loops”, when we go round and round in circles trying to get an LLM, or a bunch of different LLMs, to fix a problem that it just doesn’t seem to be able to, are an everyday experience using this technology. Anyone claiming it doesn’t happen to them has either been extremely lucky, or is fibbing.
It’s pretty much guaranteed that there will be many times when we have to edit the code ourselves. The “comprehension debt” is the extra time it’s going to take us to understand it first.
And we’re sitting on a rapidly growing mountain of it.
This is the second post in a series that aims to pseudo-formalise the workflow of Test-Driven Development by describing the “rules” of TDD using pseudocode.
As I discussed in my first post on Usage-Driven Design, I’ll be aiming to express workflow declaratively, so order is implied rather than imperatively hardwired. (Instead of “boil the kettle, then pour the water” I might say “When the water is poured, it should be near boiling”).
So, for a strictly usage-driven workflow, instead of expressing it as “Except for a test, use the thing and then, if it doesn’t exist, declare it”, I’ve made it “When something in code is declared, it should either be a test, or it should have at least one reference already”.
A closely-related “rule” that we find helpful in making sure our tests – and therefore our test-driven solution code – is being driven by user outcomes is to write the test’s assertion first and work backwards from that. Describe the outcome we want, and work backwards to the simplest design that will deliver that outcome.
Again, I’m keen to avoid hardwiring a workflow like “Write the test assertion, then write the code – the set-up and the action – needed to check that assertion” because of the noise and the mess that could well intervene.
Instead, I can express it declaratively as “When adding code (a child, in syntax tree terms) to the body of a test, it should either be an assertion, or there should already be an assertion”.
In pseudocode, when a child is added to the syntax tree of the body of a test:
child.isAssertion || test.assertions.count > 0
As with the previous pseudo-formalisation, this would need to be translated into suitable terms for your programming language and IDE.
When applied, if I write this code:
@Test
void roverTurnsRightFromNorth(){
MarsRover rover = new MarsRover(Direction.NORTH);
}
It will report a workflow error that I should have written a test assertion first. And it will also report an error that I’ve declared a variable that isn’t being used when the Usage-Driven Design rule’s being applied.
Then I’m sticking to both the Assert First and Usage-Driven Design workflows and my “workflow linter” remains silent. (I’m going to experiment with different policies on this – maybe there could be some kind of indication when rules aren’t broken, too, to let you know when you done good?)
We’re back to this old chestnut. In one of the exercises on my Code Craft course called “The CD Warehouse”, one of the use cases is that customers can buy a CD from the warehouse.
The most common design solution is to add something like a buyCD(artist, title, card) method to a Warehouse class. And, given that this is an exercise in modular design, this naturally raises the question of encapsulation.
When a customer buys a CD, their card is charged the price of that CD, and the stock of that CD is reduced by one. When I parse that sentence, I see “of that CD” appear twice.
When we give responsibility to the Warehouse for achieving those outcomes, we end up with Feature Envy for cd.price and cd.stock. We also end up with the need to find the CD that we’re talking about, searching in the catalogue by artist and title.
So we tend to end up with more code, and with more coupling than if the responsibility was shifted to where the data is – e.g., cd.buy(card).
When I raise this with pairs, a common response is “But Jason, CD’s don’t buy themselves!” And this steers into a more philosophical conversation about what the objects are in object-oriented programming, and how we read OO code.
To many, cd.buy(card) means telling the cd to buy that card. I don’t read OO code that way. To me cd.buy(cards) means “buy this cd using that card”. cd is the object, not the subject. It’s the thing that’s being bought.
Think about it; if we wrote this in C, it would be buy(cd, card). As a convention, any function applied to an object – an instance of a user-defined data type – would take that instance as the first parameter. And what did we used to call that first parameter – the thing to which the function is being applied? We called it “this”. That’s where that comes from.
So cd.buy(card) means exactly the same thing as buy(cd, card). OOP just flips it around in the syntax so that the “this” parameter is placed in front of the function. In most OO languages, you read it backwards: object.action().
This relates directly to encapsulation. We want the effect of a function to be contained in the same place – in the same class. When it isn’t, we end up having to share internal details with whichever other class is handling that piece of business – otherwise known as “coupling”.
This is the first part in a series of posts where I’m going to try to crystallise my ideas about how TDD really works, ideally expressed pseudo-formally (with pseudocode).
Workflow Linting. Like Code Linting, Only The Same.
It’s part of an ongoing side-project to create what I’ve provisionally titled a “workflow linter”. Many of us are familiar with code linters. They walk the abstract syntax tree of our code, applying rules to each node depending on what type of code element it is – rules that apply to classes, rules that apply to methods, rules that apply to parameters, and so on.
In a very real sense, programming languages can be described as Finite State Machines, and some parsers are even event-driven. Instead of building an abstract syntax tree, the rules of the language are applied as one language element transitions to the next: e.g., the reserved word “public” can only be followed by certain allowed elements, like a type name, or “void” or “static” etc.
The same idea can be applied to workflow. Boiling the kettle can be followed by pouring the water into a teapot. We do not pour the water and then boil it, just like we do not write “void public”.
Workflows As Finite State Machines vs. The Real World
Modeling workflows as FSMs can be helpful to visualise them, and that’s traditionally how processes like Test-Driven Development have been described.
A large part of my job as a trainer and mentor could be described as “workflow linting”. I observe developers doing TDD or refactoring or Continuous Integration, and I have an idea in my head of what they should be doing at any given stage in that workflow. So when I see someone start to write the code to pass the test, but we haven’t seen the test fail yet, a little light goes ping in my head, and I intervene.
And if our goal is behaviour modification – habit forming, basically – then the best time to give feedback is as it’s happening. This is one of the reasons why organisations who rely mostly on after-the-fact feedback – Pull Request code reviews, weekly or monthly appraisals etc – tend to be places where developers learn the slowest. Junior developers who are left mostly working on their own tend to remain junior developers for a long time. Sometimes forever.
So it’s important when we’re mentoring developers on specific processes to have a clear internal mental model of those workflows – the light that goes “ping” when they deviate.
The problem with internal mental models is they don’t transmit easily. So we need to express them somehow in order that we can teach them (to people, to computers, to squirrels who need to know when they shouldn’t distract us, and so on.)
In the real world, though, things are messy and noisy. Maybe I boiled the kettle but forgot to put the tea in the pot, so I have to do that next. Or maybe I got distracted by a squirrel in the garden, lost track of time and had to boil the kettle again. Maybe I don’t even have a kettle, and need to boil the water in a pan.
The stream of events created by our IDEs as we work is equally messy and noisy. Maybe I wrote the failing test but when I ran it, it actually passed first time. Maybe I decided to rewrite the assertion. Maybe I deleted the test without running it and started again. Maybe I got distracted by a squirrel in the garden. Etc.
Declarative Implied Workflow
Hardwired imperative workflows of the “A → B → C” variety can be brittle, and don’t handle mess and noise well. You’re left having to map every possible edge case, every allowable scenario. Things can get very complicated very quickly. And no matter how much we try to think of everything, we invariably miss cases. “Computer says ‘no’.”
A more flexible and robust way to describe workflow is to imply ordering of events declaratively*. Instead of “Boil the kettle, then pour the water into the teapot”, we might say “When the water is poured into the teapot, it must be at near boiling”. Now all of our edge cases work. We could have boiled it in a pan. We could have boiled it in a microwave. We could have watched the squirrel and then boiled the kettle again. Workflow is implied: the water has been boiled, but we leave ourselves many more routes to pouring it in the teapot.
A similar approach can be taken with workflows like TDD; instead of “Run the test to see it fail, then write the simplest code to pass it”, we could say “When we write the code to pass the test, it must be failing”. It’s a subtle but important distinction, and one which I’ve realised I’ve been using for a long time when I observe developers working.
On training courses, in particular, I’m popping my head around the proverbial door while pairs are in the middle of something. Subconsciously, I have to establish where they are in the process without seeing them follow the workflow. If they’re writing implementation code and I see a test failing that appears to be related, I can deduce that they’re writing code to pass that test. If no tests are failing, and they appear to be adding behaviour, then – without seeing them do the whole “A → B → C” – the little light goes “ping” and the “workflow linter” reports an error.
(The limitation of this meat-based workflow linter is availability. If there are six pairs doing a 90-minute exercise, they get maybe 15 minutes with the benefit of the linter. That might be 1-2 TDD cycles. This, by the way, is why follow-up coaching can be such a good investment. Hence this ongoing side-project to see if I can create an IDE plugin that can do a bit more than my t-shirt that says “NOW RUN YOUR TESTS”.)
Usage-Driven Design Formalised Declaratively
Anyhoo, back to the main feature.
UDD is a core component of TDD. Arguably, it’s the whole point of it; driving our implementation design by working backwards from how it’s used in tests.
Informally, it goes like this: you use it in a test, and that tells you that you need it. Usage comes first. Practically, I don’t declare a ShoppingBasket class:
public class ShoppingBasket {
}
And then instantiate it in a test:
@Test
void totalOfEmptyBasket(){
ShoppingBasket basket = new ShoppingBasket();
I instantiate it in my test, and when the compiler tells me that class doesn’t exist, that tells me that I need to declare it. Use it, and that tells you that you need it. It always flows in that direction.
Formalising this rule declaratively is actually pretty straightforward; any time I declare something, it must either be a test (a test method or a test class), or it must already have a reference – it must already be used.
Applying this rule when a declaration is added to the code’s syntax tree should catch anyone declaring things that are not used in a test, or used by something that’s used in a test – basically, nothing can exist that isn’t in the call stack of a test except a test.
I might even go so far as to proclaim this as a formal definition of the “test-driven” part of TDD. Nothing can exist unless a test is directly or indirectly using it. Or, as I tell customers, “If there isn’t a test for it, you ain’t getting it”.
Of course, this rule has to be adapted to work in Java, with JUnit, inside, say, IntelliJ’s or Eclipse’s event model, so translation is required. But I have a version of this in an IntelliJ plugin, and it does appear to work, and be mess and noise resistant.
* Formal Methods folks might think this is a bit like temporal logic.It’ll be our little secret.
To me, a code review done by people is exploratory testing. We gather around a piece of code (e.g., a merge diff for a new feature). We go through the code and we ask ourselves “What do we think of this?”
Maybe we see a method or a function that has control flow nested 4 deep. Eurgh! Difficult to test and easy to break (such cyclomatic complexity, much wow). So we flag it up.
So far, so normal.
Once that code quality “bug” has been flagged up, I’m sure we both agree that it needs fixing. So we fix it. Job done? Now, here’s where you and I may part company.
It’s almost guaranteed that won’t be the last time that problem rears its head in our code. So, as well as fixing the complex conditional we found in our review, we also fix the process that allowed the problem to make it that far – and waste a bunch of time – in the first place.
When we find a logic error in our code by exploratory testing, we don’t just fix the bug. We write a regression test in case a future change breaks it again. (We do, right?)
And when we find a code quality bug, we shouldn’t just refactor that example. We should add a code quality check for it to our suite of code inspections – automated if at all possible – that will catch it as soon as it reappears somewhere else in the code.
Now, you can take this too far, as with all things. Automating regression tests is D.R.Y. applied to our testing process. If we need to perform the same test over and over, automating it makes a lot of sense.
But D.R.Y. has caveats, and one of those caveats is The Rule of Three. On average, we wait until we’ve seen something repeated 3 times before we refactor. This increases the odds that:
a. The refactoring will pay off later (the more examples we see, the more likely there’ll be more in the future)
b. We have more examples to guide us towards the better design.
Both apply as much to duplicated effort in our process as they do to duplication in our code.
So we might want to keep a log of the problems our code reviews find, and we see the same type of issue appear 3 times (or thereabouts), that might be our cue to look into building a check for it into our Continuous Inspection rule suite. Maybe our linter already has a rule we can use. Maybe we’ll need to write our own custom rule for it. (That’s a very undervalued skillset, BTW). Maybe we could train a small ANN to detect it. Maybe we’ll need to add it to the manual inspection checklist.
A good yardstick might be that the same type of code quality issue doesn’t appear in merges (or attempted merges) more than 3 times.
And there’s more. There are teams I’ve worked with who not only add a check to their Continuous Inspection suite, but also ask “Why does this problem happen in the first place?” How do 500-line functions become 500-line functions? How do deeply-nested IFs become deeply-nested IFs? How do classes end up with 12 distinct responsibilities and 25 dependencies?
The answer, BTW, is that – more often than not – the way functions get 500 lines long, IFs get deeply nested and classes end up doing so many different things is because the people writing that code didn’t see it as a problem.
I’ve joked in the past that what really makes LLMs work is our tendency to see faces on toast, but there’s a more serious point there about how much of our perception of the ability of models to “understand”, “reason”, “follow instructions” etc is in reality projection.
We’ve evolved to read intention into the behaviour of other people so that we can predict what they might do. But we can also see intent in the behaviour of pets, weather, dishwashers, etc etc. So we shouldn’t be too surprised if something that’s designed to statistically reproduce human creativity and reasoning has that effect on many of us to a much greater extent.
I certainly fell for it during the first few hours experimenting with GPT-4, until I played it at chess, and then the curtain was pulled back. It doesn’t know where the pieces are on the board, it doesn’t plan ahead, it doesn’t know the rules. It literally just predicts – by matching the sequence of moves to its vast example space of chess transcripts – which chess move is most likely to come next.
Once you’ve seen it, it can’t be unseen. But I appreciate that a lot of people have yet to see the tiger in the Magic Eye picture. The dazzling complexity of human language makes it hard to see the wood for the trees. That’s why something simple and deterministic, like chess, makes it much clearer.
As impressive as LLMs can be, I encourage users not to mistake powerful pattern matching and next-token prediction for actual intelligence or understanding. I urge folks who use these tools – which is all they are – to take a rational and evidence-based approach to them, as I’ve been doing for 2 1/2 years now.
Your cat doesn’t understand what you’re saying. It can learn to recognise certain words, your tone of voice, your body language, and associate it with – for example – imminent treats or bath time. That learned behaviour can be easily mistaken for actual conceptual understanding.
Clever Hans couldn’t do arithmetic when he couldn’t see his trainer, but not even his trainer realised he was subconsciously giving off visual cues. Oh yeah, and LLMs don’t understand the instructions and rules in your claude.md file. (A good test is to add a “brown M&Ms” rule to your context.)
But we’re hardwired to see that in them, and it’s a very powerful effect. I see much confirmation bias, for example, in interpreting output – a strong desire to focus on the things they get right while overlooking many of the things they got wrong. And they get a lot wrong.
Expect that, because it’s not going to get much better. You have to keep these tools on a very tight leash.
The “brown M&Ms” test? This is a famous story about Van Halen’s rider for concerts. It was often used to imply that the band were absolute divas, but it had a serious purpose. A little detail like that, buried in the contract – when the venue didn’t observe it, the band would double-check everything in their very complex stage show.
Going through the practices that many software developers report improve the results they get with “A.I.” coding assistants – the ones I’ve managed to reproduce myself.
* Small, task-specific prompts – solve one problem at a time, reconstruct context for the next task (only what the model needs to know)
* Prompting with tests/usage examples
* Tight feedback loops with continuous testing, code review and refactoring
* Merciless version control, with commits on every acceptable outcome, hard reset when it breaks the code
* Merging directly to the trunk (the code-generating firehose has a tendency to overwhelm PR-based processes)
* External deterministic sources of truth – the code as it is now, test results, linter output, mutation testing reports etc (as opposed to what the context says these things are, or what the model tells us)
* Code that clearly communicates intent
* Good, clean separation of concerns – smaller “blast radius” for changes
* A “birds-eye” view that the developer maintains, because big picture isn’t something LLMs can handle reliably – e.g., high-level design sketches, screenflows and wireframes, test lists and so on. Basically, don’t rely on the LLM to set high-level direction or to plan long-range. That requires actual intelligence.
Two very interesting things that I’m focused on now: to what extent do these practices have a net effect of reducing uncertainty (increasing confidence) in next-token prediction, and/or in minimising bottlenecks in the development process.
This makes sense, since any strategy that might actually work would need a causal mechanism in the models themselves. Since research seems to be rapidly converging on the same mechanism, I think we may well be on to something here.
The second is just how familiar these practices look: small steps (limited WIP), rapid feedback loops, continuous testing, code review, refactoring, integration etc. Remind you of anything, perhaps something a little extreme?
Now, that surely can’t be a coincidence?
Is the net effect of XP technical practices essentially the same, and with the same causal mechanism – maximising confidence in the output by continuously attacking uncertainty, which we achieve by minimising bottlenecks in the feedback loops?