The LLM In The Room

Image

Over 2 years ago, the at-the-time not-for-profit research organisation OpenAI released a new version of their Large Language Model, GPT 3.5, under the friendlier brand name of ChatGPT, and started a media and market frenzy.

This was arguably the first time a chat interface could genuinely fool users into believing it was a person, and there was much talk about the age of “artificial general intelligence” and even “super-intelligence” now being upon us. Many pundits predicted the end of knowledge workers like lawyers, doctors, and – of course – software developers within a few years.

Naturally, this was a claim I had to check out for myself, so when GPT-4 was released a few months later, I signed up for the “Pro” version of ChatGPT to get (limited) access to it and started to experiment in various problem domains, including programming and software development.

Like millions of people, I was initially very impressed with GPT-4 (not so much with 3.5, I have to say). But as I started to try to actually do things – specific things – with it, its limitations became more and more apparent. While it is indeed remarkable that what is essentially a predictive texting engine can write Python or Java or C# that actually compiles – let’s not take that away from OpenAI – the actual code itself was less impressive.

In fact, it was often not acceptable at all. LLMs – and generative transformers more generally – are not very good at specifics. An honest marketing slogan for the technology might be “Impressive, but wrong.”

I found myself having to double-check everything, correct more than half the code, and routinely ended up having to “coach” GPT-4 to get half-decent results that didn’t look like they were written by an intern in a hurry. This often took longer than if I’d just written the code myself. As I’ve evaluated each new model, this has stubbornly remained the case.

No doubt in the intervening 20 months, “A.I. coding assistants” have improved, and I’ve been keeping a close eye on new models as they’ve been emerging to see just how much improved. In January 2025, we’re still at a point where LLM-generated code needs double-checking, correcting and refactoring too often to make them usable on anything beyond small one-shot “How do I…?” tasks. They are – as of today – at best, conversational interfaces to code examples included in their training data. They’re an improvement on Stack Overflow searches.

Hyperbolic claims by some of achieving 10x or even 100x productivity with these tools, or of non-programmers creating complex working products with them, like reports of flying saucers, have a tendency to evaporate on contact with reality. As yet, I’ve not seen a shred of hard evidence to back them up.

More tempered claims of modest productivity gains, backed up by hard data (i.e., not surveys of how productive devs feel LLMs are making them), paint a very ambiguous picture. Maybe they help a little. Maybe they don’t. Programming’s such a small part of software development that even if they did speed it up 10x – which at this point I’m confident they don’t – that’s a 90% saving on 10% of the work. There’s even hard data to suggest that, at the team level – and that’s where the productivity rubber meets the road – extensive LLM use can actually have a small negative impact. More code faster != more value sooner. I try to bear in mind that the feeling of productivity can often be deceptive. (For every “I stayed late and wrote a tonne of code uninterrupted” story, there’s usually at least four more “I spent the whole morning trying to understand 100s of changes some dude had pushed the night before” stories.)

One obvious long-term risk of having a big chunk of your code generated by “AI” at speed is that a team’s understanding of their code base will run away from them, creating a kind of “comprehension debt” that seems likely to significantly increase the cost of fixing problems that the LLM can’t fix. We should keep an eye on the Mean-Time To Recovery of businesses who proudly claim that a growing percentage of their code’s “AI-generated” (presumably to impress investors).

Now, a conversational interface to gazillions of code examples – a kind of Stack Overflow++ – is not to be sniffed at. Good for them! But what it most certainly is not is a replacement for actual software developers. Not even close. But outside of our profession, the confident pronouncements by CEOs and pundits in the media that they are has been doing real damage to the industry.

As “software developers”, they remain stubbornly not good enough. It would appear that this is an un-fixable problem, no matter how much training data and compute they throw at it. Pattern matchers are gonna pattern-match!

At some point, even investors, executives and commentators are going to be confronted with the reality that this technology hasn’t replaced any software developers. If anything, all the low-quality code these tools are churning out is creating a Mount Everest of technical debt that will require even more developers to keep the wheels on their enterprises turning in the future.

At this point, someone usually says “Ah, but Jason, maybe they’re not good enough now, but what about future models?” And this is where we all place our bets.

Some, like Microsoft, OpenAI and Nvidia, are betting that model performance is just going to keep improving until we reach AGI and beyond, even if we have to burn the planet to get there. This is their “growth story” upon which their current stock prices – riding at record highs – are based. If it’s not true, their stock prices will plummet back to what they were before this current “A.I.” bubble started to inflate. That’s trillions of dollars wiped off the NASDAQ. So there’s a lot very wealthy people with a very big interest in it turning out to be true. This is the biggest bet in history.

So anything that one of these models does that kind of sort of looks like AGI – in a certain light, from a distance, if we squint – is leapt upon as evidence that the Singularity is upon us, and that we should all start digging bunkers and buying canned goods in preparation for the inevitable Butlerian Jihad.

I’m skeptical of that. These claims are usually supported by A.I. performance benchmarks, and the models can be trained and fine-tuned to do well in these standard tests. There’s no shortage of training data.

And when I say “well”, I mean not as well as a human expert, but better than the average Joe. And while the gap closes little by little, that “little” seems to get “littler” with each new iteration. I speculated that transformer performance would converge on not-quite-good-enough. Needs more work. See me after. Not so much “super-intelligence” as “super-mediocrity”. Yes, it can write code, but not good code. Yes, it can play chess. Just not well. And so on.

The strength of LLMs is that they are not-quite-good-enough at very many text-based problems. But commercially, what’s the value proposition here? A not-quite-good-enough programmer that is also a not-quite-good-enough tax lawyer? An under-performing car that can also bake cakes is still an under-performing car.

And even as LLMs inch forward, there’s also the cost to consider with each new model. At $20 per month for ChatGPT Pro, OpenAI were losing money hand-over-first. The price of the new plan is ten times that. And they’re still burning through enormous amounts of investor cash. Executives at OpenAI have recently been floating the idea of a $2,000/month plan. But would they break even at that price? Reports that a single task performed by the newest model, in “high-compute” mode, can cost thousands of dollars, and still fall short of expert performance, makes me wonder if the final destination of all this research, all this fanfare, and all this MONEY, might be a world where human experts are both the better and the cheaper option. That would be very funny. I would laugh a lot as the world economy collapses!

Much has been made of the idea that the newest models can follow and evaluate multiple “chains of thought”, and there seems little doubt that this improves their performance in benchmark tests. I’m not at all convinced that this is, as the makers claim, “reasoning”.

There’s also the question of what these models are evaluating their “chain of thought” against. What’s telling them that this is the right maths answer, or the best chess move, or the right Python code? How could a language model know?

I wonder if OpenAI are, in these cases, using their LLMs as interfaces to, say, maths programs, or chess programs, or Python testing or linting tools. And is that “artificial general intelligence”, or is that a natural language interface to point solutions; application-specific intelligence?

And after all that, the end results are still not-quite-good-enough, even with oceans of computing power thrown at the problem.

I don’t have a crystal ball, so this is just a bet. And I’m betting that LLMs will eventually – once decision makers finally see the tiger in the Magic Eye picture of generative A.I. – find their natural fit in the world as very impressive conversational natural language interfaces. The question that follows is: natural language interfaces to what, exactly? And in many cases, the answer is: something we haven’t figured out how to build yet.

So, back into A.I. winter we go, until the next major breakthrough. Perhaps next time, businesses will have been so badly burned by the crash – we’ve never seen tech hyped on this scale before, and it’s distorting everything – that they’ll think a little more critically about claims of “A.G.I.” and “super-intelligence”.

I’d like to think that investors and executives, unlike LLMs, are capable of learning from experience and applying a little dynamic reasoning next time around.

In the meantime, we – software developers, and the businesses who rely on us – have a looming pipeline problem of potentially epic proportions. Businesses who’ve stopped hiring and training entry-level developers because “GitHub Copilot can do what they do” are going to find out what happens when nobody plants tomatoes because “Hey, who needs tomatoes? We’ve already got pasta sauce”.

Combine that with a backlog that stretches to the Moon of real business problems neglected while “A.I.” has been sucking all the oxygen out of the room, and a planet-sized amount of LLM-generated technical debt, and you have the perfect storm.

When that happens, I’ll be here if you need me, shopping for superyachts 🙂

NB: For those thinking “Yes, but what about the environmental and ethical impact of LLMs” As a paid-up member of the Green Party, I’m right there with you. But my argument isn’t aimed at people with a track record of making business decisions on ethical grounds. We don’t live in that world any more (if we ever did).