As you already know if you’ve read any of this blog in the last few years, I am
a somewhat reluctant — but nevertheless quite staunch — critic of LLMs. This
means that I have enthusiasts of varying degrees sometimes taking issue with my
stance.
It seems that I am not going to get away from discussions, and, let’s be
honest, pretty intense arguments about “AI” any time soon. These arguments are
starting to make me quite upset. So it might be time to set some rules of
engagement.
I’ve written about all of these before at greater length, but this is a short
post because it’s not about the technology or making a broader point, it’s
about me. These are rules for engaging with me, personally, on this topic.
Others are welcome to adopt these rules if they so wish but I am not
encouraging anyone to do so.
Thus, I’ve made this post as short as I can so everyone interested in engaging
can read the whole thing. If you can’t make it through to the end, then please
just follow Rule Zero.
Rule Zero: Maybe Don’t
You are welcome to ignore me. You can think my take is stupid and I can think
yours is. We don’t have to get into an Internet Fight about it; we can even
remain friends. You do not need to instigate an argument with me at all, if
you think that my analysis is so bad that it doesn’t require rebutting.
Rule One: No ‘Just’
As I explained in a post with perhaps the least-predictive title I’ve ever
written, “I Think I’m Done Thinking About genAI For
Now”, I’ve already
heard a bunch of bad arguments. Don’t tell me to ‘just’ use a better model,
use an agentic tool, use a more recent version, or use some prompting trick
that you personally believe works better. If you skim my work and think that I
must not have deeply researched anything or read about it because you don’t
like my conclusion, that is wrong.
Rule Two: No ‘Look At This Cool Thing’
Purely as a productivity tool, I have had a terrible experience with genAI.
Perhaps you have had a great one. Neat. That’s great for you. As I explained
at great length in “The Futzing Fraction”,
my concern with generative AI is that I believe it is probably a net
negative impact on productivity, based on both my experience and plenty of
citations. Go check out the copious footnotes if you’re interested in more
detail.
Therefore, I have already acknowledged that you can get an LLM to do various
impressive, cool things, sometimes. If I tell you that you will, on average,
lose money betting on a slot machine, a picture of a slot machine hitting a
jackpot is not evidence against my position.
Rule Two And A Half: Engage In Metacognition
I specifically didn’t title the previous rule “no anecdotes” because data
beyond anecdotes may be extremely expensive to produce. I don’t want to say
you can never talk to me unless you’re doing a randomized controlled trial.
However, if you are going to tell me an anecdote about the way that you’re
using an LLM, I am interested in hearing how you are compensating for the
well-documented biases that LLM use tends to induce. Try to measure what you
can.
Rule Three: Do Not Cite The Deep Magic To Me
As I explained in “A Grand Unified Theory of the AI Hype
Cycle”, I already know quite a bit of
history of the “AI” label. If you are tempted to tell me something about how
“AI” is really such a broad field, and it doesn’t just mean LLMs, especially if
you are trying to launder the reputation of LLMs under the banner of jumbling
them together with other things that have been called “AI”, I assure you that
this will not be convincing to me.
Rule Four: Ethics Are Not Optional
I have made several arguments in my previous writing: there are ethical
arguments, efficacy arguments, structuralist arguments, efficiency arguments
and aesthetic arguments.
I am happy to, for the purposes of a good-faith discussion, focus on a specific
set of concerns or an individual point that you want to make where you think I
got something wrong. If you convince me that I am entirely incorrect about the
effectiveness or predictability of LLMs in general or as specific LLM product,
you don’t need to make a comprehensive argument about whether one should use
the technology overall. I will even assume that you have your own ethical
arguments.
However, if you scoff at the idea that one should have any ethical boundaries
at all, and think that there’s no reason to care about the overall utilitarian
impact of this technology, that it’s worth using no matter what else it does as
long as it makes you 5% better at your job, that’s sociopath behavior.
This includes extreme whataboutism regarding things like the water use of
datacenters, other elements of the surveillance technology stack, and so on.
Consequences
These are rules, once again, just for engaging with me. I have no particular
power to enact broader sanctions upon you, nor would I be inclined to do so if
I could. However, if you can’t stay within these basic parameters and you
insist upon continuing to direct messages to me about this topic, I will
summarily block you with no warning, on mastodon, email, GitHub, IRC, or
wherever else you’re choosing to do that. This is for your benefit as well:
such a discussion will not be a productive use of either of our time.
The dawning of a new year is an opportune moment to contemplate what has
transpired in the old year, and consider what is likely to happen in the new
one.
Today, I’d like to contemplate that contemplation itself.
The 20th century was an era characterized by rapidly accelerating change in
technology and industry, creating shorter and shorter cultural cycles of
changes in lifestyles. Thus far, the 21st century seems to be following that
trend, at least in its recently concluded first quarter.
The early half of the twentieth century saw the massive disruption caused by
electrification, radio, motion pictures, and then television.
In 1971, Intel poured gasoline on that fire by releasing the 4004, a microchip
generally recognized as the first general-purpose microprocessor. Popular
innovations rapidly followed: the computerized cash register, the personal
computer, credit cards, cellular phones, text messaging, the Internet, the web,
online games, mass surveillance, app stores, social media.
These innovations have arrived faster than previous generations, but also, they
have crossed a crucial threshold: that of the human lifespan.
While the entire second millennium A.D. has been characterized by a gradually
accelerating rate of technological and social change — the printing press and
the industrial revolution were no slouches, in terms of changing society, and
those predate the 20th century — most of those changes had the benefit of
unfolding throughout the course of a generation or so.
Which means that any individual person in any given century up to the 20th
might remember one major world-altering social shift within their lifetime,
not five to ten of them. The diversity of human experience is vast, but most
people would not expect that the defining technology of their lifetime was
merely the latest in a progression of predictable civilization-shattering
marvels.
Along with each of these successive generations of technology, we minted a new
generation of industry titans. Westinghouse, Carnegie, Sarnoff, Edison, Ford,
Hughes, Gates, Jobs, Zuckerberg, Musk. Not just individual rich people, but
entire new classes of rich people that did not exist before. “Radio DJ”,
“Movie Star”, “Rock Star”, “Dot Com Founder”, were all new paths to wealth
opened (and closed) by specific technologies. While most of these people did
come from at least some level of generational wealth, they no longer came
from a literal hereditary aristocracy.
To describe this new feeling of constant acceleration, a new phrase was
coined: “The Next Big
Thing”. In addition to
denoting that some Thing was coming and that it would be Big (i.e.: that it
would change a lot about our lives), this phrase also carries the strong
implication that such a Thing would be a product. Not a development in
social relationships or a shift in cultural values, but some new and amazing
form of conveying salted meatpaste or what-have-you, that would make
whatever lucky tinkerer who stumbled into it into a billionaire — along with
any friends and family lucky enough to believe in their vision and get in on
the ground floor with an investment.
In the latter part of the 20th century, our entire model of capital allocation
shifted to account for this widespread belief. No longer were mega-businesses
built by bank loans, stock issuances, and reinvestment of profit, the new model
was “Venture Capital”. Venture capital is a model of capital allocation
explicitly predicated on the idea that carefully considering each bet on a
likely-to-succeed business and reducing one’s risk was a waste of time, because
the return on the equity from the Next Big Thing would be so disproportionately
huge — 10x, 100x, 1000x – that one could afford to make at least 10 bad bets
for each good one, and still come out ahead.
The biggest risk was in missing the deal, not in giving a bunch of money to a
scam. Thus, value investing and focus on fundamentals have been broadly
disregarded in favor of the pursuit of the Next Big Thing.
If Americans of the twentieth century were temporarily embarrassed
millionaires, those of the twenty-first are all temporarily embarrassed
FAANG CEOs.
The predicament that this tendency leaves us in today is that the world is
increasingly run by generations — GenX and Millennials — with the shared
experience that the computer industry, either hardware or software, would
produce some radical innovation every few years. We assume that to be true.
But all things change, even change itself, and that industry is beginning to
slow down. Physically, transistor density is starting to brush up against
physical
limits.
Economically, most people are drowning in more compute power than they know
what to do with anyway. Users already have most of what they need from the
Internet.
The big new feature in every operating system is a bunch of useless
junknobody really
wants
and is seeing remarkably little uptake. Social media and smartphones changed
the world, true, but… those are both innovations from 2008. They’re just not
new any more.
So we are all — collectively, culturally — looking for the Next Big Thing, and
we keep not finding it.
It wasn’t 3D printing. It wasn’t crowdfunding. It wasn’t smart watches. It
wasn’t VR. It wasn’t the Metaverse, it wasn’t Bitcoin, it wasn’t NFTs1.
It’s also not AI, but this is why so many people assume that it will be AI.
Because it’s got to be something, right? If it’s got to be something then
AI is as good a guess as anything else right now.
The fact is, our lifetimes have been an extreme anomaly. Things like the
Internet used to come along every thousand years or so, and while we might
expect that the pace will stay a bit higher than that, it is not reasonable to
expect that something new like “personal computers” or “the Internet”3
will arrive again.
We are not going to get rich by getting in on the ground floor of the next
Apple or the next Google because the next Apple and the next Google are Apple
and Google. The industry is maturing. Software technology, computer
technology, and internet technology are all maturing.
There Will Be Next Things
Research and development is happening in all fields all the time. Amazing new
developments quietly and regularly occur in pharmaceuticals and in materials
science. But these are not predictable. They do not inhabit the public
consciousness until they’ve already happened, and they are rarely so profound
and transformative that they change everybody’s life.
There will even be new things in the computer industry, both software and
hardware. Foldable phones do address a real problem (I wish the screen were
even bigger but I don’t want to carry around such a big device), and would
probably be more popular if they got the costs under control. One day
somebody’s going to crack the problem of volumetric displays, probably. Some VR
product will probably, eventually, hit a more realistic price/performance ratio
where the niche will expand at least a little more.
Maybe there will even be something genuinely useful, which is recognizably
adjacent to the current “AI” fad, but if it is, it will be some new
development that we haven’t seen yet. If current AI technology were
sufficient to drive some interesting product, it would already be doing it, not
using marketing disguised as
science
to conceal diminishing
returns
on current investments.
But They Will Not Be Big
The impulse to find the One Big Thing that will dominate the next five years is
a fool’s errand. Incremental gains are diminishing across the board. The
markets for time and attention2 are largely saturated. There’s no need for
another streaming service if 100% of your leisure time is already committed to
TikTok, YouTube and Netflix; famously, Netflix has already considered
sleep
its primary competitor for close to a decade - years before the pandemic.
Those rare tech markets which aren’t saturated are suffering from pedestrian
economic problems like wealth inequality, not technological bottlenecks.
For example, the thing preventing the development of a robot that can do your
laundry and your dishes without your input is not necessarily that we couldn’t
build something like that, but that most households just can’t afford it
without wage growth catching up to productivity
growth. It doesn’t make sense for
anyone to commit to the substantial R&D investment that such a thing would
take, if the market doesn’t exist because the average worker isn’t paid enough
to afford it on top of all the other tech which is already required to exist
in society.
Even if we were to accept the premise of an actually-“AI” version of this, that
is still just a wish that ChatGPT could somehow improve enough behind the
scenes to replace that worker, not any substantive investment in a novel,
proprietary-to-the-chores-robot software system which could reliably perform
specific functions.
What, Then?
The expectation for, and lack of, a “big thing” is a big problem. There are
others who could describe its economic, political, and financial dimensions
better than I can. So then let me speak to my expertise and my audience: open
source software developers.
When I began my own involvement with open source, a big part of the draw for me
was participating in a low-cost (to the corporate developer) but high-value (to
society at large) positive externality. None of my employers would ever have
cared about many of the applications for which
Twisted forms a core bit of infrastructure; nor would I
have been able to predict those applications’ existence. Yet, it is nice to
have contributed to their development, even a little bit.
However, it’s not actually a positive externality if the public at large can’t
directly benefit from it.
When real world-changing, disruptive developments are occurring, the
bean-counters are not watching positive externalities too closely. As we
discovered with many of the other benefits that temporarily accrued to
labor
in the tech economy, Open Source that is usable by individuals and small
companies may have been a ZIRP. If you know you’re gonna make a billion
dollars you’re not going to worry about giving away a few hundred thousand here
and there.
When gains are smaller and harder to realize, and margins are starting to get
squeezed, it’s harder to justify the investment in vaguely good vibes.
But this, itself, is not a call to action. I doubt very much that anyone
reading this can do anything about the macroeconomic reality of higher interest
rates. The technological reality of “development is happening slower” is
inherently something that you can’t change on purpose.
However, what we can do is to be aware of this trend in our own work.
Fight Scale Creep
It seems to me that more and more open source infrastructure projects are tools
for hyper-scale application development, only relevant to massive cloud
companies. This is just a subjective assessment on my part — I’m not sure what
tools even exist today to measure this empirically — but I remember a big part
of the open source community when I was younger being things like Inkscape,
Themes.Org and Slashdot, not React, Docker Hub and Hacker News.
This is not to say that the hobbyist world no longer exists. There is of course
a ton of stuff going on with Raspberry Pi, Home Assistant, OwnCloud, and so on.
If anything there’s a bit of a resurgence of self-hosting. But the interests
of self-hosters and corporate developers are growing apart; there seems to be
far less of a beneficial overflow from corporate infrastructure projects into
these enthusiast or prosumer communities.
This is the concrete call to action: if you are employed in any capacity as an
open source maintainer, dedicate more energy to medium- or small-scale open
source projects.
If your assumption is that you will eventually reach a hyper-scale inflection
point, then mimicking Facebook and Netflix is likely to be a good idea.
However, if we can all admit to ourselves that we’re not going to achieve a
trillion-dollar valuation and a hundred thousand engineer headcount, we can
begin to consider ways to make our Next Thing a bit smaller, and to accommodate
the world as it is rather than as we wish it would be.
Be Prepared to Scale Down
Here are some design guidelines you might consider, for just about any open
source project, particularly infrastructure ones:
Don’t assume that your software can sustain an arbitrarily large fixed
overhead because “you just pay that cost once” and you’re going to be
running a billion instances so it will always amortize; maybe you’re only
going to be running ten.
Remember that such fixed overhead includes not just CPU, RAM, and filesystem
storage, but also the learning curve for developers. Front-loading a
massive amount of conceptual complexity to accommodate the problems of
hyper-scalers is a common mistake. Try to smooth out these complexities and
introduce them only when necessary.
Test your code on edge devices. This means supporting Windows and macOS, and
even Android and iOS. If you want your tool to help empower individual
users, you will need to meet them where they are, which is not on an EC2
instance.
This includes considering Desktop Linux as a platform, as opposed to Server
Linux as a platform, which (while they certainly have plenty in common) they
are also distinct in some details. Consider the highly specific example of
secret storage: if you are writing something that intends to live in a cloud
environment, and you need to configure it with a secret, you will probably
want to provide it via a text file or an environment variable. By contrast,
if you want this same code to run on a desktop system, your users will
expect you to support the Secret
Service.
This will likely only require a few lines of code to accommodate, but it is
a massive difference to the user experience.
Don’t rely on LLMs remaining cheap or free. If you have LLM-related
features4, make sure that they are sufficiently severable from the rest of
your offering that if ChatGPT starts costing $1000 a month, your tool
doesn’t break completely. Similarly, do not require that your users have
easy access to half a terabyte of VRAM and a rack full of 5090s in order to
run a local model.
Even if you were going to scale up to infinity, the ability to scale down and
consider smaller deployments means that you can run more comfortably on, for
example, a developer’s laptop. So even if you can’t convince your employer
that this is where the economy and the future of technology in our lifetimes is
going, it can be easy enough to justify this sort of design shift, particularly
as individual choices. Make your onboarding cheaper, your development feedback loops tighter, and your systems generally more resilient to economic headwinds.
So, please design your open source libraries, applications, and services to run
on smaller devices, with less complexity. It will be worth your time as well
as your users’.
But if you can fix the whole wealth inequality thing, do that first.
... or even their lesser-but-still-profound aftershocks like “Social
Media”, “Smartphones”, or “On-Demand Streaming Video” ...
secondary manifestations of the underlying innovation of a packet-switched
global digital network ... ↩
My preference would of course be that you just didn’t have such features
at all, but perhaps even if you agree with me, you are part of an
organization with some mandate to implement LLM stuff. Just try not to
wrap the chain of this anchor all the way around your code’s neck. ↩
You’re working on an application. Let’s call it “FooApp”. FooApp has a
dependency on an open source library, let’s call it “LibBar”. You find a bug
in LibBar that affects FooApp.
To envisage the best possible version of this scenario, let’s say you actively
like LibBar, both technically and socially. You’ve contributed to it in the
past. But this bug is causing production issues in FooApp today, and
LibBar’s release schedule is quarterly. FooApp is your job; LibBar is (at
best) your hobby. Blocking on the full upstream contribution cycle and waiting
for a release is an absolute non-starter.
What do you do?
There are a few common reactions to this type of scenario, all of which are
bad options.
I will enumerate them specifically here, because I suspect that some of them
may resonate with many readers:
Find an alternative to LibBar, and switch to it.
This is a bad idea because a transition to a core infrastructure component
could be extremely expensive.
Vendor LibBar into your codebase and fix your vendored version.
This is a bad idea because carrying this one fix now requires you to
maintain all the tooling associated with a monorepo1: you have to be
able to start pulling in new versions from LibBar regularly, reconcile your
changes even though you now have a separate version history on your
imported version, and so on.
This is a bad idea because you are now extremely tightly coupled to a
specific version of LibBar. By modifying LibBar internally like this,
you’re inherently violating its compatibility contract, in a way which is
going to be extremely difficult to test. You can test this change, of
course, but as LibBar changes, you will need to replicate any relevant
portions of its test suite (which may be its entire test suite) in
FooApp. Lots of potential duplication of effort there.
Implement a workaround in your own code, rather than fixing it.
This is a bad idea because you are distorting the responsibility for
correct behavior. LibBar is supposed to do LibBar’s job, and unless you
have a full wrapper for it in your own codebase, other engineers (including
“yourself, personally”) might later forget to go through the alternate,
workaround codepath, and invoke the buggy LibBar behavior again in some new
place.
Implement the fix upstream in LibBar anyway, because that’s the Right
Thing To Do, and burn credibility with management while you anxiously wait
for a release with the bug in production.
This is a bad idea because you are betraying your users — by allowing the
buggy behavior to persist — for the workflow convenience of your dependency
providers. Your users are probably giving you money, and trusting you with
their data. This means you have both ethical and economic obligations to
consider their interests.
As much as it’s nice to participate in the open source community and take
on an appropriate level of burden to maintain the commons, this cannot
sustainably be at the explicit expense of the population you serve
directly.
Even if we only care about the open source maintainers here, there’s
still a problem: as you are likely to come under immediate pressure to ship
your changes, you will inevitably relay at least a bit of that stress to
the maintainers. Even if you try to be exceedingly polite, the maintainers
will know that you are coming under fire for not having shipped the fix
yet, and are likely to feel an even greater burden of obligation to ship
your code fast.
Much as it’s good to contribute the fix, it’s not great to put this on the
maintainers.
The respective incentive structures of software development — specifically, of
corporate application development and open source infrastructure development —
make options 1-4 very common.
On the corporate / application side, these issues are:
it’s difficult for corporate developers to get clearance to spend even small amounts of
their work hours on upstream open source projects, but clearance to spend
time on the project they actually work on is implicit. If it takes 3 hours
of wrangling with Legal2 and 3 hours of implementation work to fix the
issue in LibBar, but 0 hours of wrangling with Legal and 40 hours of
implementation work in FooApp, a FooApp developer will often perceive it as
“easier” to fix the issue downstream.
it’s difficult for corporate developers to get clearance from management to
spend even small amounts of money sponsoring upstream reviewers, so even if
they can find the time to contribute the fix, chances are high that it will
remain stuck in review unless they are personally well-integrated members of
the LibBar development team already.
even assuming there’s zero pressure whatsoever to avoid open sourcing the
upstream changes, there’s still the fact inherent to any development team
that FooApp’s developers will be more familiar with FooApp’s codebase and
development processes than they are with LibBar’s. It’s just easier to
work there, even if all other things are equal.
systems for tracking risk from open source dependencies often lack visibility
into vendoring, particularly if you’re doing a hybrid approach and only
vendoring a few things to address work in progress, rather than a
comprehensive and disciplined approach to a monorepo. If you fully absorb a
vendored dependency and then modify it, Dependabot isn’t going to tell you
that a new version is available any more, because it won’t be present in your
dependency list. Organizationally this is bad of course but from the
perspective of an individual developer this manifests mostly as fewer
annoying emails.
But there are problems on the open source side as well. Those problems are all
derived from one big issue: because we’re often working with relatively small
sums of money, it’s hard for upstream open source developers to consume
either money or patches from application developers. It’s nice to say that you
should contribute money to your dependencies, and you absolutely should, but
the cost-benefit function is discontinuous. Before a project reaches the
fiscal threshold where it can be at least one person’s full-time job to worry
about this stuff, there’s often no-one responsible in the first place.
Developers will therefore gravitate to the issues that are either fun, or
relevant to their own job.
These mutually-reinforcing incentive structures are a big reason that users of
open source infrastructure, even teams who work at corporate users with
zillions of dollars, don’t reliably contribute back.
The Answer We Want
All those options are bad. If we had a good option, what would it look like?
It is both practically necessary3 and morally required4 for you to have a
way to temporarily rely on a modified version of an open source dependency,
without permanently diverging.
Below, I will describe a desirable abstract workflow for achieving this goal.
Step 0: Report the Problem
Before you get started with any of these other steps, write up a clear
description of the problem and report it to the project as an issue;
specifically, in contrast to writing it up as a pull request. Describe the
problem before submitting a solution.
You may not be able to wait for a volunteer-run open source project to respond
to your request, but you should at least tell the project what you’re
planning on doing.
If you don’t hear back from them at all, you will have at least made sure to
comprehensively describe your issue and strategy beforehand, which will provide
some clarity and focus to your changes.
If you do hear back from them, in the worst case scenario, you may discover
that a hard fork will be necessary because they don’t consider your issue
valid, but even that information will save you time, if you know it before you
get started. In the best case, you may get a reply from the project telling
you that you’ve misunderstood its functionality and that there is already a
configuration parameter or usage pattern that will resolve your problems with
no new code. But in all cases, you will benefit from early coordination on
what needs fixing before you get to how to fix it.
Step 1: Source Code and CI Setup
Fork the source code for your upstream dependency to a writable location where
it can live at least for the duration of this one bug-fix, and possibly for the
duration of your application’s use of the dependency. After all, you might
want to fix more than one bug in LibBar.
You want to have a place where you can put your edits, that will be version
controlled and code reviewed according to your normal development process.
This probably means you’ll need to have your own main branch that diverges from
your upstream’s main branch.
Remember: you’re going to need to deploy this to your production, so testing
gates that your upstream only applies to final releases of LibBar will need to
be applied to every commit here.
Depending on your LibBar’s own development process, this may result in slightly
unusual configurations where, for example, your fixes are written against the
last LibBar release tag, rather than its current5main; if the project has a branch-freshness requirement, you
might need two branches, one for your upstream PR (based on main) and one for
your own use (based on the release branch with your changes).
Ideally for projects with really good CI and a strong “keep main
release-ready at all times” policy, you can deploy straight from a development
branch, but it’s good to take a moment to consider this before you get started.
It’s usually easier to rebase changes from an older HEAD onto a newer one than
it is to go backwards.
Speaking of CI, you will want to have your own CI system. The fact that GitHub
Actions has become a de-facto lingua franca of continuous integration means
that this step may be quite simple, and your forked repo can just run its own
instance.
Optional Bonus Step 1a: Artifact Management
If you have an in-house artifact repository, you should set that up for your
dependency too, and upload your own build artifacts to it. You can often treat
your modified dependency as an extension of your own source tree and install
from a GitHub URL, but if you’ve already gone to the trouble of having an
in-house package repository, you can pretend you’ve taken over maintenance of
the upstream package temporarily (which you kind of have) and leverage those
workflows for caching and build-time savings as you would with any other
internal repo.
Step 2: Do The Fix
Now that you’ve got somewhere to edit LibBar’s code, you will want to actually
fix the bug.
Step 2a: Local Filesystem Setup
Before you have a production version on your own deployed branch, you’ll want
to test locally, which means having both repositories in a single integrated
development environment.
At this point, you will want to have a local filesystem reference to your
LibBar dependency, so that you can make real-time edits, without going through
a slow cycle of pushing to a branch in your LibBar fork, pushing to a FooApp
branch, and waiting for all of CI to run on both.
This is useful in both directions: as you prepare the FooApp branch that makes
any necessary updates on that end, you’ll want to make sure that FooApp can
exercise the LibBar fix in any integration tests. As you work on the LibBar
fix itself, you’ll also want to be able to use FooApp to exercise the code and
see if you’ve missed anything - and this, you wouldn’t get in CI, since LibBar
can’t depend on FooApp itself.
In short, you want to be able to treat both projects as an integrated
development environment, with support from your usual testing and debugging
tools, just as much as you want your deployment output to be an integrated
artifact.
Step 2b: Branch Setup for PR
However, for continuous integration to work, you will also need to have a
remote resource reference of some kind from FooApp’s branch to LibBar. You
will need 2 pull requests: the first to land your LibBar changes to your
internal LibBar fork and make sure it’s passing its own tests, and then a
second PR to switch your LibBar dependency from the public repository to your
internal fork.
At this step it is very important to ensure that there is an issue filed on
your own internal backlog to drop your LibBar fork. You do not want to lose
track of this work; it is technical debt that must be addressed.
Until it’s addressed, automated tools like Dependabot will not be able to apply
security updates to LibBar for you; you’re going to need to manually integrate
every upstream change. This type of work is itself very easy to drop or lose
track of, so you might just end up stuck on a vulnerable version.
Step 3: Deploy Internally
Now that you’re confident that the fix will work, and that your
temporarily-internally-maintained version of LibBar isn’t going to break
anything on your site, it’s time to deploy.
Some deployment
heritage
should help to provide some evidence that your fix is ready to land in
LibBar, but at the next step, please remember that your production environment
isn’t necessarily emblematic of that of all LibBar users.
Step 4: Propose Externally
You’ve got the fix, you’ve tested the fix, you’ve got the fix in your own
production, you’ve told upstream you want to send them some changes. Now, it’s
time to make the pull request.
You’re likely going to get some feedback on the PR, even if you think it’s
already ready to go; as I said, despite having been proven in your production
environment, you may get feedback about additional concerns from other users
that you’ll need to address before LibBar’s maintainers can land it.
As you process the feedback, make sure that each new iteration of your branch
gets re-deployed to your own production. It would be a huge bummer to go
through all this trouble, and then end up unable to deploy the next publicly
released version of LibBar within FooApp because you forgot to test that your
responses to feedback still worked on your own environment.
Step 4a: Hurry Up And Wait
If you’re lucky, upstream will land your changes to LibBar. But, there’s still
no release version available. Here, you’ll have to stay in a holding pattern
until upstream can finalize the release on their end.
Depending on some particulars, it might make sense at this point to archive
your internal LibBar repository and move your pinned release version to a git
hash of the LibBar version where your fix landed, in their repository.
Before you do this, check in with the LibBar core team and make sure that they
understand that’s what you’re doing and they don’t have any wacky workflows
which may involve rebasing or eliding that commit as part of their release
process.
Step 5: Unwind Everything
Finally, you eventually want to stop carrying any patches and move back to an
official released version that integrates your fix.
You want to do this because this is what the upstream will expect when you are
reporting bugs. Part of the benefit of using open source is benefiting from
the collective work to do bug-fixes and such, so you don’t want to be stuck off
on a pinned git hash that the developers do not support for anyone else.
As I said in step 2b6, make sure to maintain a tracking task for doing this
work, because leaving this sort of relatively easy-to-clean-up technical debt
lying around is something that can potentially create a lot of aggravation for
no particular benefit. Make sure to put your internal LibBar repository into
an appropriate state at this point as well.
Up Next
This is part 1 of a 2-part series. In part 2, I will explore in depth how to
execute this workflow specifically for Python packages, using some popular
tools. I’ll discuss my own workflow, standards like PEP 517 and
pyproject.toml, and of course, by the popular demand that I just know will
come, uv.
if you already have all the tooling associated with a monorepo,
including the ability to manage divergence and reintegrate patches with
upstream, you already have the higher-overhead version of the workflow I am
going to propose, so, never mind. but chances are you don’t have that, very
few companies do. ↩
In any business where one must wrangle with Legal, 3 hours is a wildly
optimistic estimate. ↩
The most optimistic vision of generative AI1 is that it will relieve us of
the tedious, repetitive elements of knowledge work so that we can get to work
on the really interesting problems that such tedium stands in the way of.
Even if you fully believe in this vision, it’s hard to deny that today, some
tedium is associated with the process of using generative AI itself.
Generative AI also
isn’tfree,
and so, as responsible consumers, we need to ask: is it worth it? What’s the
ROI
of genAI, and how can we tell? In this post, I’d like to explore a logical
framework for evaluating genAI expenditures, to determine if your organization
is getting its money’s worth.
Perpetually Proffering Permuted Prompts
I think most LLM users would agree with me that a typical workflow with an LLM
rarely involves prompting it only one time and getting a perfectly useful
answer that solves the whole problem.
Generative AI best practices, even from the most optimistic
vendors
all suggest that you should continuously evaluate everything. ChatGPT, which
is really the
only
genAI product with significantly scaled adoption, still says at the bottom of
every interaction:
ChatGPT can make mistakes. Check important info.
If we have to “check important info” on every interaction, it stands to reason
that even if we think it’s useful, some of those checks will find an error.
Again, if we think it’s useful, presumably the next thing to do is to perturb
our prompt somehow, and issue it again, in the hopes that the next invocation
will, by dint of either:
enhanced application of our skill to
engineer
a better prompt based on the deficiencies of the current inference, or
better performance of the model by populating additional
context in subsequent chained
prompts.
Unfortunately, given the relative lack of reliable methods to re-generate the
prompt and receive a better answer2, checking the output and re-prompting
the model can feel like just kinda futzing around with it. You try, you get a
wrong answer, you try a few more times, eventually you get the right answer
that you wanted in the first place. It’s a somewhat unsatisfying process, but
if you get the right answer eventually, it does feel like progress, and you
didn’t need to use up another human’s time.
In fact, the hottest buzzword of the last hype cycle is “agentic”. While I
have my own feelings about this particular word3, its current practical
definition is “a generative AI system which automates the process of
re-prompting itself, by having a deterministic program evaluate its outputs for
correctness”.
A better term for an “agentic” system would be a “self-futzing system”.
However, the ability to automate some level of checking and re-prompting does
not mean that you can fully delegate tasks to an agentic tool, either. It
is, plainly put, not safe. If you leave the AI on its own, you will get
terrible results that will at best make for a funny story45 and at
worst might end up causing serious damage67.
Taken together, this all means that for any consequential task that you want
to accomplish with genAI, you need an expert human in the
loop. The human must be
capable of independently doing the job that the genAI system is being asked to
accomplish.
When the genAI guesses correctly and produces usable output, some of the
human’s time will be saved. When the genAI guesses wrong and produces
hallucinatory gibberish or even “correct” output that nevertheless fails to
account for some unstated but necessary property such as security or scale,
some of the human’s time will be wasted evaluating it and re-trying it.
Income from Investment in Inference
Let’s evaluate an abstract, hypothetical genAI system that can automate some
work for our organization. To avoid implicating any specific vendor, let’s
call the system “Mallory”.
Is Mallory worth the money? How can we know?
Logically, there are only two outcomes that might result from using Mallory to
do our work.
We prompt Mallory to do some work; we check its work, it is correct, and
some time is saved.
We prompt Mallory to do some work; we check its work, it fails, and we futz
around with the result; this time is wasted.
As a logical framework, this makes sense, but ROI is an arithmetical concept,
not a logical one. So let’s translate this into some terms.
In order to evaluate Mallory, let’s define the Futzing Fraction, “”, in terms of the following variables:
the average amount of time a Human worker would take to do a task,
unaided by Mallory
the amount of time that Mallory takes to run one Inference8
the amount of time that a human has to spend Checking Mallory’s output for
each inference
the Probability that Mallory will produce a correct inference for each prompt
the average amount of time that it takes for a human to Write one prompt for
Mallory
since we are normalizing everything to time, rather than money, we do also have to account for the dollar of Mallory as as a product, so we will include the Equivalent amount of human time we could purchase for the marginal cost of one9 inference.
As in last week’s example of simple ROI
arithmetic, we will put our costs in the
numerator, and our benefits in the denominator.
The idea here is that for each prompt, the minimum amount of time-equivalent cost possible is . The user must, at least once, write a prompt, wait for inference to run, then check the output; and, of course, pay any costs to Mallory’s vendor.
If the probability of a correct answer is , then they will do this entire process 3 times10, so we put in the denominator. Finally, we divide everything by , because we are trying to determine if we are actually saving any time or money, versus just letting our existing human, who has to be driving this process anyway, do the whole thing.
If the Futzing Fraction evaluates to a number greater than 1, as previously discussed, you are a bozo; you’re spending more time futzing with Mallory than getting value out of it.
Figuring out the Fraction is Frustrating
In order to even evaluate the value of the Futzing Fraction though, you have to
have a sound method to even get a vague sense of all the terms.
If you are a business leader, a lot of this is relatively easy to measure. You
vaguely know what is, because you know what your
payroll costs, and similarly, you can figure out with
some pretty trivial arithmetic based on Mallory’s pricing table. There are endless
YouTube channels, spec sheets and benchmarks to give you . is probably going to be so small compared to that it hardly merits consideration11.
But, are you measuring ? If your employees are not checking the outputs of the AI, you’re on a path to catastrophe that no ROI calculation can capture, so it had better be greater than zero.
Are you measuring ? How often does the AI get it right on the first try?
Challenges to Computing Checking Costs
In the fraction defined above, the term is going to be
large. Larger than you think.
Measuring and with a high
degree of precision is probably going to be very hard; possibly unreasonably
so, or too expensive12 to bother with in practice. So you will undoubtedly need
to work with estimates and proxy metrics. But you have to be aware that this
is a problem domain where your normal method of estimating is going to be
extremely vulnerable to inherent cognitive bias, and find ways to measure.
Margins, Money, and Metacognition
First let’s discuss cognitive and metacognitive bias.
My favorite cognitive bias is the availability
heuristic and a close
second is its cousin salience
bias.
Humans are empirically predisposed towards noticing and remembering things that
are more striking, and to overestimate their frequency.
If you are estimating the variables above based on the vibe that you’re
getting from the experience of using an LLM, you may be overestimating its
utility.
Consider a slot machine.
If you put a dollar in to a slot machine, and you lose that dollar, this is an
unremarkable event. Expected, even. It doesn’t seem interesting. You can
repeat this over and over again, a thousand times, and each time it will seem
equally unremarkable. If you do it a thousand times, you will probably get
gradually more anxious as your sense of your dwindling bank account becomes
slowly more salient, but losing one more dollar still seems unremarkable.
If you put a dollar in a slot machine and it gives you a thousand dollars,
that will probably seem pretty cool. Interesting. Memorable. You might tell
a story about this happening, but you definitely wouldn’t really remember any
particular time you lost one dollar.
Luckily, when you arrive at a casino with slot machines, you probably know well
enough to set a hard budget in the form of some amount of physical currency you
will have available to you. The odds are against you, you’ll probably lose it
all, but any responsible gambler will have an immediate, physical
representation of their balance in front of them, so when they have lost it
all, they can see that their hands are empty, and can try to resist the “just
one more pull” temptation, after hitting that limit.
Now, consider Mallory.
If you put ten minutes into writing a prompt, and Mallory gives a completely
off-the-rails, useless answer, and you lose ten minutes, well, that’s just what
using a computer is like sometimes. Mallory malfunctioned, or hallucinated,
but it does that sometimes, everybody knows that. You only wasted ten minutes.
It’s fine. Not a big deal. Let’s try it a few more times. Just ten more
minutes. It’ll probably work this time.
If you put ten minutes into writing a prompt, and it completes a task that
would have otherwise taken you 4 hours, that feels amazing. Like the computer
is magic! An absolute endorphin rush.
Very memorable. When it happens, it feels like .
But... did you have a time budget before you started? Did you have a specified
N such that “I will give up on Mallory as soon as I have spent N minutes
attempting to solve this problem with it”? When the jackpot finally pays out
that 4 hours, did you notice that you put 6 hours worth of 10-minute prompt
coins into it in?
If you are attempting to use the same sort of heuristic intuition that probably
works pretty well for other business leadership decisions, Mallory’s
slot-machine chat-prompt user interface is practically designed to subvert
those sensibilities. Most business activities do not have nearly such an
emotionally variable, intermittent reward schedule. They’re not going to trick
you with this sort of cognitive illusion.
Thus far we have been talking about cognitive bias, but there is a
metacognitive bias at play too: while
Dunning-Kruger,
everybody’s favorite metacognitive bias does have some
problems
with it, the main underlying metacognitive bias is that we tend to believe our
own thoughts and perceptions, and it requires active effort to distance
ourselves from them, even if we know they might be wrong.
This means you must assume any intuitive estimate of
is going to be biased low; similarly is going to be
biased high. You will forget the time you spent checking, and you will
underestimate the number of times you had to re-check.
To avoid this, you will need to decide on a Ulysses
pact to provide some inputs to a
calculation for these factors that you will not be able to able to fudge if
they seem wrong to you.
Problematically Plausible Presentation
Another nasty little cognitive-bias landmine for you to watch out for is the
authority bias, for two
reasons:
People will tend to see Mallory as an unbiased, external authority, and
thereby see it as more of an authority than a similarly-situated human13.
Being an LLM, Mallory will be overconfident in its answers14.
The nature of LLM training is also such that commonly co-occurring tokens in
the training corpus produce higher likelihood of co-occurring in the output;
they’re just going to be closer together in the vector-space of the weights;
that’s, like, what training a model is, establishing those relationships.
If you’ve ever used an heuristic to informally evaluate someone’s credibility
by listening for industry-specific shibboleths or ways of describing a
particular issue, that skill is now useless. Having ingested every industry’s
expert literature, commonly-occurring phrases will always be present in
Mallory’s output. Mallory will usually sound like an expert, but then make
mistakes at random.15.
While you might intuitively estimate by thinking “well,
if I asked a person, how could I check that they were correct, and how long
would that take?” that estimate will be extremely optimistic, because the
heuristic techniques you would use to quickly evaluate incorrect information
from other humans will fail with Mallory. You need to go all the way back to
primary sources and actually fully verify the output every time, or you will
likely fall into one of these traps.
Mallory Mangling Mentorship
So far, I’ve been describing the effect Mallory will have in the context of an
individual attempting to get some work done. If we are considering
organization-wide adoption of Mallory, however, we must also consider the
impact on team dynamics. There are a number of possible potential side effects
that one might consider when looking at, but here I will focus on just one that
I have observed.
I have a cohort of friends in the software industry, most of whom are
individual contributors. I’m a programmer who likes programming, so are most
of my friends, and we are also (sigh), charitably, pretty solidly
middle-aged at this point, so we tend to have a lot of experience.
As such, we are often the folks that the team — or, in my case, the community —
goes to when less-experienced folks need answers.
On its own, this is actually pretty great. Answering questions from more
junior folks is one of the best parts of a software development job. It’s an
opportunity to be helpful, mostly just by knowing a thing we already knew. And
it’s an opportunity to help someone else improve their own agency by giving
them knowledge that they can use in the future.
However, generative AI throws a bit of a wrench into the mix.
Let’s imagine a scenario where we have 2 developers: Alice, a staff engineer
who has a good understanding of the system being built, and Bob, a relatively
junior engineer who is still onboarding.
The traditional interaction between Alice and Bob, when Bob has a question,
goes like this:
Bob gets confused about something in the system being developed, because
Bob’s understanding of the system is incorrect.
Bob formulates a question based on this confusion.
Bob asks Alice that question.
Alice knows the system, so she gives an answer which
accurately reflects the state of the system to Bob.
Bob’s understanding of the system improves, and thus he will have fewer and
better-informed questions going forward.
You can imagine how repeating this simple 5-step process will eventually
transform Bob into a senior developer, and then he can start answering
questions on his own. Making sufficient time for regularly iterating this loop
is the heart of any good mentorship process.
Now, though, with Mallory in the mix, the process now has a new decision point,
changing it from a linear sequence to a flow chart.
We begin the same way, with steps 1 and 2. Bob’s confused, Bob formulates a
question, but then:
Bob asks Mallory that question.
Here, our path then diverges into a “happy” path, a “meh” path, and a “sad”
path.
The “happy” path proceeds like so:
Mallory happens to formulate a correct answer.
Bob’s understanding of the system improves, and thus he will have fewer and
better-informed questions going forward.
Great. Problem solved. We just saved some of Alice’s time. But as we learned earlier,
Mallory can make mistakes. When that happens, we will need to check
important info. So let’s get checking:
Mallory happens to formulate an incorrect answer.
Bob investigates this answer.
Bob realizes that this answer is incorrect because it is inconsistent with
some of his prior, correct knowledge of the system, or his investigation.
Bob asks Alice the same question; GOTO traditional interaction step 4.
On this path, Bob spent a while futzing around with Mallory, to no particular
benefit. This wastes some of Bob’s time, but then again, Bob could have
ended up on the happy path, so perhaps it was worth the risk; at least Bob
wasn’t wasting any of Alice’s much more valuable time in the process.16
Notice that beginning at the start of step 4, we must begin allocating all of
Bob’s time to , so already
starts getting a bit bigger than if it were just Bob checking Mallory’s output
specifically on tasks that Bob is doing.
That brings us to the “sad” path.
Mallory happens to formulate an incorrect answer.
Bob investigates this answer.
Bob does not realize that this answer is incorrect because he is unable to
recognize any inconsistencies with his existing, incomplete knowledge of the
system.
Bob integrates Mallory’s incorrect information of the system into his mental
model.
Bob proceeds to make a larger and larger mess of his work, based on an
incorrect mental model.
Eventually, Bob asks Alice a new, worse question, based on this incorrect
understanding.
Sadly we cannot return to the happy path at this point, because now Alice
must unravel the complex series of confusing misunderstandings that Mallory
has unfortunately conveyed to Bob at this point. In the really sad
case, Bob actually doesn’t believe Alice for a while, because Mallory
seems unbiased17, and Alice has to waste even more time convincing Bob
before she can simply explain to him.
Now, we have wasted some of Bob’s time, and some of Alice’s time. Everything
from step 5-10 is , and as soon as Alice gets involved,
we are now adding to at double real-time. If more
team members are pulled in to the investigation, you are now multiplying by the number of investigators, potentially running at triple
or quadruple real time.
But That’s Not All
Here I’ve presented a brief selection reasons why
will be both large, and larger than you expect. To review:
Gambling-style mechanics of the user interface will interfere with your own
self-monitoring and developing a good estimate.
You can’t use human heuristics for quickly spotting bad answers.
Wrong answers given to junior people who can’t evaluate them will waste more
time from your more senior employees.
But this is a small selection of ways that Mallory’s output can cost you
money and time. It’s harder to simplistically model second-order effects like
this, but there’s also a broad range of possibilities for ways that, rather
than simply checking and catching errors, an error slips through and starts
doing damage. Or ways in which the output isn’t exactly wrong, but still
sub-optimal in ways which can be difficult to notice in the short term.
For example, you might successfully vibe-code your way to launch a series of
applications, successfully “checking” the output along the way, but then
discover that the resulting code is unmaintainable garbage that prevents future
feature delivery, and needs to be re-written18. But this kind of
intellectual debt isn’t even specific to technical debt while coding; it can
even affect such apparently genAI-amenable fields as LinkedIn content
marketing19.
Problems with the Prediction of
isn’t the only challenging term
though. , is just as, if not more important, and just as
hard to measure.
LLM marketing materials love to phrase their accuracy in terms of a
percentage. Accuracy claims for LLMs in general tend to hover around
70%20. But these scores vary per field, and when you aggregate them across
multiple topic areas, they start to trend down. This is exactly why “agentic”
approaches for more immediately-verifiable LLM outputs (with checks like “did
the code work”) got popular in the first place: you need to try more than once.
Independently measured claims about accuracy tend to be quite a bit lower21.
The field of AI benchmarks is exploding, but it probably goes without saying
that LLM vendors game those benchmarks22, because of course every incentive
would encourage them to do that. Regardless of what their arbitrary scoring on
some benchmark might say, all that matters to your business is whether it is
accurate for the problems you are solving, for the way that you use it.
Which is not necessarily going to correspond to any benchmark. You will need to
measure it for yourself.
With that goal in mind, our formulation of must be a
somewhat harsher standard than “accuracy”. It’s not merely “was the factual
information contained in any generated output accurate”, but, “is the output
good enough that some given real knowledge-work task is done and the human
does not need to issue another prompt”?
Surprisingly Small Space for Slip-Ups
The problem with reporting these things as percentages at all, however, is that our actual definition for is , where for any given attempt, at least, must be an integer greater than or equal to 1.
Taken in aggregate, if we succeed on the first prompt more often than not, we could end up with a , but combined with
the previous observation that you almost always have to prompt it more than once, the practical reality is that will start at 50% and go down from there.
If we plug in some numbers, trying to be as extremely optimistic as we can,
and say that we have a uniform stream of tasks, every one of which can be
addressed by Mallory, every one of which:
we can measure perfectly, with no overhead
would take a human 45 minutes
takes Mallory only a single minute to generate a response
Mallory will require only 1 re-prompt, so “good enough” half the time
takes a human only 5 minutes to write a prompt for
takes a human only 5 minutes to check the result of
has a per-prompt cost of the equivalent of a single second of a human’s time
Thought experiments are a dicey basis for reasoning in the face of
disagreements, so I have tried to formulate something here that is absolutely,
comically, over-the-top stacked in favor of the AI optimist here.
Would that be a profitable? It sure seems like it, given that we are trading
off 45 minutes of human time for 1 minute of Mallory-time and 10 minutes of
human time. If we ask Python:
12345
>>> def FF(H, I, C, P, W, E):
... return (W + I + C + E) / (P * H)
... FF(H=45.0, I=1.0, C=5.0, P=1/2, W=5.0, E=0.01)
...
0.48933333333333334
We get a futzing fraction of about 0.4896. Not bad! Sounds like, at least
under these conditions, it would indeed be cost-effective to deploy Mallory.
But… realistically, do you reliably get useful, done-with-the-task quality
output on the second prompt? Let’s bump up the denominator on just a little bit there, and see how we fare:
With this little test, we can see that at our next iteration we are already at
0.9792, and by 5 tries per prompt, even in this absolute fever-dream of an
over-optimistic scenario, with a futzing fraction of 1.2240, Mallory is now a
net detriment to our bottom line.
Harm to the Humans
We are treating as functionally constant so far, an
average around some hypothetical Gaussian distribution, but the distribution
itself can also change over time.
Formally speaking, an increase to would be good for
our fraction. Maybe it would even be a good thing; it could mean we’re taking
on harder and harder tasks due to the superpowers that Mallory has given us.
But an observed increase to would probably not be
good. An increase could also mean your humans are getting worse at solving
problems, because using Mallory has atrophied their skills23 and sabotaged
learning opportunities2425. It could also go up because your senior,
experienced people now hate their jobs26.
For some more vulnerable folks, Mallory might just take a shortcut to all these
complex interactions and drive them completely insane27 directly. Employees
experiencing an intense psychotic episode are famously less productive than
those who are not.
This could all be very bad, if our futzing fraction eventually does head north
of 1 and you need to reconsider introducing human-only workflows, without
Mallory.
Abridging the Artificial Arithmetic (Alliteratively)
To reiterate, I have proposed this fraction:
which shows us positive ROI when FF is less than 1, and negative ROI when it is
more than 1.
This model is heavily simplified. A comprehensive measurement program that
tests the efficacy of any technology, let alone one as complex and rapidly
changing as LLMs, is more complex than could be captured in a single blog post.
Real-world work might be insufficiently uniform to fit into a closed-form
solution like this. Perhaps an iterated simulation with variables based on the
range of values seem from your team’s metrics would give better results.
However, in this post, I want to illustrate that if you are going to try to
evaluate an LLM-based tool, you need to at least include some representation
of each of these terms somewhere. They are all fundamental to the way the
technology works, and if you’re not measuring them somehow, then you are flying
blind into the genAI storm.
I also hope to show that a lot of existing assumptions about how benefits
might be demonstrated, for example with user surveys about general impressions,
or by evaluating artificial benchmark scores, are deeply flawed.
Even making what I consider to be wildly, unrealistically optimistic
assumptions about these measurements, I hope I’ve shown:
in the numerator, might be a lot higher than you
expect,
in the denominator, might be a lot lower than you
expect,
repeated use of an LLM might make go up, but despite
the fact that it's in the denominator, that will ultimately be quite bad for
your business.
Personally, I don’t have all that many concerns about and . is still seeing significant loss-leader pricing, and might not be coming down as fast as vendors would like us to believe, if the other numbers work out I don’t think they make a huge difference. However, there might still be surprises lurking in there, and if you want to rationally evaluate the effectiveness of a model, you need to be able to measure them and incorporate them as well.
In particular, I really want to stress the importance of the influence of LLMs on your team dynamic, as that can cause massive, hidden increases to . LLMs present opportunities for junior employees to generate an endless stream of chaff that will simultaneously:
wreck your performance review process by making them look much more
productive than they are,
increase stress and load on senior employees who need to clean up unforeseen
messes created by their LLM output,
and ruin their own opportunities for career development by skipping over
learning opportunities.
If you’ve already deployed LLM tooling without measuring these things and
without updating your performance management processes to account for the
strange distortions that these tools make possible, your Futzing Fraction may
be much, much greater than 1, creating hidden costs and technical debt that
your organization will not notice until a lot of damage has already been done.
If you got all the way here, particularly if you’re someone who is
enthusiastic about these technologies, thank you for reading. I appreciate
your attention and I am hopeful that if we can start paying attention to these
details, perhaps we can all stop futzing around so much with this stuff and
get back to doing real work.
I do not share this optimism, but I want to try very hard in this
particular piece to take it as a given that genAI is in fact helpful. ↩
If we could have a better prompt on demand via some repeatable and
automatable process, surely we would have used a prompt that got the answer
we wanted in the first place. ↩
The software idea of a “user
agent”
straightforwardly comes from the legal principle of an
agent, which has deep roots
in common law, jurisprudence, philosophy, and
math. When we
think of an agent (some software) acting on behalf of a principal (a human
user), this historical baggage imputes some important ethical
obligations
to the developer of the agent software. genAI vendors have been as eager
as any software vendor to dodge responsibility for faithfully representing
the user’s
interests
even as there are some indications that at least some courts are not
persuaded
by this dodge, at least by the consumers of genAI attempting to pass on the
responsibility all the way to end users. Perhaps it goes without saying,
but I’ll say it anyway: I don’t like this newer interpretation of “agent”. ↩
During which a human will be busy-waiting on an answer. ↩
Given the fluctuating pricing of these products, and fixed subscription overhead, this will obviously need to be amortized; including all the additional terms to actually convert this from your inputs is left as an exercise for the reader. ↩
I feel like I should emphasize explicitly here that everything is an
average over repeated interactions. For example, you might observe that a
particular LLM has a low probability of outputting acceptable work on the
first prompt, but higher probability on subsequent prompts in the same
context, such that it usually takes 4 prompts. For the purposes of this
extremely simple closed-form model, we’d still consider that a
of 25%, even though a more sophisticated model, or
a monte carlo simulation that sets progressive bounds on the probability,
might produce more accurate values. ↩
It’s worth noting that all this expensive measuring itself must be
included in until you have a solid grounding for
all your metrics, but let’s optimistically leave all of that out for the
sake of simplicity. ↩
My father, also known as
“R0ML” once described a methodology for evaluating
volume purchases that I think needs to be more popular.
If you are a hardcore fan, you might know that he has already described this
concept publicly in a talk at OSCON in 2005, among other places, but it has
never found its way to the public Internet, so I’m giving it a home here, and
in the process, appropriating some of his words.1
Let’s say you’re running a circus. The circus has many clowns. Ten thousand
clowns, to be precise. They require bright red clown noses. Therefore, you
must acquire a significant volume of clown noses. An enterprise licensing
agreement for clown noses, if you will.
If the nose
plays,
it can really make the act. In order to make sure you’re getting quality
noses, you go with a quality vendor. You select a vendor who can supply noses
for $100 each, at retail.
Do you want to buy retail? Ten thousand clowns, ten thousand noses, one
hundred dollars: that’s a million bucks worth of noses, so it’s worth your
while to get a good deal.
As a conscientious executive, you go to the golf course with your favorite
clown accessories vendor and negotiate yourself a 50% discount, with a
commitment to buy all ten thousand noses.
Is this a good deal? Should you take it?
To determine this, we will use an analytical tool called R0ML’s Ratio (RR).
The ratio has 2 terms:
the Full Undiscounted Retail List Price of Units Used (FURLPoUU), which can
of course be computed by the individual retail list price of a single unit
(in our case, $100) multiplied by the number of units used
the Total Price of the Entire Enterprise Volume Licensing Agreement
(TPotEEVLA), which in our case is $500,000.
It is expressed as:
Crucially, you must be able to compute the number of units used in order to
complete this ratio. If, as expected, every single clown wears their nose at
least once during the period of the license agreement, then our Units Used is
10,000, our FURLPoUU is $1,000,000 and our TPotEEVLA is $500,000, which makes
our RR 0.5.
Congratulations. If R0ML’s Ratio is less than 1, it’s a good deal. Proceed.
But… maybe the nose doesn’t play. Not every clown’s costume is an exact
clone of the traditional, stereotypical image of a clown. Many are
avant-garde. Perhaps this plentiful proboscis pledge was premature. Here, I
must quote the originator of this theoretical framework directly:
What if the wheeze doesn’t please?
What if the schnozz gives some pause?
In other words: what if some clowns don’t wear their noses?
If we were to do this deal, and then ask around afterwards to find out that
only 200 of our 10,000 clowns were to use their noses, then FURLPoUU comes
out to 200 * $100, for a total of $20,000. In that scenario, RR is 25,
which you may observe is substantially greater than 1.
If you do a deal where R0ML’s ratio is greater than 1, then you are the bozo.
I apologize if I have belabored this point. As R0ML expressed in the email we
exchanged about this many years ago,
I do not mind if you blog about it — and I don't mind getting the credit —
although one would think it would be obvious.
And yeah, one would think this would be obvious? But I have belabored it
because many discounted enterprise volume purchasing agreements still fail the
R0ML’s Ratio Bozo Test.2
In the case of clown noses, if you pay the discounted price, at least you get
to keep the nose; maybe lightly-used clown noses have some resale value. But
in software licensing or SaaS deals, once you’ve purchased the “discounted”
software or service, once you have provisioned the “seats”, the money is gone,
and if your employees don’t use it, then no value for your organization will
ever result.
Measuring number of units used is very important. Without this
number, you have no idea if you are a bozo or not.
It is often better to give your individual employees a corporate card and
allow them to make arbitrary individual purchases of software licenses and SaaS
tools, with minimal expense-reporting overhead; this will always keep R0ML’s
Ratio at 1.0, and thus, you will never be a bozo.
It is always better to do that the first time you are purchasing a new
software tool, because the first time making such a purchase you (almost by
definition) have no information about “units used” yet. You have no idea — you
cannot have any idea — if you are a bozo or not.
If you don’t know who the bozo is, it’s probably you.
Acknowledgments
Thank you for reading, and especially thank you to my
patrons who are supporting my writing on this blog. Of
course, extra thanks to dad for, like, having this idea and doing most of the
work here beyond my transcription. If you like my dad’s ideas and you’d like
to post more of them, or you’d like to support my various open-source
endeavors, you can support my work as a
sponsor!
One of my other favorite posts on this blog was just
stealing another one of his ideas, so
hopefully this one will be good too. ↩
This concept was first developed in 2001, but it has some implications
for extremely recent developments in the software industry; but that’s a
post for another day. ↩