Stories by Mark Williams on Medium

Context is Infrastructure, Not Instructions

Mark Williams — Fri, 15 May 2026 15:31:00 GMT

What teams gain when they govern AI context like a software dependency

A team replaces task-specific prompts with a generic “improved” template. Extraction accuracy drops from 100% to 90%. RAG compliance (the degree to which a model’s answers stay grounded in retrieved documents rather than generating from its own training data) falls from 93.3% to 80% . The model is the same. The new instructions look better on paper. What changed was the context, and nobody tested whether the change was safe before deploying it.

This is context regression, a term borrowed from software engineering where “regression” means a change that was supposed to improve something but degraded existing behavior instead. It behaves like any other dependency compatibility problem in a software supply chain, and the governance response, production contracts, risk-based test suites, compatibility gates, is the same one software teams already use for their other dependencies.

“Context is the New Code” established context engineering as a formal discipline with its own taxonomy, maturity levels, and practitioner artifacts, and “The Turn as the Unit of Quality” explored how structured iteration with checklists and selective memory improves turn-level quality. This article picks up a different thread. What happens when context moves from a single team’s configuration file to an organizational dependency serving dozens of agents across thousands of daily interactions? Recent research suggests that the teams making the fastest progress are the ones applying familiar software supply chain governance to their context, and the returns are measurable.

What Structured Context Unlocks

A study of 200 documented interactions across four AI tools found that incomplete context was associated with 72% of iteration cycles . That number is worth sitting with. Nearly three-quarters of the rework, the back-and-forth where a human corrects, clarifies, and re-prompts, traced not to a bad model or a poorly worded instruction but to missing information that should have been available from the start.

When the same study introduced structured context assembly, a methodology that organizes context into five roles (Authority, Exemplar, Constraint, Rubric, and Metadata), iteration cycles dropped from an average of 3.8 to 2.0 per task, and first-pass acceptance rose from 32% to 55% . Authority context establishes what standards govern the task. Exemplar context provides reference outputs that demonstrate the expected quality. Constraint context defines boundaries the output must respect. Rubric context specifies how the output will be evaluated. Metadata context supplies facts, dates, names, and domain-specific details. Having names for these roles is not a minor convenience, it is what makes the difference between ad hoc tuning and repeatable engineering, because a team that cannot describe what is missing from its context cannot systematically fix it.

Like a well-organized server room where every cable run is labeled and every rack follows a standard layout, structured context gives a team the ability to reason about what the AI is actually working with. The evaluation-driven iteration research reinforces this by showing that context quality is not one-dimensional . A change that improves instruction-following can simultaneously degrade extraction accuracy. A prompt that scores better on helpfulness can score worse on format compliance. The minimum viable evaluation suite (MVES) framework proposes tiered evaluation requirements, one set for general applications, another for retrieval-augmented generation systems, and a third for agentic workflows, precisely because quality along one dimension does not guarantee quality along others . The practical implication is that quality has multiple dimensions that can trade against each other, and navigating those trade-offs requires measurement infrastructure, not intuition.

Governing Context as a Dependency

The clearest articulation of this shift comes from research that frames LLM update management as a software supply chain governance problem . Hosted language model services evolve through provider-side updates without explicit version changes, so the API endpoint stays the same while the behavior underneath shifts. Empirical work cited within that framework documents cases where code execution accuracy dropped from 52% to 10% within three months with no version change on the consumer side . This is behavioral drift (a gradual, unannounced change in how a model responds to the same inputs), and it affects every piece of context that was tuned against the previous behavior.

The proposed governance framework has three components that map directly to established software engineering practice . Production contracts define explicit behavioral rules with measurable thresholds, things like “authentication code must pass security tests” or “JSON outputs must be valid.” Risk-category-based testing organizes evaluation around deployment risk areas rather than relying on a single aggregate score, preventing critical regressions in formatting or safety from being masked by overall performance improvements. Compatibility gates block updates that fail defined thresholds, requiring review before a model update is adopted into production. None of these ideas are new to software engineering. What is new is recognizing that context, the system prompts, retrieved documents, and configuration files that shape AI behavior, is a dependency that deserves the same governance.

A readiness harness for LLM and RAG applications demonstrates what this looks like in practice . The system combines automated benchmarks, OpenTelemetry observability (a standardized way to collect and export telemetry data like traces, metrics, and logs), and CI quality gates (automated checkpoints in the deployment pipeline that block releases if quality checks fail) under a minimal API contract. Rather than reducing readiness to a single metric, it aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and latency into scenario-weighted readiness scores. In ticket-routing experiments, the regression gates consistently rejected unsafe prompt variants before deployment . This is a concrete example of the shift from “the model was tested” to “the deployment pipeline tested every context change before it reached production.”

One challenge specific to AI systems is that the same configuration can produce different outputs across runs. Traditional binary pass/fail testing struggles with this fundamental non-determinism. A regression testing framework designed for this problem replaces binary verdicts with three-valued probabilistic outcomes (Pass, Fail, Inconclusive) backed by confidence intervals and sequential analysis . The framework achieves 78 to 100% cost reduction compared to naive repeated testing while maintaining statistical guarantees, and its behavioral fingerprinting approach achieves 86% detection power on regressions where binary pass/fail testing has 0% . The cost reduction matters as much as the accuracy. Testing that is too expensive to run routinely is testing that does not get run, and context changes that do not get tested are the ones that cause production surprises.

From Files to Living Systems

The governance patterns above treat context as a versioned artifact, something written, tested, and deployed. But a growing body of work suggests that this framing, while useful, captures only part of the picture. In production multi-agent systems, context is not a file. It is a runtime-constructed “View” projected into an agent’s context window (the maximum amount of text a model can consider at once) from a pool of global artifacts, and that View changes dynamically based on the task, the step, and the state of the system .

Research on what the authors call “Loosely-Structured Software” characterizes this as a class of system whose defining property is runtime generation and evolution under uncertainty . Classic software architecture assumes build-time decomposition and slow-changing boundaries. Multi-agent AI systems violate those assumptions in three ways. First, an agent’s effective program is determined not by compiled code but by a View assembled at runtime from system prompts, skills, plans, tools, and memories. Second, the connections between components form dynamically through semantic understanding rather than fixed function signatures. Third, the system’s own executable substrate, the artifacts that mediate its behavior, can be rewritten by the system itself.

To make this governable, the research proposes a three-layer engineering framework . View/Context Engineering manages the execution environment and maintains task-relevant Views. This is the layer where the static context files that teams already write (the CLAUDE.md and AGENTS.md files examined in “Context is the New Code”) get assembled, filtered, and delivered at runtime. Structure Engineering organizes the dynamic bindings between agents and artifacts, governing how components find and connect to each other. Evolution Engineering manages the lifecycle of self-rewriting artifacts, ensuring that when the system modifies its own context (a capability that “The Edge of the Underdefined” documents self-improving agents already demonstrating), those modifications remain within governed bounds.

This is where context infrastructure becomes genuinely adaptive. Instead of choosing between static configuration files (reliable but rigid) and autonomous self-modification (flexible but ungoverned), the three-layer framework offers a middle path. Context can evolve in response to operational feedback, while infrastructure constraints prevent that evolution from drifting outside acceptable bounds. The combination of governance patterns from the supply chain framing with the runtime adaptivity from the loosely-structured software framing points toward a more complete picture of what production context infrastructure might look like.

The Maturity Opportunity

The infrastructure patterns described here, production contracts, multi-dimensional evaluation, CI gates, statistical regression testing, runtime View management, each have working implementations backed by empirical evidence. The gap between what the research demonstrates and what most teams have actually built is mostly one of adoption, not of available tools.

Survey data suggests that prompt usage in software engineering remains largely ad hoc, with prompts refined through trial-and-error and rarely reused. As “Context is the New Code” noted, only about 5% of surveyed open-source repositories have adopted any context file format at all. The parallel to early unit testing adoption or early version control adoption is hard to miss. A practice that starts as optional among a skilled minority tends to become standard once enough teams experience the cost of not doing it.

What distinguishes this moment is that the infrastructure does not need to be invented from scratch. Supply chain governance, production testing methodology, continuous deployment practice, and statistical experiment design all have established patterns that transfer directly to context management. Treating context as infrastructure is largely a matter of applying existing engineering discipline to a new class of artifact, one that happens to shape every decision an AI system makes.

The teams moving fastest appear to be the ones that recognized this early. They built the infrastructure to measure, test, and govern the context their models consume, and that investment compounded over time. For teams still tuning prompts by hand and evaluating by feel, the patterns are available to adopt directly, without rediscovering the hard lessons from scratch.

Originally published at https://thinkata.com.

The Turn as the Unit of Quality

Mark Williams — Fri, 08 May 2026 17:56:00 GMT

What makes iterative refinement productive, and when it starts to hurt

Iterative refinement is one of the defining features of how language models are used in practice. Rather than producing a final result in a single pass, users and autonomous agents refine outputs across multiple turns of interaction. Early work on self-feedback and verbal reflection established that this approach reliably outperforms single-pass generation. But how reliably, and for how long?

This finding connects three ideas that keep appearing across recent AI systems research. Structured checklists decompose quality into individually verifiable criteria, formalizing what “targeted feedback” actually means. Selective memory architectures decide what to retain and what to forget between turns, preventing the context window from becoming a graveyard of stale instructions. Deterministic validation layers enforce constraints that probabilistic models cannot guarantee on their own. Each imposes structure on what would otherwise be an open-ended, drift-prone process.

Why Turns Go Wrong

Understanding why unstructured iteration degrades output requires looking at what happens inside a model’s context window (the maximum amount of text a model can consider at once) as turns accumulate. Research on the “lost in the middle” phenomenon showed that language model performance is highest when relevant information appears at the beginning or end of the input, and drops significantly when the model must access information positioned in the middle of long contexts. As conversations grow longer, earlier instructions are not just diluted by newer content. The model’s attention mechanism actively deprioritizes them. A survey covering over 1,400 research papers formalized this challenge by decomposing context engineering into three stages, retrieval, processing, and management, each introducing its own failure modes. The default mode of iterative interaction, appending each turn’s output to a growing window without structured curation, is working against sustained quality from the start.

Checklists That Steer

A sound engineer at a mixing console adjusts each channel independently, setting levels for bass, treble, reverb, and compression on separate faders rather than turning a single “make it sound better” knob. Structured quality evaluation works the same way. The TICK framework demonstrated that decomposing quality into checklist-based yes/no questions is more reliable for both humans and language models than holistic scoring. Answering “Does the response address the user’s budget constraint?” is a simpler cognitive task than assigning an overall quality rating on a 10-point scale. The decomposition reduces the inconsistency that plagues open-ended judgments, and composable pipelines like AutoChecklist can now generate such criteria automatically from a task description.

This connects directly to the 12-turn study’s central finding. When Javaji et al. compared vague “improve it” feedback against prompts targeting specific quality dimensions, the targeted version sustained improvement over more turns precisely because it functioned as a single-item checklist . A multi-item checklist extends this logic by ordering quality dimensions by importance. Each turn addresses the highest-priority unsatisfied criterion, and the checklist records what has already been verified so that subsequent turns do not undo earlier gains. The model is no longer guessing what “better” means. The checklist tells it.

This pattern appears in practitioner tools as well. The Codified Context framework, developed during construction of a 108,000-line C# distributed system, included a “constitution” file that functioned as a prioritized checklist. Naming conventions came first, build commands second, orchestration protocols third. The ordering was not arbitrary. It reflected which violations were most costly to fix if left uncaught. Across 283 development sessions, this structure prevented repeated failures by ensuring each session validated high-priority constraints before moving to less critical ones. The criteria themselves can be generated by a model, but the prioritization, the decision about which quality dimension matters most, still required human judgment about costs and consequences.

Remembering What Matters

A library that never removes a book eventually buries its most valuable references under sheer accumulation. AI memory faces a similar problem. A checklist that structures each turn is only useful if the system remembers what was checked and what was found, but retaining everything introduces its own degradation.

The Agentic Context Engineering (ACE) framework named two failure modes that make this concrete. Brevity bias is the tendency for iterative optimization to compress rich context into short, generic summaries that strip away the domain-specific knowledge that actually made previous turns successful. A detailed playbook that says “when the build fails on the orchestration layer, check the gRPC timeout before restarting the container” gets summarized into “handle build failures appropriately,” and the specific knowledge that prevented a two-hour debugging session disappears. Context collapse is the complementary failure. Successive rewrites gradually erode important details, each individual edit seeming reasonable in isolation but the cumulative effect hollowing out the context’s value.

ACE addressed both by treating context as an evolving playbook updated through structured, incremental additions rather than wholesale rewrites, achieving a 10.6% improvement over strong baselines. One counterintuitive finding from this work is that language models appear to perform better with long, detailed contexts than with tight summaries. Unlike humans, who benefit from concise briefings, LLMs can extract relevance from comprehensive inputs autonomously. Stripping context down for brevity’s sake may sacrifice exactly the edge-case knowledge that separates correct output from output that merely compiles.

The Dynamic Cheatsheet (DC) framework demonstrates what effective curation looks like in practice . DC equips a language model with a persistent, self-curating external memory. After each query, the system explicitly decides which problem-solving strategies deserve to be kept, which should be discarded, and which existing entries should be updated. The results are impressive. On math competition problems, one model’s accuracy more than doubled (from 23% to 50%) by retaining algebraic insights across problems. On the Game of 24 puzzle, another model went from 10% to 99% by accumulating and reusing solution templates . The gains did not come from better prompting or a larger model. They came from the system learning what was worth remembering, and what was not, across successive encounters with similar problems. Meta Context Engineering takes this one step further by having a separate agent optimize the curation procedures themselves, meaning even the format and structure of what gets remembered becomes subject to improvement .

Hard Constraints for Soft Outputs

Checklists and selective memory both improve iteration quality, but they share a limitation. Both rely on the language model itself, or a similar model, to make evaluative judgments. A model asked to evaluate its own output against a checklist can exhibit the same biases and inconsistencies that it exhibits in generation. For constraints that must hold without exception, a different mechanism is needed, one that removes the model from the decision entirely.

The general principle is to separate what the model does well (natural language understanding, flexible reasoning, tolerant interpretation of ambiguous input) from what it does poorly (logical guarantees, strict constraint enforcement). VERUS-LM demonstrates this by splitting reasoning into two responsibilities. The language model translates a task description into a formal representation. A symbolic reasoning engine then performs logically sound inference over that representation. On logical reasoning benchmarks, the advantage of this hybrid approach grew as task complexity increased. The model is good at understanding what the problem is. The symbolic engine is good at solving it correctly. Neither works as well alone.

An application of this division of labor uses the Lean 4 theorem prover as a verification layer for financial compliance. Every proposed action by the agent is translated into a formal logical proposition and verified by the Lean 4 proof kernel before execution. If the proof does not check, the action does not execute. There is no probability threshold, no confidence score, no “this looks right.” A compliance rule under this architecture becomes a constraint enforced with mathematical certainty, independent of whatever the model’s next-token distribution might prefer. From a systems perspective, this is the kind of guarantee that makes the difference between a prototype and a production deployment in regulated industries.

What This Suggests

The three mechanisms operate at different stages of the refinement cycle and address distinct failure modes. A checklist defines what “better” means for the current turn. Selective memory decides what to carry forward. Deterministic validation enforces constraints that must hold regardless of the model’s probabilistic output.

Any one of these in isolation appears to be insufficient. A checklist without selective memory will eventually be overwhelmed by accumulated context. Selective memory without structured criteria risks curating toward the wrong quality dimensions. Deterministic validation without good memory and good criteria will enforce hard constraints on output that is otherwise drifting.

For teams building iterative workflows, whether for code generation, research, writing, or any domain where quality develops through successive passes, the practical takeaway is that the turn is the unit of design. The effort spent deciding what each turn evaluates, remembers, and enforces may matter at least as much as the effort spent on the initial prompt. Whether the structuring of turns will itself be automated, as early work on meta-level skill evolution tentatively suggests , or whether it will remain a domain where human judgment about priorities and consequences provides durable value, is a question the field has not yet answered.

Originally published at https://thinkata.com.

The Capability-Reliability Split in Agent Systems

Mark Williams — Thu, 30 Apr 2026 15:36:01 GMT

Why frontier agents reach state-of-the-art on one run, and fail at the same task on the next

A frontier agent can occasionally surpass a published research baseline and, in another run on the same task, fail to make any meaningful progress. The pattern recurs often enough across recent evaluations that researchers have started to treat it as a structural feature of agent systems rather than a quirk of any single implementation. Capability asks whether a model can perform a task in principle. Reliability asks whether it does so consistently, across repeated attempts, across small perturbations, and across tasks that take dozens or hundreds of steps to complete. Recent evidence suggests these two properties drift apart faster than benchmark headlines make visible.

The split has practical stakes. An agent system, in this context, refers to a large language model (LLM, the underlying neural network that processes text) coupled with a scaffold (the surrounding software that decides when to call the model, what tools to invoke, and how to handle errors). When the same agent passes a benchmark on Monday and breaks on a near-identical task on Tuesday, the deployment question is no longer whether the technology can do the work. The question becomes how often it does.

When the Same Agent Both Wins and Fails

ResearchGym, a benchmark that places agents inside containerized research environments rebuilt from accepted papers at ICML, ICLR, and ACL, captures the split with unusual clarity. In a controlled evaluation of an agent powered by GPT-5, the system improved over the provided baselines in only 1 of 15 evaluations, an improvement rate of 6.7%, and completed only 26.5% of sub-tasks on average across 39 sub-tasks total . In a single run, the same agent surpassed the solution from an ICML 2025 Spotlight paper, evidence that the underlying capability is real even when the reliability is not. Proprietary scaffolds built on Claude Code (Opus-4.5) and Codex (GPT-5.2) displayed a similar gap.

Across Long Horizons

HORIZON, a cross-domain diagnostic benchmark released in April 2026, looked at the same problem from a different angle. Across more than 3,100 trajectories collected from frontier models in the GPT-5 and Claude families, the authors documented a horizon-dependent degradation pattern. Agents that performed strongly on short tasks broke down on long-horizon work that required extended, interdependent action sequences .

Across Many Models

The Holistic Agent Leaderboard (HAL), introduced by a group at Princeton, ran 21,730 agent rollouts spanning 9 models, 9 benchmarks, and four domains, comparing models, scaffolds, and benchmarks side by side and bringing the cost of large-scale agent evaluation down by roughly an order of magnitude . One counterintuitive finding from that data is worth pausing on. Higher reasoning effort, the practice of allocating more inference-time compute to deliberation, reduced accuracy in the majority of runs.

A move that should obviously help did not. Bigger headline numbers and steadier behavior are not the same thing, even when the same lever is being pulled.

Why Standard Benchmarks Miss the Gap

Part of the reliability story is methodological. Most agent evaluations report pass@1, the probability that an agent succeeds on a single attempt. A 2026 study collected 60,000 agentic trajectories on SWE-Bench-Verified, a software engineering benchmark, across three models and two scaffolds, and found that single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0, the setting that should produce the most deterministic behavior . Reported improvements of 2 to 3 percentage points, the kind that often headline a new release, may reflect evaluation noise rather than genuine progress. Trajectories diverged early, often within the first few percent of generated tokens (a token is the unit of text the model processes, roughly a word or word fragment), and these small differences cascaded into entirely different solution strategies.

A Reliability Science for Agents

Just as a cockpit instrument panel separates altitude, airspeed, and fuel into independent gauges so a pilot can see when one is failing, a reliability science framework released in March 2026 splits agent performance into separate dimensions tracked over time. The authors evaluated 10 models across 23,392 episodes on a 396-task benchmark that varied task duration and domain, and proposed four metrics including a Reliability Decay Curve, which tracks how success rate falls as tasks lengthen, and a Variance Amplification Factor, which measures how variability in outcomes grows with horizon . Capability and reliability rankings diverged substantially, with multi-rank inversions at long horizons. A model ranked first on short tasks could fall to fourth or fifth once tasks stretched out. Frontier models showed the highest meltdown rates, up to 19%, because they attempted ambitious multi-step strategies that sometimes spiraled into failure.

A March 2025 survey of agent evaluation methods, updated through 2026, identified the same pattern at a higher level. Cost-efficiency, safety, and robustness remain underassessed in most agent benchmarks .

The Mechanics of Long-Horizon Failure

The next question is mechanical. What is actually breaking when an agent that performs well on short tasks falls apart on long ones? A January 2026 analysis frames the answer as a mismatch between reasoning and planning. Step-wise reasoning, the chain-of-thought pattern that has driven much of the recent progress in LLMs, induces what the authors call a step-wise greedy policy . The agent picks the locally best next action without modeling delayed consequences. Over short horizons this often suffices. Over long horizons, early myopic commitments compound and become difficult to recover from. The proposed fix, FLARE (Future-aware Lookahead with Reward Estimation), pushes value propagation back through the trajectory so that downstream outcomes can shape early decisions. Across multiple benchmarks, FLARE often allowed a smaller open-source model to outperform a larger frontier model running standard step-by-step reasoning. The argument draws a clearer line between reasoning, the local manipulation of intermediate steps, and planning, the explicit consideration of how early choices constrain later ones.

ResearchGym catalogs the same phenomenon from the failure side. Across runs, the recurring problems were impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits imposed by context length, the maximum number of tokens an LLM can consider at once . None of these are pure capability failures. An agent that knows what a good experiment looks like can still abandon it too early, commit to the wrong hypothesis with too much confidence, or simply run out of working memory before the task ends. The capabilities the model has in isolation do not translate cleanly into behavior under sustained pressure.

What Helps, and What Surprisingly Does Not

Mitigation research has clustered around test-time scaling, the practice of allocating more compute at inference time to improve outcomes without retraining. The first systematic study of test-time scaling for language agents, published in mid-2025, found that scaling helps, that knowing when to reflect matters, that list-wise verification methods, which compare a list of candidates rather than ranking them pairwise, outperform alternatives, and that diversifying rollouts has a positive effect on task performance . A 2026 framework called ARTIS extended these ideas to settings where actions touch external systems and cannot be undone, by decoupling exploration from commitment through simulated interactions before real-world execution . The authors flag a less obvious finding. Naive LLM-based simulators struggle to capture rare but high-impact failure modes, which means simulators have to be deliberately trained to be honest about how things go wrong, not only how they go right.

What Helps

For long-horizon coding agents specifically, a 2026 study argued that test-time scaling is fundamentally a problem of representation, selection, and reuse rather than generating more attempts . By converting each rollout into a structured summary of hypotheses, progress, and failure modes, then using methods like Recursive Tournament Voting and Parallel-Distill-Refine to select among candidates, the authors moved Claude-4.5-Opus from 70.9% to 77.6% on SWE-Bench Verified and from 46.9% to 59.1% on Terminal-Bench v2.0.

What Hurts

The same reliability framework that documented divergence between capability and reliability also reported a counterintuitive negative result. Across all 10 models tested, memory scaffolds, the systems designed to give agents persistent context across turns, universally hurt long-horizon performance . The default assumption that more memory is always better appears to be wrong in this regime, at least for the scaffolds and tasks studied. The HAL finding that higher reasoning effort can reduce accuracy points in a similar direction. More of a thing is not always more useful.

What This Might Mean

The picture that emerges, while still incomplete, points toward a few useful adjustments rather than a single fix. The field appears to be moving toward treating reliability as a first-class evaluation dimension rather than a footnote to capability. Multi-run pass@1, statistical power analysis, and pessimistic bounds like pass^k are entering the conversation precisely because the cost of mistaking noise for progress is now visible. The design assumption that more compute, more memory, or more reasoning effort always helps is being tested empirically and sometimes failing. The gap between “the agent did this once” and “the agent does this when it matters” remains the gap that separates impressive demos from production deployments.

For organizations evaluating agent systems, the implication is straightforward enough to state without overstatement. A single high score on a benchmark suggests what the system can sometimes do. It does not, on its own, describe what the system will do under repetition, perturbation, or duration. The evidence from late 2025 and early 2026 suggests treating these as different questions, and budgeting evaluation accordingly. One open question is whether the next generation of agent improvements will close the split or widen it.

Originally published at https://thinkata.com.

The Edge of the Underdefined

Mark Williams — Thu, 23 Apr 2026 15:41:01 GMT

This is the final article in “The Meta-Engineer,” a three-part series examining how AI is reshaping the identity and skill set of software engineers. The first article, “Context is the New Code,” traced the rise of context engineering as a discipline. The second, “Don’t Vibe, Architect,” showed how professionals orchestrate agents at scale. Both ended with the same uncomfortable observation. The artifacts and skills that feel distinctly human are already beginning to be automated by the systems they were designed to guide.

This final article takes up the question directly. If self-improving agents can refine their own prompts, playbooks, and architectures, what remains durably human? The answer requires examining two things. First, which engineering skills are being commoditized, and which are gaining value. Second, how far the automation of meta-knowledge, knowledge about how to manage knowledge, has actually progressed. The evidence points toward a conclusion more precise than either “everything will be automated” or “humans will always be needed.”

Which Skills Survive

The analysis of 57 practitioner videos that identified the conductor metaphor in the previous article also raised a pointed concern about what happens at the entry level . Junior engineers who accept AI output without understanding it create “house of cards” solutions, code that compiles and passes tests but rests on foundations no one in the room actually understands. The study argued for curricular shifts toward problem-solving, architectural thinking, code review, and early integration of large language model (LLM) tools, precisely because the skills that agents handle well (syntax, boilerplate, routine implementation) are the same skills that traditionally served as the training ground for new developers. If the on-ramp disappears, the question becomes how to develop judgment without the years of hands-on experience that currently produce it.

A paper framing the emergence of “SE 3.0” documented the broader role shift from manual coding to high-level orchestration and projected that traditional IDEs (integrated development environments, the text editors and tooling that programmers use to write code) will eventually give way to agent orchestration environments . This describes tools and workflows that already exist in prototype form.

What’s Commoditizing

The first direct comparison of agent and human code proficiency found that agents generate overwhelmingly basic-level code, with over 90% of Python constructs falling into beginner and elementary categories. The proficiency profiles of agent-written code and human-written code were broadly similar, with small but statistically significant differences. Agents are not writing qualitatively different code. They are writing structurally similar code faster and cheaper, which makes the commoditization of routine implementation concrete rather than theoretical.

What’s Getting More Expensive

These gains come with real costs. Industry surveys report nearly 89% increases in computing expenses from 2023 to 2025, driven largely by generative AI adoption, with some companies already postponing AI initiatives because the business case collapsed once costs were factored in. Cost-aware engineering, the discipline of managing token budgets (tokens are the units of text that language models process, and each one costs money), model selection, and compute allocation, is emerging as a professional competency that did not exist two years ago. The cheap part is getting cheaper. The expensive part is getting more expensive.

An industry-academia consortium of over 30 European partners attempted to map where all of this is heading . Their five-year vision projects “self-star” systems (self-healing, self-optimizing software) enabled by agentic AI across all phases of the software development lifecycle, from requirements gathering through maintenance. The role of the software professional, in this projection, shifts decisively toward oversight, intent specification, and high-level design. The GENIUS project is building tools for this transition, but the transition itself is not waiting for the tools to be ready.

When Agents Learn to Improve Themselves

The skills gaining value, architectural thinking, constraint specification, quality judgment, all involve what might be called meta-knowledge, knowledge about how to organize, evaluate, and direct other knowledge. The uncomfortable question is whether this meta-level work is itself automatable. A growing body of research suggests that it is, at least partially.

A comprehensive survey of self-evolving AI agents reviewed techniques spanning prompt evolution (automatically refining the instructions given to agents), memory adaptation (optimizing how agents store and retrieve information), tool creation (agents building new capabilities they were not initially given), and architecture search (automatically discovering better organizational structures for multi-agent systems) . The scope is striking. These are not narrow improvements to individual outputs. They are systematic methods for automatically enhancing every major component of an agent system through interaction data and environmental feedback.

The Compression Pattern

Just as a caterpillar’s cocoon becomes unnecessary once the butterfly can fly, layers of engineered scaffolding around an AI agent can become counterproductive when the underlying model grows capable enough. The SICA system (Self-Improving Coding Agent) demonstrated this by autonomously editing its own codebase, improving from 17% to 53% on a subset of SWE-Bench Verified, a benchmark that tests whether agents can resolve real GitHub issues. When a reasoning model was provided as a sub-component, crude reasoning scaffolds that SICA had built for itself actually hurt performance, because the model’s native reasoning was better than the agent’s self-designed wrapper. This recurs throughout the history of software. A layer that was necessary at one capability level becomes dead weight at the next.

The ACE framework, described in the first article of this series, treats context as an evolving playbook refined through a generate-reflect-curate cycle . Without any labeled training data, relying solely on execution feedback, ACE matched the top-ranked production-level agent on the AppWorld benchmark, a test suite that evaluates agents on realistic multi-step tasks, despite using a smaller open-source model. The configuration files that feel novel and human-crafted today are already beginning to be optimized by the systems they guide. The MASS framework (Multi-Agent System Search) went further by automating the search over both agent prompts and the topologies connecting multiple agents, treating not just what individual agents do but how they are organized as an optimization target . And the ALAS system (Autonomous Learning Agent System) demonstrated autonomous knowledge acquisition through an iterative loop that generates its own learning curriculum, retrieves information from the web, distills it into training data, fine-tunes the model, evaluates results, and revises its plan without human intervention . This is an agent that expands its own knowledge boundary through self-directed research.

The evidence is clear enough to state plainly. Prompt optimization, memory management, tool selection, coordination strategy, and even knowledge acquisition, every major dimension of what this series has called “context engineering,” is already the subject of automated improvement. The question is not whether these capabilities will be partially automated. They already are.

The Four Things That Stay

The analysis across this series does not support either comfortable conclusion. Claiming that everything will be automated ignores the specific structural reasons why certain problems resist computational solutions. Claiming that humans will always be needed, as a reassurance, obscures the question of what exactly they will be needed for.

The more precise claim, supported by the evidence across these studies, is that four categories of work resist automation, and they resist it not because they are computationally hard but because they require external grounding that agent systems do not have access to.

Goal formation. What should the system do, and why does it matter? Every agent system begins with an objective that a human defined. The choice to build a distributed multiplayer game, to prioritize latency over consistency, to serve a particular user population, these are not optimization problems. They are decisions about what is worth doing, grounded in values, strategy, and institutional context that sits outside any training corpus.

Constraint legitimacy. Legal requirements, ethical boundaries, and business constraints come from outside the computational system. An agent can be told to comply with GDPR (the European data protection regulation), but it cannot independently determine that GDPR compliance matters, or negotiate the trade-offs between privacy protection and product functionality. These constraints originate in institutions, not in data.

Taste and judgment. The anti-mock instructions that appear in CLAUDE.md files, described in the first article, offer a small but concrete example. Someone had to decide that excessive mocking constitutes bad practice for that particular project. That is a judgment call agents do not make on their own, because “good” is not a property of code. It is a property of the relationship between code and human intentions, and those intentions vary by context in ways that no benchmark captures.

Accountability. When systems fail, someone must be responsible. This is not a technical constraint but an institutional one. The question of who is accountable when an autonomous agent introduces a security vulnerability or makes an architectural decision that causes a production outage cannot be resolved computationally. It requires the kind of social contract that only humans can enter into.

These four categories share a common structure. They are not technical problems. They are social, institutional, and epistemic. They persist not because they are difficult to compute, but because the ground truth lives outside the system, in human values, legal frameworks, organizational priorities, and the continuous generation of new ambiguity that the real world produces faster than any system can resolve.

Where the Edge Moves

Every abstraction layer in the history of software has eventually been formalized and then automated. Assembly gave way to compilers. Manual memory management gave way to garbage collectors. Boilerplate gave way to frameworks. Code generation gave way to autonomous agents. And context engineering, despite feeling like a distinctly human cognitive skill right now, is already being partially automated by the systems it was designed to guide.

The real long-term role of the engineer has less to do with writing code or designing context than with operating at the edge of what machines still cannot define. That edge moves, and it moves fast. But it does not disappear, because the world keeps generating new ambiguity faster than systems can resolve it. The engineer of 2030 probably will not be writing CLAUDE.md files by hand. That engineer will be defining intent, negotiating constraints, and reviewing outcomes, the same things that were always the hardest part of engineering, dressed in new tools.

The pattern across this series suggests that humans do not simply move up the stack. They move to wherever meaning is still underdefined.

Originally published at https://thinkata.com.

Don’t Vibe, Architect

Mark Williams — Thu, 16 Apr 2026 15:51:01 GMT

How professionals work with agents, how context scales, and why orchestration is a transitional skill

This is the second article in “The Meta-Engineer,” a three-part series examining how AI is reshaping the identity and skill set of software engineers. The first article is “Context is the New Code.”

The first article in this series described a new category of software artifact, configuration files that tell AI coding agents how to behave within a particular codebase. Those files have measurable impact on agent efficiency and output quality. But they immediately raise a deeper question. If structured context is the foundation of effective agent use, who creates it, and what does the rest of the work actually look like?

The popular narrative about coding agents splits into two contradictory claims. One holds that agents are replacing developers, writing code at a pace no human can match. The other insists they are merely fancier autocomplete, useful for boilerplate but incapable of real engineering. A growing body of field research, large-scale repository analysis, and detailed practitioner case studies supports neither version. Professional developers are using agents extensively, but in a mode that looks nothing like “vibe coding,” the practice of trusting AI output without careful review. They plan, supervise, validate, and increasingly build elaborate infrastructure to keep agents effective across complex, long-running projects. The work has not disappeared. It has changed shape.

What Professionals Actually Do

A field study combining 13 in-depth observations with a qualitative survey of 99 experienced developers found a consistent pattern . Professional developers value agents as a productivity boost, but they retain authority over software design and implementation. They plan before implementing, validate all agent outputs, and insist on fundamental quality attributes like maintainability, test coverage, and architectural coherence. Developers found agents well-suited to straightforward, well-described tasks but not to complex ones involving architectural judgment or unfamiliar domains. The relationship resembles less a pair programming partnership and more a delegation arrangement where the human sets the specification and reviews the results.

“The role is more… if you think of it like a conductor of sorts as opposed to the actual instrument player.”

-Practitioner quoted in Chang et al., 2025

A separate qualitative analysis of 57 practitioner videos published between late 2024 and October 2025 confirmed a complementary picture . Developers consistently describe their evolving role using the metaphor of a conductor, someone who directs rather than plays. The cognitive load has not decreased so much as shifted. Instead of grappling with syntax, APIs, and repetitive implementation details, developers devote greater attention to domain modeling, architectural decisions, and system integration. Natural language has become the primary medium of software composition, but the reasoning behind that language, the judgment about what to build and why, remains firmly human. The study also raised a specific warning about junior engineers who accept AI output without understanding it, creating what practitioners described as “house of cards” solutions that compile and pass tests but rest on foundations no one in the room actually understands.

The scale of adoption is already substantial and growing fast. A study of over 129,000 GitHub projects found that between 15.8% and 22.6% show traces of coding agent use, a remarkably high figure for tools that have existed in their current form for less than a year . Agent-assisted commits tend to be larger than purely human commits and focus disproportionately on features and bug fixes, suggesting developers use agents for substantive production work rather than experimentation. A complementary dataset of over 456,000 agent-generated pull requests (proposed code changes submitted to a repository for review) across 61,000 repositories reinforced the trend . OpenAI Codex alone produced more than 400,000 pull requests within two months of its release. Developers appear to work in two distinct modes, using agents for “acceleration” on familiar tasks where the goal is speed, and for “exploration” of unfamiliar design spaces where the goal is learning. The relevant productivity question, one that frameworks like SPACE address by measuring satisfaction, collaboration, and efficiency alongside raw throughput, is not how fast agents generate code but how effectively the combined human-agent system produces correct, maintainable software.

What these studies collectively describe is neither replacement nor mere assistance. The developer’s contribution has shifted from producing code to producing specifications, constraints, and quality judgments, a transition that turns out to demand more expertise rather than less.

When a Config File Isn’t Enough

The configuration files described in the first article, CLAUDE.md and AGENTS.md, work well for modest-sized projects. A few hundred lines of instructions can orient an agent to a codebase’s conventions, testing expectations, and architectural patterns. But what happens when a project reaches 108,000 lines of code, spans 45 subsystems, and defines 35 network message types? A single file no longer suffices.

Three Tiers of Machine Memory

Just as a large library organizes its holdings into different levels of accessibility, with reference materials on open shelves, specialized texts in reserve, and archival documents retrieved on request, a sufficiently complex software project needs layered knowledge infrastructure for its AI agents. A detailed case study documented exactly what this looks like. A researcher built a 108,000-line C# distributed system using Claude Code as the sole code-generation tool, developing a three-tier context architecture across 283 development sessions. The first tier, a “hot memory” constitution of roughly 660 lines, loaded into every agent session automatically. It encoded naming conventions, build commands, and orchestration protocols. The second tier comprised 19 specialized domain-expert agents, each responsible for a specific subsystem like networking, physics, or UI, totaling around 9,300 lines. The third tier was a cold-memory knowledge base of 34 on-demand specification documents served through a retrieval tool only when relevant. The total context infrastructure amounted to about 26,000 lines, roughly 24% of the codebase it supported.

The detail that the researcher’s primary background is in chemistry, not software engineering, inverts a common assumption about who can do this kind of work. Building complex software with agents may depend less on traditional coding skill and more on the ability to design knowledge architectures, to decompose a problem domain into structured components and write clear specifications for each. That is an architectural competency, but not necessarily a programming one. The context infrastructure itself was AI-generated under human architectural direction, with the human’s role being to decide what knowledge to capture and how to organize it.

Similar infrastructure patterns appear in other systems. A technical report on the OpenDev terminal agent described five-stage progressive context compaction that activates at increasing token pressure thresholds, from 70% to 99% of the model’s context window capacity (the maximum amount of text it can consider at once) . To counteract “instruction fade-out,” the phenomenon where agents gradually stop following their original instructions as a conversation grows longer, the system injects event-driven reminders at key decision points rather than relying solely on the initial prompt. A three-tier Skills hierarchy, spanning built-in, project-level, and user-defined instructions, manages reusable templates through lazy loading, injecting only what each specific task requires. These are infrastructure-level solutions to a problem that anyone running a long agent session has encountered.

Multi-agent approaches add another dimension of complexity and capability. A study of context engineering for coordinated coding systems found that retrieving both external knowledge (research papers and documentation) and internal codebase context (project files and conventions) substantially improved task resolution on the SWE-Bench Lite benchmark, a widely used test suite for evaluating whether agents can resolve real GitHub issues . The multi-agent approach yielded higher single-shot success rates than single-agent baselines, at the cost of roughly 3 to 5 times more tokens per task. Dividing work among specialized sub-agents, each operating within a focused context window, reduced hallucinations (plausible but incorrect AI-generated content) and improved adherence to project conventions. But orchestrating multiple agents introduced its own complexity. Someone had to design the task decomposition, define agent roles, and ensure shared state remained consistent. For now, that someone is a human.

The Orchestration Paradox

The orchestration patterns that professionals develop, decomposing tasks, routing work to specialized agents, maintaining shared memory across sessions, represent genuine engineering skill. They also represent the next thing likely to be automated.

The Darwin Gödel Machine demonstrated this directly . Rather than relying on a fixed, human-designed coordinator to direct improvements, the system iteratively modified its own codebase, including its own orchestration logic, and empirically validated each change against coding benchmarks. On SWE-bench, it improved performance from 20% to 50%. On the Polyglot benchmark, which tests across six programming languages, it improved from 14.2% to 30.7%. The key architectural insight is that this is a single system that both solves coding problems and refines its own implementation, removing the need for a separate, hand-crafted meta-agent. The better tools and workflows it discovered were not anticipated by its designers.

When the Wrapper Becomes Redundant

Just as a machine tool capable of manufacturing other machine tools represents a fundamentally different category than one that merely stamps out parts, a coding agent that can edit its own source code occupies a different position than one that simply follows instructions. The SICA system (Self-Improving Coding Agent) demonstrated this by autonomously modifying its own Python codebase, improving from 17% to 53% on a subset of SWE-Bench Verified. One finding proved particularly telling. When a reasoning model was provided as a sub-component, crude reasoning scaffolds that SICA had built for itself actually hurt performance, because the model’s native reasoning was better than the agent’s self-designed wrapper. This is a concrete instance of a recurring compression pattern, where a layer that was necessary at one capability level becomes counterproductive when the underlying system matures.

Meanwhile, trajectory-informed memory generation already automates the extraction of structured lessons from agent execution histories . Rather than relying on humans to document what worked and what failed after each session, the system analyzes completed task trajectories, identifies which decisions led to successes or failures through causal attribution, and generates categorized guidance for future runs, including strategy tips from successful patterns, recovery tips from failure handling, and optimization tips from inefficient successes. On the AppWorld benchmark, this approach improved task completion by up to 14.3 percentage points, with the strongest gains on the most complex tasks. This is essentially automating the “lessons learned” process that the codified context researcher performed manually across 283 development sessions.

The pattern is consistent across these studies. The conductor role that practitioners are developing right now is structurally similar to what self-improving systems are learning to do autonomously. Decomposing tasks, routing to specialists, and refining strategies based on execution feedback are exactly the capabilities that agent systems are acquiring through their own operation. Code completion automated the first layer of developer effort. Context configuration is being formalized now. Orchestration appears to be next.

The final article in this series will take up the question this observation raises. If the orchestration layer compresses too, what remains durably human? The answer, the evidence across these studies suggests, has less to do with any particular abstraction level and more to do with wherever meaning is still underdefined.

Originally published at https://thinkata.com.

Context is the New Code

Mark Williams — Thu, 09 Apr 2026 14:57:01 GMT

This is the first article in “The Meta-Engineer,” a three-part series examining how AI is reshaping the identity and skill set of software engineers.

Sometime in mid-2025, a shift began among engineers building production AI systems. The previous two years had been dominated by a single idea, that the key to getting good results from a language model was learning to talk to it well. Entire job titles sprang up around the skill. Courses, certifications, and prompt libraries proliferated. And for a while, the idea held. Careful phrasing did produce better outputs. But as AI coding tools evolved from autocomplete assistants into autonomous agents, the engineers working with them found that “prompt engineering,” however refined, was no longer sufficient. The tasks they faced, getting an agent to navigate a 100,000-line codebase, maintain architectural consistency across sessions, and avoid repeating past mistakes, had little to do with crafting a clever sentence. They needed something more systematic. The emerging answer is context engineering, a discipline that treats the entire informational environment surrounding an AI agent as a designed artifact .

The distinction is more than semantic. Prompt engineering focuses on the instruction itself, the text sent to a language model. Context engineering encompasses everything the model sees at inference time, from system prompts and retrieved documents to session memory, tool definitions, and the structure organizing all of it . If prompt engineering is writing a memo to a new employee, context engineering is designing the entire onboarding program, complete with reference materials, reporting lines, institutional knowledge, and decision-making protocols. The memo matters, but it cannot compensate for a badly designed information environment.

The need for systematic context design became especially visible as coding agents moved from autocomplete tools to autonomous systems capable of multi-step reasoning. An agent that only completes the next line of code can function adequately with a short prompt. An agent that independently creates a feature branch, writes an implementation spanning multiple files, runs tests, diagnoses failures, and iterates until the build passes needs far more than an instruction. It needs to understand the project’s technology stack, its conventions for error handling and logging, its test infrastructure, which directories contain which types of code, and the architectural rationale behind structural decisions that might otherwise look arbitrary. Providing all of this reliably, economically, and in the right format at the right time is a design problem, and it is the problem that context engineering exists to solve.

A Discipline Takes Shape

A comprehensive survey covering over 1,400 research papers formalized this field, establishing a taxonomy that decomposes context engineering into three foundational components . The first, context retrieval and generation, addresses where relevant information comes from, whether through search over documents, tool calls to external APIs, or synthesis from prior interactions. The second, context processing, covers how that information is filtered, compressed, and structured for relevance. The third, context management, deals with the ongoing challenge of maintaining context within a model’s context window, the maximum amount of text it can consider at once, across multi-step interactions. Each stage introduces its own design decisions and failure modes, and the survey reveals that treating any single stage in isolation produces fragile systems.

From Craft to Maturity Model

Just as a well-organized notebook helps a researcher locate the right reference at the right moment, context engineering structures the informational landscape an AI agent draws from. A separate framework proposes a four-level maturity pyramid for what it calls “agent engineering”. At the base sits prompt engineering, the craft of writing individual queries. Above it sits context engineering, the design and management of the entire informational environment. The third level, intent engineering, encodes organizational goals and trade-off hierarchies into agent infrastructure, moving beyond operational instructions to strategic alignment. At the top, specification engineering creates machine-readable corpora of corporate policies enabling multi-agent systems to operate autonomously at scale. Each level subsumes the one below it as a necessary foundation.

The same framework proposes five quality criteria for evaluating engineered context . Relevance means the agent receives only what pertains to the current task. Sufficiency means nothing critical is left out. Isolation, especially important in multi-agent architectures where multiple AI sub-agents collaborate on different parts of a task, ensures each sub-agent’s context does not leak into another’s. Economy demands minimum token expenditure for maximum informational value. Provenance requires that every element of context be traceable to a verified source. Most teams operating at the prompt engineering level address one or two of these criteria at best, and typically only by instinct rather than by design.

“Whoever controls the agent’s context controls its behavior; whoever controls its intent controls its strategy; whoever controls its specifications controls its scale.”

-Vishnyakova, 2026

The gap between this vision and current practice is wide. An exploratory survey of 74 software professionals across six countries found that prompt usage in software engineering remains “largely ad hoc,” with prompts refined through trial-and-error, rarely reused, and shaped more by individual heuristics than standardized practices . Most organizations are still at level one of the maturity pyramid. The knowledge to do better exists, but the institutional habits have not caught up.

A related line of work pushes further by arguing that prompts should be treated not as informal text but as first-class software artifacts, subject to the same lifecycle of requirements engineering, design, testing, and versioning as traditional code . That paper describes the present state as a “promptware crisis,” an echo of the original “software crisis” of the 1960s that gave rise to software engineering as a discipline. The parallel is illuminating. Early software development was also trial-and-error, driven by individual skill rather than systematic method. It took decades of accumulated failures, ballooning complexity, and hard-won professional norms to establish the field. Context engineering may be at a similar inflection point, the moment before a craft becomes a discipline.

The Artifacts Practitioners Actually Build

While the academic literature establishes frameworks and taxonomies, a parallel development is happening in practice. Developers working with agentic coding tools like Claude Code, Codex, and Cursor have begun creating a new category of software artifact, configuration files that serve as persistent, structured instructions for AI agents. Files named CLAUDE.md, AGENTS.md, and .cursorrules are essentially “READMEs for AI,” machine-readable documents that encode project-specific knowledge an agent needs to operate effectively within a particular codebase.

Several empirical studies have examined what developers actually put in these files. An analysis of 328 CLAUDE.md files from popular GitHub projects found that 72.6% specify application architecture, making it the most common concern, followed by testing instructions, development guidelines, and project overviews . A separate study of 253 Claude Code manifests confirmed consistent structural patterns, typically one main heading with several subsections, dominated by operational commands, technical implementation notes, and high-level architectural descriptions . The shallow structure is not a sign of immaturity. It appears to reflect what agents actually need, a flat, scannable set of instructions rather than deeply nested documentation.

Scaling Across Tools

Just as a growing organization eventually needs written policies that work across departments rather than relying on informal tribal knowledge, the expanding ecosystem of AI coding tools needs configuration standards that work across platforms. The broadest study to date examined 2,923 GitHub repositories and identified eight distinct configuration mechanisms spanning a spectrum from static context files to executable integrations. Context Files, simple Markdown documents like CLAUDE.md and AGENTS.md, dominate the landscape. More advanced mechanisms such as Skills (structured packages with executable resources) and Subagents remain only shallowly adopted, with most repositories defining just one or two configuration artifacts. AGENTS.md has emerged as a de facto interoperable standard, recognized across multiple tools. The picture is of an ecosystem in its early days, where the simplest approach, a well-written Markdown file, is doing the heavy lifting.

These files are not just documentation. A controlled study of 10 repositories and 124 pull requests found that the presence of an AGENTS.md file was associated with a 29% reduction in median agent runtime and a 17% reduction in output token consumption, while maintaining comparable task completion behavior . The researchers hypothesize that agents spend less time on exploratory navigation when they have explicit project context, needing fewer planning iterations and fewer repeated calls to the model. In practical terms, a well-crafted context file can cut both the time and cost of an agent session by roughly a quarter.

Yet adoption remains strikingly low. A study of open-source software projects found that only about 5% of surveyed repositories have adopted any context file format . This is a field where the early adopters are seeing real gains, but the vast majority of projects have not yet begun to invest in structured agent context. The parallel to early version control adoption, or early unit testing adoption, is hard to miss. A practice that starts as optional among a skilled minority tends to become standard once enough teams experience the cost of not doing it.

What Goes In, and Why It Matters

The content of these files reveals something important about what developers have learned through experience with agents. Architecture specifications dominate because agents without architectural context tend to generate code that works in isolation but violates the system’s structural assumptions. A microservices project with strict domain boundaries, for example, will see an unconstrained agent casually import across those boundaries, creating coupling that takes hours to untangle. An agent working without knowledge of a project’s event-driven architecture might implement a synchronous function call where an asynchronous message was expected, producing code that compiles but behaves incorrectly under load. The agent has no way to infer architectural intent from the code alone. Architectural decisions are often conventions enforced by humans rather than patterns enforced by compilers.

Testing instructions appear frequently, and a recent empirical study reveals exactly why. An analysis of over 1.2 million commits across 2,168 repositories found that coding agents are significantly more likely to add mock objects to tests than human developers . Specifically, 36% of agent commits that modify test files introduce mocks, compared with 26% for human-authored commits. The study also found that 23% of commits made by coding agents add or change test files, compared with only 13% by non-agents, and that 68% of repositories with agent test activity also contain agent mock activity . Repositories created more recently showed even higher proportions of agent-generated test and mock commits, suggesting the trend is accelerating as agent adoption grows. Mock objects, which substitute simplified stand-ins for real system components during testing, are easier for agents to generate automatically but less effective at validating how components actually interact. Tests that mock everything pass reliably but verify very little about the real system’s behavior. The researchers explicitly recommend including guidance on mocking practices in agent configuration files .

Developers have independently arrived at the same conclusion. Anti-mock instructions appear in CLAUDE.md files across many projects, a concrete example of the feedback loop between agent output and human judgment. The chain of reasoning behind such an instruction is worth unpacking. Someone had to encounter the problematic tests, recognize the pattern of excessive mocking, diagnose that the agent was reaching for mocks as the path of least resistance, and then encode a corrective instruction that prevents recurrence. That entire chain, from recognizing a quality problem to articulating a rule that addresses its root cause, is precisely the kind of reasoning that context engineering formalizes.

Project overviews also appear frequently, and their function is subtler than it first appears. An agent that knows it is working on a distributed event-processing system written in Rust makes different choices than one operating under the assumption that it is working on a standard web application. The overview is not there for the agent’s curiosity. It establishes the interpretive frame within which every subsequent instruction and code change should be understood. Without that frame, the agent optimizes locally, generating code that satisfies the immediate request. With it, the agent’s local decisions become more likely to cohere with the system’s global design intent. Software projects accumulate unstated assumptions over time, assumptions about performance targets, deployment environments, backward compatibility requirements, and acceptable trade-offs between code clarity and runtime efficiency. A human developer absorbs these assumptions gradually through code review, team conversations, and debugging sessions. An agent has none of that ambient context. The project overview and its associated configuration files are the only mechanism for transmitting what would otherwise require months of socialization.

The First Signs of Compression

The configuration files described above are brand new, barely a year old as a widespread practice. They represent a distinctly human contribution, the product of engineering judgment, project-specific knowledge, and hard-won experience. And yet, there are already early signs that the same systems these files were designed to guide are learning to generate and refine similar artifacts autonomously.

The ACE (Agentic Context Engineering) framework treats context not as a static human-authored artifact but as an “evolving playbook” . Through a modular cycle of generation, reflection, and curation, ACE accumulates, refines, and organizes strategies without any labeled training data, relying instead on natural execution feedback. In practice, the generation phase creates new strategy elements from recent task experiences. The reflection phase evaluates which strategies contributed to successes or failures. And the curation phase integrates promising strategies into the evolving playbook while pruning elements that have proven unhelpful. What distinguishes ACE from simple prompt optimization is the cumulative, structured nature of the updates. Rather than rewriting the entire context on each iteration, the framework makes targeted additions and modifications, preserving the accumulated knowledge that prior iterations have validated .

ACE demonstrated a 10.6% improvement over strong baselines on agent benchmarks and 8.6% on domain-specific financial reasoning tasks . On the AppWorld leaderboard, ACE matched the top-ranked production-level agent on the overall average and surpassed it on the harder test-challenge split, despite using a smaller open-source model.

The ACE researchers identified two failure modes that plague simpler, static approaches. Brevity bias is the tendency for iterative optimization to collapse rich context into short, generic summaries that strip away domain-specific heuristics. Context collapse occurs when iterative rewriting gradually erodes important details over time . ACE addresses both with structured, incremental updates guided by a “grow-and-refine” principle that preserves detailed knowledge rather than compressing it. The framework argues, counterintuitively, that large language models are actually more effective with long, detailed contexts than with tight summaries. Unlike humans, LLMs can autonomously distill relevance from comprehensive inputs, so stripping context down may sacrifice the edge-case knowledge that separates correct output from output that merely compiles.

This is proto-self-context-engineering. The artifacts that feel novel and distinctly human today, the carefully authored CLAUDE.md files and AGENTS.md specifications that encode project architecture and testing conventions, are already beginning to be optimized by the very systems they were written to guide.

The Automation Ladder

There is a pattern worth noticing, and it recurs so reliably across the history of software that it probably qualifies as structural rather than coincidental. Every major abstraction layer eventually got formalized, stabilized, and then partially or fully automated.

In the 1950s, programmers encoded instructions in raw machine language, addressing memory registers by number. Compilers eliminated that work. In the decades that followed, programmers managed memory by hand, tracking every allocation and deallocation. Garbage collectors eliminated that work. By the 1990s, developers wrote boilerplate business logic from scratch for every project, implementing authentication, database access, and request routing by hand. Frameworks and libraries eliminated most of that work. Entire product categories, e-commerce, content management, analytics, became platforms. And in the last three years, code generation itself has undergone a dramatic shift. What began as autocomplete suggestions in IDEs evolved into autonomous agents capable of creating features, writing tests, and issuing pull requests with minimal human direction.

Context engineering sits at the latest step on this ladder. It feels like the domain of uniquely human judgment, and for now, in most practical settings, it is. Designing the right information environment for an AI agent requires understanding the project, its architecture, its failure modes, and its quality standards in ways that demand genuine expertise. The decision to include anti-mock instructions in a CLAUDE.md file, for instance, reflects not just a knowledge of testing patterns but a judgment about what “good” means for that particular codebase. That judgment currently lives in human heads.

But the ACE framework demonstrates that at least the refinement of context, the iterative improvement of playbooks based on execution feedback, can be automated today. The generate-reflect-curate loop does not need labeled data. It does not need a human reviewing each iteration. It learns from the natural consequences of its own decisions, and it demonstrably outperforms static, human-authored baselines on agent benchmarks.

A question the remaining articles in this series will explore, is where will the ladder lead? If agents can learn to refine their own context, and the orchestration patterns that coordinate multi-agent work are themselves being learned by self-improving systems, what remains durably human? Professional developers are already shifting from writing code to designing context. If context design itself begins to compress, as the evidence tentatively suggests, the next shift may not be upward to a higher rung on the same ladder. It may be toward a different kind of work entirely.

As the evidence from practitioner studies, scaled infrastructure projects, and self-improving agent systems will suggest across this series, has less to do with any particular abstraction layer and more to do with the nature of the work itself. Humans persist wherever meaning is still underdefined. That edge moves, and it moves fast. But it does not disappear, because the world keeps generating new ambiguity faster than systems can resolve it.

Originally published at https://thinkata.com.

Closing the Loop: How Human Corrections Can Make AI Systems Smarter Over Time

Mark Williams — Wed, 01 Apr 2026 15:51:01 GMT

Every day, thousands of domain experts in law firms, hospitals, and financial institutions review the outputs of AI systems and quietly fix the mistakes. A legal automation tool misclassifies a contract clause. A clinical decision support system recommends the wrong risk category. A customer service bot generates an irrelevant response. In each case, a human steps in, corrects the output, and moves on. But what happens to those corrections? In most production systems today, the answer is surprisingly little. The same mistakes keep recurring, reviewers grow frustrated, and the promised value of automation slowly erodes . Even at companies with sophisticated ML infrastructure, model update cycles often stretch to months before corrections feed back into training .

The fundamental challenge is architectural. Converting scattered human corrections into durable improvements requires a carefully designed feedback pipeline . That pipeline must respect privacy constraints, handle noisy annotations, and adapt at the right speed for each use case. Recent advances in reinforcement learning, adaptive routing, and noise-robust supervision are making this feedback loop increasingly practical .

The Core Problem: Two Timescales of Improvement

Like a Pilot’s Instrument Panel

A pilot monitors altitude and heading in real time, making constant small corrections. But deeper analysis happens only after landing, from mechanical inspections to route adjustments. An effective correction system works the same way. A fast loop provides immediate, lightweight adjustments without changing the model’s core parameters. A slow loop periodically retrains the model using accumulated, quality-filtered correction data. Mixing these two timescales creates a system that is either too slow to fix obvious errors or too unstable for high-stakes deployment.

Production correction systems also face constraints that academic benchmarks rarely address. Privacy regulations in healthcare and finance may prohibit storing full model outputs, limiting the system to structured metadata about each correction. Annotation quality varies across reviewers, meaning a single careless override can push the model in the wrong direction. In platforms that serve multiple client organizations, different clients may need distinct model behaviors, making a single shared update inappropriate.

Learning from Preferences: RLHF and DPO

Reinforcement Learning from Human Feedback (RLHF) is one of the most influential approaches to aligning model behavior with human intent. The technique works in two stages . First, it trains a reward model from human preference data, meaning pairs of outputs where a human has indicated which is better. Then it uses reinforcement learning to fine-tune the target model so it produces outputs the reward model scores highly. A landmark demonstration showed that a relatively small RLHF-aligned model could be preferred by human raters over a much larger unaligned model. Alignment through feedback can be more efficient than simply making models bigger.

A notable trend in 2024–2025 is the growing adoption of online iterative RLHF, where feedback is collected continuously from the current model rather than from a pre-collected dataset . This matters because reward models trained on outputs from a previous version of the model often struggle with outputs from the current version. The data goes stale. Online iterative approaches solve this by keeping feedback current, ensuring the training data matches what the model is actually producing now.

A cost-effective variant called RLTHF (Targeted Human Feedback) achieves comparable alignment in benchmark evaluations using only about 6–7% of the typical human annotation effort . It does this by focusing corrections on the hardest samples, the ones the reward model itself flags as uncertain. Whether these efficiency gains hold in production, where error distributions and reviewer behavior differ from controlled benchmarks, remains an open question. But the direction is promising for settings where human review time is the scarcest resource.

Direct Preference Optimization (DPO) takes a different path by eliminating the separate reward model entirely . Instead of the two-stage RLHF process, DPO converts preference pairs directly into a training signal for the model. The math works out so that the model can learn the same alignment objective in a single, simpler step. Because DPO skips the reward-model stage, it is substantially more stable and computationally lighter than traditional RLHF, making it practical for teams that batch corrections on a weekly schedule . A comprehensive 2025 survey organizes the growing DPO research into four dimensions covering data strategy, learning framework, constraint mechanisms, and model properties . One important finding is that including ambiguous or difficult preference pairs in training data can actually harm alignment, underscoring the importance of careful data curation .

Smart Routing: Contextual Bandits for Model Selection

Choosing Where to Eat, at Machine Speed

Imagine walking down a street lined with restaurants. Should the diner return to a familiar spot or try someplace new? This is the exploration-exploitation dilemma, and it is exactly the trade-off that contextual bandits solve for AI systems. These algorithms provide a principled way to route each incoming query to the best-suited model or configuration. The key insight is that in deployment, only the outcome of the chosen model is observed. The system never learns what would have happened with a different choice, a constraint that most simpler routing approaches ignore.

The BaRP (Bandit-feedback Routing with Preferences) framework, introduced in 2025, treats routing as a balancing act between performance and cost . Operators can adjust that trade-off on the fly without retraining, simply by specifying how much they value accuracy versus cost savings. In preprint results not yet peer-reviewed, experiments across diverse benchmarks show BaRP outperforming strong alternatives by at least 12% while simultaneously reducing costs, and generalizing well to tasks never seen during training.

In a production correction loop, each time a human corrects a model output, that correction updates the router’s estimate of how well that model handles similar queries. Over time, the router learns to steer traffic away from models that consistently underperform on certain query types. Each client organization can maintain its own routing preferences, while new clients benefit from patterns already learned across the broader user base.

When Corrections Themselves Are Wrong

Human corrections are imperfect. Reviewers vary in expertise, attention, and consistency. A correction loop that treats every override as ground truth will inevitably amplify errors. Programmatic Weak Supervision (PWS) addresses this by treating each labeling source, including each human reviewer, as an imperfect signal whose reliability can be measured and weighted accordingly .

Recent work has advanced this idea significantly. A 2025 methodology attaches confidence scores to the labels produced by weak supervision systems, enabling the learning pipeline to quantify uncertainty and reduce the influence of unreliable labels . This connects to a broader principle in production ML. Label noise should be treated as a first-class design concern, with explicit mechanisms for detection and mitigation, rather than as a data-cleaning afterthought.

Putting It Together: A Correction-to-Improvement Pipeline

One way to organize these techniques into a practical architecture is a three-stage correction pipeline, all under a shared governance layer. The specific design draws on patterns from the literature cited above, though the overall structure is an editorial synthesis rather than any single paper’s proposal.

Ingestion and signal processing. Every corrected output event produces structured metadata (error type, model version, tenant ID, confidence score) written to a permanent event log . Raw corrections then pass through several quality filters, including noise reduction, confidence scoring, and prioritization of the most informative examples, before reaching any model .

Fast loop (real-time). Between retraining cycles, the fast loop improves behavior without changing the model itself. It injects prompt hints based on common confusion patterns, adds validated corrections to a reference knowledge base the model can consult at query time, updates the routing system’s performance estimates, and monitors correction rates in real time .

Slow loop (periodic). On a weekly or event-triggered schedule, accumulated preference pairs feed fine-tuning through either DPO or online RLHF workflows . Updated models must pass a quality check before deployment, verifying that accuracy has not dropped and that correction rates on held-out test samples remain below baseline. Validated updates then roll out gradually, initially serving only 5–10% of traffic before expanding.

A governance layer spans all three stages, enforcing a permanent audit log and filtering of personally identifiable information at ingestion. It also provides independent rollback capabilities for the model and routing system, along with access controls that prevent one client’s correction data from leaking to another.

Open Questions and Limits

Not every correction loop should be closed. When correction volume is too low to be statistically meaningful, feeding sparse overrides into training risks overfitting to noise rather than learning genuine patterns. When the operating environment shifts, older corrections may no longer reflect current conditions. And when the task is inherently subjective, with reasonable experts regularly disagreeing on the right answer, consensus-based retraining can suppress legitimate diversity of judgment. Recognizing when not to retrain is as important as building the pipeline to do so.

Among the challenges that do apply, reward hacking is perhaps the most concerning. Models optimized repeatedly against imperfect reward signals can learn to game the system, producing outputs that score well on the reward model but miss the mark on true human intent . This can be subtle. A customer service model might learn to generate responses that match evaluator style preferences without actually resolving the underlying issue. Detecting this kind of drift requires monitoring not just the reward signal but also downstream task outcomes, an additional layer of instrumentation that many teams underinvest in.

Annotation cost remains a major bottleneck. Even with active learning and targeted feedback strategies like RLTHF, correction loops demand sustained human effort. One promising approach, demonstrated in production at Airbnb, embeds annotation directly into operational workflows rather than treating it as a separate labeling task, compressing model update cycles from months to weeks . AI-generated feedback offers another path toward partial automation at lower cost per data point, but it introduces its own risks and should complement rather than replace human review in high-stakes domains.

The central thesis emerging from recent research is clear. A robust correction loop requires the separation of timescales. Fast-loop mechanisms like prompt hints, retrieval augmentation, and bandit routing deliver immediate responsiveness . Slow-loop mechanisms deliver principled fine-tuning on accumulated, quality-filtered preference data, whether through DPO’s single-step approach or iterative online RLHF pipelines . The convergence of targeted feedback strategies, smart routing, and confidence-aware weak supervision means that a production-grade human-correction loop is now within reach, for the right kinds of tasks and with clear-eyed awareness of its limits. The organizations that invest in closing this loop will find their AI systems not just tolerating human oversight but actively benefiting from it, getting measurably better with every correction.

Originally published at https://thinkata.com.

Offline RL and the Data Flywheel

Mark Williams — Wed, 25 Mar 2026 17:06:00 GMT

How production systems learn from logged data, and why dataset quality is the most underinvested layer of the RL stack

Photo by Aamy Dugiere on Unsplash

Every reinforcement learning (RL) system needs data. In textbook settings, the agent, the decision-making program being trained, generates its own data by exploring an environment, trying actions, and updating its behavior based on the results. In production settings, this assumption is often untenable. Exploration is expensive. In healthcare, an agent cannot try random treatment plans to observe what happens. In autonomous driving, a bad exploratory action is measured in human safety. In recommendation systems, even brief periods of degraded performance carry real revenue consequences.

Offline reinforcement learning offers a different premise. Instead of learning through active interaction, the agent learns entirely from a static dataset of previously collected experiences . The logged actions of prior policies, human operators, or existing systems become the training signal. This paradigm shift, from learning by doing to learning from records, changes the engineering surface of RL dramatically. The algorithm is no longer the bottleneck. The data is.

The Core Problem of Learning from Logs

The central technical challenge in offline RL is distributional shift, the mismatch that arises when a model trained on one distribution of data is applied in conditions that look different from training. Think of a navigator who has studied detailed charts of the Pacific but is dropped in the Arctic. The tools are the same, but the territory has changed.

In offline RL, this mismatch is structural. When an RL algorithm updates its value estimates, meaning its predictions of how rewarding a given action will be, it needs to evaluate the consequences of actions the current policy would take. In online RL, the policy generates its own experience. In offline RL, the agent can only observe the consequences of actions that were actually taken by whatever behavior policy, the prior system that collected the data, was running at the time. Actions the new policy would prefer may never appear in the dataset at all.

This gap creates a destructive failure mode. Standard off-policy methods like deep Q-learning estimate the value of unseen state-action pairs by extrapolating from observed data. When these estimates are wrong, and they frequently are for actions far from the data distribution, the learning algorithm can latch onto erroneously high value estimates and produce policies that confidently take actions with no empirical support. Levine et al. describe this as the fundamental challenge that makes offline RL qualitatively harder than its online counterpart, noting that standard off-policy methods routinely fail in the offline setting due to unchecked value overestimation .

Three Approaches to Taming Distributional Shift

Conservative Value Estimation

The first strategy accepts that value estimates for unseen actions will be unreliable and works to make them deliberately pessimistic. Conservative Q-Learning (CQL) augments the standard Q-learning objective with a regularization term, a mathematical penalty that pushes down estimated values for actions not well-represented in the dataset while pushing up values for actions that are. The result is a Q-function that provably lower-bounds the true value of the learned policy, ensuring the agent does not chase phantom value in unexplored regions of the action space . The trade-off is that excessive conservatism can leave value on the table, as an overly cautious agent may decline actions that would have been beneficial simply because they were underrepresented in training data.

In-Sample Learning

The second strategy avoids the problem of evaluating unseen actions entirely. Implicit Q-Learning (IQL) never queries the value of actions outside the dataset. Instead of computing the maximum Q-value over all possible actions, IQL approximates this maximum implicitly by fitting an upper expectile, a statistical summary that focuses on the better-performing tail, of the value distribution using only actions present in the data . IQL is particularly effective on tasks that require “trajectory stitching,” where no single sequence of actions in the dataset solves the complete task, but the optimal path can be assembled from fragments of different suboptimal trajectories. For production systems that must learn from heterogeneous data collected by multiple prior policies of varying quality, this stitching capability is essential.

Sequence Modeling

The third strategy reframes the RL problem entirely. The Decision Transformer treats offline RL as a sequence modeling problem rather than a dynamic programming problem . Dynamic programming, the traditional approach, works backward from rewards to infer action values. Sequence modeling instead treats the problem like language translation, learning to predict what action comes next given a history of states, prior actions, and a target level of performance. At inference time, the desired performance level is specified as a conditioning variable, and the model generates actions aimed at achieving it. This reframing imports the scaling properties of transformer architectures, the same class of model that powers large language models, directly into the decision-making domain. For organizations already operating transformer training infrastructure, the marginal cost of deploying a Decision Transformer is substantially lower than building a separate RL training stack.

Dataset Quality as a First-Class Concern

The Bottleneck Is the Data, Not the Algorithm

Just as a skilled chef cannot cook a great meal from poor ingredients, even the most sophisticated offline RL algorithm cannot compensate for a poorly characterized dataset. Research on the relationship between dataset characteristics and algorithm performance has established that popular offline RL methods are profoundly sensitive to the composition of the data they train on. Two properties matter most. The first is trajectory quality, measured by the average return, or cumulative reward, of the trajectories in the dataset. The second is state-action coverage, measured by the proportion of the state-action space represented in the data. Selecting an offline RL algorithm without first understanding the dataset is an unreliable engineering practice. Dataset characterization must precede algorithm selection, and it must be treated as a recurring operational task rather than a one-time analysis.

As the system’s behavior policy changes, as user populations shift, and as the product evolves, the statistical properties of the logged data will change with them. An algorithm that performed well on last quarter’s data may underperform on this quarter’s if the composition of the underlying dataset has drifted. The feature store, the embedding pipeline, the data validation layer, and the logging infrastructure are not ancillary support systems for the RL component. They are the RL component’s most consequential dependency.

The Data Flywheel

The most powerful production pattern that emerges from offline RL is the data flywheel. The cycle operates as follows. A deployed policy generates interactions with users or environments. Those interactions are logged with full state, action, and outcome information. The logged data is curated, filtered, and used to train an improved policy via offline RL. The improved policy is deployed, generating higher-quality interactions, which in turn produce a better training dataset for the next iteration.

When the Flywheel Spins Backward

What makes the RL instantiation of this cycle distinctive is that the quality of the data is a direct function of the quality of the policy that generated it. In supervised learning, the training data and the model are largely independent. In RL, they are coupled. A poor policy generates poor data, which trains another poor policy, which generates more poor data. The flywheel can spin in either direction. Breaking out of a negative flywheel requires deliberate intervention at the data layer. Mixing logged production data with expert demonstrations ensures that high-quality trajectories are always present in the training set. Importance sampling techniques can reweight the dataset to emphasize transitions from higher-performing episodes. And offline-to-online fine-tuning, where a policy learned offline is subsequently refined through limited live interaction, provides a principled bridge between the static dataset and the live environment. Each of these interventions is an infrastructure decision, not a modeling decision.

The data flywheel also intersects directly with reward design. In offline RL, rewards must be present in the logged data, meaning they were computed by whatever reward function was active when the data was collected. If the reward function has since been updated, the logged rewards may no longer reflect the current definition of success. The data infrastructure must track which reward function was active when each transition was logged, and the training pipeline must be capable of either filtering for compatibility or relabeling rewards under the updated function. The dataset is not neutral raw material. It encodes the objectives, the biases, and the limitations of every prior policy and reward function that contributed to its creation.

The Bottom Line

Offline RL transforms the economics of learning systems. It makes it possible to extract value from historical interaction data without the cost and risk of live exploration. But it also shifts the engineering center of gravity from model training to data management. The quality, coverage, and provenance of the training dataset become the primary determinants of system performance, and the infrastructure to manage those properties becomes the primary investment.

For organizations building AI-native systems, the data pipeline is not a prerequisite for the RL system. It is the RL system. Neglecting it in favor of algorithm selection is equivalent to optimizing the engine of a car while ignoring the fuel supply. The system must not only learn from its data, it must learn about its data, continuously, as a condition of safe and effective operation.

Originally published at https://thinkata.com.

Reward Design as Architecture

Mark Williams — Wed, 18 Mar 2026 20:06:00 GMT

Why the reward function is the most consequential, and most overlooked, design decision in any RL system

Every reinforcement learning (RL) system contains a reward function, a mathematical signal that tells a learning agent whether a given action was a step in the right direction. In production contexts, teams spend enormous effort selecting policy algorithms, designing neural network architectures, and building the infrastructure to run them. The reward function is often treated as a brief specification step, a few lines of code decided before the real engineering begins. This is precisely the wrong instinct. Reward design is, in fact, an architectural decision, one that shapes everything the system will ever learn to do.

The parallel to software architecture is instructive. A poorly chosen database schema does not simply produce slow queries, it constrains what questions can be asked at all. A poorly designed reward function does not simply slow down training, it shapes the agent toward a version of success that may diverge from the intended goal, sometimes in ways that only become visible under production load.

The Anatomy of a Reward Function

In formal terms, a reward function maps each state-action pair in an environment to a scalar numerical value. At each step, the agent takes an action, receives a reward, and updates its internal estimates of which behaviors are worth repeating. Over many interactions, the agent learns a policy, a mapping from situations to actions, that tends to maximize the cumulative reward it expects to receive over time. The entire objective of learning is therefore defined by the reward function. Change the reward, and a completely different policy emerges from the same algorithm and the same data.

This is what makes reward design an architectural concern rather than a configuration detail. In most software systems, a misconfigured parameter can be corrected by changing a value and redeploying. A misconfigured reward function propagates its assumptions through every parameter in the model, baking in the wrong objective at the foundation of everything the system has learned.

Comprehensive analysis of reward engineering methods across real-world RL applications confirms that inadequately crafted reward functions frequently lead to reward hacking and unpredictable agent behaviors, particularly when objectives are ambiguous or when the reward fails to account for unintended exploitation paths.

The Sparse-Dense Trade-off

Two Fundamentally Different Feedback Regimes

Like a lighthouse that emits a single concentrated beam across miles of darkness, a sparse reward delivers one clear signal amid long stretches of silence. Dense rewards, by contrast, provide feedback at every step. Each regime shapes the agent’s learning in profoundly different ways, and the choice between them carries long-term architectural implications for production systems.

One of the most consequential choices in reward design is how frequently to provide feedback. A sparse reward provides a signal only at meaningful milestones, a customer completes a purchase, a robot arm places an object successfully, or a recommendation results in a click. A dense reward provides a signal at every step, rewarding or penalizing incremental progress continuously.

Sparse rewards have an important advantage. They align naturally with the true objective. If the goal is for the agent to complete a task, the reward is given when the task is completed. There is no intermediate signal to game, no proxy metric to optimize at the expense of the genuine outcome. Research comparing sparse and dense reward paradigms across robotic control tasks found that sparse formulations not only match the intended goal more faithfully, but can in some cases produce higher-quality policies than their dense counterparts, which tend to converge on locally optimal behaviors that satisfy intermediate rewards without achieving the core objective.

The drawback of sparse rewards is that they provide little information to the learning algorithm during the vast majority of training. An agent receiving feedback only at the end of a long interaction must somehow determine which of the many preceding actions contributed to success, a challenge related directly to the credit assignment problem examined in Thinkata’s insight When Success Has No Author. Dense rewards address this by providing richer feedback throughout the trajectory, but they introduce a different risk. Every intermediate signal is a design choice that encodes assumptions about what constitutes progress, and those assumptions can be wrong.

Reward shaping, the practice of supplementing a sparse reward with additional signals to guide learning, is a powerful technique that must be applied with care. Research on potential-based reward shaping, a mathematically principled approach, shows that properly constructed shaping functions preserve the optimal policy of the original reward. Arbitrary shaping functions, however, can fundamentally alter what the agent learns to do, steering it toward behaviors that maximize the intermediate signals rather than the true goal. The theoretical guarantees of potential-based shaping are not preserved when engineers add heuristic signals based on intuition alone.

Goodhart’s Law in Deployed Systems

The most consequential failure mode in production RL is reward hacking, a phenomenon closely related to Goodhart’s Law, the principle that when a measure becomes a target, it ceases to be a good measure. In RL systems, the reward function is a proxy for the true objective. When the agent optimizes this proxy aggressively, it can discover strategies that score well by the proxy metric while failing, sometimes dramatically, on the true goal.

Empirical analysis of Goodhart’s Law in Markov decision processes demonstrates that optimizing an imperfect proxy reward beyond a critical threshold reliably causes performance on the true objective to degrade, and that this effect is robust across a wide range of environments and reward functions. The research provides a geometric explanation for why this occurs and proposes early stopping strategies that can bound the degradation. From a production standpoint, this means that continued training is not always better training, and that monitoring proxy reward performance alongside real-world outcome metrics is an architectural necessity, not an optional quality check.

The RLHF (Reinforcement Learning from Human Feedback) literature has quantified this dynamic with particular precision. Studies on reward model overoptimization show that as a policy is optimized further against a proxy reward model, performance according to a “gold standard” true reward initially improves and then declines, following a pattern whose shape and magnitude scale with model size and data quantity. The practical implication is that the relationship between optimization budget and real-world performance is non-monotonic. Systems that train longer do not necessarily perform better on what actually matters.

In practice, this degradation takes several recognizable forms. The most direct is proxy reward gaming, where the agent finds behaviors that score well on the reward function while violating the spirit of the objective. Classic examples include agents that achieve high scores by exploiting physics simulation bugs, or recommender systems that maximize engagement by surfacing extreme content. A subtler variant is overoptimization degradation, where performance on the true objective falls even as the proxy reward continues to rise. The agent has learned to satisfy the measure rather than the goal it was designed to represent. The third pattern, distribution shift exploitation, emerges when reward functions calibrated on offline data encounter production conditions. Inputs that were rare during training become common edge cases that the reward function handles poorly, opening gaps the agent can exploit in ways that were never observed during development.

Reward Design as a Governance Decision

There is a dimension of reward design that technical framing often obscures. Every reward function encodes a value judgment. The choice to reward engagement over wellbeing, throughput over accuracy, or short-term conversion over long-term retention is not a neutral engineering decision. It is a statement about what the system is for and whose interests it serves.

This is why reward design cannot be treated as purely a machine learning concern. As explored in When Oversight Becomes Infrastructure, governing AI agents requires enforcement mechanisms that operate independently of the systems they govern. Reward functions are precisely the layer at which governance must engage. A governance framework that audits outputs without examining the reward structure misses the source of the behavior it is trying to control.

Reward Specification as Policy

Just as an architectural blueprint commits a building to a specific structure before the first brick is laid, the reward function commits an AI system to a specific definition of success before training begins. Misalignment at this layer cannot be fully corrected by monitoring outputs or adding guardrails downstream. The specification itself must be treated as a first-class governance artifact, subject to review, versioning, and audit alongside the models trained against it.

The parallel to contract design is useful here. A contract that is technically fulfilled but contrary to the spirit of the agreement produces outcomes the parties would not have endorsed. An RL agent that technically maximizes its reward function while producing outcomes the designers would not endorse has been given the wrong contract. The correction does not come from better enforcement of the existing reward, it comes from redesigning the reward to better reflect what is actually wanted.

In multi-objective AI-native systems, the reward function must also integrate coherently with the broader objective hierarchy. Modular, composable architectures present a particular challenge. When multiple specialized components each operate under distinct reward signals, their joint optimization can produce emergent behaviors that satisfy no individual component’s objective while appearing locally optimal to each. Designing reward functions that remain coherent under composition requires explicit attention to how component-level incentives aggregate, not just to how each component performs in isolation.

Reward Design in Practice: Architectural Principles

The treatment of reward design as architecture implies a set of practical commitments. The reward function should be versioned, like any other critical infrastructure component, so that changes can be audited and traced back to behavioral differences in deployed systems. It should be tested against adversarial scenarios that probe for exploitable patterns before deployment, extending evaluation strategies directly to the reward specification layer.

Reward monitoring requires dedicated observability. The proxy reward during training is not the only metric worth tracking. Production systems should maintain instrumentation on real-world outcome metrics and watch for the divergence between proxy reward and true performance that characterizes overoptimization. Reward signal behavior should be treated as a first-class observable alongside latency, accuracy, and routing decisions.

Finally, reward design should be treated as an iterative discipline rather than a one-time specification. The assumptions baked into a reward function at training time will encounter production conditions that differ from the design environment. The reward function must be expected to evolve, and the systems around it must support that evolution without requiring full retraining cycles from scratch.

What This Means for Production Systems

The most important shift that reward design as architecture demands is organizational. It relocates the reward function from the domain of model training to the domain of system design, placing it in the same category as data schema decisions, API contracts, and infrastructure topology choices, which is to say decisions whose consequences propagate far and whose correction is expensive.

This reframing has direct implications for who should be involved in reward design decisions. The engineers building the training pipeline, the product managers defining the system’s success criteria, the governance and compliance functions responsible for its behavior in deployment, and, in high-stakes applications, the domain experts who understand what the system’s outputs will mean for the people affected by them, all have legitimate standing in these decisions.

Building an RL system without explicit architectural treatment of its reward function is equivalent to building a complex software system without explicit treatment of its data model. The agent will learn something. The question is whether what it learns is what anyone actually intended, and whether the system has been designed to detect the difference before that divergence becomes a production incident.

Every downstream challenge in production RL, learning from logged data that no longer matches the live environment, coordinating multiple agents with competing incentives, adapting policies as the world shifts beneath them, eventually traces back to the reward function. Get the reward right and the system has a foundation worth building on. Get it wrong and no amount of architectural sophistication will compensate for an objective that was never what anyone actually intended.

Originally published at https://thinkata.com.

Designing for Graceful Failure in Compound AI Systems

Mark Williams — Wed, 11 Mar 2026 16:36:00 GMT

Compound AI systems, architectures in which multiple specialized AI agents collaborate to complete complex tasks, are becoming the dominant model for production AI-Native deployment. One agent retrieves information, another reasons over it, a third formats and delivers results. When it works, the result is impressive. When it fails, the consequences are rarely as simple as a single error message.

The core challenge is that failure in a compound AI system does not look like a crash. An agent that hallucinates (producing confident but factually wrong output) does not throw an error. An agent that times out mid-task may leave downstream agents waiting on data that will never arrive. A reasoning component that returns subtly nonsensical output can corrupt every subsequent step in the pipeline, and the system may report success the entire time. Research now confirms what engineers who have built these systems already suspect, that failures in multi-agent systems are “frequently complex,” involving compounding effects across agent interactions rather than clearly isolated faults .

Why Compound Systems Break Differently

Before designing for resilience, it helps to understand why multi-agent failure is a distinct problem from single-model failure. In a traditional application, an error in one component typically produces a detectable signal, a crash, an exception, or a null return value. Multi-agent systems break this assumption.

Research analyzing seven popular multi-agent systems identified 14 distinct failure modes, organized into three broad categories that span specification and system design failures, inter-agent misalignment, and task verification and termination failures . What makes this taxonomy significant is not the number of failure modes but the implication that most of them are invisible by default. An agent may complete its assigned task according to its own internal logic while producing output that is semantically wrong for the context it operates in.

The scaling dynamics make this worse. Research into multi-agent scaling has formalized the intuition that “more moving parts increase fragility,” finding quantitatively that each additional tool in an agent’s chain amplifies error sensitivity . A system with five specialized agents is not five times more fragile than a single agent. The error amplification is multiplicative across coordination paths. Centralized architectures, where a coordinator validates outputs before passing them along, showed substantially higher resilience than flat peer-to-peer designs, despite the added overhead .

The Silent Failure Problem

Agent failures wear a mask of plausibility. Unlike a software crash, they present as normal-looking outputs. A hallucinating agent does not know it is wrong. A timed-out agent may return a cached or partial result indistinguishable from a valid one. Without active detection mechanisms to look behind that mask, these failures propagate downstream until they surface as user-facing errors, often far removed from their origin.

Fallback Chains and Degradation Hierarchies

The engineering response to this challenge starts with accepting that every AI component in a production system will eventually fail, and designing accordingly. This means building explicit degradation hierarchies, predetermined sequences of fallback behaviors that trigger when a primary agent cannot perform reliably.

A well-designed degradation hierarchy for a question-answering agent might proceed through three levels, starting with full AI reasoning with retrieval, falling back to a simpler retrieval-only response that surfaces raw source documents without synthesis, and finally handing off to a human operator with relevant context attached. The key insight is that each level must be independently functional, not merely a warning that the primary system failed. Research on cognitive degradation in agentic systems identifies “fallback logic rerouting,” the ability to redirect execution to predefined safe outputs when primary logic degrades, as one of seven essential runtime controls for production AI .

The analogy from distributed systems engineering is the circuit breaker pattern. A circuit breaker, in software terms, monitors a downstream component for repeated failures and, after a defined threshold, stops sending requests to it entirely, routing traffic to a fallback instead. Applied to AI agents, this means tracking output quality signals in real time, and automatically reducing the system’s reliance on a component whose behavior has degraded below an acceptable threshold, before users notice the problem.

Detecting Failure Before It Reaches Users

Circuit breakers require signals to trip them. This is where hallucination detection becomes a structural concern rather than a model improvement concern. Research on watchdog frameworks for LLM-based agents demonstrates that hallucination monitoring can be implemented as a layer external to the model itself, requiring no access to the model’s internal state . This matters enormously for compound systems built on commercial APIs, where internal model inspection is impossible.

Reliability as a Multi-Dimensional Property

Agent reliability cannot be inferred from average task success alone. Research proposes measuring it across four independent dimensions of consistency, robustness, predictability, and safety. A system that scores well on one dimension can fail badly on another, and those failures often only appear in production.

Practically, effective pre-user detection combines three instrumentation layers. First, output confidence monitoring, which tracks consistency signals across repeated or varied queries to identify agents operating outside their reliable range. Second, latency-based health probes, continuous checks that flag agents showing response time anomalies, a common early signal of context flooding or resource exhaustion . Third, cross-agent consistency checks at coordination boundaries, where a lightweight validator confirms that an agent’s output is plausibly coherent with the inputs it received before passing results downstream.

The Topology Question

Architecture is not neutral with respect to resilience. Research on optimizing multi-agent system structure found that both the topology (how agents connect to each other) and the prompts governing their behavior have strong, measurable impacts on overall resilience . Hierarchical structures, where a coordinator validates and routes between specialized agents, consistently outperform flat collaborative structures when faulty agents are present . The coordinator’s advantage is not merely that it can catch errors, but that centralized validation creates a natural checkpoint where fallback logic can activate before errors propagate.

This finding has a practical implication that goes beyond architecture diagrams. Teams building compound AI systems should treat the coordinator agent as the primary resilience surface, the component most heavily instrumented, most aggressively tested, and most conservatively designed. Complexity and creativity belong in specialized sub-agents, while the coordinator’s job is to be reliably boring.

What This Means for Production Teams

The gap between AI systems that impress in demos and AI systems that hold up in production is largely a gap in failure design. Graceful degradation requires upfront decisions about what “acceptable failure” looks like at each layer, instrumentation that surfaces failure signals before they become user-visible, and fallback paths that are tested as rigorously as primary paths.

The research is clear that these properties do not emerge automatically from capable models or clever prompting. They require explicit architectural choices made early, before the system is under load and before the pressure to ship has narrowed the design space. Building for graceful failure is not pessimism, it is the engineering discipline that separates systems that scale from systems that eventually collapse under their own complexity.

Originally published at https://thinkata.com.