Pipeline orchestration engine for multi-agent LLM workflows. Define pipelines in .dip files (Dippin language), execute them with parallel agents, and watch progress in a TUI dashboard.
Built by 2389.ai.
# Install
go install github.com/2389-research/tracker/cmd/tracker@latest
# See what's built in
tracker workflows
# Run a built-in pipeline by name — no file needed
tracker build_product
# Or copy it locally to customize
tracker init build_product
tracker build_product.dip
# Run fully autonomous with an LLM judge
tracker --autopilot mid build_product
# Use the Claude Code backend for file editing + terminal (native is the default)
tracker --backend claude-code build_product
# Or route nodes through an Agent Client Protocol server
tracker --backend acp build_product
# Check your setup (API keys, dippin binary, working directory)
tracker doctor
# Configure LLM providers interactively
tracker setup
# Validate a pipeline without running it
tracker validate build_product
# Resume a stopped run
tracker -r <run-id> build_product.dip
# When something goes wrong
tracker diagnose
# Override workflow params at runtime (v0.19.0)
tracker --param model=claude-opus-4 --param retries=3 build_product
# Pin the run's artifact directory explicitly (v0.19.0)
tracker --artifact-dir /tmp/tracker-runs build_productWhat's new in v0.34.0 (2026-05-27): the
workflows/mirror is gone — built-in workflows are now embedded directly fromexamples/via explicit-file//go:embedlines (#256). The repo previously kept four byte-identical copies of the built-in workflow.dipfiles underworkflows/andexamples/, kept in lockstep by amake sync-workflows/make check-workflowstarget, a pre-commit gate, and a CI step. Despite those guardrails, the two copies drifted three times. The//go:embed workflows/*.dipglob is replaced with four explicit//go:embed examples/<name>.diplines pointing directly at the canonical copies, and theworkflows/directory plus all of the sync infrastructure (Makefile targets, pre-commit gate #9, CIEmbedded workflows in syncstep) is deleted. No functional change for CLI users —tracker workflows,tracker init, and bare-name resolution behave identically. Library-API note:WorkflowInfo.Filenow reports paths with theexamples/prefix instead ofworkflows/; that's the only externally visible delta. ~3,200 lines deleted net.Previously in v0.33.0 (2026-05-27): the last two visible
build_productaudit gaps from #233 close — the eight-gap thread is now down to a single design-scope follow-up (Gap 5.2OutcomeHumanOverride), which has wide blast radius and is tracked separately. Gap 5.1 (#263):parseAutoStatusnow tolerates markdown-emphasis on STATUS lines. The audit observed**STATUS: fail**(bold) silently falling back to the defaultsuccessbecauseHasPrefix("**STATUS:", "STATUS:")returns false, and LLMs commonly bold/italicize STATUS lines when they want the verdict to draw the eye.parseStatusLinenowstrings.Trims the line and value with the"*_"cutset, so**STATUS: fail**/*STATUS: fail*/STATUS: **fail**/__STATUS: success__all parse correctly. Locked semantics fromTestParseAutoStatus_V3FailFirstContractare unchanged: last-line-wins + default-success-on-empty. New regression coverage inTestParseAutoStatus_Gap5_1_AuditedShapes(11 sub-tests) — six previously RED, now GREEN; five pin existing behavior. Gap 5.3 (#264):build_productre-runs reviewers afterApplyReviewFixeswith a one-shot budget. Pre-fix, the workflow only ran reviewers once and routed straight toFinalBuildafter fixes, so a fix commit introducing W4 (zero-assertion stubs), W5 (wrong-target tests), or W13 (DI-bypass) regressions reachedDoneunchecked —FinalSpecCheckper Gap 8 v5 scopes to W17 + iface +SPEC.mdand explicitly delegates W4/W5/W13 to reviewer rubric point 3, which only ran pre-fix. NewCheckReviewFixBudgettool node increments.ai/build/review_fix_attemptsand routesApplyReviewFixes -> CheckReviewFixBudget -> ReviewParallel restart:true(one re-review pass) or-> EscalateReview(budget exhausted, MAX_ATTEMPTS=1). AResetReviewBudgettool node sits on theEscalateReview "retry" -> Decomposeedge so the counter clears when the human picks retry — otherwise the stale counter would immediately fail-close the retry's first re-review pass (caught by Codex / Copilot during PR review). Pattern mirrors the existingTestMilestonefix_attemptsprecedent. No breaking changes; all four example pipelines remain A grade ondippin doctor.Previously in v0.32.0 (2026-05-27): dippin-lang dependency bumped from the v0.31.0-window SHA pin to the v0.32.0 tag — joint-release closeout for #258 / dippin-lang#41. v0.31.0 shipped with a transient pseudo-version pin pointing at the dippin-lang#41 merge SHA because dippin v0.32.0 didn't exist yet when tracker was tagged. dippin v0.32.0 (whose go.mod pins tracker v0.31.0) is now published, so tracker v0.32.0 swaps the SHA for the tag and updates
PinnedDippinVersion. No functional changes vs v0.31.0 — the underlying dippin code is byte-identical between the pseudo-version and the v0.32.0 tag. With this release, the joint-release loop is closed: tracker v0.32.0 ↔ dippin v0.32.0 are mutually pinned and both tags are published.Previously in v0.31.0 (2026-05-27): two more
build_productaudit gaps closed + newtool_accessruntime enforcement that bounds the v0.28.2 single-agent multi-tool-call vector. Seven of the eight #233 audit gaps now closed; only Gap 5 (engine-levelauto_statusaudit +OutcomeHumanOverride+ post-ApplyReviewFixesre-check) remains as a follow-up. Gap 7 (#254):FinalSpecCheckenforces interface-method reachability with shown-work grep evidence.Setupwrites.ai/build/iface-reachability-rubric.md(language detection + per-language enumeration patterns + caller-discipline rules + stdlib carve-out principle + waiver discipline + known-limitation skips);FinalSpecCheckand the three reviewers reference it so the discipline lives in one place. The prompt opens with an inverted STATUS contract —STATUS:failfirst,STATUS:successlast only on full pass — so a truncated response fails closed under last-line-wins parsing, defending against theparseAutoStatusdefault-success-on-empty shape that was the original Gap 7 bug. Gap 8 (#257):FinalSpecCheckadds a sleep-as-fence section restricted to tracked test files viagit grep(so dependency dirs likenode_modules/,vendor/,.venv/,target/are excluded by construction); each hit needs disposition (cite SPEC.md timing contract OR cite deterministic primitive OR.ai/decisions/waiver). Reviewer rubric point 3 strengthened across all three reviewers — Gemini's "stdlib-only tests FAIL" sentence promoted to ReviewClaude and ReviewCodex (W5), plus shown-work demands for zero-assertion bodies (W4) and DI bypass when a Clock/Random/IO seam exists (W13). The legacy STATUS tail atFinalSpecCheckwas rewritten to align with PR #254's inverted contract.tool_access: noneenforcement (#259, joint with dippin-lang v0.32.0): bounds the v0.28.2 single-agent multi-tool-call vector — an LLM emitting[bash("rm -rf"), write("payload"), bash("./payload")]in one response would otherwise dispatch all of them beforemax_turnschecks the cap. Withtool_access: noneset on an agent node, tracker hands the LLM zero tools so the response comes back as plain text. Defense in depth: built-in registry gate + post-WithToolsregistry clear inNewSession+request.ToolChoice = ToolChoiceNone()+executeToolCallsearly-exit + Params-bypass defense (allowed_tools/disallowed_tools/permission_modeignored whentool_accessis set). Backend coverage: native full honor; claude-code best-effort via--disallowedToolsenumeration; ACP refuses session creation (no verified deny-equivalent yet). Fail-closed for typos —noen/None/offall disable tools so a misspelling can't ship full tools. No breaking changes; all three core example pipelines remain A / 100 ondippin doctor.Previously in v0.30.0 (2026-05-19):
build_productworkflow hardened against five of the eight failure modes the #233 audit uncovered. The audit ranbuild_productend-to-end on a real Phase 1 spec and found 38 issues the workflow declared "Done" on — wrong API shape, off-by-one retries, red CI, dead interface methods, Phase-6 features built in Phase 1 with green tests pinning the bug. Five gaps closed in this release; three (Gap 5auto_statusengine audit, Gap 7 interface reachability, Gap 8TestQualitystep) remain. Gap 1:TestMilestone+FinalBuildnow probe for a projectMakefileand run the first defined target out ofci/check/lint(parsed via sed+awk that handles multi-target rules and rejects variable assignments); a missingmakebinary fails loud instead of silently skipping. Gaps 3 + 6:Implementcarries explicit "spec literals are contracts" + "tests verify the contract, not your code" + "snapshot tests must be hand-verified" rules, andDecomposeproduces a per-milestone "DO NOT implement" list so deferred-phase affordances stay inert. Gap 2: the three cross-reviewers (ReviewClaude,ReviewCodex,ReviewGemini) now share a 5-point structured rubric (spec literals grep, interface reachability, test-verifies-contract, scope, architecture); each carries a "do not trust spec ✅ markers" warning;ReviewGeminiis now explicitly adversarial ("faithful and high-quality realization" is a forbidden phrase);SynthesizeReviewsweights findings by evidence not vote count, so a single reviewer with grep / file:line evidence flipsSTATUS:failover two evidence-free PASSes. Gap 4:VerifyMilestonereadsSPEC.mddirectly (not just the milestone notes), runs spec-literal greps inline (with a.ai/decisions/documented-deviation carve-out matching Implement's contract), applies the test-asserts-contract check, and confirms tests + CI passed via thetests-passsentinel rather than scanning for visible test output (which the 64KB stdout tail-cap routinely elides). Also bumpsdippin-langv0.27.0 → v0.29.0 across two passthrough PRs (#248, #250):ir.ToolConfiggainsMarkerGrep/RouteRequired/OutputLimit/Outputsadapter forwarding, and dippin-side polish lands lint suppression onmarker_greptools (DIP138),parseBoolAttrnormalization (yes/1/TRUEnow parse correctly ongoal_gate/auto_status/cache_tools/route_required), andOutputsDOT round-trip parity. No breaking changes; all three example pipelines remain A grade ondippin doctor.Previously in v0.29.0 (2026-05-18): workflows can now declare environmental dependencies in the header (#234). A new
requires: <list>keyword in the.dipworkflow header (e.g.requires: git) tells tracker to verify those dependencies before any node executes — git installed AND working tree present AND HEAD points at a real commit. Unrecognized entries (docker,gh,jq, etc.) parse fine and warn instead of erroring, so workflow authors can forward-declare deps that future tracker versions will check. New--git=auto|off|warn|require|initCLI flag overrides the policy:auto(default) respects the workflow'srequires:,offbypasses,warndowngrades hard-fails to warnings,requireforces the check even if the workflow didn't declare it, andinit(with a mandatory--allow-initlatch or interactive[Y/n]prompt) auto-runsgit initand an empty initial commit so HEAD is born. Safety refusals fire for$HOME,/, and nested git contexts — including linked worktrees (.gitis a file), submodules (same), and bare repos (no.gitat all);$HOMEand root comparisons fold case on Windows. Auto-init refuses in non-empty workdirs rather than silentlygit add -A-ing content that might be secrets, build artifacts, or untracked files the user hasn't decided to keep.tracker doctorgets a new Git Requires check that previews the runtime decision under the resolved policy (OK/Warn/Error/Skip) with copy-paste remediation inHint. The three built-in workflows that commit/branch/merge mid-run (ask_and_execute,build_product,build_product_with_superspec) now declarerequires: gitso a non-git workdir fails in seconds with a remediation message instead of burning $20–$100 of LLM spend before the first git operation. Requires dippin-lang v0.26.0.Previously in v0.28.2 (2026-05-14): patch release fixing a runaway-agent bug in three of the four built-in workflows (#230).
ask_and_execute,build_product, andbuild_product_with_superspecdefinedStart/Doneasagentnodes withprompt: Initialize pipeline.andprompt: Pipeline complete.— which causedensureStartExitNodesto skip the passthrough handler and turn them into real codergen sessions with full default tool access and nomax_turnscap. A realbuild_productrun was observed spending ~10 minutes and ~39k output tokens insideStart, implementing an entire separate Go project from a SPEC.md found on disk, before gettingcontext canceled. Dropping the prompt lines makes Start/Done passthroughs, matchingdeep_review.dipwhich was already correct. Engine-policy follow-ups (defaultmax_turnscap, cancel→retry mapping, runaway-node detection intracker diagnose, tool-call arg logging, token accounting) remain tracked on #230. No engine changes; no breaking changes.Previously in v0.28.1 (2026-05-14): maintenance release — picks up dippin-lang v0.25.0 (
.dipxformat v1.1). For tracker, three bundle-load improvements arrive automatically: cycle detection indipx.Opennow walks every manifest-listed workflow (was: only entry-reachable),ErrManifestInvalid/ErrUnsupportedFormatVersionerrors now always carry the bundle path, anddipx.Packsubgraph parse failures classify correctly asErrSubgraphParseinstead ofErrEntryParse.OpenandPackare also cancellable mid-loop (ctx.Err()checked per-entry). No tracker-side feature changes; no breaking changes. TheSource.Workflowsignature change in dippin-lang v0.25.0 doesn't affect tracker — we useBundle.Entry()/Bundle.Lookup()directly.Previously in v0.28.0 (2026-05-13): the entire #208 follow-up arc lands — five hardening issues, all closed. New typed routing channels remove the
ctx.tool_stdout-grep foot-gun:marker_grep:(#210) is a declarative regex on a tool node that populatesctx.tool_markerwith the last match's capture group;_TRACKER_ROUTE=<value>(#212) is a reserved sentinel line tools emit on stdout that populatesctx.tool_route. Both fail loudly viaOutcomeFail+ dedicated events (tool_marker_missing/tool_route_missing) when the expected marker isn't found, instead of silently falling through. NewTRK101validate-time lint (#211) flags the risky shape (ctx.tool_stdoutconditional + unconditional fallback + tee/2>&1 volume + nomarker_grep) attracker validate/tracker doctortime, before a pipeline ships. Activity log integrity hardening (#213) relocates live writes from<workDir>/.tracker/runs/<runID>/activity.jsonl(mode 0o644, reachable from tool subprocesses via relative path) to$XDG_STATE_HOME/tracker/runs/<runID>/(mode 0o600, override viaTRACKER_AUDIT_DIR), prefixes every runtime line with a\x1f\x1esentinel that diagnose validates, and writes a sentinel-stripped snapshot back to the legacy path at Close so bundle export and external tooling keep working. NewSuggestionAuditLogInjectionfires when the secure log has non-sentinel lines. Detection-only, not authentication — documented in CLAUDE.md. Property-based tests fortailBuffer(#214) generalize the v0.27.0 boundary tests across the full state space viapgregory.net/rapid.Previously in v0.27.0 (2026-05-13): tool stdout/stderr truncation now keeps the tail, not the head (closes #208). Pre-fix, the per-stream 64KB cap kept the first 64KB of output, silently dropping routing markers (
printf 'tests-pass') past the boundary — pipeline routing could then fall through the unconditional fallback edge and ship broken code as if it had passed. NewtailBufferring-buffer keeps the trailinglimitbytes;CommandResultgains structuredStdoutTruncated/BytesDropped/StderrTruncated/StderrBytesDroppedfields; the in-band"...(output truncated at N bytes)"suffix is gone. Two new activity events surface what happened:EventToolOutputTruncated(per truncated stream) andEventConditionalFallthrough(when conditional edges all evaluated false and routing fell back).tracker diagnosecorrelates the two on the same node and surfaces a combined "your routing marker may have been dropped" suggestion.Previously in v0.26.0 (2026-05-12): native
.dipxbundle support across the whole CLI. Tracker now accepts content-addressed.dipxbundles (produced bydippin pack) anywhere it accepts a pipeline file —tracker validate,tracker simulate,tracker run,tracker doctor, andtracker -r <runID>resume. Pre-fix, tracker read the bundle's ZIP bytes as.dipsource and failed with bogusDIP001/DIP002validation errors. The newpipeline.LoadDipxBundleopens the bundle viadipx.Open(SHA-256 verifies every file inmanifest.jsonbefore any content reaches the parser), uses the bundle's pre-parsed*ir.Workflowdirectly, and bypasses the filesystem subgraph walker entirely since dipx already verifies ref closure + acyclicity. The bundle's content-addressed identity (sha256:<hex>) is stamped on every line ofactivity.jsonl, persisted intocheckpoint.json, and surfaced intracker list(newBundlecolumn) andtracker audit(newBundle:header line). Resume against a.dipxstrictly verifies the stored identity matches — mismatch aborts with both hashes shown;--force-bundle-mismatchis the escape hatch. dippin-lang dependency bumped v0.23.0 → v0.24.0 for the newdipxpackage.Previously in v0.25.1 (2026-05-11): bedrock gateway integration polish + Gemini token-usage fix. The Bedrock Gateway integration guide is refreshed for upstream gateway fixes (Cloudflare AI Gateway native routing prefixes
/anthropic/openai/google-ai-studio/compat, and Gemini's/v1beta/models/...paths) — tracker's--gateway-urlflag now works end-to-end againsthttps://bedrock-gateway.2389-research-inc.workers.devandprovider: geminiis no longer broken. The Gemini SSE adapter coalesces split finish +usageMetadatachunks into a singleEventFinish, fixing both the 0/0 token-count bug (gateway emits usage as a standalone trailing chunk) and the duplicate "llm finish" trace line that the first fix would have left behind. Partial-failure streams correctly emit a terminal finish before the error so accumulators record the reason for work that completed before the stream broke.Previously in v0.25.0 (2026-05-05): architect-side machinery for local codegen + self-healing declared writes. New agent-tool primitive
TerminalToollets a tool flag itself as the terminal step of an agent session — the runtime breaks the loop the moment it succeeds, no wasted post-dispatch turns. Three new tools:dispatch_sprintsruns a deterministic in-tool loop over a{path, description}JSONL plan with bounded retry+backoff for transient provider errors;write_enriched_sprintcalls a mid-tier LLM once per sprint with a 4-strategy SEARCH/REPLACE matcher (exact → indent-preserving → whitespace-insensitive → fuzzy) plus partial-apply semantics and a tolerant audit-verdict parser;generate_codeexpands a contract into one or more files via a cheap/fast model. All four are env-gated viaTRACKER_SPRINT_WRITER_MODEL/TRACKER_CODEGEN_MODEL. Validated end-to-end on Notebook synthetic (41/41 pytest passing, ~$2, 28min) and NIFB architect-only (16 sprints, ~$5, 47min). Declarativewrites:now self-heals when an LLM returns prose instead of JSON: 4-step extraction cascade (direct parse → fenced block via strict-shape regex → balanced-brace scan that handles stray brace pairs and string-internal{correctly → 8 KiB-capped raw-response fallback for single-key writes only), with the fallback gated on "no extractable JSON found" so a model that returned valid JSON missing the declared key still hard-fails. Reserved-key collision rejection onwrites:covers the tool_command safe-key allowlist (security) and thewrites_error/writes_warningsignal keys (integrity).tracker doctorprovider probe restored to 16-token max output (amaxTok=1regression had been breaking OpenAI keys with HTTP 400). Newdocs/bedrock-gateway.mdintegration guide for routing through the 2389 Cloudflare Worker.Previously in v0.24.2 (2026-05-03): 10-finding security audit pass — ACP
CreateTerminalnow validates commands against the built-in denylist and constrainscwd; Claude Code backend kills subprocess process group on pipeline cancellation;TRACKER_PASS_API_KEYSrequires=1(was any non-empty value); engine fails on unknown outcome status; pipeline goroutine panic recovery;manager_loopsteer_contextkeys namespaced understeer.*to close a future LLM-controlled steer-key bypass.Previously in v0.24.1 (2026-04-24): cost-accounting + observability layer on top of v0.24.0 — claude-code backend now parses
cache_read_input_tokens/cache_creation_input_tokensfrom the NDJSON result envelope (pre-fix: silently dropped, ~3× input-cost overcount on heavy-cache Sonnet workloads); newTRACKER_ACP_CACHE_READ_RATIOenv var lets operators tell the ACP estimator what fraction of estimated input to price as cache-read; new--tool-denylist-add <glob>CLI flag +tool_denylist_addgraph attribute let workflows extend the built-in tool-command denylist for defense in depth (completes the deferredWorkflowDefaults.ToolDenylistAddadapter wiring);Estimatedflag now plumbs throughSessionStats→ProviderUsage→ CLI/TUI/NDJSON so a mixed native+ACP run correctly marks heuristic spend with(estimated)suffixes instead of silently mixing metered and estimated figures.Previously in v0.24.0 (2026-04-24): two budget-bypass P1 fixes —
subgraphandstack.manager_loopnodes no longer ignore--max-tokens/--max-cost(child engines now inherit the parent'sBudgetGuard+ baseline usage, and their spend rolls up viaOutcome.ChildUsageso parent-level guards fire between nodes); ACP backend surfaces approximate per-prompt token usage from rune counts across assistant text, reasoning chunks, and tool-call argument/result payloads;claude-codeandacpbackends now correctly populateSessionResult.Providerand threadcfg.ModelintoTokenTracker.AddUsage;llm.EstimateCostwarns once per unknown model; dippin-lang v0.23.0 bump.Previously in v0.23.0 (2026-04-22):
.dipauthors can now declarestack.manager_loopsupervisors directly via the newir.NodeManagerLoopIR kind (dippin-lang v0.22.0 contract — subgraph_ref, poll_interval, max_cycles, stop_condition, steer_condition, steer_context with percent-encoded round-trip); three new tool-safety CLI flags —--bypass-denylist,--tool-allowlist <pattern>,--max-output-limit <bytes>— plustool_commands_allowgraph attribute that unions with the CLI allowlist; manager_loop evaluator-compatibility fixes (&&/||Parsed-fallback, comma-ok attr precedence, strict steer_context validation) surfaced by the v0.22.0 review-squad pass.Previously in v0.22.0 (2026-04-22): new
tracker-swebench analyze <results-dir>subcommand for bulk triage of completed SWE-bench runs (auto-detects evaluator reports, surfaces empty-patch diagnostics, per-repo breakdown, error-class distribution,--jsonfor downstream tooling); typedNodeConfigaccessors on*pipeline.Nodethat replace scatteredmap[string]stringparsing in codergen/human/tool/parallel/retry paths (closes the #19 Primitive Obsession refactor); tool-nodetimeout: "0"or negative durations now error with a clear "non-positive timeout" message when the tool node executes instead of being silently passed through tocontext.WithTimeout(behavior change — see CHANGELOG).Previously in v0.21.0 (2026-04-21): declarative
writes:/reads:unified structured output for agent/human/tool nodes;tracker.SimulateGraphgraph-in variant; repository localization pre-processing; agent episodic memory across retries; plan-before-execute phase; accurate cost accounting fixes.Previously in v0.20.0 (2026-04-21):
stack.manager_loopsupervisor handler (Attractor spec 4.11); engine-level steering channel; accurate cost estimation via catalog with cache-token pricing; April 2026 model catalog refresh (Claude Opus 4.7, GPT-5.4 family, Gemini 2.5 GA, Gemini 3.1 pro preview); ACP sandbox hardening against..path traversal. See CHANGELOG.md anddocs/architecture/handlers/manager-loop.md.
Four pipelines are embedded in the binary and available via tracker workflows:
Competitive implementation: ask the user what to build, fan out to 3 agents (Claude/Codex/Gemini) in isolated git worktrees, cross-critique the implementations, select the best one, apply it, clean up the rest.
Sequential milestone builder: read a SPEC.md, decompose into milestones, implement each with verification loops (opus-powered fix agent with 50 turns), cross-review the complete result, verify full spec compliance. Context-specific escalation gates let you override flaky tests or skip milestones without aborting the build.
graph LR
ReadSpec --> Decompose --> ApprovePlan
ApprovePlan -->|approve| PickNext
PickNext -->|milestone N| Implement --> Test
Test -->|pass| Verify --> MarkDone --> PickNext
Test -->|fail| Fix --> Test
Test -->|escalate| EscalateMilestone
EscalateMilestone -->|mark done| MarkDone
EscalateMilestone -->|retry| Implement
PickNext -->|all done| CrossReview --> FinalBuild --> FinalSpec --> Cleanup --> Done
Parallel stream execution for large structured specs: reads the spec's work streams and dependency graph, executes independent streams in parallel (with git worktree isolation), enforces quality gates between phases, cross-reviews with 3 specialized reviewers (architect/QA/product), and audits traceability.
Interview-driven codebase review: describe what you want reviewed, answer structured interview questions to scope the analysis, then three parallel agents analyze correctness, security, and design. A second interview presents findings for your context (is this intentional? known issue?), a third prioritizes remediation, and the pipeline produces an actionable remediation plan.
graph LR
DescribeGoal --> Explore --> ScopeInterview
ScopeInterview --> AnalyzeParallel
AnalyzeParallel --> Correctness & Security & Design
Correctness & Security & Design --> Join
Join --> Synthesize --> FindingsInterview
FindingsInterview --> PriorityInterview
PriorityInterview --> RemediationPlan --> ReviewPlan
ReviewPlan -->|approve| Finalize --> Done
ReviewPlan -->|revise| RemediationPlan
Pipelines are embedded in the binary so brew and go install users can run them without cloning the repo:
tracker workflows # List all built-in workflows
tracker build_product # Run directly by name
tracker validate build_product # Validate works too
tracker simulate build_product # Simulate too
tracker init build_product # Copy to ./build_product.dip for editingLocal .dip files always take precedence over built-ins. After tracker init build_product, running tracker build_product uses your local copy.
Pipelines are defined in .dip files using the Dippin language:
workflow MyPipeline
goal: "Build something great"
start: Begin
exit: Done
defaults
model: claude-sonnet-4-6
provider: anthropic
agent Begin
label: Start
human AskUser
label: "What should we build?"
mode: freeform
agent Implement
label: "Build It"
prompt: |
The user wants: ${ctx.human_response}
Implement it following the project's conventions.
agent Done
label: Done
edges
Begin -> AskUser
AskUser -> Implement
Implement -> Done
Workflows can declare what they need from the host environment with a requires: line in the header:
workflow BuildProduct
goal: "..."
requires: git
start: Start
exit: Done
Tracker checks these at startup (when you invoke tracker <workflow>). If the env doesn't satisfy them, the run fails in seconds with a copy-paste remediation instead of burning LLM spend before the first failure. Override per-run with --git=auto|off|warn|require|init (default auto respects requires:; --git=init --allow-init auto-initializes the workdir, with safety refusals for $HOME, /, and nested repos — including bare repos, linked worktrees, and submodules). v0.29.0 implements git; unrecognized entries warn and continue so workflow authors can forward-declare deps that future tracker versions will check.
| Type | Shape | Description |
|---|---|---|
agent |
box | LLM agent session (codergen) |
human |
hexagon | Human-in-the-loop gate (choice, freeform, or hybrid) |
tool |
parallelogram | Shell command execution |
parallel |
component | Fan-out to concurrent branches |
fan_in |
tripleoctagon | Join parallel branches |
subgraph |
tab | Execute a referenced sub-pipeline |
manager_loop |
house | Managed iteration loop |
conditional |
diamond | Condition-based routing |
Three namespaces for ${...} syntax in prompts:
${ctx.outcome}— runtime pipeline context (outcome, last_response, human_response, tool_stdout)${params.model}— workflow-levelvars(optionally overridden by--param key=valueat run time, v0.19.0) and subgraph parameters passed from a parent pipeline${graph.goal}— workflow-level attributes
Declare defaults in a top-level vars block and override them per-run:
workflow MyPipeline
vars
model: claude-sonnet-4-6
retries: 3
tracker --param model=claude-opus-4 --param retries=1 MyPipelineUnknown --param keys hard-fail at startup. Dippin-lang's lint (run automatically at .dip load) flags undeclared ${params.*} references and other variable-reference mistakes — see dippin doctor for the full lint catalog.
Variables are expanded in a single pass — resolved values are never re-scanned, preventing recursive expansion.
Important: Each agent node runs a fresh LLM session. Data flows between nodes via context keys, not conversation history. Per-node scoping (${ctx.node.<nodeID>.<key>}) lets you reference a specific earlier node's output without relying on the last-writer-wins last_response key. See Pipeline Context Flow for the full model, fidelity levels, and parallel-branch patterns.
edges
Check -> Pass when ctx.outcome = success
Check -> Fail when ctx.outcome = fail
Check -> Retry when ctx.outcome = retry
Gate -> Next when ctx.tool_stdout contains all-done
Gate -> Loop when ctx.tool_stdout not contains all-done
Supported operators: =, !=, contains, not contains, startswith, not startswith, endswith, not endswith, in, not in, &&, ||, not.
ctx.tool_stdout and ctx.tool_stderr capture the tail of a tool node's output (default cap 64KB per stream, configurable per-node via output_limit: 256KB; --max-output-limit is a hard global ceiling, default 10MB, that caps how high a per-node output_limit can go). Routing markers emitted at end-of-output via printf survive truncation by construction; tracker diagnose surfaces a tool_output_truncated suggestion when a stream was clipped so you know to raise the limit if the captured tail isn't what you expected.
Conditions support the ctx. namespace prefix (dippin convention) and internal.* references for engine-managed state.
Agent, tool, and mode: interview human nodes can declare the context keys they produce and consume (v0.21.0):
agent Planner
response_format: json_object
writes:
- milestone_id
- files
reads:
- spec_path
The node output must be a valid top-level JSON object; every declared key in writes: must be present or the node hard-fails. Extras are allowed (surfaced as warnings), strings are stored verbatim, non-string values are stored as compact JSON. reads: pins fidelity for upstream keys so downstream nodes see consistent data. See Pipeline Context Flow for the full contract, worked examples, and interview-mode semantics.
For git worktree isolation in parallel implementations:
agent ImplementClaude
working_dir: .ai/worktrees/claude
model: claude-sonnet-4-6
prompt: Implement the spec in this isolated worktree.
The working_dir attribute is validated against path traversal and shell metacharacters.
Five gate modes:
- Choice mode (default): presents outgoing edge labels as a radio list. Arrow keys navigate, Enter selects.
- Freeform mode (
mode: freeform): captures text input. If the response matches an edge label (case-insensitive), it routes to that edge. Otherwise it's stored asctx.human_response. - Hybrid mode (automatic): when a freeform gate has labeled outgoing edges, the TUI presents a radio list of labels plus an "other" option for custom feedback. Selecting a label submits it directly; selecting "other" opens a textarea for specific instructions.
- Yes/No mode (
mode: yes_no): fixed two-option prompt. Yes maps toOutcomeSuccess, No maps toOutcomeFail— route withwhen ctx.outcome = success/when ctx.outcome = failedges. Distinct from choice mode, where the outcome is always success and routing usespreferred_label. - Interview mode (
mode: interview): structured multi-field form driven by upstream agent output. An agent generates markdown questions with inline options; the handler parses them into individual form fields and presents a fullscreen interview form. Answers are stored as JSON and markdown summary.
Long prompts with labels (e.g., escalation gates with agent output) automatically use a fullscreen review hybrid view: glamour-rendered scrollable viewport on top (PgUp/PgDn to scroll), radio label selection in the middle, and an "other" freeform option at the bottom for custom retry instructions. Long prompts without labels use a split-pane review: scrollable viewport on top, textarea on bottom.
human ApproveSpec
label: "Review the spec. Approve, refine, or reject."
mode: freeform
edges
ApproveSpec -> Build label: "approve"
ApproveSpec -> Revise label: "refine" restart: true
ApproveSpec -> Done label: "reject"
Interview gates let an agent generate structured questions that the user answers via a form:
human ScopeInterview
label: "Help us focus the review."
mode: interview
questions_key: interview_questions
answers_key: scope_answers
The upstream agent writes markdown questions to the questions_key context variable. The parser extracts:
- Numbered/bulleted questions ending in
?or imperative prompts ("Describe...", "List...") - Inline options from trailing parentheticals:
Auth model? (API key, OAuth, JWT)becomes a select field - Yes/no patterns detected automatically as confirm toggles
The TUI presents a fullscreen form with per-field navigation (arrow keys), pagination (PgUp/PgDn for 10+ questions), elaboration textareas (Tab), and pre-fill from previous answers on retry. Answers are stored as JSON at answers_key and as a markdown summary at human_response. If zero questions are parsed, the gate falls back to freeform. Cancellation returns outcome=fail.
A reusable interview loop pattern is available in examples/subgraphs/interview-loop.dip — embed it via subgraph nodes with topic and focus parameters.
Submit with Ctrl+S. Enter inserts newlines. Esc cancels (empty) or submits (with content). Ctrl+C cancels and unblocks the pipeline (no deadlock).
Tracker supports four LLM providers: anthropic, openai, gemini, and openai-compat (for any OpenAI-compatible API). Set up with:
# Interactive setup wizard
tracker setup
# Verify your configuration
tracker doctorKeys are stored in ~/.config/2389/tracker/.env. You can also export them directly:
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
export GEMINI_API_KEY=...Important: Use gemini (not google) as the provider name in .dip files.
Non-retryable provider errors (quota exceeded, auth failure, model not found) immediately fail the pipeline with a clear message instead of silently retrying.
Tracker can route every provider through Cloudflare AI Gateway so you stop hitting rate limits (Anthropic, OpenAI, etc. cap per-account request rates; Cloudflare's gateway capacity is much higher), gain central analytics and caching, and enable model routing on the gateway side.
Set one env var or flag instead of four:
# The root URL of your Cloudflare AI Gateway:
# https://gateway.ai.cloudflare.com/v1/<account_id>/<gateway_slug>
export TRACKER_GATEWAY_URL="https://gateway.ai.cloudflare.com/v1/acc/gw"
# API keys still go to the provider — Cloudflare just proxies.
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
export GEMINI_API_KEY=...
tracker build_productOr as a CLI flag:
tracker --gateway-url https://gateway.ai.cloudflare.com/v1/acc/gw build_productTracker automatically appends the per-provider suffix:
| Provider | Resolved URL |
|---|---|
anthropic |
<gateway>/anthropic |
openai |
<gateway>/openai |
gemini |
<gateway>/google-ai-studio |
openai-compat |
<gateway>/compat |
Per-provider overrides still win. If you set ANTHROPIC_BASE_URL directly, Anthropic traffic goes there, and the gateway only proxies the providers you haven't explicitly overridden. This means you can point Anthropic at a self-hosted proxy while keeping OpenAI on Cloudflare with one command.
Troubleshooting:
429from Cloudflare: something bigger is wrong (account-level limits, bad gateway slug). 429s from direct provider calls are what the gateway is meant to prevent.401: check your provider API key, not the gateway — Cloudflare passes auth through.- Empty responses: verify the gateway slug is correct and the provider is enabled in the Cloudflare dashboard.
Tracker is a three-layer stack: an LLM client (provider adapters and token tracking), an agent session (turn loop, tool execution, context compaction), and a pipeline engine (graph execution, edge routing, checkpoints, decision audit, TUI). The dippin adapter converts parsed .dip IR into tracker's Graph model, and handlers implement per-node behavior.
graph TB
subgraph "Layer 3: Pipeline Engine"
Engine["Graph Execution<br/>Edge Routing<br/>Checkpoints<br/>Decision Audit"]
Handlers["Handlers: start, exit, codergen, tool,<br/>wait.human, parallel, parallel.fan_in,<br/>conditional, subgraph, stack.manager_loop"]
Adapter["Dippin Adapter<br/>IR → Graph"]
TUI["TUI: node list,<br/>activity log, modals"]
end
subgraph "Layer 2: Agent Session"
Session["Tool Execution<br/>Context Compaction<br/>Event Streaming"]
end
subgraph "Layer 1: LLM Client"
Anthropic & OpenAI & Gemini
end
Engine --> Handlers
Engine --> Adapter
Engine --> TUI
Handlers --> Session
Session --> Anthropic & OpenAI & Gemini
For subsystem-level architecture docs, see ARCHITECTURE.md and docs/architecture/.
The terminal UI shows:
- Pipeline panel: node list in topological execution order (Kahn's algorithm) with status lamps, thinking spinners, and tool execution indicators
- Activity log: per-node streaming with line-level formatting (headers, code blocks, bullets), node change separators, multi-node activity indicators for parallel execution, and inline
FAILED:/RETRYING:messages when nodes fail or retry - Subgraph nodes: dynamically inserted and indented under their parent
| Icon | Meaning |
|---|---|
| ○ | Pending — not yet reached |
| 🟡 (spinner) | Running — LLM thinking |
| ⚡ | Running — tool executing |
| ● (green) | Completed successfully |
| ✗ (red) | Failed |
| ↻ (amber) | Retrying |
| ⊘ (dim) | Skipped — pipeline took a different path |
| Key | Action |
|---|---|
| v | Cycle log verbosity (all / tools / errors / reasoning) |
| z | Toggle zen mode (full-width log, sidebar hidden) |
| / | Search the activity log (n/N next/prev, Esc exits) |
| ? | Help overlay with all shortcuts |
| Enter | Drill down into the selected node (Esc exits) |
| y | Copy the visible log to the clipboard |
| Ctrl+O | Toggle expand/collapse tool output |
| Ctrl+S | Submit human gate input |
| Esc | Cancel (empty) or submit (with content) |
| PgUp/PgDn | Scroll review viewport (plan approval) |
| q | Quit |
Every run produces an activity.jsonl log. Live writes go to the integrity-protected path under $XDG_STATE_HOME/tracker/runs/<id>/activity.jsonl (mode 0o600, default $HOME/.local/state/tracker/runs/<id>/, override via TRACKER_AUDIT_DIR; #213). At run-end a sentinel-stripped snapshot is mirrored back to .tracker/runs/<id>/activity.jsonl for bundle export and post-run grep/jq workflows. Captured content:
- Pipeline events: node start/complete/fail, checkpoint saves
- Agent events: LLM turns, tool calls, text output
- Decision events: edge selection (with priority level and context snapshot), condition evaluations (with match results), node outcomes (with token counts), restart detections
Reconstruct any routing decision after the fact (post-run snapshot path):
# See all edge decisions
grep 'decision_edge' .tracker/runs/<id>/activity.jsonl | python3 -m json.tool
# See condition evaluations
grep 'decision_condition' .tracker/runs/<id>/activity.jsonl | python3 -m json.tool
# See node outcomes with token counts
grep 'decision_outcome' .tracker/runs/<id>/activity.jsonl | python3 -m json.toolEnable git artifacts from the library via the WithGitArtifacts(true) option on the engine builder; the artifact run directory becomes a git repository and each terminal node outcome creates a commit. (There is no CLI flag for this today — use the ExportBundle helper or the --export-bundle CLI flag to produce a portable bundle from any run directory.)
node(start): start outcome=success
node(middle): codergen outcome=success
node(end): exit outcome=success
Checkpoint tags (checkpoint/<runID>/<nodeID>) mark each save point.
ExportBundle packages the entire git history — commits and tags — into a single file you can copy anywhere:
// Library usage
result, _ := engine.Run(ctx)
if err := tracker.ExportBundle(result.ArtifactRunDir, "/tmp/run.bundle"); err != nil {
log.Printf("bundle export failed: %v", err)
}# CLI usage: export bundle after the pipeline completes
tracker --export-bundle /tmp/run.bundle examples/ask_and_execute.dip
# Restore and inspect on any machine with git
git clone /tmp/run.bundle /tmp/run
cd /tmp/run && git log --onelineThe bundle is self-contained — no network access needed. Clone it on another machine, inspect the exact sequence of node outcomes, and replay from any checkpoint tag.
When a pipeline run doesn't go as expected, tracker gives you tools to understand what happened:
Analyzes a run's failures and surfaces the information you need — tool stdout/stderr, error messages, timing anomalies — without manually grepping through JSONL files.
# Diagnose the most recent run
tracker diagnose
# Diagnose a specific run (prefix matching works)
tracker diagnose 7813bThe output shows each failed node with its output, stderr, errors, and actionable suggestions. For example, it will tell you if a tool node failed because of a stale counter file, or if a node completed suspiciously fast (suggesting a configuration issue).
For a broader view of a run's timeline, retries, and recommendations:
# List all runs
tracker list
# Full audit report for a specific run
tracker audit <run-id>| Symptom | Cause | Fix |
|---|---|---|
| "no LLM providers configured" | Missing API keys | tracker setup or export env vars |
| TestMilestone instantly escalates | Stale fix_attempts counter |
rm .ai/milestones/fix_attempts |
| Node fails with no visible error | Tool stderr not surfaced | tracker diagnose shows full output |
| Pipeline loops forever | Unconditional fallback to loop target | Ensure fallbacks go to an exit node (Done, escalation gate), not back into the loop |
| Tool retries same error 5 times | Deterministic command bug | tracker diagnose flags identical retries — fix the command in the .dip file |
| Every milestone needs fixing | known_failures has comments or bad format | Ensure bare test names only, no comments — v0.11.2 strips them automatically |
| Build loop skips all milestones | Milestone headers don't match expected format | Use ## Milestone N: Title format — v0.11.2 is flexible + fails loudly |
Tracker exposes per-provider token and dollar cost from every run, and can halt pipelines that exceed configured ceilings.
Library consumers read cost via Result.Cost:
result, _ := tracker.Run(ctx, source, tracker.Config{
Budget: pipeline.BudgetLimits{
MaxTotalTokens: 100_000,
MaxCostCents: 500, // $5.00
MaxWallTime: 30 * time.Minute,
},
})
if result.Status == pipeline.OutcomeBudgetExceeded {
log.Printf("halt: %s, spent $%.4f", result.Cost.LimitsHit, result.Cost.TotalUSD)
}
for provider, pc := range result.Cost.ByProvider {
log.Printf("%s: %d tokens, $%.4f", provider, pc.Usage.InputTokens+pc.Usage.OutputTokens, pc.USD)
}CLI users pass flags directly to tracker:
tracker --max-tokens 100000 --max-cost 500 --max-wall-time 30m \
examples/ask_and_execute.dipA halted run prints a HALTED: budget exceeded section naming the dimension
that tripped. Run tracker diagnose to see the per-provider breakdown and
remediation guidance.
Streaming consumers subscribe to EventCostUpdated via
tracker.Config.EventHandler. Each terminal-node outcome emits a
CostSnapshot with aggregate tokens, dollar cost, per-provider totals,
and wall-clock elapsed time.
Budget ceilings can also be declared inline in the workflow's defaults: block (v0.19.0) and act as the fallback when neither Config.Budget nor the matching --max-* CLI flags are set:
workflow MyPipeline
defaults
model: claude-sonnet-4-6
max_total_tokens: 100000
max_cost_cents: 500
max_wall_time: 30m
Explicit library/CLI values still win over the .dip defaults.
--webhook-url enables fully headless operation: instead of pausing the pipeline to wait for a human at a terminal, tracker POSTs every human gate as JSON to your URL and waits for a callback.
This is the integration point for Slack bots, email approval flows, mobile push notifications, factory workers, or any custom approval system.
- A human gate fires → tracker POSTs a
WebhookGatePayloadto--webhook-url. - Your service receives the payload, routes it to a human (Slack message, email, etc.).
- The human responds → your service POSTs a
WebhookGateResponseto thecallback_urlfield. - Tracker resumes the pipeline with the human's answer.
tracker --webhook-url https://factory.example.com/api/gate \
--gate-timeout 30m \
--gate-timeout-action fail \
--webhook-auth "Bearer sk_live_..." \
examples/build_product.dip| Flag | Default | Description |
|---|---|---|
--webhook-url |
(required to enable) | URL to POST gate payloads to |
--gate-callback-addr |
:8789 |
Local addr for the inbound callback server |
--gate-timeout |
10m |
How long to wait for a reply per gate |
--gate-timeout-action |
fail |
What to do on timeout: fail or success |
--webhook-auth |
(empty) | Authorization header on outbound POSTs |
--webhook-url is mutually exclusive with --autopilot and --auto-approve.
Tracker POSTs JSON with this shape:
{
"gate_id": "uuid",
"run_id": "optional-run-id",
"node_id": "ApproveSpec",
"prompt": "Review the spec. Approve, refine, or reject.",
"choices": [{"label": "approve", "value": "approve"}, ...],
"callback_url": "http://localhost:8789/gate/f47ac10b-58cc-4372-a567-0e02b2c3d479",
"timeout_seconds": 1800,
"gate_token": "per-gate-secret"
}Your service POSTs back to callback_url with:
{
"choice": "approve",
"freeform": "optional free-text response",
"reasoning": "optional explanation"
}Include the gate_token value in the X-Tracker-Gate-Token header — the callback server rejects requests with missing or wrong tokens (HTTP 401).
⚠️ Stability note (pre-v1.0): tracker's library API is usable now, but breaking changes may still happen between minor releases while the surface is finalized. CheckCHANGELOG.mdbefore upgrading.
Library consumers set tracker.Config.WebhookGate instead of using CLI flags:
result, _ := tracker.Run(ctx, source, tracker.Config{
WebhookGate: &tracker.WebhookGateConfig{
WebhookURL: "https://factory.example.com/api/gate",
CallbackAddr: ":8789",
Timeout: 30 * time.Minute,
TimeoutAction: "fail",
AuthHeader: "Bearer sk_live_...",
},
})import (
"context"
tracker "github.com/2389-research/tracker"
)
ctx := context.Background()
report, err := tracker.DiagnoseMostRecent(ctx, ".")
if err != nil { log.Fatal(err) }
for _, f := range report.Failures {
fmt.Printf("failed: %s (handler=%s, retries=%d)\n",
f.NodeID, f.Handler, f.RetryCount)
}
for _, s := range report.Suggestions {
fmt.Printf(" %s: %s\n", s.Kind, s.Message)
}tracker.Audit, tracker.DiagnoseMostRecent, tracker.Simulate, and tracker.Doctor all accept context.Context as their first argument and return JSON-serializable reports. tracker.ListRuns and DiagnoseMostRecent/Diagnose accept an optional config (AuditConfig, DiagnoseConfig) with a LogWriter for non-fatal parse warnings; if LogWriter is left unset, warnings are discarded, so embedded callers are silent by default. Set LogWriter to something like os.Stderr (or another writer/logger sink) if you want to receive those warnings. Audit and Simulate currently take just ctx (plus their payload); Doctor takes a required DoctorConfig plus optional functional options (e.g., tracker.WithVersionInfo).
If you currently shell out to tracker diagnose and scrape stdout, migrate to
tracker.Diagnose() / tracker.DiagnoseMostRecent() and read
DiagnoseReport directly instead of parsing formatted CLI text.
To stream events programmatically in the same NDJSON format as tracker --json, use tracker.NewNDJSONWriter:
w := tracker.NewNDJSONWriter(os.Stdout)
result, _ := tracker.Run(ctx, source, tracker.Config{
EventHandler: w.PipelineHandler(),
AgentEvents: w.AgentHandler(),
})tracker [flags] <pipeline> Run a pipeline (file path or built-in name)
tracker workflows List built-in workflows
tracker init <workflow> Copy a built-in to current directory
tracker setup Interactive provider configuration
tracker validate <pipeline> Check pipeline structure
tracker simulate <pipeline> Dry-run execution plan
tracker doctor Preflight health check
tracker diagnose [runID] Analyze failures in a run
tracker audit <runID> Full audit report for a run
tracker list List recent pipeline runs
tracker update Self-update to the latest GitHub release
tracker version Show version information
Flags:
-w, --workdir— working directory (default: current)-r, --resume— resume a previous run by ID--format— pipeline format override:dip(default) ordot(legacy; emits a deprecation warning)--json— stream events as NDJSON to stdout--no-tui— disable TUI dashboard, use plain console--verbose— show raw provider stream events--backend— agent backend:native(default),claude-code, oracp--autopilot <persona>— replace human gates with an LLM judge (lax/mid/hard/mentor)--auto-approve— deterministically accept every human gate (no LLM)--param key=value— override a declared workflow var at run time (repeatable)--artifact-dir— override the node state directory (default:<workdir>/.tracker/runs)--max-tokens— halt if total tokens across the run exceed this value (0 = no limit)--max-cost— halt if total cost in cents exceeds this value (0 = no limit)--max-wall-time— halt if pipeline wall time exceeds this duration (0 = no limit)--gateway-url— Cloudflare AI Gateway root URL (per-provider*_BASE_URLenv vars win)--webhook-url— POST human gate prompts to this URL and wait for callback (headless)--gate-callback-addr— local addr for the webhook callback server (default::8789)--gate-timeout— per-gate wait timeout when--webhook-urlis set (default:10m)--gate-timeout-action— what to do on gate timeout:fail(default) orsuccess--webhook-auth—Authorizationheader for outbound webhook requests--export-bundle— write a portable git bundle of run artifacts to the given path after completion--bypass-denylist— disable the built-in tool command denylist (prints a stderr warning; sandboxed use only)--tool-allowlist <pattern>— glob pattern a tool command must match to execute (repeatable or comma-separated)--max-output-limit <bytes>— hard ceiling per tool command output stream (default: 10MB)
# Run tests
go test ./... -short
# Validate all example pipelines
for f in examples/*.dip; do tracker validate "$f"; done
# Run dippin simulation tests
for f in examples/*.dip; do dippin test "$f"; done
# Check with dippin-lang tools
dippin doctor examples/build_product.dip
dippin simulate -all-paths examples/build_product.dipSee LICENSE.