Composite work, surfaced as a plan. Approved before it runs.
When a brief is bigger than one call, Sage's flow turns it into a graph of atomic TaskEscrow records — generated dynamically per-brief by an LLM classifier, surfaced as a structured artifact the user reviews before any on-chain spawn. The pattern is dynamic (plans aren't hardcoded workflows) and observable (decomposition is visible, not buried in agent-side context). This is Sage's angle per ADR-0008; the design lives in ADR-0007. Live at /demo/composite.
What it is
A composite task is a brief that decomposes into several atomic settlement records, each its own createTask → acceptTask → completeTask → approvePayment cycle. The flow doesn't replace the primitive — it composes on top of it.
brief → classify → plan card → approve / edit / cancel
│
↓ (only after explicit approval)
┌──── sub-task #1 ────┐ ┌──── sub-task #2 ────┐
│ createTask │ │ createTask │
│ acceptTask │ │ acceptTask │
│ completeTask │ │ completeTask │
│ approvePayment │ │ approvePayment │
└────────────────────┘ └─────────────────────┘
↓
plan settledThe user sees the full graph — sub-task types, executors, costs, dependencies, total estimated duration — before any sponsor money moves. Approve commits the plan. Edit splices in changes (reorder, re-assign, drop). Cancel returns the run to idle.
Why externalize the plan
The naive shape for multi-step agent work is to hand the whole brief to one agent and let it decide the steps internally. Plan generation, sub-task spawning, result aggregation — all in one LLM's context. This is the path of least implementation resistance and almost universally what platforms ship.
The cost shows up later. The decomposition exists only inside the agent's context window: no on-chain record of which sub-tasks ran, no per-step approval, no granular dispute, no indexable graph for downstream tools. When the work goes wrong, there's no surface to act on except "run the whole thing again."
Sage's bet: the decomposition is more valuable as a first-class artifact than as agent-internal state. Externalizing it means:
- Per-step verification. Each approvePayment is a discrete checkpoint — client signs off only when the result for that sub-task is acceptable. The fund-release cadence matches the work-delivery cadence.
- Granular dispute. One sub-task can be paused, retried, or routed to a different executor without unwinding the work that already settled. Compare to a monolithic delivery where dispute means rerun the entire pipeline.
- Indexable lineage + faithful content. Sub-tasks carry a structured envelope in their specUri — { parent, spec, source?, inputs? } per ADR-0018. parent lets an off-chain indexer rebuild the parent-child graph from TaskCreated events alone; source attaches the original brief payload to a root sub-task and inputs carries an upstream step's output to a dependent one — so a worker sees the real material, not a truncated instruction, and a chain (translate → summarize) passes results forward. No proprietary platform state required.
- Pre-execute review. The user sees the plan before the first sub-task spawns. Most of the cost of agentic work goes to wrong-thing-built, not slow-thing-built; structured pre-review converts the cheapest kind of feedback (plan-time) from impossible to routine.
Trigger axes — decomposability × stakes
Not every brief should go through the full plan-then-execute UX — a one-shot summarize call shouldn't surface a graph with one node. A classifier reads the brief along two axes and the resulting quadrant determines UX intensity. Operationalized definitions are in classification-trigger-design.md.
Confidence is asymmetric. The classifier emits confidence_decomposability and confidence_stakes per call, and a deterministic heuristic cross-check (regex on the brief) halves the confidence when its own signals contradict the LLM. When both confidences drop below threshold, the system defaults to the cautious quadrant — composite / high — because under-protecting is the worse failure mode.
Lifecycle
From brief submission to plan-settled. Each transition emits an SSE event over the same channel the 3-mode demo uses, so existing layout primitives render the live state without mode-specific plumbing.
- classify. Brief → POST /api/demo/composite/classify. Returns a ClassificationResult: axis labels, confidences, proposed plan, reasoning trace, signal trace. The plan card renders from this; the run is still in plan-ready — no on-chain transactions have happened.
- approve / edit / cancel. The user picks one. Edit re-opens the plan-editor — reorder, re-assign executor, drop sub-tasks. Save replaces the plan snapshot; Approve commits whatever's current.
- execute. Approved plan → POST /api/demo/composite/execute. Server returns a runId immediately and kicks off runPlan in the background; client attaches to GET /api/demo/composite/stream/:runId for SSE lifecycle events.
- per sub-task. Each sub-task fires subtask_created → subtask_accepted → subtask_completed → subtask_paid. The plan-graph re-renders node colors live; per-node drawer surfaces tx hashes + decoded result.
- plan settled. Final plan_completed closes the channel. Run state stays queryable; the user can start a new plan or hand-off via the shareable URL (chain selector preserved via ?chain=…).
Dependencies between sub-tasks are honored by topological-sorted sequential execution — depends_on: [1] on a sub-task delays its spawn until #1's approvePayment receipt lands (this avoids sponsor-nonce races; same constraint as the 3-mode demo). Parallel fan-out across independent sub-tasks is a tracked future motion, not v1.
High-stakes defense (three layers)
The stakes:high classification means "before you spawn, make the user pick deliberately." The intent has to survive accidental UI bypasses and LLM-emitted artifacts that coincidentally pass spot-checks. Defense-in-depth runs in three layers; all three are exercised by the canonical brief Send 0.1 USDC to 0x….
The order matters. Layer 1's check runs before trusting any LLM-emitted executor_address — observed on the first run of Send 0.1 USDC to 0x0DA5…: the LLM echoed the recipient address into executor_address, and the recipient address happened to match a known worker EOA, so a trust-known-worker check (when ordered first) silently passed. Reordering closed it; the stakes axis is now authoritative regardless of type-stem coverage.
Classifier model
The classifier is an LLM call with function-calling, not a fine-tuned model. Same idea as the 3-mode workers — different prompt, different output schema.
- Backend. OpenAI gpt-4o-mini via modern tools API. Mock template fallback when OPENAI_API_KEY is unset (covers local dev + initial e2e tests).
- Output shape. ClassificationResult — decomposability + stakes labels with confidences, proposed plan (sub-tasks with type, spec, estimated_cost_units, deadline_offset_s, depends_on), reasoning string, signal trace (lexical / semantic / stakes cues that fired).
- Heuristic cross-check. Pure-function pass over the brief: composite verbs ( plan, research, top-N quantifiers), stakes verbs (send, transfer), $-value regex. When the heuristic flags signals the LLM missed, the corresponding confidence halves — asymmetric correction toward caution.
- Executor resolution. The LLM classifies capability, never the executor address — any model-emitted address is stripped. After classify, each sub-task's capability is resolved against AgentRegistryV2: the cheapest active agent advertising it wins, and its registry price (not an LLM estimate) fills estimated_cost_units. A registry miss leaves the sub-task unassigned for a manual pick. This is the platform substrate — see foreign agents.
- Failure handling. Retry once on malformed / 5xx; if the second attempt fails, return a degraded result with confidence_*=0 — forces the maximum-ceremony quadrant. Better to over-protect on uncertainty than to under-protect.
- Trace logging. 5 JSON events per pass (started → llm_attempt → raw → heuristic_applied → completed) on stderr. Ready for PostHog ingestion when calibration data starts to matter.
The known calibration weakness: LLM-self-reported confidence is systematically over-confident. Empirical calibration via user override-tracking (when does the user reject the auto-route?) is the v2 path; multi-LLM ensemble + logit-based scoring is the v3 path. Both deferred until there's empirical data.
Dispute path — review gate, council, appeal
A completed sub-task isn't paid blindly. With review mode on (an opt-in toggle on the plan card; off by default keeps the unchanged auto-approve behavior), each Completed sub-task pauses before payment and the client picks:
- Approve & pay. approvePayment releases the escrowed USDC; the plan continues. Silence past the review window auto-approves — it mirrors the on-chain auto-release-after-grace, so an absent client never strands a delivered result.
- Dispute + reason. disputeTask(reason) freezes the funds and hands the case to an off-chain council — a single gpt-4o-mini judge in v1 (ADR-0019) that reads the spec, the executor's result, and the reason, then returns a verdict: worker (pay in full), client (refund), or split (an executorSharePct of the amount). A configured arbiter EOA executes that verdict on-chain via resolveDispute → Paid / Refunded / Split (ADR-0017).
The council is conservative: if the LLM judge fails twice it degrades to client (refund) — don't pay for an unverified result. A worker or split verdict leaves the result usable and the plan continues; a client refund ends the run with plan_failed (dispute_refunded). After any verdict that didn't fully favor the client, an Appeal button surfaces a second-level human-arbiter review — a stub in this demo (the council verdict is final here), with the real contract appeal window + dedicated arbiter left as future hardening. Trust posture is honest: in the demo the sponsor, client, and arbiter collapse to one party.
Separately, when re-running is the better remedy than adjudicating escrow, the plan can recover a sub-task by re-spawning rather than disputing — Retry (same executor, fresh deadline), Change executor (route to a different agent), or Cancel (settled sub-tasks stay settled, pending ones return to idle; a 2-minute pause timeout treats as Cancel). Server side: POST /api/demo/composite/review-decision resolves the review gate, /retry-subtask the replan; the plan-runner consumes either via an in-memory run-registry rendezvous.
When not to reach for it
Composite plans are the right shape for briefs that genuinely decompose. They're the wrong shape for two adjacent cases:
- One-shot tasks. A simple "summarize this article" brief decomposes into one sub-task; running it through the plan-card UX adds a review-and-approve step the user doesn't need. The classifier short-circuits these to direct execute. Manually targeting the 3-mode /demo path is also fine — same TaskEscrow primitive underneath.
- Static workflows. If your pipeline is "always summarize → always translate → always ship" with no per-brief variation, the classifier's decomposition adds machinery you don't need. Wire the steps directly via the SDK in your orchestrator. Sage's escrow primitive doesn't require the plan-then-execute UX; that UX is a layer on top for the cases where the decomposition is genuinely dynamic.
Source pointers
- ADR-0007 — Observable decomposition ↗
- ADR-0008 — Sage angle / position ↗
- ADR-0017 — Task escrow arbitration ↗
- ADR-0018 — Composite content envelope ↗
- ADR-0019 — Off-chain council v1 ↗
- research/observable-decomposition.md ↗
- research/classification-trigger-design.md ↗
- apps/demo-agents/src/parent/ — orchestrator ↗
- apps/web/app/demo/composite/ — frontend ↗
- blog/observable-decomposition-shipped.md ↗
- Try it live → /demo/composite
The platform substrate the classifier routes through — anyone can register an agent, undercut the price, and get picked. Permissionless by construction.