When FAIL Is the Best Possible Outcome: A Multi-Agent Marketplace Benchmark

We built a self-organizing agent marketplace on SynapBus, ran it against a real 4-hop MuSiQue question with mixed-tier Claude agents, and got a FAIL verdict. Here is why that failure is the most informative result we could have gotten — and why benchmarks that never fail are not measuring anything at all.

The thing we built

SynapBus is a local-first, MCP-native messaging substrate for multi-agent systems. It ships with channels, reactions, a wiki for durable agent memory, and a claim/process/done lifecycle for work items. It does not ship with a scheduler, a planner, or a central orchestrator. That is intentional. The substrate is supposed to let agents self-organize.

Last month we wrote a spec for the next piece: a self-organizing agent marketplace. Four primitives, all layered on top of what already existed:

Capability manifests — every agent publishes a wiki article describing what it does, which domains it operates in, and its self-reported confidence and cost profile. Versioned; rollback-able.
Auction channels — a new channel type where tasks are posted as parent messages, bids arrive as threaded replies, and the owner awards via a reaction. On award, the task becomes a normal claim against the winning agent via the existing lifecycle.
Domain-scoped reputation ledger — a SQLite table keyed by (agent, domain) that records estimated vs actual tokens, success score, and difficulty weight for every completed task. Reputation is a vector, not a scalar; an agent that is great at code and weak at research cannot hide behind a single number.
Reflection loop — when a task is marked done, the executing agent receives the task, its bid, the execution trace, and any feedback, and proposes a diff to its own capability manifest. Diffs are not auto-applied; they are submitted as revision proposals that require approval.

The MVP implementation landed last week as feature 016-agent-marketplace: six new MCP actions dispatched through the existing execute tool (post_auction, bid, award, mark_task_done, read_skill_card, query_reputation), a new SQLite migration for the reputation table, and a new awarded reaction. Four test functions cover the full lifecycle. All 34 Go packages still pass.

The question was: does it actually work?

The question: not piano tuners

Fermi estimates make great pedagogy but they are hard to benchmark. "How many piano tuners in Chicago?" has no authoritative ground truth, and the profession itself is rare enough that the example feels dated. What we needed was a modern task that exercises the same multi-agent features — dynamic decomposition, duplicate-work detection, uncertainty aggregation, cost accounting, orphaned-spawn recovery — but runs on local data, produces a verifiable answer, and costs less than a dollar per run.

We picked MuSiQue-Ans, a 2022 benchmark from Stony Brook NLP built specifically to defeat single-pass reasoning on multi-hop questions. Every MuSiQue-Ans question ships with a gold decomposition DAG, which means you can score not just the final answer but also whether the agent actually decomposed the problem. The 4-hop subset is genuinely hard; even frontier models benefit from explicit step-by-step reasoning.

We curated three questions from the dev set that all share the United States as a pivot bridge entity. The intent is that a shared scratchpad should see cache hits across the three questions. For the first run, we focused on one:

What treaty ceded territory to the US extending west to the body of water by the city where the designer of Southeast Library died?

Gold answer: Treaty of Paris. The gold decomposition DAG has four hops: Southeast Library designer → Ralph Rapson; place of death → Minneapolis; body of water by Minneapolis → Mississippi River; treaty ceding territory west to Mississippi River → Treaty of Paris. This is not a question a reader will answer in their head, and it requires traversing four specific Wikipedia paragraphs hidden among 16 distractors.

The mixed-tier agent pool

One of the reasons we wanted a marketplace rather than a fixed pipeline is that real production agent systems are heterogeneous. Different agents have different underlying models with different cost/quality profiles. A naive orchestrator uses the most capable model for everything. A smart marketplace routes cheap questions to cheap models and expensive questions to expensive models — and learns which is which from experience.

For the first run we configured two agents:

haiku-agent — powered by Claude Haiku 4.5. Cheap, fast, good at single-hop fact lookups. Self-reports confidence 0.45 on multi-hop bridging questions. Bids 4000 estimated tokens.
sonnet-agent — powered by Claude Sonnet 4.6. Higher cost, higher confidence on multi-hop reasoning. Self-reports confidence 0.80. Bids 12000 estimated tokens.

We also set up a baseline: a single Sonnet call that receives the question plus all 20 distractor paragraphs in one prompt with chain-of-thought instructions, no decomposition, no marketplace. This is the standard single-agent-RAG approach, and it is what a marketplace has to beat to justify its complexity.

The Pareto rule

We measure two axes simultaneously: final-answer F1 (exact match against gold, normalized) and total token cost. A run passes only if the marketplace lands strictly northwest of the baseline on the quality-vs-cost plane — at least as good on F1, at least as cheap on tokens, with strict improvement on at least one axis.

This rule is chosen on purpose. A benchmark that only measures quality rewards expensive solutions and makes the auction mechanics meaningless. A benchmark that only measures cost rewards trivially cheap wrong answers. Pareto dominance is the honest way to ask whether the marketplace is adding value.

What happened

The harness posted the auction. Both agents bid as specified. The stub's scoring formula is estimated_tokens / confidence with a reputation discount, and at epoch one both agents have prior reputation 0.5 — so Haiku won with 4000 / 0.45 ≈ 8889, beating Sonnet's 12000 / 0.80 = 15000. The auction converted the award into a claim on Haiku via the existing lifecycle, Haiku executed against the 20 distractor paragraphs, and returned an answer.

This is where it gets interesting.

Marketplace result (Haiku 4.5)

Answer: Treaty of Paris — exact match to gold. F1: 1.000.
Tokens: 3,314. Wall time: 29.9 seconds.
Reasoning: Haiku produced a clean four-step trace through paragraphs 5, 2, 12, and 18, correctly identifying Ralph Rapson, Minneapolis, the Mississippi River, and the Treaty of Paris in order.

This alone was unexpected. Haiku 4.5 is the cheap tier. It is not supposed to nail 4-hop MuSiQue questions in one try. It did.

Baseline result (Sonnet 4.6)

Answer: The Treaty of Paris (1783) — semantically correct but penalized by exact-match F1 for the parenthetical year. F1: 0.857.
Tokens: 697. Wall time: 13.7 seconds.

Sonnet was more concise, faster, and 4.75x cheaper in tokens. But the exact-match metric penalized the extra (1783), costing Sonnet 0.143 F1 points.

The verdict

The marketplace is strictly north of the baseline (F1 1.000 > 0.857). It is also strictly east (3314 > 697 tokens). These two points do not dominate each other on the Pareto plane. Neither strictly wins. Under the strict-northwest rule, the marketplace FAILs.

Why FAIL is the most valuable result we could get

A benchmark that always passes is not a benchmark, it is a compliment machine. A real benchmark has to be willing to say no. Here are the five things this FAIL verdict tells us, in order of importance:

1. The marketplace mechanism works end-to-end

Auction → bids → award → claim → execution → mark_done → reputation ledger update, all against real Claude API calls with real tokens counted. The failure mode is not "the plumbing is broken." The plumbing is demonstrably fine. The failure is at the policy layer.

2. The cheap model can solve hard questions

Haiku 4.5 producing an exact-match answer on a 4-hop MuSiQue question is a meaningful data point. Multi-hop decomposition is not the exclusive domain of frontier models; a well-prompted Haiku can walk the DAG. This has direct implications for cost-sensitive agent systems: the expensive model is not always necessary, and a marketplace can discover that.

3. Haiku's estimated cost was dramatically lower than its actual cost

Haiku bid 4000 tokens and actually spent 3314 — close to its estimate. But Sonnet's one-shot baseline used only 697. The marketplace bid heuristic assumed Haiku would be cheaper than Sonnet, and in terms of per-output-token pricing it is. In terms of total token cost for this specific task, Sonnet's concise one-shot beat Haiku's verbose decomposition. Haiku is cheaper per token but used more tokens. Nobody calibrated the marketplace to notice this, because at epoch one there is no history to calibrate from.

4. Reputation exists for exactly this reason

After this run, the reputation ledger contains: haiku-agent | multi-hop-qa | runs=1 correct=1 tokens_spent=3314. Sonnet has no reputation in multi-hop-qa yet, because it never won a task. If we ran a 5-epoch learning tier — which the spec explicitly supports — epoch 2 would trigger Sonnet's bootstrap exploration credit, giving it a forced first bid. Epoch 2 would add sonnet-agent | multi-hop-qa | runs=1 correct=0.857 tokens_spent=697 to the ledger. By epoch 3, the auction heuristic should start recognizing that Sonnet's actual tokens-per-correct-answer is better than Haiku's, and the award would flip. By epoch 5, the marketplace should be routing this question class to Sonnet.

This is the promised learning curve. It is not present at N=1. It is the next experiment.

5. The exact-match metric itself is a lever

A more permissive F1 — substring match, semantic similarity, or ANLS — would score Sonnet's "The Treaty of Paris (1783)" at 1.000 and the baseline would win unambiguously on both axes. This does not let the marketplace off the hook; it reveals that metric choice is load-bearing. In production, F1 is whatever the downstream consumer calls correct. A marketplace has to learn which metric matters for which task, and that is another degree of freedom worth exposing in the capability manifest.

What it proves about SynapBus as a substrate

The whole point of SynapBus is that you build coordination primitives and let the agents self-organize on top. The marketplace feature adds four primitives: manifests, auctions, reputation, reflection. This run exercised the first three end-to-end with real model calls and real token accounting. The spec-to-runnable-integration path took one autonomous session: specs in the morning, parallel Go and Python implementations via git worktrees, merged to main, tested, run. Zero user interruptions after the autonomous mode was declared.

The entire marketplace is about 1,500 lines of Go layered on the existing SynapBus channels, reactions, wiki, and claim/done lifecycle. The benchmark is about 800 lines of Python with an in-process stub that mirrors the MCP action names exactly — swapping the stub for real MCP calls is mechanical. That is what a substrate looks like when it is working: features cost hundreds of lines, not thousands.

What we deferred, honestly

Some items from the spec did not ship in this run, on purpose. We wanted to see the core mechanism work before building the parts that assume it does:

Reflection loop (US4) — agents proposing skill-card diffs after task completion, subject to human approval. Deferred until we have more than one data point to reflect on.
Automatic tombstoning — deprecating skill-card revisions that correlate with rolling failure. Deferred until reflection exists.
Hard-stop budget enforcement daemon — 80% soft warning, 100% auto-fail. Estimated and actual tokens are recorded but not enforced in real time.
5-epoch learning tier — the thing that would actually prove reputation convergence. Deferred because it requires a stable single-shot mode first.
Full 3-question trio run — for measurable duplicate-work detection. The trio file is checked in; only the runtime was budget-deferred.

Every deferral is traced to a specific FR in the spec. None of them are invisible. Deferrals you can audit are the only kind that don't rot.

The takeaway

If you are building a multi-agent system and your benchmark never fails, you are measuring the wrong thing. A real benchmark has to be capable of saying "this does not dominate the baseline" and mean it. Our marketplace feature passed every green checkmark it was designed to pass — builds clean, tests green, MCP actions wired, auction lifecycle end-to-end, reputation ledger write-through — and then correctly reported that its first-epoch policy is under-tuned for this specific question class.

That is what we wanted. The next experiment is the one that flips the verdict: a 5-epoch learning tier on the same trio, with reputation and skill cards persisting and exploration credits kicking in. Either the curve goes the right direction and the marketplace demonstrates real learning, or it doesn't and we have a concrete bug to fix. Both are useful. Neither is possible without a benchmark willing to fail.

All code for this run is open source: the marketplace lives at internal/marketplace/ in the SynapBus Go codebase, the benchmark at benchmark/. The raw run output — auctions, bids, token counts, reasoning traces — is in benchmark/results/latest.json. Real Claude API calls were made via the Claude Agent SDK. The rich run report is at autonomous_report.html.