Does a universal harness move the needle?

A pre-registered, falsifiable, reproducible benchmark.

TL;DR — Six agents, one non-trivial task, three LLM judges. At matched models the score difference between each vendor's own CLI and D.U.H. sits well inside judge noise on two of three pairs (Opus 4.7 Δ = −0.3, GPT-5.4 Δ = −0.6) and favours D.U.H. on the third (Gemini 3.1 Δ = +2.0). Model capability dominates harness choice by over 10×. The full method, the rubric, every diff, every session log, and the 27 judge outputs are in the repo so you can re-run and see for yourself.

Scoreboard

Rank	Agent	j-opus	j-gpt54	j-g31	Mean /35	Elapsed
1=	`claude-code-opus`	35	30	35	33.3	742 s
1=	`codex-gpt54`	35	30	35	33.3	510 s
3	`duh-opus`	35	29	35	33.0	915 s
4	`duh-gpt54`	33	30	35	32.7	230 s
5	`duh-gemini-3.1`	25	23	33	27.0	305 s
6	`gemini-cli-3.1`	25	22	28	25.0	358 s

Same-model deltas

Model	First-party CLI	D.U.H.	Δ
Opus 4.7	33.3	33.0	−0.3
GPT-5.4	33.3	32.7	−0.6
Gemini 3.1 Pro	25.0	27.0	+2.0

Two of three deltas fall cleanly inside the ±2-point pre-registered parity band. The Gemini pair sits at the edge of the band and points in D.U.H.'s direction — driven almost entirely by test coverage (3.0 vs 1.3 /5) and implementation completeness (3.7 vs 3.3 /5).

Method, in one page

Task: add a double-agent TDD flow (Driver writes tests, Navigator reviews; six strict phases) to the D.U.H. codebase itself. An ADR must predate the first code change. Real wiring, not a skeleton. Unit tests that exercise the phase contract. README + wiki + help-text updates. No git commits; the working tree is the submission.
Baseline: every agent starts in a fresh git worktree of the same pinned commit (645d91a, D.U.H. v0.8.0 release-prep).
Agents: Claude Code @ Opus 4.7 · D.U.H. @ Opus 4.7 · Codex CLI @ GPT-5.4 · D.U.H. @ GPT-5.4 · Gemini CLI @ Gemini 3.1 Pro · D.U.H. @ Gemini 3.1 Pro.
Judges: Opus 4.7, GPT-5.4, Gemini 3.1 Pro, each via D.U.H.. Each judge scores every submission — including runs produced by its own model family — so single-provider bias cancels on the average.
Rubric: seven dimensions, each 0–5, summed to /35. ADR quality · Implementation completeness · Use of existing abstractions · Test coverage of the six-phase contract · Documentation · Code quality · Protocol adherence.

Pre-registered hypotheses

Three hypotheses, each with a numeric falsification threshold. All three were declared before any run.

H1 — Harness parity at matched model. Score delta between first-party CLI and D.U.H. at the same model is ≤ 2 points /35. Result: supported (−0.3, −0.6, +2.0). Gemini pair sits at the edge — a one-point judge swing would cross the threshold.
H2 — Measurable same-family judge bias. Judges score own-family submissions ≥ 2 points higher than cross-family judges do. Result: supported strongly for Gemini (same-family bias +9.0 on one target), weakly for Anthropic (+2.5 to +3.0), and inverted for OpenAI (−5.0 to −4.0; the GPT judge was harsher on GPT submissions). Three-judge averaging substantially — but not fully — cancels the bias.
H3 — Model capability dominates CLI choice. Model score range strictly greater than CLI score range. Result: supported decisively. Model range = 23.3 (33.0 → 9.7 across D.U.H. runs). Max CLI range at any matched model = 2.0.

Why didn't the harness help on Opus and GPT-5.4?

The honest answer: this task doesn't exercise the axes where a harness is supposed to differentiate. It's a single-session, sub-hour feature. All three vendor CLIs give their model a context, a tool schema, and an edit loop; D.U.H. does the same. On short tasks inside native context, the model does the work and the harness is a courier.

Where a harness should matter — long-context behaviour, context compaction, cross-session memory, multi-agent orchestration, cross-provider model hopping — this benchmark doesn't test at all. Two follow-on benchmarks are specced to stress those axes: a multi-file distributed-systems task with adversarial tests, and a documentation-generation task over a real codebase.

What D.U.H. unlocks that the vendor CLIs cannot

There is one structural result the scoreboard above understates. Claude Code drives Claude. Codex drives OpenAI. Gemini CLI drives Gemini. That's it.

D.U.H. in this benchmark also ran llama-4-scout, gpt-oss-120b, and qwen3-32b — strong open models served via Groq, none of which has a first-party coding CLI of any kind. Two of those three runs were rate-limited out by Groq's free-tier TPM cap (recorded in the results as failures, not silently dropped); the third ran cleanly and scored below the frontier agents. The scores are one signal; the bench inclusion is another. Vendor CLIs can't even line up at the start.

Reproducing this run

# From the D.U.H. repo
cd benchmarks/double-agent-tdd

export ANTHROPIC_API_KEY=…  OPENAI_API_KEY=…
export GEMINI_API_KEY=…     GROQ_API_KEY=…

./preflight.sh        # verifies CLIs, keys, baseline commit
./run_all.sh          # six vendor runs + three open-model runs
./judge_all.sh        # 27 judgments
python3 aggregate.py  # writes results/scoreboard.md

Every artefact — diff, session log, meta.json, judge JSON — persists under results/. If you run this on a different machine with different keys and your numbers come out more than 3 points off ours on any agent, file an issue. That is the path to invalidating any of H1/H2/H3.

duhwave-agile (May 2026)

A separate, complementary benchmark for the persistent-swarm extension. A single CLI invocation drives a 5-stage agile-team pipeline (PM → Architect → Engineer → Tester → Reviewer) against real OpenAI models. Each stage spawns a worker via duhwave's Spawn tool, reads exposed handles from the coordinator's RLM REPL, and binds its result back as a new named handle.

Metric	gpt-4o-mini	gpt-4o
Stages completed	5 / 5	5 / 5
Wall (single-threaded by design)	35.5 s	29.3 s
Total prompt tokens	3,934	4,706
Total completion tokens	1,553	1,900
Estimated cost	$0.0015	$0.0308
Cost ratio	1×	~20×
pytest pass rate on produced code	3 / 5	5 / 6

Headline finding. Real coordination defects surface naturally — the Reviewer agent reads but does not execute, missing test failures the Tester introduced. gpt-4o-mini's test_error_handling fails because an earlier time.sleep(2) in another test refills the bucket; gpt-4o's test_rate_limiter_thread_safety references threading.Thread but the test file imports only pytest and time. Both Reviewers issued APPROVE. The benchmark catches genuine multi-agent coordination bugs that wouldn't show up in a stub-only demo.

The obvious next step: add a sixth role — Runner — that executes the test suite and binds the failures back as a handle for the Reviewer to peek. Adding a sixth handle-passing stage is one new entry in the pipeline list. Architecture composes.

Full per-stage ledger, cost breakdown, and reproducibility steps: benchmarks/duhwave-agile/RESULT.md. See also the duhwave guide for the architecture this benchmark exercises.

Caveats

N = 1 per (agent, task). Absolute means carry ±3 points of judge-sampling uncertainty; ordinal ranking is stable across judges.
Single task. Generalising H1 to all coding tasks is not warranted. Benchmarks 2 and 3 exist to stress different axes.
Rubric is document-grounded, not execution-grounded. codex-gpt54 had two actually-failing unit tests that every judge missed. A future rubric iteration should run pytest.
Protocol-honesty note. The first Claude Code run interpreted the task's reference to the D.U.H. source path literally and wrote files to the upstream repo instead of its benchmark worktree. The prompt was rewritten to remove the absolute path and the Claude Code invocation was hardened with --add-dir=$WORKTREE. All reported numbers use the hardened prompt.

See the benchmark Try D.U.H.