Does a universal harness move the needle?
A pre-registered, falsifiable, reproducible benchmark.
TL;DR โ Six agents, one non-trivial task, three LLM judges. At matched models the score difference between each vendor's own CLI and D.U.H. sits well inside judge noise on two of three pairs (Opus 4.7 ฮ = โ0.3, GPT-5.4 ฮ = โ0.6) and favours D.U.H. on the third (Gemini 3.1 ฮ = +2.0). Model capability dominates harness choice by over 10ร. The full method, the rubric, every diff, every session log, and the 27 judge outputs are in the repo so you can re-run and see for yourself.
Scoreboard
| Rank | Agent | j-opus | j-gpt54 | j-g31 | Mean /35 | Elapsed |
|---|---|---|---|---|---|---|
| 1= | claude-code-opus | 35 | 30 | 35 | 33.3 | 742 s |
| 1= | codex-gpt54 | 35 | 30 | 35 | 33.3 | 510 s |
| 3 | duh-opus | 35 | 29 | 35 | 33.0 | 915 s |
| 4 | duh-gpt54 | 33 | 30 | 35 | 32.7 | 230 s |
| 5 | duh-gemini-3.1 | 25 | 23 | 33 | 27.0 | 305 s |
| 6 | gemini-cli-3.1 | 25 | 22 | 28 | 25.0 | 358 s |
Same-model deltas
| Model | First-party CLI | D.U.H. | ฮ |
|---|---|---|---|
| Opus 4.7 | 33.3 | 33.0 | โ0.3 |
| GPT-5.4 | 33.3 | 32.7 | โ0.6 |
| Gemini 3.1 Pro | 25.0 | 27.0 | +2.0 |
Two of three deltas fall cleanly inside the ยฑ2-point pre-registered parity band. The Gemini pair sits at the edge of the band and points in D.U.H.'s direction โ driven almost entirely by test coverage (3.0 vs 1.3 /5) and implementation completeness (3.7 vs 3.3 /5).
Method, in one page
- Task: add a double-agent TDD flow (Driver writes tests, Navigator reviews; six strict phases) to the D.U.H. codebase itself. An ADR must predate the first code change. Real wiring, not a skeleton. Unit tests that exercise the phase contract. README + wiki + help-text updates. No git commits; the working tree is the submission.
- Baseline: every agent starts in a fresh
git worktreeof the same pinned commit (645d91a, D.U.H. v0.8.0 release-prep). - Agents: Claude Code @ Opus 4.7 ยท D.U.H. @ Opus 4.7 ยท Codex CLI @ GPT-5.4 ยท D.U.H. @ GPT-5.4 ยท Gemini CLI @ Gemini 3.1 Pro ยท D.U.H. @ Gemini 3.1 Pro.
- Judges: Opus 4.7, GPT-5.4, Gemini 3.1 Pro, each via D.U.H.. Each judge scores every submission โ including runs produced by its own model family โ so single-provider bias cancels on the average.
- Rubric: seven dimensions, each 0โ5, summed to /35. ADR quality ยท Implementation completeness ยท Use of existing abstractions ยท Test coverage of the six-phase contract ยท Documentation ยท Code quality ยท Protocol adherence.
Pre-registered hypotheses
Three hypotheses, each with a numeric falsification threshold. All three were declared before any run.
- H1 โ Harness parity at matched model. Score delta between first-party CLI and D.U.H. at the same model is โค 2 points /35. Result: supported (โ0.3, โ0.6, +2.0). Gemini pair sits at the edge โ a one-point judge swing would cross the threshold.
- H2 โ Measurable same-family judge bias. Judges score own-family submissions โฅ 2 points higher than cross-family judges do. Result: supported strongly for Gemini (same-family bias +9.0 on one target), weakly for Anthropic (+2.5 to +3.0), and inverted for OpenAI (โ5.0 to โ4.0; the GPT judge was harsher on GPT submissions). Three-judge averaging substantially โ but not fully โ cancels the bias.
- H3 โ Model capability dominates CLI choice. Model score range strictly greater than CLI score range. Result: supported decisively. Model range = 23.3 (33.0 โ 9.7 across D.U.H. runs). Max CLI range at any matched model = 2.0.
Why didn't the harness help on Opus and GPT-5.4?
The honest answer: this task doesn't exercise the axes where a harness is supposed to differentiate. It's a single-session, sub-hour feature. All three vendor CLIs give their model a context, a tool schema, and an edit loop; D.U.H. does the same. On short tasks inside native context, the model does the work and the harness is a courier.
Where a harness should matter โ long-context behaviour, context compaction, cross-session memory, multi-agent orchestration, cross-provider model hopping โ this benchmark doesn't test at all. Two follow-on benchmarks are specced to stress those axes: a multi-file distributed-systems task with adversarial tests, and a documentation-generation task over a real codebase.
What D.U.H. unlocks that the vendor CLIs cannot
There is one structural result the scoreboard above understates. Claude Code drives Claude. Codex drives OpenAI. Gemini CLI drives Gemini. That's it.
D.U.H. in this benchmark also ran llama-4-scout,
gpt-oss-120b, and qwen3-32b โ strong open
models served via Groq, none of which has a first-party coding CLI of
any kind. Two of those three runs were rate-limited out by Groq's
free-tier TPM cap (recorded in the results as failures, not silently
dropped); the third ran cleanly and scored below the frontier agents.
The scores are one signal; the bench inclusion is another.
Vendor CLIs can't even line up at the start.
Reproducing this run
# From the D.U.H. repo
cd benchmarks/double-agent-tdd
export ANTHROPIC_API_KEY=โฆ OPENAI_API_KEY=โฆ
export GEMINI_API_KEY=โฆ GROQ_API_KEY=โฆ
./preflight.sh # verifies CLIs, keys, baseline commit
./run_all.sh # six vendor runs + three open-model runs
./judge_all.sh # 27 judgments
python3 aggregate.py # writes results/scoreboard.md
Every artefact โ diff, session log, meta.json, judge JSON โ persists
under results/. If you run this on a different machine
with different keys and your numbers come out more than 3 points off
ours on any agent, file an issue. That is the path to invalidating any
of H1/H2/H3.
duhwave-agile (May 2026)
A separate, complementary benchmark for the persistent-swarm
extension. A single CLI invocation drives a 5-stage agile-team
pipeline (PM โ Architect โ Engineer โ Tester โ
Reviewer) against real OpenAI models. Each stage spawns a
worker via duhwave's Spawn tool, reads exposed handles
from the coordinator's RLM REPL, and binds its result back as a new
named handle.
| Metric | gpt-4o-mini | gpt-4o |
|---|---|---|
| Stages completed | 5 / 5 | 5 / 5 |
| Wall (single-threaded by design) | 35.5 s | 29.3 s |
| Total prompt tokens | 3,934 | 4,706 |
| Total completion tokens | 1,553 | 1,900 |
| Estimated cost | $0.0015 | $0.0308 |
| Cost ratio | 1ร | ~20ร |
| pytest pass rate on produced code | 3 / 5 | 5 / 6 |
Headline finding. Real coordination defects surface
naturally โ the Reviewer agent reads but does not execute,
missing test failures the Tester introduced. gpt-4o-mini's
test_error_handling fails because an earlier
time.sleep(2) in another test refills the bucket;
gpt-4o's test_rate_limiter_thread_safety references
threading.Thread but the test file imports only
pytest and time. Both Reviewers issued
APPROVE. The benchmark catches genuine multi-agent coordination bugs
that wouldn't show up in a stub-only demo.
The obvious next step: add a sixth role โ Runner โ that executes the test suite and binds the failures back as a handle for the Reviewer to peek. Adding a sixth handle-passing stage is one new entry in the pipeline list. Architecture composes.
Full per-stage ledger, cost breakdown, and reproducibility steps: benchmarks/duhwave-agile/RESULT.md. See also the duhwave guide for the architecture this benchmark exercises.
Caveats
- N = 1 per (agent, task). Absolute means carry ยฑ3 points of judge-sampling uncertainty; ordinal ranking is stable across judges.
- Single task. Generalising H1 to all coding tasks is not warranted. Benchmarks 2 and 3 exist to stress different axes.
- Rubric is document-grounded, not execution-grounded.
codex-gpt54had two actually-failing unit tests that every judge missed. A future rubric iteration should run pytest. - Protocol-honesty note. The first Claude Code run
interpreted the task's reference to the D.U.H. source path literally
and wrote files to the upstream repo instead of its benchmark
worktree. The prompt was rewritten to remove the absolute path and
the Claude Code invocation was hardened with
--add-dir=$WORKTREE. All reported numbers use the hardened prompt.