Skip to main content

Does a universal harness move the needle?

A pre-registered, falsifiable, reproducible benchmark.

TL;DR โ€” Six agents, one non-trivial task, three LLM judges. At matched models the score difference between each vendor's own CLI and D.U.H. sits well inside judge noise on two of three pairs (Opus 4.7 ฮ” = โˆ’0.3, GPT-5.4 ฮ” = โˆ’0.6) and favours D.U.H. on the third (Gemini 3.1 ฮ” = +2.0). Model capability dominates harness choice by over 10ร—. The full method, the rubric, every diff, every session log, and the 27 judge outputs are in the repo so you can re-run and see for yourself.

Scoreboard

RankAgentj-opusj-gpt54j-g31Mean /35Elapsed
1=claude-code-opus35303533.3742 s
1=codex-gpt5435303533.3510 s
3duh-opus35293533.0915 s
4duh-gpt5433303532.7230 s
5duh-gemini-3.125233327.0305 s
6gemini-cli-3.125222825.0358 s

Same-model deltas

ModelFirst-party CLID.U.H.ฮ”
Opus 4.733.333.0โˆ’0.3
GPT-5.433.332.7โˆ’0.6
Gemini 3.1 Pro25.027.0+2.0

Two of three deltas fall cleanly inside the ยฑ2-point pre-registered parity band. The Gemini pair sits at the edge of the band and points in D.U.H.'s direction โ€” driven almost entirely by test coverage (3.0 vs 1.3 /5) and implementation completeness (3.7 vs 3.3 /5).

Method, in one page

Pre-registered hypotheses

Three hypotheses, each with a numeric falsification threshold. All three were declared before any run.

Why didn't the harness help on Opus and GPT-5.4?

The honest answer: this task doesn't exercise the axes where a harness is supposed to differentiate. It's a single-session, sub-hour feature. All three vendor CLIs give their model a context, a tool schema, and an edit loop; D.U.H. does the same. On short tasks inside native context, the model does the work and the harness is a courier.

Where a harness should matter โ€” long-context behaviour, context compaction, cross-session memory, multi-agent orchestration, cross-provider model hopping โ€” this benchmark doesn't test at all. Two follow-on benchmarks are specced to stress those axes: a multi-file distributed-systems task with adversarial tests, and a documentation-generation task over a real codebase.

What D.U.H. unlocks that the vendor CLIs cannot

There is one structural result the scoreboard above understates. Claude Code drives Claude. Codex drives OpenAI. Gemini CLI drives Gemini. That's it.

D.U.H. in this benchmark also ran llama-4-scout, gpt-oss-120b, and qwen3-32b โ€” strong open models served via Groq, none of which has a first-party coding CLI of any kind. Two of those three runs were rate-limited out by Groq's free-tier TPM cap (recorded in the results as failures, not silently dropped); the third ran cleanly and scored below the frontier agents. The scores are one signal; the bench inclusion is another. Vendor CLIs can't even line up at the start.

Reproducing this run

# From the D.U.H. repo
cd benchmarks/double-agent-tdd

export ANTHROPIC_API_KEY=โ€ฆ  OPENAI_API_KEY=โ€ฆ
export GEMINI_API_KEY=โ€ฆ     GROQ_API_KEY=โ€ฆ

./preflight.sh        # verifies CLIs, keys, baseline commit
./run_all.sh          # six vendor runs + three open-model runs
./judge_all.sh        # 27 judgments
python3 aggregate.py  # writes results/scoreboard.md

Every artefact โ€” diff, session log, meta.json, judge JSON โ€” persists under results/. If you run this on a different machine with different keys and your numbers come out more than 3 points off ours on any agent, file an issue. That is the path to invalidating any of H1/H2/H3.

duhwave-agile (May 2026)

A separate, complementary benchmark for the persistent-swarm extension. A single CLI invocation drives a 5-stage agile-team pipeline (PM โ†’ Architect โ†’ Engineer โ†’ Tester โ†’ Reviewer) against real OpenAI models. Each stage spawns a worker via duhwave's Spawn tool, reads exposed handles from the coordinator's RLM REPL, and binds its result back as a new named handle.

Metricgpt-4o-minigpt-4o
Stages completed5 / 55 / 5
Wall (single-threaded by design)35.5 s29.3 s
Total prompt tokens3,9344,706
Total completion tokens1,5531,900
Estimated cost$0.0015$0.0308
Cost ratio1ร—~20ร—
pytest pass rate on produced code3 / 55 / 6

Headline finding. Real coordination defects surface naturally โ€” the Reviewer agent reads but does not execute, missing test failures the Tester introduced. gpt-4o-mini's test_error_handling fails because an earlier time.sleep(2) in another test refills the bucket; gpt-4o's test_rate_limiter_thread_safety references threading.Thread but the test file imports only pytest and time. Both Reviewers issued APPROVE. The benchmark catches genuine multi-agent coordination bugs that wouldn't show up in a stub-only demo.

The obvious next step: add a sixth role โ€” Runner โ€” that executes the test suite and binds the failures back as a handle for the Reviewer to peek. Adding a sixth handle-passing stage is one new entry in the pipeline list. Architecture composes.

Full per-stage ledger, cost breakdown, and reproducibility steps: benchmarks/duhwave-agile/RESULT.md. See also the duhwave guide for the architecture this benchmark exercises.

Caveats

See the benchmark   Try D.U.H.