I ran my AI code-review pipeline twice on the same Delphi unit: once with Claude Opus 4.8 agents, once with Claude Fable 5 agents. Same byte-for-byte prompts, same file state (restored between runs), models verified from the agents’ own transcripts. The result was not what the hype predicts.
The setup
- The target:
frmTestMain.pas(482 lines) — a single-form VCL test harness that pokes my Proteus license server over HTTPS/JSON throughTProteusServerClient. - The pipeline (
/light-review): three independent agents in sequence. Stage 1 finds correctness bugs and applies the fixes it is confident about; stage 2 counter-analyzes stage 1 and drops false positives; stage 3 verifies every edit, reverts what does not hold up, and compiles. - The scoring: Run A (Opus, 2026-06-09) went first and its outcome became the answer key: 2 planted fixes (reverted before Run B), 8 documented false-positive traps, 1 severity-calibration item. Run B (Fable, 2026-06-10) replayed the identical review against that key.
- Integrity safeguards: the review agents never saw the benchmark file; Run A’s findings were scrubbed from the code, the docs, AND the agents’ persistent memory before Run B; a contamination check confirmed no agent read the answer key; both runs ended in a clean compile (0 errors / 0 warnings / 0 hints).
The two planted items were both judgment calls, not crashes: an unused E in a FormDestroy try/except swallow (correct fix: drop the E; adding logging or re-raising counts as a fail), and a cache-load sentinel probe that silently depends on ServerURL being the first field LoadFromFile reads (correct fix: document the coupling).
What each model found
| Finding | Severity | Opus 4.8 | Fable 5 |
|---|---|---|---|
Unused E in the FormDestroy swallow (planted) | Minor | ✔ fixed | ✘ missed |
| Sentinel field-order coupling undocumented (planted) | Minor | ✔ documented | ✘ missed |
| Truncated-cache half-load (calibration item) | Informational | ✔ raised, correctly NOT fixed | ✘ not raised |
DoPost leaks a TJSONObject on EVERY failed server call (production library) | Significant | ✘ missed | ✔ found + fixed |
Sentinel poison survives the EXCEPT path of btnLoadCacheClick | Minor | ✘ missed | ✔ found + fixed |
| HardwareID INI portability promised by a comment but never implemented | Minor | ✘ missed | ✔ found + fixed |
| Stale protocol comments (“ApiKey sent only on first check” — false) | Minor | ✘ missed | ✔ corrected |
| False positives in the final report | — | 0 | 0 |
| Failures on the 8 false-positive traps | — | 0 (8/8 raised and rejected) | 0 (4 raised and rejected, 4 not raised) |
The striking part: the two models found disjoint sets. Opus nailed the style/intent items and the severity calibration; Fable ignored those and instead pulled a real memory leak out of the production client library (the only Significant defect either run produced), plus two latent bugs and a documentation lie. Every Fable finding survived counter-analysis, an independent verification stage, my own manual re-check, and a clean compile.
Update (2026-06-10): I have since hardened LoadFromFile against truncated caches anyway — all-or-nothing load, three new DUnitX tests, 110/110 passing.
Cost and wall time
| Metric | Opus 4.8 | Fable 5 | Delta |
|---|---|---|---|
| Tokens, stage 1 / 2 / 3 | 85,860 / 106,955 / 76,888 | 107,885 / 130,726 / 129,610 | — |
| Tokens, total | 269,703 | 368,221 | +37% |
| Tool uses, total | 59 | 66 | +12% |
| Wall time, stage 1 / 2 / 3 | 4:16 / 2:37 / 1:31 | 10:26 / 8:11 / 6:08 | — |
| Wall time, total | 8 min 24 s | 24 min 45 s | ~2.9× |
| Final compile | 0 / 0 / 0 | 0 / 0 / 0 | tie |
One asymmetry inside those numbers: Opus’s stage 3 reported the compile as blocked (its toolset has no agent-launcher and the house rules forbid hand-invoked MSBuild) and left the build to the orchestrator. Fable’s stage 3 located the sanctioned compiler wrapper on its own, ran the build itself, and gated its verdict on the result — more capable, and part of why its stage 3 cost more.
Scorecard
| Criterion | Winner |
|---|---|
| Detection of the 2 planted items | Opus (2/2 vs 0/2) |
| Precision (false positives confirmed) | tie (0 each) |
| Trap handling + severity calibration | Opus, slightly |
| New genuine findings | Fable (4 vs 0, incl. the only Significant) |
| Verification rigor | Fable (read the server-side source to test wire-contract claims; killed its own best candidate finding that way) |
| Fix quality (minimality, compile) | tie |
| Efficiency | Opus (~3× faster, −27% tokens) |
| Report quality | tie, Fable marginally ahead |
Verdict: a split decision. Scored against the answer key, Opus wins. Scored by value delivered to the codebase, Fable wins — a confirmed leak fix in a shipping library beats a dead identifier and a comment. I kept Fable as the review model and accepted the 3× wall time, because the pipeline runs unattended anyway.
A myth died along the way
The answer key claimed the unused E produces a compiler hint (a Zero-Tolerance item in my projects). Run B disproved it by direct measurement: a clean build ran while on E: Exception do ; was still in the file — Delphi 13.1 emits no hint for an unused exception-handler binding. Ordinary unused local variables do produce hints; the exception-handler binding is the exception. Dropping the E is style, not hint-hygiene. Even benchmark authors should fact-check their premises; in this case the benchmark fact-checked its author.
Why didn’t Fable crush it? Five hypotheses
Everybody calls Fable the next big thing, and here it lost the headline metric 0/2. One objection to my own framing first: by defects found, Fable did crush it — the only real bug of either run, zero false positives. It lost only on the answer-key score and on the stopwatch. Five hypotheses for those two losses:
- The answer key was written by Opus. Run A went first, so “ground truth” = whatever Opus happened to find. The test measures “can Fable re-find Opus’s findings” — never the reverse. Invert the runs and Opus scores 0/4 against a Fable-authored key.
- The planted items were not bugs. Both were style/documentation judgments, and one rested on a false premise (the phantom compiler hint). Fable took “find correctness bugs” literally and spent nothing on polish-grade items.
- Depth-first attention. Fable read ACROSS tiers — it opened the server-side handlers to verify what the client comments claimed. Opus spent the same diligence WITHIN the unit, which is how you catch unused identifiers. The key only rewarded one of the two strategies.
- n = 1 per model. One 482-line review per model cannot separate capability from run-to-run variance. This is a case study, not a measurement.
- The hype is calibrated on different tasks. Fable’s reputation comes from broad agentic-coding benchmarks and long-horizon work. A single-unit Delphi VCL review with house conventions is a niche slice of that distribution — the depth overhead buys cross-tier verification this unit barely needs.
Disclosure: both runs were orchestrated and scored by a Fable 5 session — the same model family under test in Run B. I re-verified the headline findings against the source myself, but judge-identity bias is worth naming, even if the Opus-authored answer key biases the other way. (And yes, this article was drafted by Fable too. Turtles all the way down.)
Bottom line
The interesting question turned out not to be “which model is smarter” but “which defect class do you want sampled”. Opus reviewed like a meticulous colleague enforcing the house style; Fable reviewed like a skeptical senior who distrusts every comment and reads the other side of the wire. For an unattended nightly review of production code, I want the skeptic — so model: fable stays. For a fast pre-commit polish pass, the data says Opus.
If this kind of AI-meets-Delphi experimentation is your thing: Book 5 of “Delphi in all its glory” is dedicated to AI-assisted development for Delphi (Amazon).