Anthropic shipped Opus 4.7 a few weeks ago and my feed has been on fire ever since. Confident hallucinations. Ignored instructions. Long-context amnesia. The usual question: is the model genuinely worse, or is this the obligatory “new model bad” tantrum that goes off after every release?
A bit of both, it turns out. Here is what I found.
It’s not just one cranky guy on Reddit
- GitHub bug #50235 — somebody actually sat down and classified the failure modes. Seven of them. I dissect this one in the next section.
- Reddit (r/ClaudeAI, r/ClaudeCode) — a “serious regression” thread hit 2,300 upvotes in 24 hours. One user reported 77 hallucinations in a single session. Another claims the model invented commit hashes with a straight face. A third had their resume rewritten with fake schools and a surname they had never heard of.
- XDA Developers — accuses the model of confabulating instead of searching the web, and “doing the bare minimum”. (Welcome to my life since 2023.)
- X / Twitter — 14k likes on “no improvement over 4.6”. “Legendarily bad”, apparently. Argues nonstop even when shown evidence to the contrary, which any married person will tell you is a survival trait.
Inside GitHub issue #50235 — seven named hallucination patterns
The single best source on the whole mess is the issue itself, opened on 18 April by GitHub user tomtokitajr. Still open at time of writing, labelled area:model, bug, platform:windows. The thing that makes it useful is that the author did not just write “it sucks”. They sat down and built a taxonomy. Seven failure modes, each with a name. Paraphrased here so you do not need a coffee just to read them:
- Confident-prose fabrication — the model talks confidently and treats its own talking as “explanation” rather than “claims that need backing up”. So no tool calls fire and no sources get checked. (Same energy as my high-school history essays.)
- Bucket-bypass drift — your
CLAUDE.mdsays “do not fabricate assertions”. The model dutifully obeys, and then fabricates inside tags, status labels, data fields, or context instead. Like a kid told not to eat sweets, who promptly invents the new category “biscuits”. - Narrative-confirming reconciliation bias — two tools return contradictory facts. The model quietly picks the one that fits the story it is already telling and never mentions the conflict. Not random. It always rounds toward the direction it was already heading.
- Tool-output temporal decay — a value the model read with a tool five steps ago somehow becomes a slightly different value by the time it is repeated, even though the original output is still sitting right there in context. The number rusted in transit.
- Correctness-by-accident insulation — sometimes the fabricated claim happens to be right, because the training data leaned that way. So you do not notice. You praise the model and move on. Until next week, when the same broken process spits out something wrong on a Friday afternoon and breaks production.
- Negative fabrication — “X does not exist”, stated with full confidence after one half-hearted
grep. The most dangerous one on the list, because it sounds like due diligence. It is not. - Plan/list-emission bypass — single-claim mode triggers verification rules. Bullet-list mode does not. Once the model is “just laying out a plan”, the safety nets stop firing. Apparently “this is a list” is a magic word.
The author then makes the observation that should actually scare anyone with a hand-tuned CLAUDE.md:
- Behavioural rules that just sit in context do not reliably fire during composition on 4.7. Writing it down is not enough.
- Principle-style rules (“be careful”, “fact-check before asserting”) drift. Command-format rules with explicit trigger, action, examples, and forbidden formats fire more reliably. The model wants a recipe, not a vibe.
- Tool-call hooks beat behavioural rules. If you really care that something does not happen, enforce it in code, not in prose.
- Sub-agents do not reliably inherit the parent’s rules. Writing “do X” in
CLAUDE.mddoes not magically bind them. - Long sessions and post-compaction state get sloppier. After a while, the model is running on vibes.
No minimal repro is attached, and Anthropic has not commented on the thread, which is its own kind of answer. But the patterns line up with what everyone else is reporting, which is why this issue is worth bookmarking even if you have never touched Claude Code. If you write production Delphi against an LLM, “negative fabrication” alone is the kind of thing that quietly poisons a refactor: the model declares your unit does not exist, you obediently re-implement it, and now you have two units doing the same thing and a future merge headache with your own code.
Anthropic’s own post-mortem
On 23 April Anthropic put out a post-mortem owning up to three bugs that were quietly nuking quality between March and April:
- Default effort silently dropped from high to medium (Mar 4 → reverted Apr 7). Yes, your “high effort” was secretly medium for a month.
- Thinking-cache bug threw away the reasoning history every turn (Mar 26 → fixed Apr 10). The model was rebuilding its own train of thought on every call.
- A “≤25 words between tool calls” prompt knocked coding quality down by 3% (Apr 16 → reverted Apr 20). Somebody on the prompt team really wanted brevity.
So a chunk of the “4.7 is broken” noise was actually harness bugs, not the model itself. Fair enough. The uncomfortable part: the hallucination complaints kept rolling in after all three fixes shipped.
Benchmarks: 4.7 vs 4.6
On paper, 4.7 looks like a clear win:
| Benchmark | Opus 4.6 | Opus 4.7 |
|---|---|---|
| SWE-bench Verified | 80.8% | 87.6% |
| SWE-bench Pro | 53.4% | 64.3% |
| CursorBench | 58% | 70% |
On paper.
Real-world is messier
- 4.7 follows instructions more literally. It will not refactor stuff you didn’t ask it to refactor, which is a small miracle. Good for surgical edits.
- 4.7 regresses on long context above 100K tokens. Bad for big Delphi codebases. Very bad if your project is one of those 200-unit monsters with twelve sub-packages.
- 4.7’s tokenizer charges roughly 35% more tokens for the same prompt. Benchmarks are free in evaluations. Not free on your invoice.
- 4.7 hallucinates confidently. If you depend on the model not inventing procedure signatures or unit names, this one alone might disqualify it.
- Multi-file coherence is genuinely better. Refactors that span six units come out cleaner.
- Root-cause analysis from stack traces is also better. When it actually reads the trace instead of guessing.
Should you downgrade Claude Opus 4.7 to 4.6?
If you write Delphi against a serious codebase and you rely on the model not making up TLightForm methods that do not exist, downgrading to 4.6 is the safer bet. The benchmark wins on 4.7 are not going to help when it confidently invents the API you were about to call.
If you do surgical edits in small files, where literal instruction-following matters more than recall, stick with 4.7. It really is better at not “improving” code you didn’t ask it to improve.
How to switch back to 4.6
If you launch Claude Code through a wrapper script (mine is c:\Users\trei\.local\bin\claude-max.cmd), change --model default to --model claude-opus-4-6. That pins 4.6 instead of auto-tracking whichever Opus Anthropic shipped this morning. Flip back when you feel brave.
Sources
- Anthropic — April 23 post-mortem
- GitHub — claude-code issue #50235 ([BUG] Opus 4.7 Hallucinations)
- XDA Developers — Claude new model is step forward and two steps back
- Vellum — Claude Opus 4.7 benchmarks explained
- MindStudio — Opus 4.7 vs 4.6 what changed
- Medium — Opus 4.7 is the worst release
- Anthropic — Claude Opus 4.7 announcement
- The Register — Claude Opus 4.7 AUC overzealous