The Science Behind CLAUDE.md: What the Research Says

The Controversy

In early 2026, two independent research teams published the first empirical studies on whether CLAUDE.md-style instruction files actually improve AI coding performance. Their headlines seemed to tell different stories — one modest, one dramatic. Tech blogs promptly cherry-picked the modest one and ran with it. An article on XDA Developers declared that CLAUDE.md helping your projects “is a myth” and that you should focus on “better prompting” instead.

Naturally, this research caught my attention. Let’s look at what both studies actually found, and why they’re not contradictory at all.

Study 1 – ETH Zurich, February 2026
Does It Solve More Tasks?

Researchers at ETH Zurich tested coding agents on two benchmarks: SWE-bench Lite (300 tasks from 11 popular Python repositories) and AGENTbench (138 tasks from 12 niche Python repositories that already had developer-provided context files).

They compared three scenarios: no context file, an LLM-generated context file, and a human-written context file. Their paper is titled “Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?” (Fournier et al., 2026).

Note: The numbers are given in this order: SWE-bench Lite / AGENTbench.

LLM-generated context files (auto-generated by AI)

Reduced task success by 0.5% / 2%
Increased inference cost by ~21%
Made agents run more steps (+2.4 / +3.9)

Human-written context files

Improved task success by 4% on average (on AGENTbench)
Still increased cost by up to 19%
Still added extra steps (+3.3 per task)

Read that again. Auto-generated context files made things worse. Human-written ones made things better. That’s a critical distinction that most of the blog posts glossed over.

The researchers also found that when they stripped documentation from repositories (simulating poorly-documented projects), even auto-generated context (md) files consistently improved performance by 2.7%.

Study 2 – SMU/Heidelberg, January 2026
Does It Work Faster and Cheaper?

A few weeks earlier, a team from Singapore Management University, Heidelberg University, the University of Bamberg, and King’s College London published a different kind of study. Their paper, “On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents” (Lulla et al., 2026), didn’t ask whether the agent solved the task — it asked how efficiently the agent worked.

They selected 10 repositories with existing AGENTS.md files and sampled 124 merged pull requests (each under 100 lines, touching at most 5 files). For every PR, they reconstructed the repository as it was just before the merge, then ran OpenAI Codex on it twice: once with the instruction file, once without. Each run happened in an isolated Docker container.

The results:

Metric	Without instruction file	With instruction file	Improvement
Median wall-clock time	99 s	70 s	28.6% faster
Mean wall-clock time	163 s	130 s	20.3% faster
Median output tokens	2925	2440	16.6% fewer
Mean output tokens	5745	4591	20.1% fewer
Mean input tokens	353010	318652	9.7% fewer

Both wall-clock time and output token reductions were statistically significant. Task completion behavior remained comparable — the agent wasn’t finishing faster by doing less. It was finishing faster by doing smarter work.

An interesting pattern emerged: the mean improvements were larger than the median for token usage, suggesting that instruction files have the greatest impact on the hardest, most expensive tasks — exactly the ones where you’d want the help most.

Why the Numbers Look So Different

At first glance, “4% improvement” and “29% faster” seem contradictory. They’re not. The two studies measured completely different things:

ETH Zurich measured correctness:
Did the agent solve the task? A binary pass/fail. Their answer: 4% more tasks solved with human-written files.
Lulla et al. measured efficiency:
How fast and cheaply did the agent work? Their answer: 29% faster, 17% fewer output tokens.

These findings are complementary. The instruction file makes the agent both slightly more likely to succeed and significantly faster and cheaper while doing so. It’s like giving a contractor both a better map and a shorter route — they find the destination a bit more reliably, and they get there much faster.

There’s one genuine tension between the studies: ETH Zurich found that human-written files increased cost by up to 19%, while Lulla et al. found they decreased token usage. Several factors likely explain this:

Different agents. ETH Zurich tested multiple agents; Lulla et al. used only Codex, which was specifically designed to read AGENTS.md files.
Different file quality. ETH Zurich’s human-written files averaged 641 words across 9.7 sections — quite verbose. Lulla et al. used files from production repositories written by developers for their own use, possibly leaner and more focused.
Different metrics. ETH Zurich counted “steps” (tool calls, file reads, test runs). More steps means more tokens. Lulla et al. measured raw token counts and wall-clock time. An agent can take fewer steps but still explore more code per step.

The bottom line: a concise, well-targeted instruction file helps on both fronts. A bloated one may improve correctness while increasing cost.

Why Auto-Generated Files Hurt

The ETH Zurich study found something important about agent behavior: coding agents faithfully follow context file instructions. When a file mentions a tool like uv, agents use it 1.6x more often.

The problem is that auto-generated context files are bloated. They dump in everything — obvious things the agent could figure out on its own, redundant information already in the README, generic best practices that add nothing. The agent then dutifully follows all these instructions, running extra tests, searching extra files, reading extra documentation. More work, no better results. Just a bigger bill.

In well-documented repositories, auto-generated context files are almost entirely redundant. The agent can already find what it needs. The context file just adds noise.

Why Human-Written Files Help

Human-written context files improve performance because humans include things the AI genuinely can’t figure out on its own:

Custom build commands that aren’t documented anywhere
Non-obvious project conventions (“we use FreeAndNil, never Free”)
Unusual tooling or folder structures
Team-specific patterns that contradict common practice

In other words: the institutional knowledge that lives in people’s heads, not in the README. The Lulla et al. study hypothesizes that instruction files reduce exploratory work — the agent can skip the discovery phase and go straight to productive work. Think of it as the difference between dropping a new developer into a codebase with no documentation versus handing them an onboarding guide.

What This Means for Your Delphi Projects

If you’re a Delphi developer using Claude Code, you’re almost certainly working on a “poorly-documented” project from the agent’s perspective. Claude was trained predominantly on Python, JavaScript, and other mainstream languages. Delphi conventions, LightSaber framework patterns, MSBuild quirks from the command line — none of this is in Claude’s training data. Your project is the niche, under-documented codebase where context files deliver the biggest gains.

For Delphi specifically, the payoff may be even larger than what either study measured. Both studies used Python repositories with mainstream tooling. The gap between what the model knows and what your project needs is wider for Delphi than for Python. A CLAUDE.md that specifies your formatting conventions, your preferred libraries, your build process, and your forbidden constructs removes guesswork that simply doesn’t exist in Python projects.

The studies also examined relatively small tasks. Real-world Delphi work — refactoring a 2000-line form unit, implementing a new plugin, debugging a serialization issue — is often larger and more complex. The Lulla et al. data trend suggests these are precisely the scenarios where instruction files deliver the biggest efficiency gains.

But the research gives us clear guidance on how to write them:

Do include:

Build instructions (your Build.cmd path, MSBuild configuration)
Personal conventions that contradict common practice
Framework-specific knowledge (use my StringList instead of TStringList)
Forbidden patterns (with statement, ProcessMessages, raw pointers)
Things Claude consistently gets wrong in your language.

Don’t include:

Obvious things (“use try-finally for resource management” Claude knows this)
Information already in your source files’ comments
Long explanations when a short rule will do – Every possible coding guideline you can think of

As the ETH Zurich researchers put it: “Human-written context files should describe only minimal requirements.” Keep it lean. Every unnecessary instruction is another thing the agent wastes tokens following.

The Real Takeaway

Taken together, the two studies paint a clear and consistent picture:

A human-written instruction file makes the agent 29% faster, use 17% fewer tokens, and solve 4% more tasks correctly.
An auto-generated instruction file makes things worse. Don’t run /init and call it a day.
The more niche your project (and Delphi qualifies), the bigger the benefit.

That’s not a marginal improvement. A 29% reduction in execution time — from a single markdown file that takes an hour to write and minutes to maintain — is a fundamental shift in how effectively the agent operates.

The CLAUDE.md file is not documentation for humans. It’s an interface specification between you and your AI collaborator. Treat it with the same care you’d give to any other critical piece of your development infrastructure.

The Irony

There’s a delicious irony in this whole episode. The tech blogs screaming “CLAUDE.md is a myth!” were, in many cases, clearly written by AI. The XDA article reads like it was generated by prompting an LLM with a study’s abstract and asking for a hot take. It conflates two different types of context files, draws a conclusion the study doesn’t support, and offers advice (“just prompt better”) that’s so vague it’s useless.

In other words: an AI-generated article about how AI-generated context files don’t work. The study’s own findings, applied to itself.

References

Lulla, J.L., Mohsenimofidi, S., Galster, M., Zhang, J.M., Baltes, S., & Treude, C. (2026). On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents. arXiv:2601.20404v1.
Fournier, G. et al. (2026). Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? ETH Zurich. arXiv:2602.11988.
Rice-Jones, J. (2026). “CLAUDE.md helping your projects is a myth.” XDA Developers.