The Problem: Your Context Window Is Not Free

Every token you put into an AI agent's context window competes with its ability to reason. This is not a metaphor. It is how transformer attention works. More input tokens means the model's attention spreads thinner. Output quality degrades measurably.

Chroma's context rot research tested 18 frontier models. Every single one gets worse as input length increases -- even on simple retrieval tasks. A model with a 200K token window can show significant degradation at 50K. Adding full conversation history (~113K tokens) drops accuracy by 30% compared to a focused 300-token version of the same information.

The ETH Zurich study on AGENTS.md files ("Evaluating AGENTS.md" by Gloaguen et al.) confirmed this for coding agents: LLM-generated context files reduced task success by 3% compared to no context file at all. Human-written files improved success by only 4%, while increasing inference costs by over 20%.

The implication is clear. Context is not free. More context is not better. Intelligence is not the bottleneck -- context is. The question is not "what can I tell my agent?" but "what is the minimum high-signal information my agent needs for this specific task?"

This is context engineering. The three-layer architecture of CLAUDE.md, AGENTS.md, and SKILL.md is the best tool we have in 2026 for solving it.

The Three-Layer Architecture

The solution is progressive disclosure -- load information at the right time, in the right amount, for the right task. Three files, three scopes, three loading behaviors.

Layer	File	Loaded When	Scope	Token Budget
1 -- Foundation	AGENTS.md	Every session, always	Cross-tool project context	< 100 lines
2 -- Tool-specific	CLAUDE.md	Every session, always	Claude Code-specific overrides	< 20 lines
3 -- On-demand	SKILL.md	Only when task matches description	Task-specific capability	< 500 lines per skill

Summary: AGENTS.md carries universal project context every AI tool reads. CLAUDE.md adds Claude Code-specific instructions. Skills load specialized knowledge only when the current task needs it. The result: lean baseline context with deep capability on demand.

Layer 1: AGENTS.md -- Always-On, Cross-Tool

AGENTS.md is the universal project context file. Claude Code, Codex CLI, Copilot CLI, Gemini CLI, and Cursor all read it. It loads at session start and stays in context the entire conversation. Every token counts against every task.

Because of this, AGENTS.md must be ruthlessly lean. Under 100 lines. Every line must pass one test: "Would removing this cause the agent to make a mistake it cannot recover from by reading the code?"

What belongs:

Architecture decisions the code does not show. The code shows Fastify, not why you chose it over Express. The code has PostgreSQL queries, not that you use the Result pattern for errors.
Hard constraints. Never modify migrations. Never use default exports. All new endpoints need integration tests. Guardrails the agent cannot infer.
Build and test commands. pnpm test, pnpm lint, pnpm build. The agent should not guess.

What does NOT belong:

Information visible in package.json, tsconfig.json, or the code itself.
Task-specific workflows (those belong in skills).
Linter rules (the linter enforces those deterministically).
Lengthy reference docs (link to them instead).

Layer 2: CLAUDE.md -- Always-On, Claude Code-Specific

If your team uses only Claude Code, you could put everything in CLAUDE.md. But the moment anyone uses Codex CLI, Copilot, or Cursor, that context becomes invisible to them. The practical strategy: AGENTS.md holds canonical context, CLAUDE.md holds only Claude Code-specific overrides.

# CLAUDE.md

Read AGENTS.md for project architecture and conventions.

## Claude Code-Specific
- When compacting, preserve the full list of modified files
- Prefer subagents for research tasks over inline exploration
- Use TodoWrite for multi-step tasks

Eight lines. References AGENTS.md for everything shared. Adds only what is unique to Claude Code. Zero duplication.

Layer 3: SKILL.md -- On-Demand, Task-Specific

This is where the architecture pays off. Skills load only when the current task matches the skill's description. Claude Code reads all installed skill descriptions at session start -- short strings, a few hundred tokens total. Full body loads only on match.

A developer with 20 or 50 skills installed pays almost nothing until one triggers. When it does load, it can be detailed -- up to 500 lines -- because that cost is paid only when the knowledge is needed.

Progressive disclosure applied to AI agent context. The concept comes from UI design (show users what they need when they need it). It maps perfectly to token economics: pay context cost only for what the current task requires.

Case Study: Splitting a 2000-Line CLAUDE.md

A real scenario. A fintech team had a 2000-line CLAUDE.md grown over six months. Architecture overview, API conventions, migration procedures, deployment checklists, review standards, testing patterns, error handling, security requirements, performance benchmarks, onboarding instructions.

Result: Claude Code was slow, expensive, and inconsistent. Some instructions followed. Others ignored. The ETH Zurich research explains why -- attention diluted across 2000 lines, most irrelevant to any given task.

Before: One Monolithic File

CLAUDE.md (2000 lines)
├── Architecture overview (80 lines)
├── Code conventions (120 lines)
├── API design standards (250 lines)
├── Database migration procedures (180 lines)
├── Deployment checklist (200 lines)
├── Code review standards (150 lines)
├── Testing patterns (220 lines)
├── Error handling guidelines (130 lines)
├── Security requirements (300 lines)
├── Performance benchmarks (170 lines)
└── Onboarding instructions (200 lines)

After: Three Layers

AGENTS.md (65 lines)
├── Architecture overview (condensed to 20 lines)
├── Code conventions (condensed to 15 lines)
├── Hard constraints (10 lines)
└── Build/test commands (20 lines)

CLAUDE.md (8 lines)
├── Reference to AGENTS.md
└── Claude Code-specific behaviors

.claude/skills/
├── api-design/SKILL.md (180 lines)
├── database-migration/SKILL.md (140 lines)
├── deployment/SKILL.md (160 lines)
├── code-review/SKILL.md (120 lines)
└── security-audit/SKILL.md (220 lines)

Baseline context dropped from ~2000 lines to ~73 lines. Five skills total 820 lines, but any given task loads at most one or two. A code review session loads 73 + 120 = 193 lines. A deployment task loads 73 + 160 = 233. Compare to 2000 lines for every task regardless of relevance.

The Token Math

Rough estimates (1 line ~ 15 tokens on average for markdown with code blocks):

Configuration	Lines in Context	Estimated Tokens	Session Cost (at $3/M input tokens)
Monolithic CLAUDE.md	2,000	~30,000	~$0.09 per session
Layered (code review task)	193	~2,900	~$0.009 per session
Layered (deployment task)	233	~3,500	~$0.011 per session

The layered approach uses roughly 10x fewer context tokens per session. Not just cost savings -- performance improvement. Fewer irrelevant tokens means attention concentrated on what matters.

The team reported that after splitting, instruction-following accuracy improved noticeably. Tasks needing 2-3 attempts started completing first try. The agent stopped ignoring instructions -- not because it got smarter, but because relevant instructions were no longer buried in 2000 lines of noise.

Progressive Disclosure in Action

Trace what happens when a developer asks Claude Code to "create a new migration for adding a preferences column to the users table."

Phase 1: Session start. Claude Code loads AGENTS.md (65 lines, ~975 tokens) and CLAUDE.md (8 lines, ~120 tokens). Also reads all skill descriptions -- short frontmatter strings. Total: ~1,200 tokens baseline.

Phase 2: Task matching. The request mentions "migration." Claude Code matches against the database-migration skill description:

description: >
  Create, modify, or review database migrations. Triggers on migration files,
  schema changes, Drizzle ORM modifications, column additions or removals.

Match. Full database-migration/SKILL.md loads (140 lines, ~2,100 tokens).

Phase 3: Execution. Now 65 + 8 + 140 = 213 lines of context. Every line relevant. The agent follows the migration workflow step by step: reads schema, checks naming conventions, modifies schema, generates migration, runs tests.

Phase 4: Skill references. The migration skill says "follow error handling conventions in AGENTS.md" rather than restating them. AGENTS.md is already in context. No duplication. No extra tokens.

If the developer then asks "review the code I just wrote," the migration skill stays loaded (still relevant) and the code-review skill triggers additionally. Two skills, both relevant, both earning their context cost.

This is what Anthropic means by "intelligence is not the bottleneck -- context is." Same model, better context engineering, measurably better results.

Measuring Context Impact

You cannot improve what you do not measure.

Token Counting

Claude Code has no built-in token counter for context files, but you can estimate:

# Count tokens in your context files (rough estimate: 1 word ≈ 1.3 tokens)
wc -w AGENTS.md CLAUDE.md
# Multiply word count by 1.3 for approximate token count

# Count skill tokens
find .claude/skills -name "SKILL.md" -exec wc -w {} +

For precise counts, use the Anthropic tokenizer API or tiktoken with cl100k_base encoding.

The 5% Rule

Always-on context (AGENTS.md + CLAUDE.md) should consume less than 5% of the effective context window. For Claude Opus 4.6 with 200K tokens, that means under 10,000 tokens. A lean 73-line setup fits easily. Exceeding 500 lines is a clear red flag.

Tracking Instruction Compliance

After restructuring, test 10 representative tasks. Score how many instructions the agent follows correctly. Compare against the same tasks with the old monolithic file. Crude but effective. Compliance going from 60% to 90% means the restructuring worked.

Cost Monitoring

Use Claude Code's /cost command to compare session costs before and after. The 10x context reduction translates to meaningful savings on teams running hundreds of sessions daily.

The ETH Zurich Finding: Less Is More

The study deserves deeper examination because it overturns a common assumption.

Most developers assume more context helps. "Tell the agent everything, fewer mistakes." The study tested this directly with AGENTbench -- 138 real Python tasks from niche repositories -- and four agents (Claude 3.5 Sonnet, Codex GPT-5.2, GPT-5.1 mini, Qwen Code).

Results:

LLM-generated context files degraded performance ~3% vs no context file.
Human-written files improved only ~4%, but increased costs 19-20%.
Both types encouraged broader exploration -- more testing, more file traversal -- sometimes helpful, often wasteful.

Core finding: detailed codebase overviews are largely redundant with what the agent discovers by reading code. Overly restrictive instructions constrain problem-solving. The sweet spot is minimal, high-signal context that prevents unrecoverable mistakes without micromanaging.

This is exactly what the three-layer architecture achieves. AGENTS.md provides minimal guardrails. Skills provide deep knowledge only when relevant. Reasoning capacity preserved for the actual task.

Try Termdock — Workspace Sync works out of the box. Free download →

Real Project Case Study: Monorepo with Multiple Teams

A monorepo with four packages, each owned by a different team:

monorepo/
├── packages/
│   ├── api/          # Backend team (Fastify, PostgreSQL, Drizzle)
│   ├── web/          # Frontend team (Next.js, React, Tailwind)
│   ├── mobile/       # Mobile team (React Native, Expo)
│   └── shared/       # Platform team (shared types, utils)
├── AGENTS.md         # Root: monorepo-wide context
├── CLAUDE.md         # Root: Claude Code-specific
└── .claude/skills/   # Root: shared skills

Root AGENTS.md (Monorepo-Wide)

## Monorepo Structure
pnpm workspace. 4 packages: api, web, mobile, shared.

## Conventions
- All packages use TypeScript strict mode
- Shared types live in packages/shared/src/types/
- Inter-package imports use workspace protocol: "shared": "workspace:*"
- CI runs affected-only: pnpm --filter ...[origin/main] test

## Hard Constraints
- Never import from api/ in web/ or mobile/ (use shared/ for shared logic)
- Never modify packages/shared/src/types/api-contract.ts without API team review

Twenty lines. Applies to every task in every package.

Package-Level AGENTS.md (Nested)

Each package has its own:

packages/api/AGENTS.md        # Fastify conventions, DB patterns, auth rules
packages/web/AGENTS.md        # Next.js patterns, component conventions
packages/mobile/AGENTS.md     # React Native conventions, Expo config
packages/shared/AGENTS.md     # Type export patterns, versioning rules

Claude Code and other tools load root AGENTS.md plus the relevant package AGENTS.md based on which files are being edited. Backend conventions load only when working on backend files.

Skills Per Package

Different teams, different skills:

.claude/skills/
├── api-endpoint/SKILL.md         # Backend: create Fastify endpoints
├── db-migration/SKILL.md         # Backend: database migration workflow
├── react-component/SKILL.md      # Frontend: component creation pattern
├── storybook-story/SKILL.md      # Frontend: Storybook conventions
├── expo-screen/SKILL.md          # Mobile: new screen workflow
├── deep-link/SKILL.md            # Mobile: deep link configuration
├── shared-type/SKILL.md          # Platform: shared type modification protocol
└── release/SKILL.md              # Platform: release and versioning workflow

A frontend developer asking "create a new Button component" triggers react-component and possibly storybook-story. Backend skills stay dormant. A backend developer asking "add a new endpoint for user preferences" triggers api-endpoint and possibly db-migration. Frontend skills stay dormant.

Total library: 8 skills across 4 teams. Any given interaction loads at most 2-3. Context stays focused.

Workspace-Level Configuration

The monorepo challenge: keeping skill configurations synchronized across packages. Platform team updates shared type protocol -- shared-type skill needs to reflect that. API team adds middleware pattern -- api-endpoint skill needs updating.

This is where workspace-level tooling matters. Managing different skill configurations, context files, and terminal environments across packages is the kind of multi-project coordination Termdock's workspace sync handles -- switching between packages with full context setup intact, so each team member works with the right skills for their package.

Workspace-Level Skill Configuration

Beyond monorepos, individual developers work across multiple projects with different context needs. Three open-source projects plus two work projects means five AGENTS.md files, five skill sets, five tool configurations.

Switching projects means switching mental models -- and switching AI agent context. Your Rust project CLAUDE.md is wrong for the TypeScript project. Your AWS deployment skill does not apply to the Vercel project.

Personal skills in ~/.claude/skills/ help: your code review style, commit message format, debugging approach travel across projects. But project-specific skills need to activate when you switch in and deactivate when you leave.

This workspace-level coordination -- right context, right skills, right terminal environment, all switching together -- separates productive multi-project workflows from constant friction. Termdock's workspace sync manages this: layouts, environment variables, git state, skill configurations move as a unit.

Practical Steps: Restructuring Your Context

If you have a CLAUDE.md beyond 200 lines:

Step 1: Audit. Read every line. For each section: "Does this apply to every task, or only specific types?"

Step 2: Extract universal context. Move everything-applies content to AGENTS.md. Condense aggressively -- if the code shows it, remove it.

Step 3: Identify skill candidates. Each task-specific section becomes a candidate. Look for: multi-step workflows, domain-specific knowledge, templates or examples, validation checklists.

Step 4: Create skills. For each candidate, directory in .claude/skills/ with SKILL.md. Specific, slightly pushy description. Instructions moved from CLAUDE.md.

Step 5: Eliminate duplication. Any instruction in both AGENTS.md and a skill -- the skill should say "follow conventions in AGENTS.md" rather than restating.

Step 6: Reduce CLAUDE.md. Only Claude Code-specific behavior remains. Everything else references AGENTS.md.

Step 7: Test. Five representative tasks per skill. Verify trigger accuracy (right skill loads?), instruction compliance (agent follows?), output quality (better than before?).

For skill design details -- description optimization, eval loops, progressive disclosure -- see good skill design principles. For when to use which file, the SKILL.md vs CLAUDE.md vs AGENTS.md comparison has the complete decision framework.

The Pattern That Emerges

Context engineering with skills follows the same principle as good software architecture: separation of concerns with clear interfaces.

AGENTS.md is the interface contract -- universal truths every tool and task needs. CLAUDE.md is the adapter -- tool-specific behavior layered on top. Skills are implementations -- deep, specialized knowledge loaded on demand through a clean trigger interface (the description field).

Teams that get this right spend less on inference, get more consistent results, and scale their AI agent setup across projects and members without the context file becoming a maintenance burden.

Context is the bottleneck. Engineer it accordingly.

Danny Huang·Follow on Threads →

Free Download

Ready to streamline your terminal workflow?

Multi-terminal drag-and-drop layout, workspace Git sync, built-in AI integration, AST code analysis — all in one app.

Download Termdock →

#context-engineering#agent-skills#skill-md#claude-md#agents-md#claude-code

Context Engineering: CLAUDE.md + AGENTS.md Layer