·10 min read·agent-skills

Context Engineering with Skills: How to Layer CLAUDE.md, AGENTS.md, and SKILL.md for Maximum Agent Performance

Every token in your context window competes with reasoning capacity. Learn the three-layer architecture — always-on CLAUDE.md, cross-tool AGENTS.md, on-demand SKILL.md — to keep your AI agent fast, focused, and correct.

DH
Danny Huang

The Problem: Your Context Window Is Not Free

Every token you put into an AI agent's context window is a token that competes with the agent's ability to reason. This is not a metaphor. It is how transformer attention works. More input tokens means the model's attention is spread thinner across more material, and the quality of its output degrades measurably.

Chroma's context rot research tested 18 frontier models and found that every single one gets worse as input length increases — even on simple retrieval tasks. A model with a 200K token window can show significant degradation at 50K tokens. Adding full conversation history (~113K tokens) can drop accuracy by 30% compared to a focused 300-token version of the same information.

The ETH Zurich study on AGENTS.md files ("Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?" by Gloaguen et al.) confirmed this directly for coding agents: LLM-generated context files reduced task success rates by 3% compared to no context file at all. Human-written files improved success by only 4%, while increasing inference costs by over 20%.

The implication is clear: context is not free, and more context is not better. Intelligence is not the bottleneck — context is. The question is not "what can I tell my agent?" but "what is the minimum high-signal information my agent needs for this specific task?"

This is context engineering. And the three-layer architecture of CLAUDE.md, AGENTS.md, and SKILL.md is the best tool we have in 2026 for solving it.

The Three-Layer Architecture

The solution is progressive disclosure — load information at the right time, in the right amount, for the right task. Three files, three scopes, three loading behaviors.

LayerFileLoaded WhenScopeToken Budget
1 — FoundationAGENTS.mdEvery session, alwaysCross-tool project context< 100 lines
2 — Tool-specificCLAUDE.mdEvery session, alwaysClaude Code-specific overrides< 20 lines
3 — On-demandSKILL.mdOnly when task matches descriptionTask-specific capability< 500 lines per skill

Summary: AGENTS.md carries the universal project context that every AI tool reads. CLAUDE.md adds Claude Code-specific instructions. Skills load specialized knowledge only when the current task needs it. The result is a lean baseline context with deep capability available on demand.

Layer 1: AGENTS.md — Always-On, Cross-Tool

AGENTS.md is the universal project context file. It is read by Claude Code, Codex CLI, Copilot CLI, Gemini CLI, and Cursor. It loads at the start of every session and stays in context for the entire conversation. Every token in this file counts against every task you run.

Because of this, AGENTS.md must be ruthlessly lean. Under 100 lines. Every line must pass one test: "Would removing this cause the agent to make a mistake it cannot recover from by reading the code?"

What belongs in AGENTS.md:

  • Architecture decisions the code does not show. The code shows you use Fastify, but not that you chose it over Express for specific reasons. The code shows PostgreSQL queries, but not that you use the Result pattern for error handling.
  • Hard constraints. Never modify migration files. Never use default exports. All new endpoints need integration tests. These are guardrails the agent cannot infer.
  • Build and test commands. pnpm test, pnpm lint, pnpm build. The agent should not have to guess.

What does NOT belong in AGENTS.md:

  • Information the agent can see in package.json, tsconfig.json, or the code itself.
  • Task-specific workflows (those belong in skills).
  • Linter rules (the linter enforces those deterministically).
  • Lengthy reference documentation (link to it instead).

Layer 2: CLAUDE.md — Always-On, Claude Code-Specific

If your team uses only Claude Code, you could put everything in CLAUDE.md. But the moment anyone on the team uses Codex CLI, Copilot, or Cursor, that context becomes invisible to them. The practical strategy: AGENTS.md holds the canonical context, CLAUDE.md holds only Claude Code-specific overrides.

# CLAUDE.md

Read AGENTS.md for project architecture and conventions.

## Claude Code-Specific
- When compacting, preserve the full list of modified files
- Prefer subagents for research tasks over inline exploration
- Use TodoWrite for multi-step tasks

That is 8 lines. It references AGENTS.md for everything shared, and adds only what is unique to Claude Code's behavior. Zero duplication.

Layer 3: SKILL.md — On-Demand, Task-Specific

This is where the architecture pays off. Skills load only when the current task matches the skill's description. Claude Code reads all installed skill descriptions at session start — these are short strings, costing a few hundred tokens total. The full skill body loads only on match.

A developer can have 20 or 50 skills installed, but until one is triggered, each consumes only a sliver of the available token budget. When a skill does load, it can be detailed — up to 500 lines of task-specific instructions, templates, and references — because that cost is paid only when the knowledge is needed.

This is progressive disclosure applied to AI agent context. The concept comes from UI design (show users what they need when they need it), but it maps perfectly to token economics: pay the context cost only for what the current task requires.

Case Study: Splitting a 2000-Line CLAUDE.md

Here is a real scenario. A fintech team had a 2000-line CLAUDE.md that had grown over six months. It contained architecture overview, API design conventions, database migration procedures, deployment checklists, code review standards, testing patterns, error handling guidelines, security requirements, performance benchmarks, and onboarding instructions.

The result: Claude Code was slow, expensive, and inconsistent. Some instructions were followed, others were ignored. The ETH Zurich research explains why — the agent's attention was diluted across 2000 lines of instructions, most of which were irrelevant to any given task.

Here is how they split it.

Before: One Monolithic File

CLAUDE.md (2000 lines)
├── Architecture overview (80 lines)
├── Code conventions (120 lines)
├── API design standards (250 lines)
├── Database migration procedures (180 lines)
├── Deployment checklist (200 lines)
├── Code review standards (150 lines)
├── Testing patterns (220 lines)
├── Error handling guidelines (130 lines)
├── Security requirements (300 lines)
├── Performance benchmarks (170 lines)
└── Onboarding instructions (200 lines)

After: Three Layers

AGENTS.md (65 lines)
├── Architecture overview (condensed to 20 lines)
├── Code conventions (condensed to 15 lines)
├── Hard constraints (10 lines)
└── Build/test commands (20 lines)

CLAUDE.md (8 lines)
├── Reference to AGENTS.md
└── Claude Code-specific behaviors

.claude/skills/
├── api-design/SKILL.md (180 lines)
├── database-migration/SKILL.md (140 lines)
├── deployment/SKILL.md (160 lines)
├── code-review/SKILL.md (120 lines)
└── security-audit/SKILL.md (220 lines)

Baseline context dropped from ~2000 lines to ~73 lines. The five skills total 820 lines, but any given task loads at most one or two of them. A typical session that involves a code review loads 73 + 120 = 193 lines of context. A deployment task loads 73 + 160 = 233 lines. Compare that to 2000 lines for every task, regardless of relevance.

The Token Math

Rough token estimates (1 line ≈ 15 tokens on average for markdown with code blocks):

ConfigurationLines in ContextEstimated TokensSession Cost (at $3/M input tokens)
Monolithic CLAUDE.md2,000~30,000~$0.09 per session
Layered (code review task)193~2,900~$0.009 per session
Layered (deployment task)233~3,500~$0.011 per session

Summary: The layered approach uses roughly 10x fewer context tokens per session. This is not just a cost saving — it is a performance improvement. Fewer irrelevant tokens means the agent's attention is concentrated on what matters.

The team reported that after the split, Claude Code's instruction-following accuracy improved noticeably. Tasks that previously required 2-3 attempts started completing on the first try. The agent stopped ignoring instructions — not because it became smarter, but because the relevant instructions were no longer buried in 2000 lines of noise.

Progressive Disclosure in Action

Let us trace exactly what happens when a developer asks Claude Code to "create a new migration for adding a preferences column to the users table."

Phase 1: Session start. Claude Code loads AGENTS.md (65 lines, ~975 tokens) and CLAUDE.md (8 lines, ~120 tokens). It also reads all skill descriptions — short strings from the frontmatter of each installed skill. Total: ~1,200 tokens of baseline context.

Phase 2: Task matching. The user's request mentions "migration." Claude Code matches this against the database-migration skill's description:

description: >
  Create, modify, or review database migrations. Triggers on migration files,
  schema changes, Drizzle ORM modifications, column additions or removals.

Match. Claude Code reads the full database-migration/SKILL.md (140 lines, ~2,100 tokens).

Phase 3: Execution. Claude Code now has 65 + 8 + 140 = 213 lines of context. Every line is relevant to the task. It follows the migration workflow step by step: reads the current schema, checks existing migrations for naming conventions, modifies the schema, generates the migration, runs tests.

Phase 4: Skill references. The migration skill says "follow the error handling conventions in AGENTS.md" instead of restating them. Claude Code already has AGENTS.md in context. No duplication, no extra tokens.

If the same developer then asks "review the code I just wrote," the database-migration skill stays loaded (it is still relevant), and the code-review skill triggers additionally. Two skills, both relevant, both earning their context cost.

This is what Anthropic means when they say "intelligence is not the bottleneck — context is." The same model, with better context engineering, produces measurably better results.

Measuring Context Impact

You cannot improve what you do not measure. Here is how to quantify the impact of your context engineering.

Token Counting

Claude Code does not expose a built-in token counter for context files, but you can estimate:

# Count tokens in your context files (rough estimate: 1 word ≈ 1.3 tokens)
wc -w AGENTS.md CLAUDE.md
# Multiply word count by 1.3 for approximate token count

# Count skill tokens
find .claude/skills -name "SKILL.md" -exec wc -w {} +

For precise counts, use the Anthropic tokenizer API or tiktoken with the cl100k_base encoding (close enough for estimation).

The 5% Rule

Your always-on context (AGENTS.md + CLAUDE.md) should consume less than 5% of the model's effective context window. For Claude Opus 4.6 with a 200K token window, that means under 10,000 tokens — plenty of room for a lean 73-line setup, and a clear red flag if your files exceed 500 lines.

Tracking Instruction Compliance

After restructuring, test 10 representative tasks and score how many instructions the agent follows correctly. Compare against the same tasks with the old monolithic file. This is crude but effective. If compliance goes from 60% to 90%, the restructuring worked.

Cost Monitoring

Use Claude Code's built-in cost tracking (/cost command) to compare session costs before and after restructuring. The 10x context reduction translates to meaningful savings on teams running hundreds of sessions per day.

The ETH Zurich Finding: Less Is More

The ETH Zurich study deserves deeper examination because it overturns a common assumption.

Most developers assume more context helps. "If I tell the agent everything about my project, it will make fewer mistakes." The study tested this directly with AGENTbench — 138 real-world Python tasks from niche repositories — and four different agents (Claude 3.5 Sonnet, Codex GPT-5.2, GPT-5.1 mini, and Qwen Code).

The results:

  1. LLM-generated context files degraded performance by ~3% compared to no context file at all.
  2. Human-written context files improved performance by only ~4%, but increased inference costs by 19-20%.
  3. Both types of context files encouraged broader exploration — more testing, more file traversal — which sometimes helped but often wasted compute.

The core finding: detailed codebase overviews are largely redundant with what the agent can already discover by reading the code. Overly restrictive instructions constrain the agent's problem-solving ability. The sweet spot is minimal, high-signal context that prevents unrecoverable mistakes without micromanaging the agent's approach.

This is exactly what the three-layer architecture achieves. AGENTS.md provides the minimal guardrails. Skills provide deep knowledge only when relevant. The agent's reasoning capacity is preserved for the actual task.

Try Termdock Workspace Sync works out of the box. Free download →

Real Project Case Study: Monorepo with Multiple Teams

Consider a monorepo with four packages, each owned by a different team:

monorepo/
├── packages/
│   ├── api/          # Backend team (Fastify, PostgreSQL, Drizzle)
│   ├── web/          # Frontend team (Next.js, React, Tailwind)
│   ├── mobile/       # Mobile team (React Native, Expo)
│   └── shared/       # Platform team (shared types, utils)
├── AGENTS.md         # Root: monorepo-wide context
├── CLAUDE.md         # Root: Claude Code-specific
└── .claude/skills/   # Root: shared skills

Root AGENTS.md (Monorepo-Wide)

## Monorepo Structure
pnpm workspace. 4 packages: api, web, mobile, shared.

## Conventions
- All packages use TypeScript strict mode
- Shared types live in packages/shared/src/types/
- Inter-package imports use workspace protocol: "shared": "workspace:*"
- CI runs affected-only: pnpm --filter ...[origin/main] test

## Hard Constraints
- Never import from api/ in web/ or mobile/ (use shared/ for shared logic)
- Never modify packages/shared/src/types/api-contract.ts without API team review

20 lines. Applies to every task in every package.

Package-Level AGENTS.md (Nested)

Each package has its own AGENTS.md with package-specific conventions:

packages/api/AGENTS.md        # Fastify conventions, DB patterns, auth rules
packages/web/AGENTS.md        # Next.js patterns, component conventions
packages/mobile/AGENTS.md     # React Native conventions, Expo config
packages/shared/AGENTS.md     # Type export patterns, versioning rules

Claude Code and other tools load the root AGENTS.md plus the relevant package AGENTS.md based on which files are being edited. The backend team's conventions only load when working on backend files.

Skills Per Package

This is where skills shine in a monorepo. Different teams need different skills:

.claude/skills/
├── api-endpoint/SKILL.md         # Backend: create Fastify endpoints
├── db-migration/SKILL.md         # Backend: database migration workflow
├── react-component/SKILL.md      # Frontend: component creation pattern
├── storybook-story/SKILL.md      # Frontend: Storybook conventions
├── expo-screen/SKILL.md          # Mobile: new screen workflow
├── deep-link/SKILL.md            # Mobile: deep link configuration
├── shared-type/SKILL.md          # Platform: shared type modification protocol
└── release/SKILL.md              # Platform: release and versioning workflow

A frontend developer asking "create a new Button component" triggers react-component and possibly storybook-story. The backend skills stay dormant. A backend developer asking "add a new endpoint for user preferences" triggers api-endpoint and possibly db-migration. The frontend skills stay dormant.

The total skill library is 8 skills across 4 teams. But any given developer interaction loads at most 2-3, keeping context focused and relevant.

Workspace-Level Configuration

The challenge in monorepos is keeping skill configurations synchronized across packages. When the platform team updates the shared type protocol, the shared-type skill needs to reflect that change. When the API team adds a new middleware pattern, the api-endpoint skill needs updating.

This is where workspace-level tooling becomes important. Managing different skill configurations, context files, and terminal environments across multiple packages in a monorepo is exactly the kind of multi-project coordination that Termdock's workspace sync handles — switching between packages with their full context setup intact, so each team member works with the right skills loaded for their package.

Workspace-Level Skill Configuration

Beyond monorepos, individual developers work across multiple projects with different context needs. A developer contributing to three open-source projects and two work projects has five different AGENTS.md files, five different skill sets, and five different tool configurations.

Switching between these projects means switching mental models — and switching AI agent context. Your CLAUDE.md for the Rust project is wrong for the TypeScript project. Your deployment skill for the AWS project does not apply to the Vercel project.

Personal skills in ~/.claude/skills/ help: your code review style, your commit message format, your debugging approach travel with you across projects. But project-specific skills need to activate when you switch into that project and deactivate when you leave.

This workspace-level coordination — the right context, the right skills, the right terminal environment, all switching together when you change projects — is what separates productive multi-project workflows from constant context-switching friction. Termdock's workspace sync manages this: terminal layouts, environment variables, git state, and skill configurations move as a unit when you switch workspaces.

Practical Steps: Restructuring Your Context

If you have an existing CLAUDE.md that has grown beyond 200 lines, here is the restructuring process.

Step 1: Audit. Read every line of your CLAUDE.md. For each section, ask: "Does this apply to every task, or only to specific types of tasks?"

Step 2: Extract universal context. Move everything that applies to every task into AGENTS.md. Condense aggressively — if the code already shows it, remove it.

Step 3: Identify skill candidates. Each task-specific section becomes a skill candidate. Look for sections with: multi-step workflows, domain-specific knowledge, templates or examples, validation checklists.

Step 4: Create skills. For each candidate, create a directory in .claude/skills/ with a SKILL.md. Write a specific, slightly pushy description. Move the instructions from CLAUDE.md into the skill body.

Step 5: Eliminate duplication. Search for any instruction that appears in both AGENTS.md and a skill. The skill should say "follow the conventions in AGENTS.md" rather than restating them.

Step 6: Reduce CLAUDE.md. What remains should be Claude Code-specific behavior only. Reference AGENTS.md for everything else.

Step 7: Test. Run 5 representative tasks per skill. Verify trigger accuracy (does the right skill load?), instruction compliance (does the agent follow the skill?), and output quality (is the result better than before?).

For a deeper dive into skill design — description optimization, eval loops, progressive disclosure patterns — see the good skill design principles guide. For understanding when to use which file, the SKILL.md vs CLAUDE.md vs AGENTS.md comparison provides a complete decision framework.

The Pattern That Emerges

Context engineering with skills follows the same principle that makes good software architecture: separation of concerns with clear interfaces.

AGENTS.md is the interface contract — the universal truths about your project that every tool and every task needs. CLAUDE.md is the adapter — tool-specific behavior layered on top. Skills are the implementations — deep, specialized knowledge loaded on demand through a clean trigger interface (the description field).

The teams that get this right spend less on inference, get more consistent results, and can scale their AI agent setup across projects and team members without the context file becoming a maintenance burden.

Context is the bottleneck. Engineer it accordingly.

DH
Free Download

Ready to streamline your terminal workflow?

Multi-terminal drag-and-drop layout, workspace Git sync, built-in AI integration, AST code analysis — all in one app.

Download Termdock →
#context-engineering#agent-skills#skill-md#claude-md#agents-md#claude-code

Related Posts