·8 min read·agent-skills

What Makes a Good Skill? Design Principles for Agent Skills That Actually Work

Most agent skills never trigger or give bad instructions. Learn the 6 design principles — from description optimization to eval loops — that separate good skills from the 351K+ noise.

DH
Danny Huang

Most Skills Are Bad. Here's Why.

The agent skills ecosystem crossed 351,000 published skills in March 2026. That number sounds impressive until you start using them. Most skills are filler. They never trigger when they should, trigger when they should not, or give instructions so vague the agent would have done better without them.

Writing a SKILL.md is trivially easy. You create a directory, drop a Markdown file in it, write a few lines of YAML frontmatter, and you have a "skill." The step-by-step tutorial takes five minutes. But having the format right and having the skill actually work are two entirely different problems.

The quality bar is low because the ecosystem rewards volume, not effectiveness. SkillsMP crawls GitHub and indexes anything with a SKILL.md file. The result: thousands of skills that look complete but fail on first contact with a real user prompt.

This article is about the gap between a syntactically valid skill and a genuinely good one. If you already know the SKILL.md format, this is what comes next — the design thinking that separates skills people actually use from skills that sit in a marketplace rotting.

The principles here are drawn from Anthropic's Skill Creator methodology, Jesse Vincent's Superpowers framework, the ETH Zurich research on context file effectiveness, and a lot of reading skill run transcripts to see where things actually break.

Principle 1: The Description Is Everything

The description field in your SKILL.md frontmatter is the primary trigger mechanism. When a user sends a prompt, the agent reads every installed skill's description and decides which ones to load. If your description is vague, the skill undertriggers — it sits there, installed but never activated. If it is too narrow, it misses valid use cases.

Here is what bad looks like:

description: Helps with code review.

"Helps with" is meaningless. The agent cannot distinguish when this skill is more relevant than its own built-in code review behavior. There are no trigger conditions, no specificity, no signal.

Here is what good looks like:

description: >
  Performs comprehensive code review after writing or modifying code.
  Use when completing logical chunks of development work. Analyzes
  security vulnerabilities, correctness issues, performance problems,
  and maintainability concerns. Outputs structured findings with
  severity ratings. Activate for PR reviews, staged change reviews,
  and file-level audits.

This description tells the agent exactly what the skill does, when to activate it, and what the output looks like. The agent can make a clear yes/no decision on any given prompt.

The "Pushy" Technique

Anthropic's own documentation acknowledges that Claude tends to undertrigger skills — to not use them when they would be useful. The fix is to write descriptions that are slightly aggressive about when to activate.

Instead of "Can be used for deployment tasks," write "Use this skill whenever the user asks to deploy, push to staging, release, ship, or when you detect deployment-related file changes." List the trigger conditions explicitly. Err on the side of triggering too often — the cost of loading an irrelevant skill (a few hundred tokens of context) is much lower than the cost of not loading a relevant one (the agent improvises and gets it wrong).

Testing Trigger Accuracy

The Skill Creator's description optimization system provides a rigorous approach: create 20 eval queries — 10 that should trigger the skill, 10 that should not. Run each query 3 times to get reliable trigger rates. The system splits the eval set 60/40 into train and test, evaluates the current description, proposes improvements based on failures, and iterates.

You do not need the Skill Creator to do this manually. Write down 20 prompts. Run them against your agent. Count how many correctly triggered or correctly did not trigger. If accuracy is below 80%, rewrite the description and test again. This is tedious but it is the single highest-leverage improvement you can make.

Principle 2: Explain Why, Not Just What

LLMs are not rule-following machines that need rigid instructions. They are reasoning engines that respond better to explained rationale than to dictated commands. This has real implications for how you write skill instructions.

Here is the rigid approach:

## Rules
- ALWAYS use TDD. NEVER skip tests.
- ALWAYS validate input. NEVER trust user data.
- MUST use error boundaries around every async operation.

Here is the reasoning approach:

## Testing Philosophy

Tests catch regressions before they reach users. Write the test first
so you know exactly what success looks like before writing implementation
code. When the test fails initially, you have proof it is actually
testing the right behavior — not just passing by accident.

## Input Validation

External data — user input, API responses, file contents — arrives in
shapes you do not control. Validate at the boundary so internal code
can trust its inputs. This moves error handling to the edges where
context is richest, keeping core logic clean.

Both say "do TDD" and "validate input." The second version explains why, which gives the agent enough context to make good judgment calls in edge cases. When the model encounters a situation the skill author did not anticipate, reasoning-based instructions generalize. Rule-based instructions leave the model guessing.

The MUST/ALWAYS/NEVER Trap

Heavy-handed rules feel authoritative but perform worse than explained reasoning. ETH Zurich's February 2026 study found that overly detailed and restrictive instructions actually reduced agent task success rates. The agents that performed best worked with lean, principled guidance — not walls of ALL-CAPS mandates.

When Rigid Rules ARE Appropriate

Security constraints and compliance requirements are the exception. "Never commit AWS credentials to version control" is not a place for nuanced reasoning. "Never modify existing migration files" is a hard constraint where violation causes real damage. Use rigid rules for binary, non-negotiable constraints. Use reasoning for everything else.

Principle 3: Progressive Disclosure

A SKILL.md that dumps 500 lines of instructions into the context window on every trigger is a SKILL.md that degrades agent performance. The ETH Zurich research confirmed this: more context does not mean better results. Agents are surprisingly good at discovering what they need; front-loading everything adds token cost and cognitive overhead.

The three-level system:

Level 1: Metadata (~100 words, always loaded). The name and description fields in frontmatter. Every installed skill's metadata is loaded at session start. This is the trigger mechanism — keep it tight.

Level 2: SKILL.md body (< 500 lines, loaded when triggered). The main instructions. This is what the agent reads when it decides the skill is relevant. It should contain the procedure, the key rules, and pointers to reference files.

Level 3: Reference files + scripts (loaded on demand). Detailed API specs, comprehensive checklists, template libraries, validation scripts. These load only when the agent encounters a task that requires them.

The rule of thumb: if a section only applies to 20% of use cases, it belongs in a reference file, not the SKILL.md body. A code review skill that includes a 200-line security-specific checklist should put that checklist in references/security-checklist.md and tell the agent to load it only when reviewing auth-related code.

For a full breakdown of how SKILL.md, CLAUDE.md, and AGENTS.md interact — and how progressive disclosure fits into the broader context engineering strategy — see SKILL.md vs CLAUDE.md vs AGENTS.md.

Principle 4: Scripts for Deterministic Work

If something can be a script, it should be a script. LLMs are unreliable at repetitive, exact tasks. Counting lines of code, validating structured formats, running specific command sequences, checking file existence patterns — these are all tasks where an LLM will occasionally hallucinate or vary its approach between runs.

Good candidates for scripts:

  • File validation (does the frontmatter have all required fields?)
  • Format checking (does the commit message match Conventional Commits?)
  • Data extraction (pull all TODOs from the codebase)
  • API calls (hit a health endpoint and report status)
  • Pre-flight checks (are all dependencies installed? is the database running?)

Bad candidates for scripts:

  • Creative decisions (which architecture pattern fits this problem?)
  • Context-dependent choices (should this be a new component or extend an existing one?)
  • Subjective evaluation (is this code well-structured?)

Here is a concrete example — a validation script that checks SKILL.md frontmatter:

#!/bin/bash
# Validates SKILL.md frontmatter fields
SKILL_FILE="$1"

if [ ! -f "$SKILL_FILE" ]; then
  echo "Error: File not found: $SKILL_FILE"
  exit 1
fi

ERRORS=0

if ! grep -q '^name:' "$SKILL_FILE"; then
  echo "Missing required field: name"
  ERRORS=$((ERRORS + 1))
fi

if ! grep -q '^description:' "$SKILL_FILE"; then
  echo "Missing required field: description"
  ERRORS=$((ERRORS + 1))
fi

LINES=$(wc -l < "$SKILL_FILE")
if [ "$LINES" -gt 500 ]; then
  echo "Warning: SKILL.md is $LINES lines (recommended < 500)"
fi

if [ "$ERRORS" -gt 0 ]; then
  echo "$ERRORS error(s) found."
  exit 1
fi

echo "Validation passed."
exit 0

The agent runs this script and gets a deterministic pass/fail result. No interpretation variance, no hallucinated field names, no skipped checks. The script does what the script does, every single time.

The SKILL.md references it simply:

## Validation

Before publishing, validate the skill:

```bash
bash scripts/validate-skill.sh SKILL.md

Script source code never enters the context window — only the output does. This keeps the agent's reasoning space clean while ensuring exact, repeatable behavior for mechanical tasks.

Principle 5: Design for the Edge Case You Haven't Seen

Good skills handle unexpected inputs gracefully. The failure mode of most skills is not that they break on the happy path — it is that they encounter a slightly different project structure, a different phrasing, or a different technology stack and produce garbage.

The "theory of mind" approach:

Think about the 10 different ways a user might phrase the task. "Review this code," "check my PR," "audit the auth module," "is this safe to merge?" — a code review skill that only triggers on "review code" misses most real usage.

Think about the 5 different project types where this skill might run. A deployment skill that assumes Docker will fail in serverless projects. A testing skill that assumes Jest will fail in projects using Vitest. Write instructions that reference the project's actual tooling rather than hardcoding your preferred stack.

Write instructions that generalize, not instructions that overfit. Jesse Vincent's Superpowers framework demonstrates this well. The brainstorming skill does not say "ask exactly these 5 questions." It explains the principle — explore multiple approaches before committing — and lets the agent adapt the principle to the specific context. This is why Superpowers works across wildly different project types while rigid template skills break on anything outside the author's specific setup.

The generalization test: could a developer working on a completely different project type still benefit from this skill's instructions? If no, you have overfit. Extract the principles and let the agent apply them contextually.

Principle 6: Keep It Lean

Every instruction that does not pull its weight wastes context and confuses the model. The best skills are short. Not because short is inherently good, but because every line in a short skill is load-bearing.

Read transcripts, not just outputs. The most revealing debugging technique for skills is reading the full agent transcript — the reasoning trace, not just the final output. You will find:

  • Sections the agent reads but never applies (remove them)
  • Instructions the agent misinterprets because they are ambiguous (rewrite them)
  • Places where the agent spends reasoning tokens re-stating your instructions back to itself (your instructions are too verbose)

Remove sections that always get skipped. If your skill has an "Edge Cases" section that the agent never references in 10 test runs, it is dead weight. Either the edge cases are not triggering, or the agent handles them fine without the instructions.

Reframe ALL-CAPS rules as reasoning. If you are writing ALWAYS and NEVER in uppercase, stop. You are compensating for unclear reasoning with volume. "NEVER use default exports" is weaker than "Named exports enable tree-shaking and make refactoring safer because the import name is tied to the export site, not the consumer." The second version explains why, which means the agent applies it correctly even in edge cases the first version does not cover.

The Agent Skills Guide recommends keeping SKILL.md under 500 lines. That is a ceiling, not a target. Many effective skills are under 100 lines. If your skill exceeds 200 lines, audit it: is every section earning its context cost?

The Eval Loop: How to Actually Test a Skill

Writing a skill without testing it is like shipping code without running it. The Anthropic Skill Creator — updated in March 2026 with eval, improve, and benchmark modes — formalizes this into a methodology that any skill author can follow.

The Four-Step Process

Step 1: Write 2-3 test prompts. These should be realistic — phrased the way a real user would type, with realistic context. Not "test the skill" but "I just finished the auth module refactor. Review the changes on this branch and let me know if it's safe to merge."

Step 2: Run with-skill vs without-skill. Does the skill improve the output? This is the only question that matters. If the agent produces equivalent results without the skill, the skill is adding context cost without adding value.

Step 3: Review the full transcript. Not just the output — the reasoning trace. Did the agent load the skill? Did it follow the instructions? Where did it deviate? Where did it waste time?

Step 4: Iterate. Fix the skill based on what you found. Rerun. Review again. The Skill Creator automates this loop — it uses four composable sub-agents (executor, grader, comparator, analyzer) to run skills against eval prompts, grade outputs, perform blind A/B comparisons between skill versions, and surface patterns that aggregate stats might hide.

Description Optimization

The Skill Creator's description optimization deserves special attention. It uses a train/test split: 60% of your eval queries become the training set, 40% are held out for testing. The system evaluates your current description by running each training query 3 times and measuring the trigger rate. Then it proposes an improved description based on what failed, re-evaluates, and iterates.

This is the most systematic way to improve trigger accuracy. But even without the tool, the principle is the same: measure, analyze failures, improve, re-measure. Most skill authors write a description once and never touch it again. That is why most skills undertrigger.

The Multi-Pane Advantage

This iterative testing loop — edit skill, run prompt, check transcript, edit again — is where having multiple terminal panes side by side pays off. The SKILL.md file in one pane, the agent session in another, the transcript or output in a third. No tab switching, no losing context, just a tight feedback loop.

Try Termdock Drag Resize Terminals works out of the box. Free download →

Anti-Patterns: Skills That Look Good but Fail

The "Kitchen Sink" Skill

Tries to handle code review, deployment, testing, documentation, and PR descriptions in a single skill. The description becomes so broad it either triggers on everything or nothing. The body is 800+ lines and the agent drowns in instructions. Split it. One skill, one verb.

The "Copy-Paste" Skill

A raw prompt dump with no structure — no clear trigger conditions, no process steps, just paragraphs of text. The author copied their ChatGPT prompt into a SKILL.md and called it a skill. The agent cannot parse intent from an unstructured wall of text. Structure matters: sections, headers, ordered steps.

The "Rigid Template" Skill

Works perfectly when the user phrases their request exactly as the author expected, and breaks on any deviation. "When the user says 'create component X', do the following..." fails the moment someone says "build a new widget for the dashboard." Write for intent, not for exact phrasing.

The "Undocumented" Skill

Has no clear trigger conditions in the description or body. Relies on the user knowing the slash command exists. Without explicit "When to Activate" and "When NOT to Activate" sections, auto-triggering is a coin flip. Skills that only work via manual invocation are skills that are effectively invisible to 90% of users.

The "Security Hole" Skill

Executes shell commands without validation, downloads files from hardcoded URLs, or instructs the agent to read sensitive files and include their contents in outputs. The Snyk ToxicSkills study found 534 skills with critical security issues out of 3,984 scanned. If your skill runs commands, validate inputs. If it reads files, scope the paths. If it makes network requests, make the URLs configurable and documented.

A Checklist Before You Ship

Before publishing or sharing a skill, run through this quality gate:

  • Description is specific, includes trigger conditions, and is slightly "pushy"
  • Body explains WHY behind rules, not just WHAT to do
  • SKILL.md is under 500 lines; detailed references live in separate files
  • Deterministic tasks use scripts, not natural-language instructions
  • Tested with 2-3 realistic prompts — outputs compared against no-skill baseline
  • Trigger accuracy tested: at least 10 should-trigger and 10 should-not-trigger queries
  • No hardcoded paths, API keys, or environment-specific values
  • Works across Claude Code and at least one other agent (Codex CLI or Copilot)
  • Has clear "When to Activate" and "When NOT to Activate" sections
  • Every section in the body is load-bearing — no dead weight

Building good skills is the same discipline as writing good code: ruthless about what stays in, honest about what works, and always tested against reality. The Agent Skills Guide covers the full ecosystem — marketplaces, security, cross-agent compatibility. This article covered the craft. Both matter.

DH
Free Download

Ready to streamline your terminal workflow?

Multi-terminal drag-and-drop layout, workspace Git sync, built-in AI integration, AST code analysis — all in one app.

Download Termdock →
#agent-skills#skill-md#best-practices#design-principles#claude-code

Related Posts