Most Skills Are Bad. Here Is Why.

The agent skills ecosystem crossed 351,000 published skills in March 2026. That number sounds impressive. Then you start using them. Most are filler. They never trigger when they should. They trigger when they should not. They give instructions so vague the agent would have done better without them.

Writing a SKILL.md is trivially easy. Create a directory. Drop a Markdown file in it. Write a few lines of YAML frontmatter. You have a "skill." The step-by-step tutorial takes five minutes. But having the format right and having the skill actually work are two entirely different problems.

The quality bar is low because the ecosystem rewards volume, not effectiveness. SkillsMP crawls GitHub and indexes anything with a SKILL.md file. The result: thousands of skills that look complete but fail on first contact with a real user prompt.

This article is about the gap between a syntactically valid skill and a genuinely good one. If you already know the SKILL.md format, this is what comes next -- the design thinking that separates skills people actually use from skills that sit in a marketplace rotting.

The principles here draw from Anthropic's Skill Creator methodology, Jesse Vincent's Superpowers framework, ETH Zurich research on context file effectiveness, and extensive reading of skill run transcripts to see where things actually break.

Principle 1: The Description Is Everything

The description field in your SKILL.md frontmatter is the primary trigger mechanism. When a user sends a prompt, the agent reads every installed skill's description and decides which to load. Vague description means the skill undertriggers -- installed but never activated. Too narrow means it misses valid use cases.

Bad:

description: Helps with code review.

"Helps with" is meaningless. The agent cannot distinguish when this skill beats its own built-in behavior. No trigger conditions. No specificity. No signal.

Good:

description: >
  Performs comprehensive code review after writing or modifying code.
  Use when completing logical chunks of development work. Analyzes
  security vulnerabilities, correctness issues, performance problems,
  and maintainability concerns. Outputs structured findings with
  severity ratings. Activate for PR reviews, staged change reviews,
  and file-level audits.

This tells the agent what the skill does, when to activate it, and what the output looks like. Clear yes/no decision on any given prompt.

The "Pushy" Technique

Anthropic's own documentation acknowledges that Claude tends to undertrigger skills -- to not use them when they would be useful. The fix: write descriptions that are slightly aggressive about when to activate.

Instead of "Can be used for deployment tasks," write "Use this skill whenever the user asks to deploy, push to staging, release, ship, or when you detect deployment-related file changes." List trigger conditions explicitly. Err on the side of triggering too often. The cost of loading an irrelevant skill (a few hundred tokens of context) is far lower than the cost of not loading a relevant one (the agent improvises and gets it wrong).

Testing Trigger Accuracy

The Skill Creator's description optimization provides a rigorous approach: create 20 eval queries -- 10 that should trigger, 10 that should not. Run each query 3 times for reliable trigger rates. The system splits 60/40 into train and test, evaluates the current description, proposes improvements based on failures, and iterates.

You do not need the Skill Creator for this. Write down 20 prompts. Run them against your agent. Count how many correctly trigger or correctly skip. Below 80% accuracy, rewrite the description and test again. Tedious, but the single highest-leverage improvement you can make.

Principle 2: Explain Why, Not Just What

LLMs are not rule-following machines that need rigid instructions. They are reasoning engines that respond better to explained rationale than dictated commands.

The rigid approach:

## Rules
- ALWAYS use TDD. NEVER skip tests.
- ALWAYS validate input. NEVER trust user data.
- MUST use error boundaries around every async operation.

The reasoning approach:

## Testing Philosophy

Tests catch regressions before they reach users. Write the test first
so you know exactly what success looks like before writing implementation
code. When the test fails initially, you have proof it actually tests
the right behavior — not just passing by accident.

## Input Validation

External data — user input, API responses, file contents — arrives in
shapes you do not control. Validate at the boundary so internal code
can trust its inputs. This moves error handling to the edges where
context is richest, keeping core logic clean.

Both say "do TDD" and "validate input." The second explains why, giving the agent enough context to make good judgment calls in edge cases. When the model encounters a situation the skill author did not anticipate, reasoning-based instructions generalize. Rule-based instructions leave the model guessing.

The MUST/ALWAYS/NEVER Trap

Heavy-handed rules feel authoritative but perform worse than explained reasoning. ETH Zurich's February 2026 study found that overly detailed and restrictive instructions actually reduced agent task success rates. The best-performing agents worked with lean, principled guidance -- not walls of ALL-CAPS mandates.

When Rigid Rules ARE Appropriate

Security constraints and compliance requirements are the exception. "Never commit AWS credentials to version control" is not a place for nuanced reasoning. "Never modify existing migration files" is a hard constraint where violation causes real damage. Use rigid rules for binary, non-negotiable constraints. Use reasoning for everything else.

Principle 3: Progressive Disclosure

A SKILL.md that dumps 500 lines of instructions into the context window on every trigger degrades agent performance. The ETH Zurich research confirmed this: more context does not mean better results. Agents are surprisingly good at discovering what they need. Front-loading everything adds token cost and cognitive overhead.

The three-level system:

Level 1: Metadata (~100 words, always loaded). The name and description fields in frontmatter. Every installed skill's metadata loads at session start. This is the trigger mechanism. Keep it tight.

Level 2: SKILL.md body (< 500 lines, loaded when triggered). The main instructions. What the agent reads when it decides the skill is relevant. Should contain the procedure, the key rules, and pointers to reference files.

Level 3: Reference files + scripts (loaded on demand). Detailed API specs, comprehensive checklists, template libraries, validation scripts. These load only when the agent encounters a task that requires them.

Rule of thumb: if a section only applies to 20% of use cases, it belongs in a reference file, not the SKILL.md body. A code review skill with a 200-line security checklist should put that checklist in references/security-checklist.md and tell the agent to load it only when reviewing auth-related code.

For a full breakdown of how SKILL.md, CLAUDE.md, and AGENTS.md interact -- and how progressive disclosure fits into the broader context engineering strategy -- see SKILL.md vs CLAUDE.md vs AGENTS.md.

Principle 4: Scripts for Deterministic Work

If something can be a script, it should be a script. LLMs are unreliable at repetitive, exact tasks. Counting lines. Validating structured formats. Running specific command sequences. Checking file existence patterns. These are tasks where an LLM will occasionally hallucinate or vary its approach between runs.

Good candidates for scripts:

File validation (does the frontmatter have all required fields?)
Format checking (does the commit message match Conventional Commits?)
Data extraction (pull all TODOs from the codebase)
API calls (hit a health endpoint and report status)
Pre-flight checks (are all dependencies installed? is the database running?)

Bad candidates for scripts:

Creative decisions (which architecture pattern fits this problem?)
Context-dependent choices (should this be a new component or extend an existing one?)
Subjective evaluation (is this code well-structured?)

A concrete example -- a validation script that checks SKILL.md frontmatter:

#!/bin/bash
# Validates SKILL.md frontmatter fields
SKILL_FILE="$1"

if [ ! -f "$SKILL_FILE" ]; then
  echo "Error: File not found: $SKILL_FILE"
  exit 1
fi

ERRORS=0

if ! grep -q '^name:' "$SKILL_FILE"; then
  echo "Missing required field: name"
  ERRORS=$((ERRORS + 1))
fi

if ! grep -q '^description:' "$SKILL_FILE"; then
  echo "Missing required field: description"
  ERRORS=$((ERRORS + 1))
fi

LINES=$(wc -l < "$SKILL_FILE")
if [ "$LINES" -gt 500 ]; then
  echo "Warning: SKILL.md is $LINES lines (recommended < 500)"
fi

if [ "$ERRORS" -gt 0 ]; then
  echo "$ERRORS error(s) found."
  exit 1
fi

echo "Validation passed."
exit 0

The agent runs this script and gets a deterministic pass/fail result. No interpretation variance. No hallucinated field names. No skipped checks. The script does what the script does, every single time.

The SKILL.md references it simply:

## Validation

Before publishing, validate the skill:

```bash
bash scripts/validate-skill.sh SKILL.md

Script source code never enters the context window -- only the output does. This keeps the agent's reasoning space clean while ensuring exact, repeatable behavior for mechanical tasks.

Principle 5: Design for the Edge Case You Have Not Seen

Good skills handle unexpected inputs gracefully. The failure mode of most skills is not that they break on the happy path. It is that they encounter a slightly different project structure, a different phrasing, or a different technology stack and produce garbage.

The "theory of mind" approach:

Think about the 10 different ways a user might phrase the task. "Review this code," "check my PR," "audit the auth module," "is this safe to merge?" A code review skill that only triggers on "review code" misses most real usage.

Think about the 5 different project types where this skill might run. A deployment skill that assumes Docker will fail in serverless projects. A testing skill that assumes Jest will fail in projects using Vitest. Write instructions that reference the project's actual tooling rather than hardcoding your preferred stack.

Write instructions that generalize, not instructions that overfit. Jesse Vincent's Superpowers framework demonstrates this well. The brainstorming skill does not say "ask exactly these 5 questions." It explains the principle -- explore multiple approaches before committing -- and lets the agent adapt the principle to the specific context. That is why Superpowers works across wildly different project types while rigid template skills break outside the author's specific setup.

The generalization test: could a developer on a completely different project type still benefit from this skill's instructions? If not, you have overfit. Extract the principles and let the agent apply them contextually.

Principle 6: Keep It Lean

Every instruction that does not pull its weight wastes context and confuses the model. The best skills are short. Not because short is inherently good, but because every line in a short skill is load-bearing.

Read transcripts, not just outputs. The most revealing debugging technique for skills is reading the full agent transcript -- the reasoning trace, not just the final output. You will find:

Sections the agent reads but never applies (remove them)
Instructions the agent misinterprets because they are ambiguous (rewrite them)
Places where the agent spends reasoning tokens re-stating your instructions back to itself (your instructions are too verbose)

Remove sections that always get skipped. If your skill has an "Edge Cases" section the agent never references in 10 test runs, it is dead weight. Either the edge cases are not triggering, or the agent handles them fine without instructions.

Reframe ALL-CAPS rules as reasoning. If you are writing ALWAYS and NEVER in uppercase, stop. You are compensating for unclear reasoning with volume. "NEVER use default exports" is weaker than "Named exports enable tree-shaking and make refactoring safer because the import name is tied to the export site, not the consumer." The second version explains why, which means the agent applies it correctly even in edge cases the first version does not cover.

The Agent Skills Guide recommends keeping SKILL.md under 500 lines. That is a ceiling, not a target. Many effective skills are under 100 lines. If yours exceeds 200, audit it: is every section earning its context cost?

The Eval Loop: How to Actually Test a Skill

Writing a skill without testing it is like shipping code without running it. The Anthropic Skill Creator -- updated in March 2026 with eval, improve, and benchmark modes -- formalizes this into a methodology any skill author can follow.

The Four-Step Process

Step 1: Write 2-3 test prompts. Realistic -- phrased the way a real user would type, with realistic context. Not "test the skill" but "I just finished the auth module refactor. Review the changes on this branch and let me know if it's safe to merge."

Step 2: Run with-skill vs without-skill. Does the skill improve the output? This is the only question that matters. If the agent produces equivalent results without the skill, the skill adds context cost without adding value.

Step 3: Review the full transcript. Not just the output -- the reasoning trace. Did the agent load the skill? Did it follow instructions? Where did it deviate? Where did it waste time?

Step 4: Iterate. Fix the skill based on findings. Rerun. Review again. The Skill Creator automates this loop with four composable sub-agents (executor, grader, comparator, analyzer) that run skills against eval prompts, grade outputs, perform blind A/B comparisons between versions, and surface patterns aggregate stats might hide.

Description Optimization

The Skill Creator's description optimization deserves special attention. It uses a train/test split: 60% of eval queries become training, 40% held out for testing. The system evaluates the current description by running each training query 3 times and measuring trigger rate, then proposes an improved description based on failures, re-evaluates, and iterates.

This is the most systematic way to improve trigger accuracy. But even without the tool, the principle is the same: measure, analyze failures, improve, re-measure. Most skill authors write a description once and never touch it again. That is why most skills undertrigger.

The Multi-Pane Advantage

The iterative testing loop -- edit skill, run prompt, check transcript, edit again -- is where multiple terminal panes side by side pay off. SKILL.md in one pane. Agent session in another. Transcript or output in a third. No tab switching. No losing context. Just a tight feedback loop.

Try Termdock — Drag Resize Terminals works out of the box. Free download →

Anti-Patterns: Skills That Look Good but Fail

The "Kitchen Sink" Skill

Tries to handle code review, deployment, testing, documentation, and PR descriptions in a single skill. The description becomes so broad it either triggers on everything or nothing. The body is 800+ lines and the agent drowns in instructions. Split it. One skill, one verb.

The "Copy-Paste" Skill

A raw prompt dump with no structure. No clear trigger conditions. No process steps. Just paragraphs of text. The author copied their ChatGPT prompt into a SKILL.md and called it a skill. The agent cannot parse intent from an unstructured wall of text. Structure matters: sections, headers, ordered steps.

The "Rigid Template" Skill

Works perfectly when the user phrases their request exactly as the author expected. Breaks on any deviation. "When the user says 'create component X', do the following..." fails the moment someone says "build a new widget for the dashboard." Write for intent, not exact phrasing.

The "Undocumented" Skill

No clear trigger conditions in description or body. Relies on the user knowing the slash command exists. Without explicit "When to Activate" and "When NOT to Activate" sections, auto-triggering is a coin flip. Skills that only work via manual invocation are effectively invisible to 90% of users.

The "Security Hole" Skill

Executes shell commands without validation. Downloads files from hardcoded URLs. Instructs the agent to read sensitive files and include their contents in outputs. The Snyk ToxicSkills study found 534 skills with critical security issues out of 3,984 scanned. If your skill runs commands, validate inputs. If it reads files, scope the paths. If it makes network requests, make the URLs configurable and documented.

A Checklist Before You Ship

Before publishing or sharing a skill, run through this quality gate:

Building good skills is the same discipline as writing good code: ruthless about what stays in, honest about what works, and always tested against reality. The Agent Skills Guide covers the full ecosystem -- marketplaces, security, cross-agent compatibility. This article covered the craft. Both matter.

Danny Huang·Follow on LinkedIn →

Free Download

Ready to streamline your terminal workflow?

Multi-terminal drag-and-drop layout, workspace Git sync, built-in AI integration, AST code analysis — all in one app.

Download Termdock →

#agent-skills#skill-md#best-practices#design-principles#claude-code

Good Skill Design: Principles That Work