Tuning & Troubleshooting

The field guide to why your agent underperforms — and the specific fix for each failure mode.

Agents fail in patterns. Once you recognize the pattern, the fix is usually mechanical. This page is the lookup table.

Quick diagnosis

Symptom	Likely cause	Fix
Invents APIs that don't exist	Stale training data	Context7 or paste current docs
Ignores your conventions	Rules missing or buried	Tighten AGENTS.md
Great start, dumb finish	Context window exhaustion	Smaller tasks, scratch files, fresh sessions
Declares victory on broken code	No verification step	Phase gates: "run tests before done"
Thrashes — edits, reverts, re-edits	Missing context or wrong mental model	Stop; make it write its understanding first
Wrong scope — refactors the world	No boundaries stated	Explicit scope: "only touch files in src/auth/"
Same mistake every session	Session learning evaporates	Move the correction into AGENTS.md
Slow and expensive	Bloated context, wrong model size	Trim servers/rules; route simple tasks to cheaper models

The rest of this page expands the five failure modes that cause 90% of the pain.

1. Hallucinated APIs

The model writes code against the library version in its training data, not yours.

Fixes, best first:

Add Context7 — injects current, version-specific docs on demand
Keep a docs/ folder with the API docs you depend on, referenced from AGENTS.md
State exact versions in AGENTS.md ("Next.js 16 — App Router, proxy.ts not middleware.ts")
Make the agent verify imports compile (typecheck gate) before declaring done

2. Rules that don't stick

You told it twice, it still uses any.

The escalation ladder — move up one rung each time a rule fails:

Rule exists in AGENTS.md, phrased as a hard prohibition ("Never use any; use unknown + type guards")
Rule has a concrete example of the right pattern next to it
Rule is enforced mechanically — lint rule, type setting, CI check. Agents respect failing commands far more than prose

If your AGENTS.md is long, that's the problem itself: every rule dilutes attention on the others. Cut it to the rules that get violated. (Best practices.)

3. Context exhaustion (the "it got dumber" problem)

Every session has a budget: instructions + files read + tool outputs + conversation. When it fills, runners compact older content — and details silently vanish. Symptoms: forgetting earlier decisions, re-reading the same files, contradicting its own plan.

Fixes:

Externalize state: plans and decisions go in scratch/ files, not just conversation — scratchpad discipline
One task per session when tasks are independent; fresh context beats long memory
Subagents for deep dives (Claude Code): exploration burns context in the child, returns only the summary — example
Trim standing costs: every MCP server's tool list and every AGENTS.md line is loaded on every request — uninstall what you don't use
Restart deliberately: when a session degrades, have the agent write a handoff file, then start fresh and read it back

4. No verification loop

An agent that can't check its work converges on plausible-looking output instead of working output.

The fix is structural, not prompting harder:

Exact test/lint commands in AGENTS.md (the agent will actually run them)
Phase gates in workflows: "Do not report done until pnpm test passes" — workflow guide
The self-critique loop for non-code artifacts
For UI work, give it eyes: Playwright MCP so it can load the page and see the result

5. Cost and speed tuning

Token spend follows context size and iteration count, so the fixes above also cut cost. Beyond those:

Match model to task — frontier models for architecture and hard debugging; faster/cheaper models for boilerplate, renames, and lint fixes. Most runners let you switch per-session.
Kill failed approaches fast — three failed attempts at the same fix means the agent lacks context, and attempts four through ten will fail too. Stop and diagnose (debug-analyzer).
Batch related small tasks into one session; split unrelated ones.
Measure before optimizing — Langfuse or Arize Phoenix show where tokens actually go.

The meta-loop: tune like an engineer

Every fix above follows the same cycle:

Capture the failure — save the transcript or output
Classify it against the table at the top
Apply the smallest fix — one new rule, one removed server, one phase gate
Re-run the same task and compare

One change at a time, evidence over vibes. The prompt-engineer skill automates exactly this loop for your prompts and skills — and improvements compound: a tuned AGENTS.md + a verification gate + Context7 routinely turns a frustrating agent into a dependable one.

← Multi-Agent Patterns Examples & Patterns →