Tuning & Troubleshooting
The field guide to why your agent underperforms — and the specific fix for each failure mode.
Agents fail in patterns. Once you recognize the pattern, the fix is usually mechanical. This page is the lookup table.
Quick diagnosis
| Symptom | Likely cause | Fix |
|---|---|---|
| Invents APIs that don't exist | Stale training data | Context7 or paste current docs |
| Ignores your conventions | Rules missing or buried | Tighten AGENTS.md |
| Great start, dumb finish | Context window exhaustion | Smaller tasks, scratch files, fresh sessions |
| Declares victory on broken code | No verification step | Phase gates: "run tests before done" |
| Thrashes — edits, reverts, re-edits | Missing context or wrong mental model | Stop; make it write its understanding first |
| Wrong scope — refactors the world | No boundaries stated | Explicit scope: "only touch files in src/auth/" |
| Same mistake every session | Session learning evaporates | Move the correction into AGENTS.md |
| Slow and expensive | Bloated context, wrong model size | Trim servers/rules; route simple tasks to cheaper models |
The rest of this page expands the five failure modes that cause 90% of the pain.
1. Hallucinated APIs
The model writes code against the library version in its training data, not yours.
Fixes, best first:
- Add Context7 — injects current, version-specific docs on demand
- Keep a
docs/folder with the API docs you depend on, referenced from AGENTS.md - State exact versions in AGENTS.md ("Next.js 16 — App Router,
proxy.tsnotmiddleware.ts") - Make the agent verify imports compile (typecheck gate) before declaring done
2. Rules that don't stick
You told it twice, it still uses any.
The escalation ladder — move up one rung each time a rule fails:
- Rule exists in AGENTS.md, phrased as a hard prohibition ("Never use
any; useunknown+ type guards") - Rule has a concrete example of the right pattern next to it
- Rule is enforced mechanically — lint rule, type setting, CI check. Agents respect failing commands far more than prose
If your AGENTS.md is long, that's the problem itself: every rule dilutes attention on the others. Cut it to the rules that get violated. (Best practices.)
3. Context exhaustion (the "it got dumber" problem)
Every session has a budget: instructions + files read + tool outputs + conversation. When it fills, runners compact older content — and details silently vanish. Symptoms: forgetting earlier decisions, re-reading the same files, contradicting its own plan.
Fixes:
- Externalize state: plans and decisions go in
scratch/files, not just conversation — scratchpad discipline - One task per session when tasks are independent; fresh context beats long memory
- Subagents for deep dives (Claude Code): exploration burns context in the child, returns only the summary — example
- Trim standing costs: every MCP server's tool list and every AGENTS.md line is loaded on every request — uninstall what you don't use
- Restart deliberately: when a session degrades, have the agent write a handoff file, then start fresh and read it back
4. No verification loop
An agent that can't check its work converges on plausible-looking output instead of working output.
The fix is structural, not prompting harder:
- Exact test/lint commands in AGENTS.md (the agent will actually run them)
- Phase gates in workflows: "Do not report done until
pnpm testpasses" — workflow guide - The self-critique loop for non-code artifacts
- For UI work, give it eyes: Playwright MCP so it can load the page and see the result
5. Cost and speed tuning
Token spend follows context size and iteration count, so the fixes above also cut cost. Beyond those:
- Match model to task — frontier models for architecture and hard debugging; faster/cheaper models for boilerplate, renames, and lint fixes. Most runners let you switch per-session.
- Kill failed approaches fast — three failed attempts at the same fix means the agent lacks context, and attempts four through ten will fail too. Stop and diagnose (debug-analyzer).
- Batch related small tasks into one session; split unrelated ones.
- Measure before optimizing — Langfuse or Arize Phoenix show where tokens actually go.
The meta-loop: tune like an engineer
Every fix above follows the same cycle:
- Capture the failure — save the transcript or output
- Classify it against the table at the top
- Apply the smallest fix — one new rule, one removed server, one phase gate
- Re-run the same task and compare
One change at a time, evidence over vibes. The prompt-engineer skill automates exactly this loop for your prompts and skills — and improvements compound: a tuned AGENTS.md + a verification gate + Context7 routinely turns a frustrating agent into a dependable one.