Tuning & Troubleshooting

The field guide to why your agent underperforms — and the specific fix for each failure mode.

Agents fail in patterns. Once you recognize the pattern, the fix is usually mechanical. This page is the lookup table.

Quick diagnosis

SymptomLikely causeFix
Invents APIs that don't existStale training dataContext7 or paste current docs
Ignores your conventionsRules missing or buriedTighten AGENTS.md
Great start, dumb finishContext window exhaustionSmaller tasks, scratch files, fresh sessions
Declares victory on broken codeNo verification stepPhase gates: "run tests before done"
Thrashes — edits, reverts, re-editsMissing context or wrong mental modelStop; make it write its understanding first
Wrong scope — refactors the worldNo boundaries statedExplicit scope: "only touch files in src/auth/"
Same mistake every sessionSession learning evaporatesMove the correction into AGENTS.md
Slow and expensiveBloated context, wrong model sizeTrim servers/rules; route simple tasks to cheaper models

The rest of this page expands the five failure modes that cause 90% of the pain.

1. Hallucinated APIs

The model writes code against the library version in its training data, not yours.

Fixes, best first:

  • Add Context7 — injects current, version-specific docs on demand
  • Keep a docs/ folder with the API docs you depend on, referenced from AGENTS.md
  • State exact versions in AGENTS.md ("Next.js 16 — App Router, proxy.ts not middleware.ts")
  • Make the agent verify imports compile (typecheck gate) before declaring done

2. Rules that don't stick

You told it twice, it still uses any.

The escalation ladder — move up one rung each time a rule fails:

  1. Rule exists in AGENTS.md, phrased as a hard prohibition ("Never use any; use unknown + type guards")
  2. Rule has a concrete example of the right pattern next to it
  3. Rule is enforced mechanically — lint rule, type setting, CI check. Agents respect failing commands far more than prose

If your AGENTS.md is long, that's the problem itself: every rule dilutes attention on the others. Cut it to the rules that get violated. (Best practices.)

3. Context exhaustion (the "it got dumber" problem)

Every session has a budget: instructions + files read + tool outputs + conversation. When it fills, runners compact older content — and details silently vanish. Symptoms: forgetting earlier decisions, re-reading the same files, contradicting its own plan.

Fixes:

  • Externalize state: plans and decisions go in scratch/ files, not just conversation — scratchpad discipline
  • One task per session when tasks are independent; fresh context beats long memory
  • Subagents for deep dives (Claude Code): exploration burns context in the child, returns only the summary — example
  • Trim standing costs: every MCP server's tool list and every AGENTS.md line is loaded on every request — uninstall what you don't use
  • Restart deliberately: when a session degrades, have the agent write a handoff file, then start fresh and read it back

4. No verification loop

An agent that can't check its work converges on plausible-looking output instead of working output.

The fix is structural, not prompting harder:

  • Exact test/lint commands in AGENTS.md (the agent will actually run them)
  • Phase gates in workflows: "Do not report done until pnpm test passes" — workflow guide
  • The self-critique loop for non-code artifacts
  • For UI work, give it eyes: Playwright MCP so it can load the page and see the result

5. Cost and speed tuning

Token spend follows context size and iteration count, so the fixes above also cut cost. Beyond those:

  • Match model to task — frontier models for architecture and hard debugging; faster/cheaper models for boilerplate, renames, and lint fixes. Most runners let you switch per-session.
  • Kill failed approaches fast — three failed attempts at the same fix means the agent lacks context, and attempts four through ten will fail too. Stop and diagnose (debug-analyzer).
  • Batch related small tasks into one session; split unrelated ones.
  • Measure before optimizingLangfuse or Arize Phoenix show where tokens actually go.

The meta-loop: tune like an engineer

Every fix above follows the same cycle:

  1. Capture the failure — save the transcript or output
  2. Classify it against the table at the top
  3. Apply the smallest fix — one new rule, one removed server, one phase gate
  4. Re-run the same task and compare

One change at a time, evidence over vibes. The prompt-engineer skill automates exactly this loop for your prompts and skills — and improvements compound: a tuned AGENTS.md + a verification gate + Context7 routinely turns a frustrating agent into a dependable one.