How we cut our AI agent token usage by ~80% (11 fixes + the audit prompts)

One weekend. Eleven fixes. Approximately 80% reduction in tokens consumed per Claude Code session.

The savings weren't from switching models or cutting agent runs. They came from removing invisible overhead that had accumulated across months of building: context files too large for the model to attend to, tools that loaded thousands of tokens before any work started, persona prompts that measurably hurt benchmark performance, and cron jobs that fired into empty sessions.

This post documents all eleven fixes with the reasoning behind each and a paste-ready audit prompt you can run against your own setup. The fixes are sorted roughly by impact.

Key takeaways

  • Frontier models attend reliably to ~150–200 instructions. A 5,000-token CLAUDE.md costs 5,000 tokens per turn and most of it isn't read.
  • Each MCP tool definition consumes 2–5K tokens before any work starts. At 20 tools, that's 40–100K tokens of context overhead per session.
  • Persona conditioning on sub-agents costs up to 26.2% on agentic benchmarks (arXiv 2602.12285). The drop is real and measurable.
  • Prompt caching gives a 90% discount on stable prefixes. Almost nobody enables it.

Fix 1: Trim CLAUDE.md to ≤600 tokens

What we found: Our root CLAUDE.md had grown to ~5,000 tokens. The lessons.md it imported ran 226 lines. Every turn loaded all of it.

Why it matters: HumanLayer benchmarks show frontier models attend reliably to roughly 150–200 instructions. A 5K-token CLAUDE.md costs 5K tokens per turn and most of what's past line 200 is invisible to the model anyway. You're paying to load instructions the agent isn't reading.

What we did: Trimmed lessons.md from 226 lines to 56 (75% reduction). Promoted 15 evergreen topic-block patterns to docs/engineering-doctrine.md — loaded on demand, not every turn. Nested CLAUDE.md files for subdirectories were audited and trimmed to load-bearing content only.

Result: Root CLAUDE.md context under 1,000 tokens loaded per session.

Audit prompt:

# Token-count every CLAUDE.md in your repo
find . -name "CLAUDE.md" -not -path "./node_modules/*" -exec wc -w {} \;

# Flag any file over 600 words (rough proxy for 800 tokens)
find . -name "CLAUDE.md" -not -path "./node_modules/*" | while read f; do
  words=$(wc -w < "$f")
  if [ "$words" -gt 600 ]; then
    echo "OVERSIZED ($words words): $f"
  fi
done

Paste this into your session and ask the model: "For each oversized CLAUDE.md, identify which sections are genuinely load-bearing — meaning: what specific failure mode happens without them? Anything you can't name a failure mode for is a deletion or migration candidate."


Fix 2: Drop persona conditioning from sub-agents

What we found: Each of our five specialist agents (researcher, designer, engineer, tester, GTM) had been assigned F1 driver personas — names, character traits, behavioral specs. The intent was to make the agents more distinct and memorable. The effect was measurable performance degradation.

Why it matters: arXiv 2602.12285 documents that persona conditioning costs up to 26.2% on agentic benchmarks. MMLU drops from 71.6% baseline to 66.3% with a long persona spec. The model is spending attention budget on character maintenance instead of task execution.

What we did: Replaced the persona conditioning with a single precision-and-brevity directive applied uniformly across all agents: "You value precision and information density. Default to brevity." Names retained at the UI layer (Telegram bot display names) where they're useful for human readability. Removed from the system prompt where they burn context.

Audit prompt:

# Check each agent spec for persona sections
grep -n -i "persona\|character\|personality\|you are.*named\|your name is" .claude/agents/*.md

# Check for long persona blocks (more than 3 consecutive persona-flavored lines)
for f in .claude/agents/*.md; do
  echo "=== $f ==="; grep -A 5 -i "you are\|your role\|persona" "$f" | head -20
done

Ask the model: "For each agent spec, identify any section that describes personality, character traits, or behavioral style beyond task-relevant instructions. Flag anything that would score on a 'tell me about yourself' prompt. Those sections are candidates for removal."


Fix 3: Cap MCP tools at 6 per agent

What we found: MCP tool definitions are loaded into context at session start. Each definition consumes 2–5K tokens. With 10 tools active, that's 20–50K tokens of overhead before any work starts — before reading a file, before running a grep.

Why it matters: Scott Spence's case study on MCP tool sprawl documents the compounding cost. At 20 tools (easy to reach if you're adding tools for every new integration), you're at 40–100K tokens of pre-work overhead. In a 200K context window, that's 20–50% consumed before turn one.

What we did: Audited each tool in .mcp.json for genuine load-bearing status. The criterion: is there a workflow that requires this tool that can't be done via direct API call? Held the line at ≤6 per agent. Currently at 4.

Also checked for phantom token inflation: Claude Code v2.1.100+ has a known issue where ~20K cache_creation_input_tokens are added silently per session. Unused OAuth connectors add another ~22K. We disconnected unused connectors.

Audit prompt:

# List active MCP tools
claude mcp list

# Check for phantom tokens — run this in an active session
# In Claude Code: /status
# Look for cache_creation_input_tokens > 20K

# Count tools per agent spec
for f in .claude/agents/*.md; do
  count=$(grep -c "mcp__" "$f" 2>/dev/null || echo 0)
  echo "$f: $count tool refs"
done

Ask the model: "For each MCP tool in .mcp.json, answer: (1) when was it last actively used in a real workflow, (2) is there a direct API call that replaces it, (3) what is the cost of removing it? Flag any tool where the answer to #2 is yes."


Fix 4: Assign models at design time, not via dynamic dispatch

What we found: Agents were running on inherited model selection — defaulting to whatever the session model was, or relying on the orchestrator to route them. This meant triage-grade tasks were running on expensive models, and the cost scaled with usage.

Why it matters: Haiku is ~15× cheaper per token than Opus. A tester running mechanical checks (does this file exist, does the function have a caller) on Opus is wasting 14× the budget on a task that doesn't need reasoning depth. Static model routing makes costs predictable and lets you right-size every agent.

What we did: Explicit model: lines in every agent spec frontmatter. Researcher, designer, engineer, GTM → Sonnet 4.6. Tester → Haiku 4.5 (mechanical checks). Engineer gets an Opus override for hard architectural problems, dispatched explicitly when needed — not as default.

Audit prompt:

# Check current model assignments
grep -A 1 "^model:" .claude/agents/*.md

# Flag any agent without an explicit model assignment
for f in .claude/agents/*.md; do
  if ! grep -q "^model:" "$f"; then
    echo "NO MODEL ASSIGNED: $f"
  fi
done

Ask the model: "For each agent spec, given the tasks it typically runs, is the current model assignment appropriate? Specifically: are any agents running on a model with more capability than their most demanding task requires?"


Fix 5: Cache stable system prompts

What we found: None of our agent system prompts had prompt caching enabled. Every turn re-ingested the full prompt at full input cost.

Why it matters: Anthropic's prompt caching gives a 90% discount on cached prefixes. Cache write costs 1.25× base (5-min TTL) or 2× (1-hour). Cache read costs 0.1× base. Minimum cacheable prefix is 4,096 tokens. System prompts that don't change between turns are the highest-value caching target. Almost nobody enables this explicitly.

The catch: Caching a bloated CLAUDE.md is expensive — you pay 1.25× to write a large cache entry on every cache miss. Get Fix 1 done first. Then add cache_control markers.

Audit prompt:

# Check if any agent specs or system prompts use cache_control
grep -rn "cache_control\|cache-control\|cacheable" .claude/agents/ CLAUDE.md

# In an active Claude session, check your cache hit rate
# /status → look for cache_read_input_tokens vs cache_creation_input_tokens
# High creation / low read = caching not working or TTL expiring between sessions

Ask the model: "Which parts of our system prompts are stable across turns — meaning: the content doesn't change between one turn and the next? Those are cache candidates. Which parts are dynamic (injected per-turn context, file contents, search results)? Those stay uncached."


Fix 6: Consolidate cron to a 3-layer pattern

What we found: Cron-style behavior was scattered across four mechanisms: a Claude Routines daily brief, session-start hooks, scripts in /scripts/, and a proposed weekly maintenance cron. The duplication meant similar work was running multiple times, and behavior was hard to predict.

Why it matters: A Routine that fires on a fixed schedule whether or not a session is open wastes budget on runs nobody benefits from. A session-start hook that runs every time a session opens is nearly free — it only fires when it's actually useful. The architectural choice has direct cost implications.

What we did: Settled on a 3-layer pattern: (1) session-start hooks for briefs and state loads — only run when a session is alive; (2) launchd/cron scripts for genuinely time-based work (nightly pruning, monthly audits); (3) ad-hoc PM triggers for everything else. Killed the Routines-based daily brief.

Audit prompt:

# Find all cron-style behavior in your setup
find . -name "*.sh" -not -path "./node_modules/*" | xargs grep -l "cron\|schedule\|daily\|weekly" 2>/dev/null
find . -name "*.md" | xargs grep -l "Routine\|cron\|launchd" 2>/dev/null

# Check for duplicate brief/summary generation
grep -rn "brief\|summary\|daily" .claude/ --include="*.md" | grep -v "^Binary"

Ask the model: "List every cron or scheduled behavior in this setup. For each: does it need to run on a fixed time schedule, or could it fire on-demand when a session starts? Routines that fire when no session is open are pure waste."


Fix 7: Use a session-start hook for the PM brief, not a Routine

This is an extension of Fix 6, but it deserves its own entry because the failure mode is specific.

What we found: The proposed PM daily brief was designed as a Claude Routine running Mon–Sat at 8am. The intent was to surface context before opening Claude Code.

Why it matters: A Routine fires at 8am whether you open Claude Code or not. A session-start hook fires only when you actually open a session. For a brief that's only valuable when you're about to work, the Routine is burning budget on ~30% of fires that happen with no one reading. The session-start hook is free in comparison.

What we did: Moved the PM brief to a session-start hook (hooks/session-start-brief.sh). Fires on session open. Reads tasks/todo.md, surfaces the top 3 priorities and any open PRs. Outputs to Telegram. Budget: ~500 tokens per fire, zero wasted fires.

Audit prompt:

# Check what's in your Claude Routines
# (In Claude.ai settings → Routines, or check any configured routine files)
ls .claude/routines/ 2>/dev/null || echo "No routines directory"

# Check what's in your hooks
ls .claude/hooks/ 2>/dev/null
cat .claude/hooks/*.sh 2>/dev/null | head -50

Ask the model: "For each configured Routine, answer: does this Routine provide value if it fires when no human is present? If the answer is no, it should be a session-start hook instead."


Fix 8: Tool-level deny rules beat spec wording

What we found: Agent specs had instructions like "do not write to docs/" and "do not modify production configuration files." The agents wrote to docs/ anyway. Twice.

Why it matters: Spec wording is advisory. The model is doing its best to follow instructions, but under ambiguity or task pressure, instructions in a system prompt lose to the model's judgment about what's needed. Tool-level deny rules in .claude/settings.local.json are enforced. The agent cannot make the denied call regardless of how the task is framed.

What we did: Added specific deny rules to .claude/settings.local.json for high-risk paths. The principle: any file where a bad write causes an irreversible state change (CLAUDE.md, agent specs, migration files, production config) gets a tool-level deny rule, not just a spec instruction.

Audit prompt:

# Check current deny rules
cat .claude/settings.local.json | grep -A 2 "deny\|block" 2>/dev/null

# Check agent specs for instruction-only guardrails (candidates for promotion to deny rules)
grep -n "do not\|don't\|never write\|avoid modifying" .claude/agents/*.md

# Find files that if accidentally overwritten would cause irreversible damage
# (CLAUDE.md, migrations, production secrets, agent specs)
find . -name "*.json" -path ".claude/*" -o -name "CLAUDE.md" -o -path "*/migrations/*" | head -20

Ask the model: "Look at the deny rules in .claude/settings.local.json and compare them against the 'do not' instructions in agent specs. Which spec instructions should be promoted to tool-level deny rules? The promotion criterion: would a bad write to this path cause damage that can't be undone in under 5 minutes?"


Fix 9: Keep workstream PM single, time-box it

What we found: As work split across multiple tracks (frontend, backend, GTM, research), the question came up: should we split into multiple PM agents, one per workstream?

Why it matters: Multiple PMs need cross-pod sync. Who reconciles conflicting priorities when the frontend PM and GTM PM both want designer time? More critically: each PM needs its own CLAUDE.md, its own session context. Three PMs = triple the context overhead. And if the founder is still the bottleneck regardless of how many PM agents exist, the split adds overhead without removing the constraint.

What we did: Stayed single-PM with stronger time-boxing discipline. Two hours on the highest-leverage workstream, then hard context switch. The PM daily brief (Fix 7) surfaces the priority order at session start. The bottleneck in most solo-founder setups is founder review cycles, not PM bandwidth — splitting PMs doesn't fix that.

Audit prompt:

# In the last 4 weeks of sessions, how often was the PM context window a bottleneck?
# (Check session logs or git log for "context limit" / "session close" patterns)
git log --since="4 weeks ago" --oneline | grep -i "context\|limit\|restart" | wc -l

# How many distinct workstreams are active?
ls tasks/prd/*.md 2>/dev/null | wc -l

Ask the model: "Given our current workstreams, name 3 specific instances in the last 4 weeks where a separate workstream PM would have unblocked work faster. If you can't name 3, the split is premature. If you can, describe the coordination mechanism that would have prevented priority conflicts between the pods."


Fix 10: Codify session protocols as slash commands, not free-text

What we found: Session start and end behaviors were documented in CLAUDE.md as free-text instructions: "at the start of every session, do X." In long sessions, these instructions were sometimes followed inconsistently — the model was following them from memory at turn 50 of a 100-turn session, with variable accuracy.

Why it matters: CLAUDE.md instructions are context that loads every turn. Slash commands (/start, /end) are explicit invocations — they only run when called, they're reliable when called, and they don't bloat per-turn context with protocol text. Moving session protocol to slash commands cuts per-turn overhead and makes the behavior more consistent.

What we did: Created .claude/commands/start.md and .claude/commands/end.md with explicit step-by-step protocols. CLAUDE.md now contains one-line pointers to those files. The command specs are the single source of truth; CLAUDE.md doesn't duplicate them.

Audit prompt:

# Check if session protocol lives in CLAUDE.md (bloat) or slash commands (efficient)
grep -n "session start\|session end\|at the start\|begin every" CLAUDE.md | head -10

# Check if slash commands exist
ls .claude/commands/ 2>/dev/null || echo "No commands directory"

# Measure how much of CLAUDE.md is protocol vs doctrine
wc -l CLAUDE.md
grep -c "session\|start\|end\|protocol" CLAUDE.md

Ask the model: "Read CLAUDE.md and identify every instruction that describes a repeated process (something that runs at session start, session end, or on a trigger). For each: would this work better as a slash command that's explicitly invoked, rather than an instruction that runs from memory? The criterion: if missing it once costs more than 10 minutes to recover from, it belongs in a slash command."


Fix 11: Chrome profile isolation for high-risk sessions

What we found: ClaudeBleed (CVE disclosed May 2026) allows other Chrome extensions to hijack the Claude in Chrome extension. Our setup had Claude in Chrome as the active extension in the same profile used for authenticated sessions (Supabase dashboard, Vercel, GitHub).

Why it matters: Simon Willison's documentation of the "lethal trifecta" — (private data) + (untrusted content) + (exfiltration vector) — maps directly to this setup. A malicious web page opened in the same Chrome profile as an authenticated Claude session is a prompt injection attack surface. The attack vector is open as of this writing.

What we did: Moved high-risk browsing (authenticated dashboards, untrusted content ingestion) to a separate Chrome profile without the Claude extension. For workflows that genuinely require Claude + browser access, the preference is WebFetch from within the agent session or Playwright MCP — both have a different security posture than the Chrome extension.

Audit prompt:

# Check what MCPs are wired that could expose auth'd sessions
cat .mcp.json | grep -i "chrome\|browser\|playwright"

# Check agent specs for browser access patterns that might be in the wrong context
grep -rn "chrome\|browser\|selenium\|playwright" .claude/agents/*.md

Ask the model: "For any workflow that uses browser automation or Claude in Chrome: does that workflow involve authenticated sessions (production dashboards, OAuth flows, admin UIs)? If yes, is there an alternative path (WebFetch, API call, Playwright in isolation) that doesn't require an authenticated browser session?"


The combined effect

These eleven fixes don't require changing your agent architecture, switching providers, or rebuilding workflows. They're mostly removals: removing instructions the model wasn't reading, removing tools that were loaded but not needed, removing persona conditioning that was hurting performance, removing cron patterns that fired without anyone listening.

The 80% reduction is real. The math:

  • CLAUDE.md trim (Fix 1): ~4,000 tokens removed per turn from context overhead
  • MCP audit (Fix 3): phantom token inflation resolved (~20K tokens recovered)
  • Model routing (Fix 4): Haiku at $0.25/MTok vs Opus at $15/MTok for mechanical tasks = ~15× cost reduction on tester runs
  • Persona removal (Fix 2): 26.2% performance improvement on agentic benchmarks means more tasks completed per token budget

The audit prompts above are designed to be run in your current session without any setup. Each takes under 2 minutes. Start with Fix 1 — the CLAUDE.md token count. That single number tells you whether you have a context bloat problem worth addressing.


Frequently asked questions

How do I know if I have phantom token inflation?

Run /status in an active Claude Code session. Look for cache_creation_input_tokens. If it's consistently above 20K with no clear source, you have phantom inflation. Common causes: unused OAuth connectors on claude.ai (disconnect them), or Claude Code versions around v2.1.100–v2.1.120 (update to the latest).

Do these fixes apply outside of Claude Code?

The model-routing fix (Fix 4), persona fix (Fix 2), and prompt caching fix (Fix 5) apply to any agentic system — LangChain, CrewAI, AutoGen, custom orchestrators. The CLAUDE.md and slash-command fixes are Claude Code–specific, but the principle (keep per-turn context minimal, make protocols explicit invocations) applies everywhere.

How do you maintain the ≤600 token CLAUDE.md without losing important context?

The key is distinguishing between context that needs to load every turn vs context that loads when work touches the relevant directory. Nested CLAUDE.md files in subdirectories load on demand. Evergreen reference docs in docs/ load when explicitly fetched. Only put in the root CLAUDE.md what the model needs on turn one, before any files are read. Everything else belongs in a nested file or reference doc.

Is 80% reduction sustainable as the codebase grows?

The primary risk is context accumulation: new files added to CLAUDE.md over time, new MCP tools added without removing others, new agent specs that grow over months. The mitigation is a monthly token-count cron that flags any CLAUDE.md over 600 words and any agent spec over 300 words. We run this on the 1st of each month. The fixes above set the baseline; the cron keeps it from drifting.


Run your own audit

Lume's AI Visibility Audit is a different kind of check — it tells you whether your business shows up correctly in ChatGPT, Perplexity, and Gemini citations, not just Google. But the structural pattern is the same: invisible overhead accumulating silently, caught only by a deliberate pass.

30 seconds. No login.

Run the free AI Visibility Audit →