10 Claude Code optimizations that cut token costs and doubled throughput
Most AI developer setups are expensive by default. Not because the tools are expensive — because the defaults were designed for ease, not efficiency. Context windows that load everything. Model tiers that default to flagship for every subtask. Permissions set to wide-open. No memory architecture, so the same ground gets re-covered in every session.
A production audit of a real Claude Code workflow surfaced 10 specific fixes. Combined, they reduced token spend by roughly half and meaningfully improved output consistency. None required new tooling. All of them are on-by-default problems that most teams haven't looked at yet.
Key Takeaways
- Default Claude Code setups auto-load far more context than most tasks need
- Model tier misrouting (using Opus for tasks Sonnet handles equally well) is often the single biggest cost lever
- A nested CLAUDE.md architecture reduces root-file auto-load by 60–80% in production
- Memory systems prevent agents from re-solving the same problems across sessions
- 5 of the 10 fixes below are structural (one-time changes); 5 are operational (judgment calls that compound)
The baseline problem
Claude Code auto-loads your root CLAUDE.md on every session. If that file is 400 lines covering your entire stack, product strategy, and agent protocols, that's dead weight on a session where the task is "refactor a single component."
The audit ran across a six-agent production setup: PM, engineer, designer, researcher, tester, GTM. Combined auto-load was approximately 13,400 tokens per session before any task-specific context. At 40–50 sessions per week, that's 500,000+ tokens of pure overhead — not model reasoning, just file-loading.
Here are the 10 fixes that came out of the audit.
1. Trim the root CLAUDE.md to an index
A root CLAUDE.md that tries to cover everything ends up covering nothing well and loading everything expensively. The fix: reduce the root file to a lean index (~100–150 lines) of pointers and absolute essentials. Move domain-specific guidance (engineering patterns, design system rules, content voice) into nested CLAUDE.md files in the corresponding directories.
Nested files load on demand — only when an agent's tool calls touch that directory. A session working on src/ pulls src/CLAUDE.md. A session working on content/ pulls content/CLAUDE.md. Sessions scoped to planning load neither.
This is Anthropic's own stated design intent for the nested CLAUDE.md pattern. Most teams don't implement it because the default is to append, not restructure.
Why it matters: In the audit, root-only CLAUDE.md loaded ~13,400 tokens per session. After moving domain-specific content to 4 nested files and trimming the root, sessions without full-stack scope loaded ~3,200 tokens. An 76% reduction on common sessions.
2. Route tasks to the right model tier
Opus for hard architectural problems. Sonnet for day-to-day coding and analysis. Haiku for simple, high-volume tasks (testing scaffolds, format conversions, boilerplate).
The default when people set up multi-agent systems is to run everything on Opus or Sonnet "just in case." The actual price of that default: Opus is roughly 15× more expensive per token than Haiku, and 3× more expensive than Sonnet. Running a test coverage agent on Opus when it's pattern-matching against an existing test suite is pure waste.
Fix: Assign model tiers explicitly in each agent's spec file. Require a justification to bump an agent to Opus. In production, 3 of 6 agents (tester, researcher, GTM) ran correctly on Haiku or Sonnet; only PM and engineer needed Sonnet with an Opus override available for hard problems.
3. Add an explicit permission allow-list and audit it quarterly
Claude Code ships with wide-open defaults. The allow-list in .claude/settings.local.json is the mechanism for tightening that. Most teams set it once and forget it.
A permissions audit of a production setup found 5 wildcard rules that were strictly broader than other rules already in the file — functionally dead rules that added surface area without adding capability. 60 entries collapsed to 55. The fix is not dramatic; the discipline of running the audit is what matters.
Fix: Every quarter, run a grep-based audit against your allow-list. For each wildcard rule, ask: what specifically does this need to cover? If you can replace it with a more precise rule, do it.
4. Build a persistent memory architecture
Without memory, agents re-solve the same problems every session. A researcher agent that identifies a competitor's pricing model in one session will re-research that same competitor three sessions later unless the finding is persisted.
The pattern that works in production: a /agent-memory/ directory with typed memory files (user context, feedback, project state, reference pointers). Each file has a frontmatter slug and a one-line summary. A MEMORY.md index (max 200 lines) loads as a pointer layer — agents scan the index, then load specific memory files on demand.
What this prevents: "Founder preferences" memory (how Prashanth likes to review PRs, what register to use in Telegram posts, which escalations to surface) otherwise has to be re-established through friction at the start of every session.
5. Scope agent worktrees so they don't collide
When two agents edit the same files in the same Git tree, you get merge conflicts and lost work. The worktree pattern (each agent spawn gets an isolated git worktree add) eliminates this.
The pattern: PM operates on main. Each engineer spawn gets a fresh worktree from the feature branch. Agents working in overlapping domains get explicit file-scope constraints in their spawn prompt.
The failure mode this prevents: two agents opening PRs that both modify src/app/(marketing)/page.tsx — the second merge breaks the first, silently.
6. Add a session start/end protocol
The single biggest driver of repeated work across sessions is state loss. When a session ends without externalizing the current state — what's open, what's blocked, what decision was made and why — the next session rebuilds context from scratch.
A /start and /end command pattern solves this. At session start: read the todo, pull the current roadmap state, check for any audit signals or researcher briefs. At session end: sweep open PRs, capture lessons, write the EOD summary to the shared team channel.
Cost of skipping this: Approximately 20–30 minutes of context reconstruction at the start of every session, multiplied by session frequency. At 5 sessions per week, that's 2+ hours of wasted agent time per week.
7. Separate agent concerns strictly
The temptation when building multi-agent systems is to give every agent access to everything. The cost of that temptation: agents start doing each other's jobs imprecisely, context windows bloat with irrelevant tool calls, and the system becomes non-auditable.
In production, each agent should have a spec file that defines exactly: role, tool whitelist, output format, boundaries (what it is NOT responsible for), and escalation path. The PM agent routes and orchestrates; it doesn't write code. The engineer writes code; it doesn't make product decisions.
The audit finding: agents with over-broad specs produced outputs that were 30–40% less targeted than agents with tight specs. Tight specs are also easier to debug when something goes wrong — you know exactly what the agent was supposed to do.
8. Add a Telegram (or equivalent) broadcast layer
When multiple agents are running in parallel, status visibility collapses without a shared broadcast channel. Email is too slow. Slack is noisy. A dedicated Telegram group per project, with one bot per agent, gives each specialist a voice while keeping the feed auditable.
The implementation overhead is low: BotFather for each bot token (5 minutes each), one poller process that routes @agent-name mentions, one poster that formats output per agent style.
What this enables: the founder/PM can approve or redirect any agent mid-task from a phone, without opening a terminal. The agent posts a checkpoint; the PM reacts with an emoji or a message; the agent continues or pivots.
9. Store secrets in a gitignored JSON file, not environment variables
This is operational hygiene that almost every team gets wrong initially. Secrets in .zshrc or .env are in plaintext in files that may be synced, backed up, or accidentally committed. Secrets in environment variables are inherited by every subprocess, including ones that don't need them.
The fix: a gitignored bots.json (or equivalent named file) in a gitignored directory (.claude/secrets/). Each agent spec references the specific key it needs. Rotation means updating one file in one place.
The supply-chain implication: in a real npm supply-chain attack against a production codebase, secrets stored in environment variables are trivially exfiltrated by a malicious postinstall script. Secrets in a gitignored file are harder to reach if the file path isn't predictable. This is defense-in-depth, not a guarantee — but it costs nothing to implement.
10. Run a codebase audit before major milestones
Technical debt is invisible until it's load-bearing. The most efficient time to run a systematic audit (dead-code detection, security/permissions sweep, code-size review) is before a major feature build, not after it ships.
The grep that catches wired-but-inert features (the most common dead-code miss):
grep -rln "<EntryPointFunction>" src/app/api/ src/app/\(app\)/
If the only file that matches is the file that defines the function, the feature shipped but nothing calls it. No test suite catches this. Type-checking doesn't catch it. The grep does.
What the audit found: an entire AEO v2 analysis chain (8 PRs of engineering work) was inert in production — fully built, no production consumers. The wiring step was missed in a rate-limit-salvaged ship. Found and fixed in the same afternoon.
The setup that works in production
Putting these together: a lean root CLAUDE.md (index only), nested domain files, explicit model tiers per agent, persistent typed memory, worktree isolation, session bookend protocols, tight agent specs, a broadcast channel, secure secrets management, and a quarterly codebase audit cadence.
The combined result on a six-agent setup: token spend roughly halved, output consistency meaningfully higher, and session reconstruction time near zero.
The individual changes are not dramatic. The compounding of all 10 is.
Frequently asked questions
How long does the initial setup take?
The structural changes (CLAUDE.md restructure, memory architecture, worktree pattern, Telegram integration) take 1–2 days on a production codebase. The operational changes (session protocols, agent specs, permissions audit) are ongoing but lightweight. Most teams can get through the structural setup in a single focused session.
Which of the 10 has the highest ROI?
Model tier routing (#2) and root CLAUDE.md trimming (#1) together typically yield 40–60% cost reduction and can be implemented in an afternoon. If you only have time for two, start there.
Does this work for solo developers or only teams?
All 10 apply to solo setups. The Telegram broadcast layer is most obviously useful for teams, but it also works for solo developers as a persistent audit log — every agent action is timestamped and searchable. The memory architecture is arguably more valuable for solos, since there's no second human to carry context across sessions.
What's the risk of over-constraining agents with tight specs?
The main risk is gaps: if an agent's spec doesn't cover an edge case that comes up in production, the agent either fails silently or escalates unnecessarily. The fix is to treat the spec as a living document and update it based on real escalations. The operational lesson: a tight spec with an explicit "escalate to PM if outside scope" rule is better than a loose spec that tries to cover everything.
Run the audit on your setup
The Lume AEO Grader audits the AI visibility side of the same equation — how visible your business is in ChatGPT, Perplexity, Gemini, and Claude citations, and what's missing. If you're optimizing your AI development workflow, knowing where your product stands in AI search is the parallel check worth running.