How AI agents catch the bugs your test suite misses
Test suites catch what you wrote tests for. That sounds obvious, but the implication is easy to miss: the most consequential class of production bugs — features that shipped but never wired up, permissions set wider than intended, structural dead weight accumulating over months — generates no test failures, no type errors, and no CI alerts. It just silently exists.
Three audit passes on a production Next.js codebase, run using isolated Claude Code agents, surfaced findings in exactly this class. The sessions took 30 minutes of human time and roughly $3 in API costs. Here is what they found, and what the detection patterns look like.
Key Takeaways
- AI agents catch a class of bug that test suites systematically miss: inert features, structural dead code, and permission surface area
- Worktree-isolated audit agents (no main-tree side effects) run clean passes without risk to the production branch
- The grep pattern that catches wired-but-inert features is short enough to live in a README
- A codebase audit before a major build cycle is cheaper than debugging the same issues after they compound
What the three audits covered
Each audit ran as a separate Claude Code agent in a git worktree — an isolated copy of the repo that can't contaminate the main branch. Each agent received a specific brief: a grep-and-read pass, surface findings, open a PR for anything actionable.
Audit 1: Code-size sweep. Which files have grown beyond the point of maintainability, and what are the natural extraction seams?
Audit 2: Permissions audit. Which rules in the Claude Code allow-list are redundant, overlapping, or strictly broader than necessary?
Audit 3: Dead-code detection. Which exported functions, analysis chains, or features have no callers in production?
What came out of each pass
Code-size sweep: 1,702 lines extracted
Four page-level files had grown well beyond what a single file should carry. The largest was a grader client component at 1,200+ lines that was load-bearing for the /aeo-grader page. After the agent's refactor, it came out at 628 lines across three focused modules. Same behavior. No regressions. The agent split along natural seams — form state, results rendering, scoring display — that were already implicit in the code but hadn't been formalized.
Total: 1,702 lines extracted from 4 files into 12 modules. One PR.
The pattern the agent used was straightforward: identify files over 600 lines, read for natural import boundaries, propose splits. The judgment calls (what constitutes a natural seam, what to name the modules) were the agent's, reviewed in the PR diff. Approval took 5 minutes.
Permissions audit: 60 → 55 rules
The Claude Code allow-list had grown across several months of agent additions. A grep-based audit found 5 wildcard rules that were strictly subsumed by more specific rules already in the file — functionally dead entries that added permission surface without adding capability.
No security regression from removing them. The allow-list is tighter with fewer entries.
The audit also produced a documented runbook for the cuts — each removed rule annotated with the specific overlapping rule that makes it redundant. The runbook means the next engineer who reviews the allow-list knows why it looks the way it does.
Dead-code detection: the production bug
This is the finding that matters.
The agent found three unused code paths. Two were genuinely dead — orphaned utility helpers from a past authentication provider swap. They compile, they import cleanly, they contribute to nothing.
The third was different. An entire analysis chain — eight PRs of engineering work, a new AEO v2 scoring system — was inert in production. Fully built, fully tested, passing all unit tests. No production consumers.
The route handler that was supposed to call the new chain was still calling the old scorer. The wiring step was missed during a rate-limit-salvaged ship where the final integration file never got written. The tests passed because they tested the chain in isolation. The chain worked. Nothing called it.
The grep that caught it:
grep -rln "analyzeV2" src/app/api/ src/app/\(app\)/
One match: the file that defined analyzeV2. No route file. No page. The chain existed in the source tree with zero production consumers.
Found and fixed in the same session. The fix was wiring the route handler to the new chain — a 20-line change. The preceding 8 PRs then became live in production.
The detection pattern that generalizes
The inert-feature pattern is a specific, catchable failure mode. A function ships. Its tests pass. Nothing calls it. Months later, someone refactors based on the assumption that the function is load-bearing — or worse, deletes it thinking it's dead.
The detection rule:
grep -rln "<EntryPointFunction>" src/app/api/ src/app/\(app\)/
If the only match is the file that defines the function, the feature is inert. Test suites don't catch this. Type checking doesn't catch this. The grep does.
The rule scales one level up. If the inert function is itself the entry point of a larger chain, the entire chain is inert. grep -rln "<ChainRoot>" against route files and page files is the check. Anything with zero route-level consumers is not running in production, regardless of what the unit tests say.
Why test suites miss this class of bug
A unit test for analyzeV2 tests the function given inputs. It does not test that any route handler invokes analyzeV2 with real user data. Integration tests that test the route handler would catch this — but most codebases have integration tests that test the route's interface, not the specific implementation path it calls.
The gap: there is no automated check that asserts "the feature I built last sprint is actually being called by a production endpoint." That check has to be written explicitly or run as an audit pass.
AI agents running grep-based audits are well-suited to this class of check because they're reading the codebase as a whole rather than exercising it in isolation. The audit is structural, not behavioral. It finds what exists and what connects to what, not whether the behavior is correct.
The cost calculus
Three audit passes. 30 minutes of human review time. ~$3 in API costs. One production bug caught (8 PRs of inert engineering work, now live). 1,702 lines of structural debt resolved.
The alternative cost of the inert AEO v2 chain: the next engineer to work on the scoring system would have started from the assumption that the old scorer was the live one. Any improvements to v2 would have been built on an inert foundation. The compounding cost of that misunderstanding grows with every session that operates on a false picture of production state.
Pre-build audits are the cheapest time to find this class of problem. Post-ship audits find the same problems after they've compounded.
Running this on your codebase
The three-pass structure (code-size, permissions, dead-code) is portable. The specific implementation depends on your stack, but the pattern is the same:
git worktree addto isolate the audit agent from your main branch- Brief the agent with a specific scope and a grep-based detection rule
- Review the PR — findings are the PR description, fixes are the diff
- Run the orphan-detection grep against route files after every major feature ship
The worktree isolation is load-bearing. An audit agent with write access to your main branch can make things worse. In a worktree, the worst-case outcome is a PR you don't merge.
Frequently asked questions
How do you prevent the audit agent from introducing regressions?
Worktree isolation means the agent's changes don't touch your main branch until you merge the PR. Before merging, run your test suite and typecheck against the worktree branch. The audit's code-size refactor in this case was validated by npm test and npm run typecheck before the PR merged.
Should audits run on a schedule or ad hoc?
Both. A quarterly structural audit (code-size, permissions, dead-code) catches compounding debt. An ad-hoc orphan-detection grep after every major feature ship catches inert wiring before it compounds. The grep takes 10 seconds; there's no reason not to run it as a post-ship checklist item.
What's the risk of a false positive in dead-code detection?
The inert-feature grep (no route-level consumers) can produce false positives if the function is called from a non-route surface (a CLI script, a cron job, a test fixture). The check is: does this function have any intended production caller? Read the callers, not just the count. Two minutes of manual verification eliminates false positives for most codebases.
Does this apply outside of Next.js?
The pattern applies to any codebase with a clear entry-point layer (route handlers, controllers, event handlers). The grep targets that layer. For a Django codebase, substitute views.py and urls.py. For a Go service, substitute main.go and the handler registration files. The structural logic is the same: if the new function isn't wired into the entry-point layer, it isn't running.
What AI visibility audits look like for your business
The same structural gap — something that should be live in production, isn't — applies to AI search visibility. A business can have a website, a Google Business Profile, and a solid review count, and still be absent from ChatGPT, Perplexity, and Gemini citations because the structural signals (schema markup, entity clarity, FAQ coverage) weren't wired up.
The Lume AEO Grader runs the audit equivalent on your business's AI visibility: what's present, what's missing, and what to fix first. 30 seconds, no login.