Your AI Code Reviewer Fails on Rule 12

Here's Why

Feb 28, 2026

You run AI-assisted code review on a substantial diff. The agent produces a clean report. You merge. Three days later someone spots a missing params.expect() call that should have been flagged. You check the review. The rule was in the agent’s instructions. The violation was right there in the diff. The agent simply didn’t catch it.

This is not a prompt engineering problem. It’s an architecture problem.

The Structural Failure

When you give a single agent a checklist of 17 rules and a diff covering controllers, models, views, and migrations, you’re not giving it a task. You’re giving it a competition.

The agent starts at rule 1 with full attention. By rule 12, its context window carries the accumulated weight of everything it has already processed: every false positive considered, every file section scanned. Rule 17 gets whatever is left. Later rules are structurally disadvantaged.

The disadvantage isn’t positional in the token stream. It’s cumulative cognitive load. Each additional rule increases reasoning complexity and attention fragmentation.

What makes this insidious: a missed violation looks identical whether caused by context dilution or an ambiguous rule definition. You can’t distinguish them from the output. You might tighten the prompt when the real problem is an underspecified reference document, or rewrite the reference doc when the agent simply ran out of focused attention. You end up iterating on the wrong variable, indefinitely.

Better prompts don’t escape this. More emphasis, critical rules first, ALL CAPS: these help rule 1 and hurt rule 17 further. Prominence requires contrast, and contrast has a fixed budget. Longer system prompts add context weight, which is the wrong direction. Splitting rules across two or three agents improves the ratio but doesn’t solve the dynamic.

Larger context windows postpone structural problems. They don’t eliminate them. Architecture does.

The root cause is treating independent constraint checks as a single composite reasoning task. Each check is independent. None benefits from the results of the others. It’s a lookup task repeated N times. The natural shape is parallel, not sequential.

This becomes visible once you’ve run enough agentic reviews to notice which violations keep slipping through. It’s not the most complex rules that get missed. It’s whichever rules happen to fire late.

The Architecture: One Rule, One Agent

An orchestrator agent handles coordination. Haiku-class agents handle rule checking, one per rule, all in parallel.

code-review-orchestrator (Sonnet)
  ├── rule-reviewer: BR-01 params-expect (Haiku)
  ├── rule-reviewer: BR-08 prevent-n-plus-1 (Haiku)
  ├── rule-reviewer: BR-12 api-ready-controllers (Haiku)
  ├── rule-reviewer: FR-01 dom-id (Haiku)
  ├── rule-reviewer: FR-06 form-with-only (Haiku)
  └── ... N agents total, all running simultaneously

The orchestrator reads the rule index, maps file paths to scopes, filters rules to those relevant to what changed, builds a prompt for each applicable rule, and spawns all agents at once. It does not review code.

Each rule-reviewer receives three things: the rule definition, the full reference document for that rule, and a scoped diff covering only the files relevant to that rule’s scope. A controller rule does not receive view templates or CSS.

The rule-reviewer’s job is deterministic. Read the rule. Read the reference. Find matches in the diff. Output JSON:

{
  “rule_id”: “rails.backend.params-expect”,
  “rule_code”: “BR-01”,
  “severity”: “error”,
  “violations”: [
    {
      “file”: “app/controllers/notifications_controller.rb”,
      “line”: 18,
      “violation”: “Uses params.require instead of params.expect”,
      “excerpt”: “params.require(:notification).permit(:message)”
    }
  ],
  “violation_count”: 1,
  “checked”: true
}

No narrative. No suggestions. One rule, one answer.

Structured output eliminates interpretive drift.

This separation is itself a design principle: coordination to Sonnet, evaluation to Haiku. Coordination requires reasoning — which rules apply, which files are in scope, how to aggregate results. Rule checking requires a precise definition and a focused diff. Matching model to task keeps costs low and outputs clean.

Scope Filtering

Each rule has a scope field: controllers, models, views, helpers, jobs, css, routes, migrations, tests. The orchestrator maps changed file paths to scopes and spawns agents only for rules whose scope intersects what changed.

A backend-only feature touching controllers and models will not spawn CSS, routing, or migration rule agents. For a typical full-stack feature, 15 to 20 agents run.

Profiles

Profiles define which rules run.

Run fast during active development. Run strict before merging.

Output and Gating

Two artifacts per run.

The JSON report is machine-readable: every rule checked, every violation, aggregated counts, and a passed boolean. Pipeline gating:

passed = (blocking_count == 0)

The markdown report is human-readable. Blocking violations appear first with file, line, excerpt, and a link to the reference doc. Advisory warnings follow. Each run ends with a Verdict:

APPROVED: no violations
APPROVED WITH SUGGESTIONS: no blocking violations, advisory warnings present
CHANGES_REQUIRED: one or more blocking violations

If CHANGES_REQUIRED, the violations report returns to the engineer. Fixes are applied. The review reruns. Maximum two iterations.

Cost

Each Haiku rule-reviewer processes roughly 3,000 tokens (rule definition + reference doc + scoped diff) and returns about 300 tokens of JSON, approximately $0.003 per agent.

The result is roughly 1.5 to 2x the cost of the single-agent approach. The previous approach used a more expensive model for a task that doesn’t require reasoning; the switch to Haiku largely offsets the cost of parallelism.

Costs assume scoped diffs and disciplined reference docs. Large diffs or poorly bounded rule documents increase token usage linearly. Cost scales directly with the number of rules in your index. Start with your highest-value rules, validate they catch what matters, then expand.

The Diagnostic Benefit

Per-rule isolation makes failures diagnostic.

With a single-agent reviewer, a missed violation is ambiguous: context dilution or ambiguous rule definition, impossible to tell which. You adjust architecture and docs simultaneously, never knowing which change mattered.

With per-rule agents, the ambiguity collapses. If a dedicated agent with one rule and one focused diff still misses a known violation, the reference document is the problem. Inject a known violation, run the agent, see if it catches it. The rule docs become testable.

Each missed catch points to which reference document needs work, not to a mystery about which part of the system failed.

In Practice

BR-08, preventing N+1 queries, fires on controller diffs, requires matching query calls against eager loading, and returns a specific line and excerpt. In a single-agent review checking 20 rules, it fires somewhere in the middle. In a parallel review, a dedicated agent reads the controller diff with one question: is there an N+1? It doesn’t matter when it fires relative to other rules. Every agent fires at the same time.

A violation that passed a single-agent review (present in the diff, rule active) was caught on the first run with a dedicated BR-08 agent. Same diff. Same rule. Different architecture.

The rule wasn’t wrong. The reference doc wasn’t ambiguous. The agent had simply processed 14 other rules before getting there.

Principles

Interleaved tasks degrade with scale. A single agent handling N rules gives earlier rules more attention and later rules less. Structural, not fixable with prompts.
Task type determines architecture. Pattern matching against an explicit definition benefits from isolation. Coordination benefits from reasoning. Match the model to the task.
Scope filtering is precision, not performance. An agent seeing only relevant files gives more useful results than one filtering mentally from everything.
Ambiguous failures compound. If you can’t distinguish “architecture failed” from “reference doc is wrong,” you can’t systematically improve either. Per-rule isolation makes failures diagnostic.
Cost should be proportional to stakes. fast for pre-commit. strict for pre-merge. Design your review tiers deliberately.

Generalization

This pattern is not specific to Rails or code review.

Any system enforcing independent constraints through a single reasoning process will degrade as constraints scale. Not because the model is weak, but because the task shape is wrong.

The architectural law is simple: independent constraints should not share cognitive state.

This applies wherever you have N independent checks, each with an explicit definition, none depending on the others’ reasoning, and where missing one creates silent failure. Linting. Security policy enforcement. Spec validation. Compliance checks. Schema enforcement. Feature conformance. AI guardrails.

When you force a single agent to juggle unrelated constraints, you compress N independent validation tasks into one composite reasoning chain. Cognitive load increases. Failure modes hide.

This is not about parallelism for speed. It is about isolation for determinism.

Getting Started

Day 1: Build the rule index and reference docs. Each entry needs id, severity, scope, and ref. Each rule gets a dedicated reference document: what the rule is, what a violation looks like with code examples, what correct usage looks like. Start with the rules where a miss has cost you before. Specificity here directly determines detection quality.

Week 2: Build the orchestrator and rule-reviewer agent. The orchestrator handles scope mapping, rule filtering, prompt construction, parallel spawning, and aggregation. The orchestrator must not inspect code content. The rule-reviewer takes three inputs and outputs JSON only. If it produces narrative, it is scope-creeping.

Week 3: Test with known violations. Inject deliberate violations into a test diff. Any rule that fails to catch its known violation has a reference doc problem: fix the doc, not the architecture. Then build profiles, at minimum fast (errors only) and strict (errors + warnings), and wire them to your pre-commit hook and pull request gate.

The system will expose gaps in your reference documents faster than you expect. That is working as intended.

What Changes

When this is running, a code review is a lookup, not a read. The review fires, runs in parallel, and returns a report with specific file names, line numbers, and excerpts for every violation found. Either the report is clean, or it contains a precise list of things to fix. No ambiguity about what the agent noticed or missed.

Violations that reach main are the ones your rule definitions didn’t cover: fixable information, not unexplained failure.

False confidence scales faster than visible failure. A noisy system gets fixed. A confident system that quietly misses violations gets trusted, until the violation reaches production.

AI-assisted review that misses violations unpredictably is worse than no review. A clean report from a structurally flawed system inserts false assurance between the engineer and the error.

This pattern generalizes. Any time you have independent constraints to evaluate, isolate them. Let coordination reason. Let evaluation specialize. That is the difference between hoping an agent remembers everything and designing a system that doesn’t require it to.

The parallel per-rule architecture does not guarantee perfect detection. Nothing does. But it removes the structural cause of inconsistency, makes failures diagnostic rather than opaque, and scales linearly with the number of rules you enforce.

That is a different quality of system.

Signal Thinking

Discussion about this post

Ready for more?