Agentic Engineering: Orchestration Design
Building Orchestration Systems That Actually Work
Most AI orchestration systems work. That’s not the problem.
The problem is they work like junior-level architecture: functional in the moment, creating maintenance debt that compounds over time. Fragile coordination logic. Unpredictable context windows. Agents that can’t run independently. Failures that cascade silently.
You trade coding time for debugging time. The leverage disappears.
The Orchestration Quality Gap
Building multi-agent systems typically follows one of two patterns:
Monolithic agents:
Single prompt handles planning, execution, validation, review
Context balloons unpredictably (20k-60k tokens)
Works for Case A, breaks on Case B
2+ hours per successful run, unpredictable costs
Hand-rolled coordination:
Orchestration script passes content between agents as strings
Context pollution (Agent B sees Agent A’s scratch work)
Orchestrator makes domain decisions it shouldn’t
Distributed monolith with unclear boundaries
Both fail for the same reason: treating orchestration like code execution instead of team coordination.
The Pattern That Emerged
After building the Visionaire orchestration system—50+ features end-to-end with consistent quality—a structural pattern emerged:
AI orchestration fails when we architect it like code execution, not like team coordination.
Effective orchestration coordinates specialists:
Who works when (sequencing)
Clean handoffs (file-based interfaces)
Progress tracking (metadata)
No domain decisions (agents decide HOW)
The framework that produces consistent results has four core architectural properties.
Autonomy Emerges From Constraint
Unclear roles create hesitation. An agent without explicit expertise doesn’t know when its judgment applies. It defaults to asking permission rather than risk exceeding unclear bounds.
Vague authority creates insecurity. Without knowing what’s fixed versus flexible, agents either violate scope boundaries or seek validation for decisions within their authority. Both waste time.
Unlimited freedom creates chaos. An agent with no explicit constraints has no framework for judgment. It tries everything, fails repeatedly, learns nothing transferable between tasks.
Explicit boundaries enable autonomy. When an agent knows precisely what it cannot change, it moves confidently within what it can. When it knows which tools are forbidden, it uses allowed tools without trial-and-error. When it knows when to ask versus proceed, it asks only when necessary.
This applies equally to humans and AI agents. Senior engineers are effective not despite constraints, but because of them. Rails conventions don’t limit DHH—they enable him to build faster by eliminating low-value decisions. The same mechanism works for agents.
The four layers that follow formalize this principle into practice.
The Four-Layer Architecture
Orchestration systems that produced reliable output shared this structure:
┌─────────────────────────────────────────┐
│ Layer 1: Input Specification │
├─────────────────────────────────────────┤
│ Layer 2: Derived Context │
├─────────────────────────────────────────┤
│ Layer 3: Phase Pipeline │
├─────────────────────────────────────────┤
│ Layer 4: Metadata & Learning │
└─────────────────────────────────────────┘These layers enforce separation of concerns:
Problem 1: Unpredictable execution costs
Layer 3 runs each agent in a fresh context window with explicit inputs only. Predictable token costs per phase.
Problem 2: Unclear failure modes
Layer 1 validates upfront. Layer 3 enforces explicit failure handling with detailed error messages.
Problem 3: Orchestrators making domain decisions
Architectural constraint—orchestrators coordinate (THAT things happen), agents decide (HOW).
Problem 4: No learning or improvement
Layer 4 tracks execution data: confidence, quality scores, tokens, duration, domain signals.
Layer 1: Input Specification
Defines required inputs, validates format, fails fast with clear errors.
Most orchestration failures happen because ambiguous inputs create ambiguous execution. Validating structure upfront prevents downstream agents from working with malformed data.
Pattern:
## Input (Required)
- Feature spec file path
Example: `docs/features/F-003-notifications.md`
## Input Validation
Feature spec filename must match pattern: `F-###-*.md`
If filename does not match, STOP and surface error:
❌ Feature spec must match format: F-###-name.md
Example: docs/features/F-003-notifications.md
Got: [actual filename]Orchestrators that silently proceed with malformed inputs fail 20 minutes into execution. With upfront validation, failures happen in under 1 second with actionable errors.
Layer 2: Derived Context
Derives all necessary context deterministically from validated inputs. IDs, paths, branch names, storage locations, all calculated once, upfront.
Agents need context: where to read inputs, write outputs, what IDs to use. If each agent derives this independently, you get inconsistency (Agent A writes to feature/F-003/, Agent B to features/F3/). When the orchestrator derives once and passes explicitly, you get consistency by construction.
Pattern:
## Derived Context (Deterministic)
From feature spec filename (`F-003-notifications.md`):
- **Feature ID:** `F-003`
- **Feature Slug:** `notifications`
- **Target Branch:** `feature/F-003`
- **Artifact Directory:** `implementation/F-003/`
If filename does not match `F-###-*`, STOP and surface error.One source of truth. All agents receive the same derived context via prompt.
Layer 3: Phase Pipeline
Defines explicit phases that run sequentially. Each phase launches a specialized agent in a fresh context window via the Task tool, passes only necessary inputs as file paths, handles success/failure explicitly.
This is where most orchestration systems fail:
Pass content between agents (token bloat, context pollution)
Share context windows (tight coupling, unpredictable costs)
Handle failures implicitly (silent degradation)
The Phase Pipeline pattern enforces clean separation and predictable execution.
Fresh Context Windows
Each agent runs via the Task tool with:
Agent’s own prompt/system instructions
Explicit inputs passed as file paths
No other context from previous phases
This produces:
Clean separation of concerns
Predictable token costs per phase
No context pollution
Composable black boxes
Testable agents in isolation
Pattern:
## Phase 1 — Architecture Planning
**Subagent:** `visionaire-rails-team:architect`
**Invocation:**Use Task tool with:
subagent_type: “visionaire-rails-team:architect”
description: “Design architecture for feature”
prompt: |
You are in DESIGN MODE.Read the feature specification at: docs/features/F-003-notifications.md
Design the complete architecture and create an implementation plan at:
implementation/F-003/F-003-notifications-IMPLEMENTATION.md
**Expected Output:**
- File: `implementation/F-003/F-003-notifications-IMPLEMENTATION.md`
**Failure Handling:**
If architect reports blocking ambiguity:
1. Set `final_status = “halted”`
2. Set `error = “Architecture planning halted: {explanation}”`
3. STOP orchestrationAnti-Pattern: Passing Content
# ❌ BAD: Passing content, not paths
spec_content = read_file(spec_path)
plan = agent_1.call(f”Here’s the spec:\n{spec_content}\nMake a plan”)Why this is bad:
Wastes tokens (orchestrator already read this)
Prevents agent from re-reading if needed
Creates tight coupling
Context pollution (agent gets orchestrator’s interpretation)
Correct Pattern: Passing Paths
# ✅ GOOD: Passing file paths
prompt: |
Read the feature specification at: docs/features/F-003-notifications.md
Read the implementation plan at: implementation/F-003/F-003-notifications-IMPLEMENTATION.mdWhy this is better:
Agent reads what it needs, when it needs it
Agent can re-read for clarification
Minimal token usage in invocation
Loose coupling (file-based interface)
Agent gets source material, not interpretation
Context windows with content passing: 40k tokens per agent. Cost per feature: $1.20.
Context windows with file paths: 8-20k tokens per agent. Cost per feature: $0.55.
The difference: Interfaces, not pipelines.
Layer 4: Metadata & Learning
Tracks rich execution metadata after each phase: agent confidence, quality scores, token costs, duration, inputs, outputs, domain-specific signals.
Without metadata, you can’t improve. You don’t know which phases are expensive, which specs are ambiguous, which agents need refinement. With rich metadata, patterns emerge.
Pattern:
{
“schema_version”: “2.0”,
“spec_id”: “F-003”,
“started_at”: “2026-01-31T01:00:00Z”,
“completed_at”: “2026-01-31T01:50:00Z”,
“status”: “complete”,
“phases”: [
{
“phase”: “architecture”,
“agent”: “visionaire-rails-team:architect”,
“execution”: {
“model”: “claude-opus-4-6”,
“input_tokens”: 12000,
“output_tokens”: 3500,
“duration_seconds”: 600,
“cost_usd”: 0.18
},
“agent_insights”: {
“confidence”: 0.95,
“quality_score”: 0.92
}
}
],
“metrics”: {
“total_cost_usd”: 0.55,
“revision_cycles”: 1
},
“learning_signals”: {
“complexity”: “medium”,
“avg_agent_confidence”: 0.88
}
}Patterns that emerge from metadata:
Low confidence scores correlate with ambiguous specs
High token counts correlate with complex domains
Revision cycles correlate with missing validation rules
Data-driven improvement instead of guessing.
The Orchestrator Boundary
The framework enforces a critical architectural constraint:
Orchestrators enforce THAT things happen, not HOW they happen.
The Orchestrator MUST:
Launch all agents via Task tool (fresh context windows)
Pass inputs as file paths, not content
Update orchestration.json after every phase
Halt on terminal failures with detailed errors
Enforce phase sequencing and retry limits
The Orchestrator MUST NOT:
Interpret specifications (planning agent’s job)
Define quality rules (agents + skills decide)
Judge technical decisions (architect decides)
Evaluate code quality (reviewer decides)
Determine if requirements met (validator decides)
When orchestrators make domain decisions, they become bottlenecks. Every domain change requires updating the orchestrator. When orchestrators only coordinate, domain expertise lives in agents and skills, where it belongs.
Real-World Results
My reference implementation: visionaire-rails-team
Domain: Full-stack Rails feature development
Pipeline: Spec → Architecture → Implementation → Validation → Review
Agents:
Architect (Opus) - Designs architecture
Engineer (Opus) - Implements with TDD
Feature Validator (Sonnet) - Validates plan compliance
Code Reviewer (Sonnet) - Reviews against sacred rules
Spec Validator (Sonnet) - Checks completeness
Results after 50+ features:
Cost metrics:
Average cost: $0.55 per feature (previously $1.20 with single-agent)
Token predictability: Phase 1: ~12k, Phase 2: ~20k, Phase 3-5: ~8k (previously 8k-60k variance)
Quality metrics:
Sacred Rule violations: 0.5 per feature (previously 4-5)
Revision cycles: 1.2 average (previously 2.5)
Code review: 2-3 minor suggestions (previously 30-60 minutes refactoring)
Behavioral change:
Before: Code “worked” but required senior refactoring. Quality inconsistent. No visibility into failures.
After: Code follows established patterns from start. Quality consistent. Full observability via metadata. Minimal human intervention.
The mechanism: Specialized agents in fresh context windows produce depth in their domain. File-based communication eliminates context pollution. Metadata reveals patterns.
The Eight Core Principles
From 50+ orchestrated features:
1. Subagent Isolation
One job per agent. Depth over breadth. Specialization enables depth.
2. File-Based Communication
Agents communicate through artifacts, not context. Orchestrators pass file paths, never content. Loose coupling, independent testing.
3. Fresh Context Windows
Each phase runs in a clean slate via Task tool. No context pollution. Predictable costs.
4. Metadata-Driven Learning
Track rich execution data to identify patterns. Data drives improvement.
5. Fail Fast With Context
Validate inputs immediately. Halt on failures with detailed errors. No silent degradation.
6. Enforce Structure, Not Content
Orchestrators coordinate (THAT), agents decide (HOW). Expertise belongs with specialists.
7. Deterministic Derivation
Derive all context from inputs once, upfront. Consistency by construction.
8. Revision With Limits
Allow quality gates to trigger re-execution, cap iterations. Bounded automation prevents infinite loops.
Getting Started
Start with one orchestration. Elevate its quality. Then scale.
Day 1: Define Your Pipeline (2 hours)
Pick one multi-step workflow you currently run manually or with a monolithic agent.
Create a command file with:
Input specification
Derived context
Phase definitions (use Task tool, pass file paths)
Day 2: Test One Phase (1 hour)
Run Phase 1 in isolation. Compare output quality to your current approach. The specialization should be noticeable.
Week 2: Add Metadata Tracking (3 hours)
Create orchestration.json after each phase. Track tokens, cost, duration. Observe predictability.
Week 3: Add Failure Handling (2 hours)
For each phase, define explicit failure modes. Run a test case that should fail. Verify clear error messages.
Month 2: Add Remaining Phases
Build out your full pipeline incrementally. Each phase: fresh context window, file-based inputs, explicit outputs, clear failure modes, metadata tracking.
The Structural Choice
Monolithic agents or hand-rolled coordination produces:
Output that works in the moment
Unpredictable debugging time on context pollution
Token costs that vary wildly
Agents as black boxes with no observability
Quality that fluctuates
Structured orchestration produces:
Predictable costs (known token ranges per phase)
Consistent quality (specialized agents in their domain)
Clear failures (detailed errors, not mysteries)
Observability (metadata showing what happened)
Improvement signals (data showing where to refine)
The framework doesn’t eliminate all problems. Agents will occasionally misinterpret requirements. Features will need revision cycles. Complex domains will cost more tokens. Edge cases will halt execution.
But the difference: systems that improve over time versus systems that accumulate debt.
Next:
Previous:
Quick Reference
The Four Layers:
Input Specification - Validate upfront, fail fast
Derived Context - Single source of truth
Phase Pipeline - Fresh context windows, file-based communication
Metadata & Learning - Track execution data
The Eight Principles:
Subagent Isolation - One job per agent
File-Based Communication - Pass paths, not content
Fresh Context Windows - Clean slate per phase
Metadata-Driven Learning - Track execution data
Fail Fast With Context - Validate early, halt clearly
Enforce Structure, Not Content - Coordinate THAT, delegate HOW
Deterministic Derivation - Single source of truth
Revision With Limits - Bounded improvement
Quality Indicators:
Predictable token costs per phase
Clear failure messages
Consistent output quality
Observable execution via metadata
Improvement signals over time
Start Here:
Define one pipeline (input → phases → output)
Use Task tool for fresh context windows
Pass file paths, not content
Track metadata (tokens, cost, duration)

