Agentic Engineering: The Learning Layer
Measuring What Matters for Self-Improving AI Teams
The Metrics Gap
Your agent team shipped its twentieth feature. Code works. Tests pass.
But you don’t know:
Why Feature 5 cost $2.80 when Feature 3 cost $0.45
Which Sacred Rules are violated most often
Whether agents improve over time or degrade
What makes features fast versus slow
How spec quality affects downstream outcomes
Traditional CI/CD tracks binary outcomes: pass or fail. Build time. Exit codes.
That’s insufficient for AI-assisted development.
CI/CD tracks: “Tests passed in 10 minutes”
You need: “Tests passed, coverage 95%, all Sacred Rules followed, architect confidence 0.98, engineer used 90K tokens ($0.38), medium complexity, similar to F-002”
The difference is signal quality.
Systems that improve over time measure execution in ways that reveal how to improve.
Improvement Requires Visibility
Most agent systems execute work in isolation. Each feature independent. No memory. No learning loop. Same mistakes repeated.
This isn’t model limitation. It’s architecture limitation.
Three mechanisms prevent learning:
Unmeasured inputs obscure causation. Low-quality specs produce uncertain architects. Uncertain architects produce revision cycles. But without measuring spec quality, the root cause stays hidden. You optimize symptoms.
Binary outcomes hide gradients. “Tests passed” reveals nothing about confidence, adherence to patterns, or edge case handling. Without nuance, you can’t distinguish excellent from acceptable.
Isolated executions prevent pattern detection. Feature costs vary 6x ($0.45 to $2.80). Without historical context and similarity metrics, each feature is unpredictable.
Visibility creates the feedback loop. Measure inputs → Track process → Aggregate signals → Detect patterns → Optimize structure.
This applies equally to human organizations and AI systems. You cannot improve what you cannot see. The three-tier measurement architecture makes invisible processes visible.
The Three-Tier Measurement Architecture
Systems that learn share the same structure:
┌─────────────────────────────────────────────────┐
│ TIER 1: AGENT SELF-ASSESSMENT │
│ Agents report: confidence, quality, insights │
└─────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ TIER 2: EXECUTION TRACKING │
│ Orchestrator captures: tokens, cost, duration │
└─────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ TIER 3: LEARNING SIGNALS │
│ Aggregated metrics: complexity, patterns, trends│
└─────────────────────────────────────────────────┘Tier 1: Agent Self-Assessment
Most systems treat agents as black boxes. Work enters. Artifacts exit. No insight into process.
Agents are closest to the work. They assess nuances automated metrics miss. An agent confident (0.95) in a simple CRUD implementation might be uncertain (0.65) about edge cases in complex state transitions—even if both pass tests.
That signal matters.
Core Metrics
Every agent reports:
Confidence (0-1): Certainty in output
Quality Score (0-1): Assessment of input quality
Input Reference: What was analyzed (file path, type, preview)
Architect Example
{
“agent_insights”: {
“confidence”: 0.95,
“quality_score”: 0.92,
“input_reference”: {
“type”: “file”,
“path”: “docs/features/F-003-notifications.md”,
“preview”: “Feature: Real-time notifications...”
},
“key_decisions”: [
{
“decision”: “Use Turbo Streams for real-time updates”,
“rationale”: “Spec requires instant notification display”,
“alternatives_considered”: [”ActionCable”, “Polling”],
“confidence”: 0.98
}
],
“architectural_patterns”: [”Turbo Streams”, “RESTful resources”],
“risks_identified”: [”WebSocket connection reliability”],
“assumptions”: [”User model has notification_preferences field”]
}
}What this enables:
Spec quality feedback: Track how spec quality affects downstream phases. Low spec quality correlates with clarifications and longer duration.
Decision traceability: Six months later: “Why Turbo Streams?” → Check F-003 architect insights.
Risk awareness: “WebSocket reliability” flagged during architecture → Plan mitigation before implementation.
Assumption validation: Engineer verifies “User has notification_preferences?” against architect assumptions.
Engineer Example
{
“agent_insights”: {
“confidence”: 0.88,
“quality_score”: 0.92,
“input_reference”: {
“type”: “file”,
“path”: “implementation/F-003/F-003-IMPLEMENTATION.md”
},
“skills_applied”: {
“backend”: [”BR-01”, “BR-08”, “BR-11”, “BR-12”],
“frontend”: [”FR-02”, “FR-07”]
},
“challenges_encountered”: [
“WebSocket authentication required custom middleware”
],
“deviations_from_plan”: [],
“test_results”: {
“total_tests”: 47,
“passed”: 47
}
}
}What this enables:
Skills tracking: Features using BR-08 correlate with higher quality scores.
Challenge documentation: “WebSocket authentication required custom middleware” → Document in skills.
Deviation tracking: Zero deviations → Engineer followed plan. Positive deviations → Plan required adjustment.
Tier 2: Execution Tracking
Agent insights reveal cognitive process. Execution metrics reveal operational cost.
Per-Invocation Metrics
{
“execution”: {
“model”: “claude-sonnet-4-5”,
“temperature”: 0.0,
“input_tokens”: 45000,
“output_tokens”: 8200,
“duration_seconds”: 600,
“cost_usd”: 0.18
}
}Model: Enables comparison (Opus vs Sonnet for architect phase)
Input tokens: Tracks context size, identifies bloat
Output tokens: Tracks verbosity
Duration: Identifies slow phases
Cost: Enables budget prediction
Per-Phase Tracking
{
“phases”: [
{
“phase”: “architecture_planning”,
“agent”: “visionaire-rails-team:architect”,
“execution”: {
“model”: “claude-opus-4-6”,
“input_tokens”: 35000,
“output_tokens”: 6500,
“cost_usd”: 1.01,
“duration_seconds”: 420
}
},
{
“phase”: “implementation”,
“agent”: “visionaire-rails-team:engineer”,
“execution”: {
“model”: “claude-sonnet-4-5”,
“input_tokens”: 65000,
“output_tokens”: 12000,
“cost_usd”: 0.38,
“duration_seconds”: 1200
}
}
]
}Enables phase-level cost analysis, model optimization, performance bottleneck identification.
Tier 3: Learning Signals
Raw metrics don’t explain patterns. Learning signals aggregate metrics into predictive insights.
Nine Core Signals
1. Feature Complexity (simple | medium | complex | very_complex)
Calculated from duration, revision cycles, files changed, agent confidence.
Enables cost prediction: “F-010 looks medium complexity → Expect ~$2.15, ~40 minutes.”
2. Spec Quality Score (0-1)
Source: architect.agent_insights.quality_score
Creates feedback loop: “Low spec quality → 2+ revisions in 60% of cases.”
3. Average Agent Confidence (0-1)
Calculation: average(all phases.agent_insights.confidence)
Low confidence signals review needed before merge.
4. Implementation Quality Score (0-1)
Source: code_review.agent_insights.quality_score
Tracks quality trends over time.
5. Plan-to-Implementation Fidelity (0-1)
Source: feature_validator.agent_insights.plan_quality_score
Low fidelity indicates plans need improvement or engineer guidance.
6. Skills Referenced (array)
Source: engineer.agent_insights.skills_applied + code_review.skills_followed
Identifies patterns: “Features using BR-08 average 20% longer.”
7-9. Future Enhancements
Required Clarifications: Count AskUserQuestion calls → Track spec ambiguity
External Research: Detect WebSearch/WebFetch → Identify knowledge gaps
Similar Features: Embeddings-based similarity → Better predictions
Calculation Example
def calculate_learning_signals(orchestration)
{
feature_complexity: assess_complexity(orchestration),
spec_quality_score: orchestration.dig(’phases’, 0, ‘agent_insights’, ‘quality_score’),
avg_agent_confidence: average_confidence(orchestration),
implementation_quality_score: orchestration.dig(’phases’, 3, ‘agent_insights’, ‘quality_score’),
plan_to_implementation_fidelity: orchestration.dig(’phases’, 2, ‘agent_insights’, ‘plan_quality_score’),
skills_referenced: extract_skills(orchestration)
}
endThe Measurement Format: orchestration.json
All three tiers captured in structured JSON:
implementation/F-003/orchestration.json{
“schema_version”: “2.0”,
“feature_id”: “F-003”,
“started_at”: “2026-01-31T14:00:00Z”,
“completed_at”: “2026-01-31T15:30:00Z”,
“final_status”: “complete”,
“phases”: [ /* Tier 1 + 2 combined */ ],
“metrics”: { /* Aggregated execution */ ],
“learning_signals”: { /* Tier 3 */ }
}Structured, versioned, durable, queryable.
What Proper Metrics Enable
1. Cost Prediction
# Find similar features
similar = features.select do |f|
f[’learning_signals’][’feature_complexity’] == ‘medium’ &&
f[’learning_signals’][’skills_referenced’].include?(’BR-08’)
end
avg_cost = similar.map { |f| f[’metrics’][’total_cost_usd’] }.sum / similar.size
# => $2.152. Quality Prediction
spec_quality = orchestration.dig(’phases’, 0, ‘agent_insights’, ‘quality_score’)
if spec_quality < 0.7
low_quality_specs = features.select { |f|
f.dig(’learning_signals’, ‘spec_quality_score’) < 0.7
}
revision_rate = low_quality_specs.count { |f|
f.dig(’metrics’, ‘revision_cycles’) >= 2
} / low_quality_specs.size.to_f
# Historical: specs < 0.7 require 2+ revisions 60% of the time
end3. Performance Optimization
# Compare Opus vs Sonnet for architect phase
opus_avg_cost = opus_features.map { |f|
f.dig(’phases’, 0, ‘execution’, ‘cost_usd’)
}.sum / opus_features.size
opus_avg_quality = opus_features.map { |f|
f.dig(’phases’, 0, ‘agent_insights’, ‘confidence’)
}.sum / opus_features.size
# Compare cost/quality tradeoff with data4. Pattern Detection
# Correlate skills with quality
with_br08 = features.select { |f|
f.dig(’learning_signals’, ‘skills_referenced’)&.include?(’BR-08’)
}
with_quality = with_br08.map { |f|
f.dig(’learning_signals’, ‘implementation_quality_score’)
}.compact.sum / with_br08.size
# Identify high-impact patterns5. Continuous Improvement
features.sort_by { |f| f[’started_at’] }.each_slice(10) do |batch|
avg_duration = batch.map { |f| f.dig(’metrics’, ‘total_duration_seconds’) }.sum / batch.size
avg_quality = batch.map { |f| f.dig(’learning_signals’, ‘implementation_quality_score’) }.compact.sum / batch.size
# Track trends over time
endReal-World Results
visionaire-rails-team after 20 features:
Cost metrics:
Average: $2.06 per feature
Range: $0.45 - $2.80
Most expensive phase: Architecture ($1.01 with Opus)
Quality metrics:
Sacred Rule violations: 0.3 per feature (baseline: 4-5)
Implementation quality: 0.92 average
Agent confidence: 0.93 average
Performance metrics:
Average duration: 42 minutes
Revision cycles: 0.2 per feature
Behavioral change mechanisms:
Before measurement: No cost visibility, unpredictable outcomes, repeated mistakes, no quality trends.
With measurement: Per-phase cost optimization, prediction from similar features, pattern-based estimation, tracked quality improvement.
The change driver: Visibility into previously opaque processes enabled targeted optimization.
The Learning Loop (Infrastructure Complete, Algorithms Next)
Metrics infrastructure is operational. Learning algorithms are next phase.
Auto-Improving Skills
# Analyze violations
violation_counts = features.flat_map { |f|
f.dig(’issues’) || []
}.group_by { |i| i[’pattern’] }.transform_values(&:size)
top_patterns = violation_counts.sort_by { |_, count| -count }.first(5)
top_patterns.each do |pattern, count|
if count > 10
# Auto-generate Sacred Rule from pattern
# Add to skills
# Update navigation
end
endPredictive Quality Gates
def predict_quality(spec_quality, architect_confidence)
(0.3 * spec_quality) + (0.7 * architect_confidence)
end
predicted_quality = predict_quality(
orchestration.dig(’phases’, 0, ‘agent_insights’, ‘quality_score’),
orchestration.dig(’phases’, 0, ‘agent_insights’, ‘confidence’)
)
if predicted_quality < 0.75
# Historical: quality < 0.75 correlates with issues
# Recommendation: Review architect plan before implementation
endAdaptive Model Selection
quality_delta = opus_quality - sonnet_quality # 0.04 (4%)
cost_delta = opus_avg_cost - sonnet_avg_cost # $0.63 (60%)
if quality_delta < 0.05
# Sonnet acceptable (4% quality loss, 60% cost savings)
else
# Opus worth premium
endThe Eight Design Principles
1. Agent Self-Assessment: Agents closest to work assess nuance automated metrics miss.
2. Input Quality Feedback Loop: Output quality depends on input quality. Track both.
3. Dual Scoring: Measurable (test coverage) + judgment (code clarity). Track both.
4. Granular Cost Tracking: Per-phase, per-model costs enable optimization.
5. Learning Signals Over Raw Metrics: Complexity and skills explain patterns. Tokens don’t.
6. Structured But Extensible: Core fields standard. Agent-specific fields optional.
7. Versioned Schema: Schema version tracked. Future changes additive.
8. Checkpoint-Based Resumability: orchestration.json doubles as checkpoint for recovery.
Beyond Software
The three-tier measurement architecture applies to any domain requiring continuous improvement.
Legal Contract Review:
Tier 1: Counsel confidence, document completeness score
Tier 2: Clauses reviewed, time per section
Tier 3: Contract complexity, risk level, precedent availability
Result: “Contracts with missing precedents take 40% longer → Flag early.”
Content Production:
Tier 1: Editor confidence, source quality, factual accuracy
Tier 2: Research depth, revision cycles
Tier 3: Content complexity, source availability
Result: “Low source quality requires 3+ fact-checking rounds → Improve sources first.”
Implementation Path
Week 1: Add Agent Self-Assessment
## Execution Report
At the end of your work, provide:
**Confidence (0-1):** How certain are you in your output?
**Quality Score (0-1):** How clear/complete was your input?
**Insights:** Key decisions, challenges, assumptions
Format as JSON.Week 2: Add Execution Tracking
{
“execution”: {
“model”: “claude-sonnet-4-5”,
“input_tokens”: 45000,
“output_tokens”: 8200,
“duration_seconds”: 600,
“cost_usd”: 0.18
}
}Week 3: Calculate Initial Signals
avg_confidence = phases.map { |p|
p.dig(’agent_insights’, ‘confidence’)
}.compact.sum / phases.size
total_cost = phases.map { |p|
p.dig(’execution’, ‘cost_usd’)
}.sumWeek 4: Analyze First 10 Features
costs = features.map { |f| f[’total_cost’] }
confidences = features.map { |f| f[’avg_confidence’] }
# Identify ranges, trends, outliersMonth 2: Add spec quality, implementation quality, skills referenced, complexity
Month 3: Build violation tracking, pattern detection, cost prediction, quality trends
Summary
AI teams require proper measurement to learn.
Three-tier architecture—Agent Self-Assessment, Execution Tracking, Learning Signals—captures process, cost, and patterns.
Nine core signals transform raw metrics into predictive insights: complexity, spec quality, confidence, implementation quality, plan fidelity, skills referenced, clarifications, research, similarity.
Structured orchestration.json format per feature. Queryable. Analyzable. Evolvable.
Results: Cost optimization, quality prediction, performance tuning, pattern detection, continuous improvement.
Learning loop (auto-improving skills, predictive gates, adaptive models) becomes possible because measurement infrastructure exists.
Visibility enables optimization. You cannot improve what you cannot see.
Previous:
Quick Reference
Three Tiers:
Agent Self-Assessment (confidence, quality, insights)
Execution Tracking (tokens, cost, duration, model)
Learning Signals (aggregated metrics revealing patterns)
Nine Core Signals:
Feature complexity
Spec quality score
Average agent confidence
Implementation quality score
Plan-to-implementation fidelity
Skills referenced
Required clarifications (future)
External research (future)
Similar features (future)
Five Analysis Categories:
Cost monitoring and prediction
Quality prediction from early signals
Performance optimization
Pattern detection
Continuous improvement
Implementation:
Add agent self-assessment
Track execution metrics
Calculate signals
Analyze patterns
Build learning loop

