feat: add pipeline-judge agent and evolution workflow system

- Add pipeline-judge agent for objective fitness scoring - Update capability-index.yaml with pipeline-judge, evolution config - Add fitness-evaluation.md workflow for auto-optimization - Update evolution.md command with /evolve CLI - Create .kilo/logs/fitness-history.jsonl for metrics logging - Update AGENTS.md with new workflow state machine - Add 6 new issues to MILESTONE_ISSUES.md for evolution integration - Preserve ideas in agent-evolution/ideas/ Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25) Auto-triggers prompt-optimizer when fitness < 0.70
2026-04-06 00:23:50 +01:00
parent 1ab9939c92
commit fa68141d47
12 changed files with 1653 additions and 193 deletions
--- a/.kilo/workflows/fitness-evaluation.md
+++ b/.kilo/workflows/fitness-evaluation.md
@@ -0,0 +1,259 @@
+# Fitness Evaluation Workflow
+
+Post-workflow fitness evaluation and automatic optimization loop.
+
+## Overview
+
+This workflow runs after every completed workflow to:
+1. Evaluate fitness objectively via `pipeline-judge`
+2. Trigger optimization if fitness < threshold
+3. Re-run and compare before/after
+4. Log results to fitness-history.jsonl
+
+## Flow
+
+```
+[Workflow Completes]
+        ↓
+[@pipeline-judge] ← runs tests, measures tokens/time
+        ↓
+    fitness score
+        ↓
+┌──────────────────────────────────┐
+│ fitness >= 0.85                  │──→ Log + done (no action)
+│ fitness 0.70 - 0.84              │──→ [@prompt-optimizer] minor tuning
+│ fitness < 0.70                   │──→ [@prompt-optimizer] major rewrite
+│ fitness < 0.50                   │──→ [@agent-architect] redesign agent
+└──────────────────────────────────┘
+        ↓
+[Re-run same workflow with new prompts]
+        ↓
+[@pipeline-judge] again
+        ↓
+    compare fitness_before vs fitness_after
+        ↓
+┌──────────────────────────────────┐
+│ improved?                        │
+│  Yes → commit new prompts        │
+│  No  → revert, try               │
+│        different strategy        │
+│        (max 3 attempts)           │
+└──────────────────────────────────┘
+```
+
+## Fitness Score Formula
+
+```
+fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
+
+where:
+  test_pass_rate = passed_tests / total_tests
+  quality_gates_rate = passed_gates / total_gates
+  efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)
+  normalized_cost = (actual_tokens / budget_tokens × 0.5) + (actual_time / budget_time × 0.5)
+```
+
+## Quality Gates
+
+Each gate is binary (pass/fail):
+
+| Gate | Command | Weight |
+|------|---------|--------|
+| build | `bun run build` | 1/5 |
+| lint | `bun run lint` | 1/5 |
+| types | `bun run typecheck` | 1/5 |
+| tests | `bun test` | 1/5 |
+| coverage | `bun test --coverage >= 80%` | 1/5 |
+
+## Budget Defaults
+
+| Workflow | Token Budget | Time Budget (s) | Min Coverage |
+|----------|-------------|-----------------|---------------|
+| feature | 50000 | 300 | 80% |
+| bugfix | 20000 | 120 | 90% |
+| refactor | 40000 | 240 | 95% |
+| security | 30000 | 180 | 80% |
+
+## Workflow-Specific Benchmarks
+
+```yaml
+benchmarks:
+  feature:
+    token_budget: 50000
+    time_budget_s: 300
+    min_test_coverage: 80%
+    max_iterations: 3
+    
+  bugfix:
+    token_budget: 20000
+    time_budget_s: 120
+    min_test_coverage: 90%  # higher for bugfix - must prove fix works
+    max_iterations: 2
+    
+  refactor:
+    token_budget: 40000
+    time_budget_s: 240
+    min_test_coverage: 95%  # must not break anything
+    max_iterations: 2
+    
+  security:
+    token_budget: 30000
+    time_budget_s: 180
+    min_test_coverage: 80%
+    max_iterations: 2
+    required_gates: [security]  # security gate MUST pass
+```
+
+## Execution Steps
+
+### Step 1: Collect Metrics
+
+Agent: `pipeline-judge`
+
+```bash
+# Run test suite
+bun test --reporter=json > /tmp/test-results.json 2>&1
+
+# Count results
+TOTAL=$(jq '.numTotalTests' /tmp/test-results.json)
+PASSED=$(jq '.numPassedTests' /tmp/test-results.json)
+FAILED=$(jq '.numFailedTests' /tmp/test-results.json)
+
+# Check quality gates
+bun run build 2>&1 && BUILD_OK=true || BUILD_OK=false
+bun run lint 2>&1 && LINT_OK=true || LINT_OK=false
+bun run typecheck 2>&1 && TYPES_OK=true || TYPES_OK=false
+```
+
+### Step 2: Read Pipeline Log
+
+Read `.kilo/logs/pipeline-*.log` for:
+- Token counts per agent
+- Execution time per agent
+- Number of iterations in evaluator-optimizer loops
+- Which agents were invoked
+
+### Step 3: Calculate Fitness
+
+```
+test_pass_rate = PASSED / TOTAL
+quality_gates_rate = (BUILD_OK + LINT_OK + TYPES_OK + TESTS_CLEAN + COVERAGE_OK) / 5
+efficiency = 1.0 - min((tokens/50000 + time/300) / 2, 1.0)
+
+FITNESS = test_pass_rate × 0.50 + quality_gates_rate × 0.25 + efficiency × 0.25
+```
+
+### Step 4: Decide Action
+
+| Fitness | Action |
+|---------|--------|
+| >= 0.85 | Log to fitness-history.jsonl, done |
+| 0.70-0.84 | Call `prompt-optimizer` for minor tuning |
+| 0.50-0.69 | Call `prompt-optimizer` for major rewrite |
+| < 0.50 | Call `agent-architect` to redesign agent |
+
+### Step 5: Re-test After Optimization
+
+If optimization was triggered:
+1. Re-run the same workflow with new prompts
+2. Call `pipeline-judge` again
+3. Compare fitness_before vs fitness_after
+4. If improved: commit prompts
+5. If not improved: revert
+
+### Step 6: Log Results
+
+Append to `.kilo/logs/fitness-history.jsonl`:
+
+```jsonl
+{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
+```
+
+## Usage
+
+### Automatic (post-pipeline)
+
+The workflow triggers automatically after any workflow completes.
+
+### Manual
+
+```bash
+/evolve                     # evolve last completed workflow
+/evolve --issue 42          # evolve workflow for issue #42
+/evolve --agent planner     # focus evolution on one agent
+/evolve --dry-run           # show what would change without applying
+/evolve --history           # print fitness trend chart
+```
+
+## Integration Points
+
+- **After `/pipeline`**: pipeline-judge scores the workflow
+- **After prompt update**: evolution loop retries
+- **Weekly**: Performance trend analysis
+- **On request**: Recommendation generation
+
+## Orchestrator Learning
+
+The orchestrator uses fitness history to optimize future pipeline construction:
+
+### Pipeline Selection Strategy
+
+```
+For each new issue:
+  1. Classify issue type (feature|bugfix|refactor|api|security)
+  2. Look up fitness history for same type
+  3. Find pipeline configuration with highest fitness
+  4. Use that as template, but adapt to current issue
+  5. Skip agents that consistently score 0 contribution
+```
+
+### Agent Ordering Optimization
+
+```
+From fitness-history.jsonl, extract per-agent metrics:
+  - avg tokens consumed
+  - avg contribution to fitness
+  - failure rate (how often this agent's output causes downstream failures)
+
+agents_by_roi = sort(agents, key=contribution/tokens, descending)
+
+For parallel phases:
+  - Run high-ROI agents first
+  - Skip agents with ROI < 0.1 (cost more than they contribute)
+```
+
+### Token Budget Allocation
+
+```
+total_budget = 50000 tokens (configurable)
+
+For each agent in pipeline:
+  agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
+  
+  If agent exceeds budget by >50%:
+    → prompt-optimizer compresses that agent's prompt
+    → or swap to a smaller/faster model
+```
+
+## Prompt Evolution Protocol
+
+When prompt-optimizer is triggered:
+
+1. Read current agent prompt from `.kilo/agents/<agent>.md`
+2. Read fitness report identifying the problem
+3. Read last 5 fitness entries for this agent from history
+4. Analyze pattern:
+   - IF consistently low → systemic prompt issue
+   - IF regression after change → revert
+   - IF one-time failure → might be task-specific, no action
+5. Generate improved prompt:
+   - Keep same structure (description, mode, model, permissions)
+   - Modify ONLY the instruction body
+   - Add explicit output format IF was the issue
+   - Add few-shot examples IF quality was the issue
+   - Compress verbose sections IF tokens were the issue
+6. Save to `.kilo/agents/<agent>.md.candidate`
+7. Re-run workflow with .candidate prompt
+8. `@pipeline-judge` scores again
+9. IF fitness_new > fitness_old: mv .candidate → .md (commit)
+   ELSE: rm .candidate (revert)