feat: add pipeline-judge agent and evolution workflow system

- Add pipeline-judge agent for objective fitness scoring - Update capability-index.yaml with pipeline-judge, evolution config - Add fitness-evaluation.md workflow for auto-optimization - Update evolution.md command with /evolve CLI - Create .kilo/logs/fitness-history.jsonl for metrics logging - Update AGENTS.md with new workflow state machine - Add 6 new issues to MILESTONE_ISSUES.md for evolution integration - Preserve ideas in agent-evolution/ideas/ Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25) Auto-triggers prompt-optimizer when fitness < 0.70
2026-04-06 00:23:50 +01:00
parent 1ab9939c92
commit fa68141d47
12 changed files with 1653 additions and 193 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -17,12 +17,15 @@ Agent: Runs full pipeline for issue #42 with Gitea logging
 |---------|-------------|-------|
 | `/pipeline <issue>` | Run full agent pipeline for issue | `/pipeline 42` |
 | `/status <issue>` | Check pipeline status for issue | `/status 42` |
+| `/evolve` | Run evolution cycle with fitness scoring | `/evolve --issue 42` |
 | `/evaluate <issue>` | Generate performance report | `/evaluate 42` |
 | `/plan` | Creates detailed task plans | `/plan feature X` |
 | `/ask` | Answers codebase questions | `/ask how does auth work` |
 | `/debug` | Analyzes and fixes bugs | `/debug error in login` |
 | `/code` | Quick code generation | `/code add validation` |
 | `/research [topic]` | Run research and self-improvement | `/research multi-agent` |
+| `/evolution log` | Log agent model change | `/evolution log planner "reason"` |
+| `/evolution report` | Generate evolution report | `/evolution report` |

 ## Pipeline Agents (Subagents)

@@ -62,7 +65,8 @@ These agents are invoked automatically by `/pipeline` or manually via `@mention`
 |-------|------|--------------|
 | `@release-manager` | Git operations | Status: releasing |
 | `@evaluator` | Scores effectiveness | Status: evaluated |
-| `@prompt-optimizer` | Improves prompts | When score < 7 |
+| `@pipeline-judge` | Objective fitness scoring | After workflow completes |
+| `@prompt-optimizer` | Improves prompts | When fitness < 0.70 |
 | `@capability-analyst` | Analyzes task coverage | When starting new task |
 | `@agent-architect` | Creates new agents | When gaps identified |
 | `@workflow-architect` | Creates workflows | New workflow needed |
@@ -94,9 +98,27 @@ These agents are invoked automatically by `/pipeline` or manually via `@mention`
 [releasing] 
  ↓ @release-manager
 [evaluated] 
-  ↓ @evaluator
-  ├── [score ≥ 7] → [completed]
-  └── [score < 7] → @prompt-optimizer → [completed]
+  ↓ @evaluator (subjective score 1-10)
+  ├── [score ≥ 7] → [@pipeline-judge] → fitness scoring
+  └── [score < 7] → @prompt-optimizer → [@evaluated]
+        ↓
+    [@pipeline-judge] ← runs tests, measures tokens/time
+        ↓
+    fitness score
+        ↓
+┌──────────────────────────────────────┐
+│ fitness >= 0.85                      │──→ [completed]
+│ fitness 0.70-0.84                    │──→ @prompt-optimizer → [evolving]
+│ fitness < 0.70                      │──→ @prompt-optimizer (major) → [evolving]
+│ fitness < 0.50                      │──→ @agent-architect → redesign
+└──────────────────────────────────────┘
+        ↓
+[evolving] → re-run workflow → [@pipeline-judge]
+        ↓
+    compare fitness_before vs fitness_after
+        ↓
+    [improved?] → commit prompts → [completed]
+              └─ [not improved?] → revert → try different strategy
 ```

 ## Capability Analysis Flow
@@ -167,6 +189,14 @@ Scores saved to `.kilo/logs/efficiency_score.json`:
 }
 ```

+### Fitness Tracking
+
+Fitness scores saved to `.kilo/logs/fitness-history.jsonl`:
+```jsonl
+{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
+{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
+```
+
 ## Manual Agent Invocation

 ```typescript
@@ -192,11 +222,34 @@ GITEA_TOKEN=your-token-here
 ## Self-Improvement Cycle

 1. **Pipeline runs** for each issue
-2. **Evaluator scores** each agent (1-10)
-3. **Low scores (<7)** trigger prompt-optimizer
-4. **Prompt optimizer** analyzes failures and improves prompts
-5. **New prompts** saved to `.kilo/agents/`
-6. **Next run** uses improved prompts
+2. **Evaluator scores** each agent (1-10) - subjective
+3. **Pipeline Judge measures** fitness objectively (0.0-1.0)
+4. **Low fitness (<0.70)** triggers prompt-optimizer
+5. **Prompt optimizer** analyzes failures and improves prompts
+6. **Re-run workflow** with improved prompts
+7. **Compare fitness** before/after - commit if improved
+8. **Log results** to `.kilo/logs/fitness-history.jsonl`
+
+### Evaluator vs Pipeline Judge
+
+| Aspect | Evaluator | Pipeline Judge |
+|--------|-----------|----------------|
+| Type | Subjective | Objective |
+| Score | 1-10 (opinion) | 0.0-1.0 (metrics) |
+| Metrics | Observations | Tests, tokens, time |
+| Trigger | After workflow | After evaluator |
+| Action | Logs to Gitea | Triggers optimization |
+
+### Fitness Score Components
+
+```
+fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
+
+where:
+  test_pass_rate = passed_tests / total_tests
+  quality_gates_rate = passed_gates / total_gates (build, lint, types, tests, coverage)
+  efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)
+```

 ## Architecture Files