feat: add pipeline-judge agent and evolution workflow system

- Add pipeline-judge agent for objective fitness scoring - Update capability-index.yaml with pipeline-judge, evolution config - Add fitness-evaluation.md workflow for auto-optimization - Update evolution.md command with /evolve CLI - Create .kilo/logs/fitness-history.jsonl for metrics logging - Update AGENTS.md with new workflow state machine - Add 6 new issues to MILESTONE_ISSUES.md for evolution integration - Preserve ideas in agent-evolution/ideas/ Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25) Auto-triggers prompt-optimizer when fitness < 0.70
2026-04-06 00:23:50 +01:00
parent 1ab9939c92
commit fa68141d47
12 changed files with 1653 additions and 193 deletions
--- a/agent-evolution/ideas/evolution-workflow.md
+++ b/agent-evolution/ideas/evolution-workflow.md
@@ -0,0 +1,201 @@
+# Evolution Workflow
+
+Continuous self-improvement loop for the agent pipeline.
+Triggered automatically after every workflow completion.
+
+## Overview
+
+```
+[Workflow Completes]
+       ↓
+[@pipeline-judge] ← runs tests, measures tokens/time
+       ↓
+   fitness score
+       ↓
+┌──────────────────────────┐
+│ fitness >= 0.85          │──→ Log + done (no action)
+│ fitness 0.70 - 0.84      │──→ [@prompt-optimizer] minor tuning
+│ fitness < 0.70           │──→ [@prompt-optimizer] major rewrite
+│ fitness < 0.50           │──→ [@agent-architect] redesign agent
+└──────────────────────────┘
+       ↓
+   [Re-run same workflow with new prompts]
+       ↓
+   [@pipeline-judge] again
+       ↓
+   compare fitness_before vs fitness_after
+       ↓
+┌──────────────────────────┐
+│ improved?                │
+│  Yes → commit new prompts│
+│  No  → revert, try       │
+│        different strategy │
+│        (max 3 attempts)   │
+└──────────────────────────┘
+```
+
+## Fitness History
+
+All fitness scores are appended to `.kilo/logs/fitness-history.jsonl`:
+
+```jsonl
+{"ts":"2026-04-05T12:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
+{"ts":"2026-04-05T14:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
+```
+
+This creates a time-series that shows pipeline evolution over time.
+
+## Orchestrator Evolution
+
+The orchestrator uses fitness history to optimize future pipeline construction:
+
+### Pipeline Selection Strategy
+```
+For each new issue:
+  1. Classify issue type (feature|bugfix|refactor|api|security)
+  2. Look up fitness history for same type
+  3. Find the pipeline configuration with highest fitness
+  4. Use that as template, but adapt to current issue
+  5. Skip agents that consistently score 0 contribution
+```
+
+### Agent Ordering Optimization
+```
+From fitness-history.jsonl, extract per-agent metrics:
+  - avg tokens consumed
+  - avg contribution to fitness
+  - failure rate (how often this agent's output causes downstream failures)
+
+agents_by_roi = sort(agents, key=contribution/tokens, descending)
+
+For parallel phases:
+  - Run high-ROI agents first
+  - Skip agents with ROI < 0.1 (cost more than they contribute)
+```
+
+### Token Budget Allocation
+```
+total_budget = 50000 tokens (configurable)
+
+For each agent in pipeline:
+  agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
+  
+  If agent exceeds budget by >50%:
+    → prompt-optimizer compresses that agent's prompt
+    → or swap to a smaller/faster model
+```
+
+## Standard Test Suites
+
+No manual test configuration needed. Tests are auto-discovered:
+
+### Test Discovery
+```bash
+# Unit tests
+find src -name "*.test.ts" -o -name "*.spec.ts" | wc -l
+
+# E2E tests  
+find tests/e2e -name "*.test.ts" | wc -l
+
+# Integration tests
+find tests/integration -name "*.test.ts" | wc -l
+```
+
+### Quality Gates (standardized)
+```yaml
+gates:
+  build:      "bun run build"
+  lint:       "bun run lint"
+  typecheck:  "bun run typecheck"  
+  unit_tests: "bun test"
+  e2e_tests:  "bun test:e2e"
+  coverage:   "bun test --coverage | grep 'All files' | awk '{print $10}' >= 80"
+  security:   "bun audit --level=high | grep 'found 0'"
+```
+
+### Workflow-Specific Benchmarks
+```yaml
+benchmarks:
+  feature:
+    token_budget: 50000
+    time_budget_s: 300
+    min_test_coverage: 80%
+    max_iterations: 3
+    
+  bugfix:
+    token_budget: 20000
+    time_budget_s: 120
+    min_test_coverage: 90%  # higher for bugfix — must prove fix works
+    max_iterations: 2
+    
+  refactor:
+    token_budget: 40000
+    time_budget_s: 240
+    min_test_coverage: 95%  # must not break anything
+    max_iterations: 2
+    
+  security:
+    token_budget: 30000
+    time_budget_s: 180
+    min_test_coverage: 80%
+    max_iterations: 2
+    required_gates: [security]  # security gate MUST pass
+```
+
+## Prompt Evolution Protocol
+
+When prompt-optimizer is triggered:
+
+```
+1. Read current agent prompt from .kilo/agents/<agent>.md
+2. Read fitness report identifying the problem
+3. Read last 5 fitness entries for this agent from history
+
+4. Analyze pattern:
+   - IF consistently low → systemic prompt issue
+   - IF regression after change → revert
+   - IF one-time failure → might be task-specific, no action
+
+5. Generate improved prompt:
+   - Keep same structure (description, mode, model, permissions)
+   - Modify ONLY the instruction body
+   - Add explicit output format if IF was the issue
+   - Add few-shot examples if quality was the issue
+   - Compress verbose sections if tokens were the issue
+
+6. Save to .kilo/agents/<agent>.md.candidate
+
+7. Re-run the SAME workflow with .candidate prompt
+
+8. [@pipeline-judge] scores again
+
+9. IF fitness_new > fitness_old:
+     mv .candidate → .md (commit)
+   ELSE:
+     rm .candidate (revert)
+```
+
+## Usage
+
+```bash
+# Triggered automatically after any workflow
+# OR manually:
+/evolve                    # run evolution on last workflow
+/evolve --issue 42         # run evolution on specific issue
+/evolve --agent planner    # evolve specific agent's prompt
+/evolve --history          # show fitness trend
+```
+
+## Configuration
+
+```yaml
+# Add to kilo.jsonc or capability-index.yaml
+evolution:
+  enabled: true
+  auto_trigger: true           # trigger after every workflow
+  fitness_threshold: 0.70      # below this → auto-optimize
+  max_evolution_attempts: 3    # max retries per cycle
+  fitness_history: .kilo/logs/fitness-history.jsonl
+  token_budget_default: 50000
+  time_budget_default: 300
+```