- Add pipeline-judge agent for objective fitness scoring - Update capability-index.yaml with pipeline-judge, evolution config - Add fitness-evaluation.md workflow for auto-optimization - Update evolution.md command with /evolve CLI - Create .kilo/logs/fitness-history.jsonl for metrics logging - Update AGENTS.md with new workflow state machine - Add 6 new issues to MILESTONE_ISSUES.md for evolution integration - Preserve ideas in agent-evolution/ideas/ Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25) Auto-triggers prompt-optimizer when fitness < 0.70
202 lines
5.7 KiB
Markdown
202 lines
5.7 KiB
Markdown
# Evolution Workflow
|
||
|
||
Continuous self-improvement loop for the agent pipeline.
|
||
Triggered automatically after every workflow completion.
|
||
|
||
## Overview
|
||
|
||
```
|
||
[Workflow Completes]
|
||
↓
|
||
[@pipeline-judge] ← runs tests, measures tokens/time
|
||
↓
|
||
fitness score
|
||
↓
|
||
┌──────────────────────────┐
|
||
│ fitness >= 0.85 │──→ Log + done (no action)
|
||
│ fitness 0.70 - 0.84 │──→ [@prompt-optimizer] minor tuning
|
||
│ fitness < 0.70 │──→ [@prompt-optimizer] major rewrite
|
||
│ fitness < 0.50 │──→ [@agent-architect] redesign agent
|
||
└──────────────────────────┘
|
||
↓
|
||
[Re-run same workflow with new prompts]
|
||
↓
|
||
[@pipeline-judge] again
|
||
↓
|
||
compare fitness_before vs fitness_after
|
||
↓
|
||
┌──────────────────────────┐
|
||
│ improved? │
|
||
│ Yes → commit new prompts│
|
||
│ No → revert, try │
|
||
│ different strategy │
|
||
│ (max 3 attempts) │
|
||
└──────────────────────────┘
|
||
```
|
||
|
||
## Fitness History
|
||
|
||
All fitness scores are appended to `.kilo/logs/fitness-history.jsonl`:
|
||
|
||
```jsonl
|
||
{"ts":"2026-04-05T12:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
|
||
{"ts":"2026-04-05T14:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
|
||
```
|
||
|
||
This creates a time-series that shows pipeline evolution over time.
|
||
|
||
## Orchestrator Evolution
|
||
|
||
The orchestrator uses fitness history to optimize future pipeline construction:
|
||
|
||
### Pipeline Selection Strategy
|
||
```
|
||
For each new issue:
|
||
1. Classify issue type (feature|bugfix|refactor|api|security)
|
||
2. Look up fitness history for same type
|
||
3. Find the pipeline configuration with highest fitness
|
||
4. Use that as template, but adapt to current issue
|
||
5. Skip agents that consistently score 0 contribution
|
||
```
|
||
|
||
### Agent Ordering Optimization
|
||
```
|
||
From fitness-history.jsonl, extract per-agent metrics:
|
||
- avg tokens consumed
|
||
- avg contribution to fitness
|
||
- failure rate (how often this agent's output causes downstream failures)
|
||
|
||
agents_by_roi = sort(agents, key=contribution/tokens, descending)
|
||
|
||
For parallel phases:
|
||
- Run high-ROI agents first
|
||
- Skip agents with ROI < 0.1 (cost more than they contribute)
|
||
```
|
||
|
||
### Token Budget Allocation
|
||
```
|
||
total_budget = 50000 tokens (configurable)
|
||
|
||
For each agent in pipeline:
|
||
agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
|
||
|
||
If agent exceeds budget by >50%:
|
||
→ prompt-optimizer compresses that agent's prompt
|
||
→ or swap to a smaller/faster model
|
||
```
|
||
|
||
## Standard Test Suites
|
||
|
||
No manual test configuration needed. Tests are auto-discovered:
|
||
|
||
### Test Discovery
|
||
```bash
|
||
# Unit tests
|
||
find src -name "*.test.ts" -o -name "*.spec.ts" | wc -l
|
||
|
||
# E2E tests
|
||
find tests/e2e -name "*.test.ts" | wc -l
|
||
|
||
# Integration tests
|
||
find tests/integration -name "*.test.ts" | wc -l
|
||
```
|
||
|
||
### Quality Gates (standardized)
|
||
```yaml
|
||
gates:
|
||
build: "bun run build"
|
||
lint: "bun run lint"
|
||
typecheck: "bun run typecheck"
|
||
unit_tests: "bun test"
|
||
e2e_tests: "bun test:e2e"
|
||
coverage: "bun test --coverage | grep 'All files' | awk '{print $10}' >= 80"
|
||
security: "bun audit --level=high | grep 'found 0'"
|
||
```
|
||
|
||
### Workflow-Specific Benchmarks
|
||
```yaml
|
||
benchmarks:
|
||
feature:
|
||
token_budget: 50000
|
||
time_budget_s: 300
|
||
min_test_coverage: 80%
|
||
max_iterations: 3
|
||
|
||
bugfix:
|
||
token_budget: 20000
|
||
time_budget_s: 120
|
||
min_test_coverage: 90% # higher for bugfix — must prove fix works
|
||
max_iterations: 2
|
||
|
||
refactor:
|
||
token_budget: 40000
|
||
time_budget_s: 240
|
||
min_test_coverage: 95% # must not break anything
|
||
max_iterations: 2
|
||
|
||
security:
|
||
token_budget: 30000
|
||
time_budget_s: 180
|
||
min_test_coverage: 80%
|
||
max_iterations: 2
|
||
required_gates: [security] # security gate MUST pass
|
||
```
|
||
|
||
## Prompt Evolution Protocol
|
||
|
||
When prompt-optimizer is triggered:
|
||
|
||
```
|
||
1. Read current agent prompt from .kilo/agents/<agent>.md
|
||
2. Read fitness report identifying the problem
|
||
3. Read last 5 fitness entries for this agent from history
|
||
|
||
4. Analyze pattern:
|
||
- IF consistently low → systemic prompt issue
|
||
- IF regression after change → revert
|
||
- IF one-time failure → might be task-specific, no action
|
||
|
||
5. Generate improved prompt:
|
||
- Keep same structure (description, mode, model, permissions)
|
||
- Modify ONLY the instruction body
|
||
- Add explicit output format if IF was the issue
|
||
- Add few-shot examples if quality was the issue
|
||
- Compress verbose sections if tokens were the issue
|
||
|
||
6. Save to .kilo/agents/<agent>.md.candidate
|
||
|
||
7. Re-run the SAME workflow with .candidate prompt
|
||
|
||
8. [@pipeline-judge] scores again
|
||
|
||
9. IF fitness_new > fitness_old:
|
||
mv .candidate → .md (commit)
|
||
ELSE:
|
||
rm .candidate (revert)
|
||
```
|
||
|
||
## Usage
|
||
|
||
```bash
|
||
# Triggered automatically after any workflow
|
||
# OR manually:
|
||
/evolve # run evolution on last workflow
|
||
/evolve --issue 42 # run evolution on specific issue
|
||
/evolve --agent planner # evolve specific agent's prompt
|
||
/evolve --history # show fitness trend
|
||
```
|
||
|
||
## Configuration
|
||
|
||
```yaml
|
||
# Add to kilo.jsonc or capability-index.yaml
|
||
evolution:
|
||
enabled: true
|
||
auto_trigger: true # trigger after every workflow
|
||
fitness_threshold: 0.70 # below this → auto-optimize
|
||
max_evolution_attempts: 3 # max retries per cycle
|
||
fitness_history: .kilo/logs/fitness-history.jsonl
|
||
token_budget_default: 50000
|
||
time_budget_default: 300
|
||
```
|