feat: add pipeline-judge agent and evolution workflow system

- Add pipeline-judge agent for objective fitness scoring
- Update capability-index.yaml with pipeline-judge, evolution config
- Add fitness-evaluation.md workflow for auto-optimization
- Update evolution.md command with /evolve CLI
- Create .kilo/logs/fitness-history.jsonl for metrics logging
- Update AGENTS.md with new workflow state machine
- Add 6 new issues to MILESTONE_ISSUES.md for evolution integration
- Preserve ideas in agent-evolution/ideas/

Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25)
Auto-triggers prompt-optimizer when fitness < 0.70
This commit is contained in:
¨NW¨
2026-04-06 00:23:50 +01:00
parent 1ab9939c92
commit fa68141d47
12 changed files with 1653 additions and 193 deletions

View File

@@ -0,0 +1,259 @@
# Fitness Evaluation Workflow
Post-workflow fitness evaluation and automatic optimization loop.
## Overview
This workflow runs after every completed workflow to:
1. Evaluate fitness objectively via `pipeline-judge`
2. Trigger optimization if fitness < threshold
3. Re-run and compare before/after
4. Log results to fitness-history.jsonl
## Flow
```
[Workflow Completes]
[@pipeline-judge] ← runs tests, measures tokens/time
fitness score
┌──────────────────────────────────┐
│ fitness >= 0.85 │──→ Log + done (no action)
│ fitness 0.70 - 0.84 │──→ [@prompt-optimizer] minor tuning
│ fitness < 0.70 │──→ [@prompt-optimizer] major rewrite
│ fitness < 0.50 │──→ [@agent-architect] redesign agent
└──────────────────────────────────┘
[Re-run same workflow with new prompts]
[@pipeline-judge] again
compare fitness_before vs fitness_after
┌──────────────────────────────────┐
│ improved? │
│ Yes → commit new prompts │
│ No → revert, try │
│ different strategy │
│ (max 3 attempts) │
└──────────────────────────────────┘
```
## Fitness Score Formula
```
fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
where:
test_pass_rate = passed_tests / total_tests
quality_gates_rate = passed_gates / total_gates
efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)
normalized_cost = (actual_tokens / budget_tokens × 0.5) + (actual_time / budget_time × 0.5)
```
## Quality Gates
Each gate is binary (pass/fail):
| Gate | Command | Weight |
|------|---------|--------|
| build | `bun run build` | 1/5 |
| lint | `bun run lint` | 1/5 |
| types | `bun run typecheck` | 1/5 |
| tests | `bun test` | 1/5 |
| coverage | `bun test --coverage >= 80%` | 1/5 |
## Budget Defaults
| Workflow | Token Budget | Time Budget (s) | Min Coverage |
|----------|-------------|-----------------|---------------|
| feature | 50000 | 300 | 80% |
| bugfix | 20000 | 120 | 90% |
| refactor | 40000 | 240 | 95% |
| security | 30000 | 180 | 80% |
## Workflow-Specific Benchmarks
```yaml
benchmarks:
feature:
token_budget: 50000
time_budget_s: 300
min_test_coverage: 80%
max_iterations: 3
bugfix:
token_budget: 20000
time_budget_s: 120
min_test_coverage: 90% # higher for bugfix - must prove fix works
max_iterations: 2
refactor:
token_budget: 40000
time_budget_s: 240
min_test_coverage: 95% # must not break anything
max_iterations: 2
security:
token_budget: 30000
time_budget_s: 180
min_test_coverage: 80%
max_iterations: 2
required_gates: [security] # security gate MUST pass
```
## Execution Steps
### Step 1: Collect Metrics
Agent: `pipeline-judge`
```bash
# Run test suite
bun test --reporter=json > /tmp/test-results.json 2>&1
# Count results
TOTAL=$(jq '.numTotalTests' /tmp/test-results.json)
PASSED=$(jq '.numPassedTests' /tmp/test-results.json)
FAILED=$(jq '.numFailedTests' /tmp/test-results.json)
# Check quality gates
bun run build 2>&1 && BUILD_OK=true || BUILD_OK=false
bun run lint 2>&1 && LINT_OK=true || LINT_OK=false
bun run typecheck 2>&1 && TYPES_OK=true || TYPES_OK=false
```
### Step 2: Read Pipeline Log
Read `.kilo/logs/pipeline-*.log` for:
- Token counts per agent
- Execution time per agent
- Number of iterations in evaluator-optimizer loops
- Which agents were invoked
### Step 3: Calculate Fitness
```
test_pass_rate = PASSED / TOTAL
quality_gates_rate = (BUILD_OK + LINT_OK + TYPES_OK + TESTS_CLEAN + COVERAGE_OK) / 5
efficiency = 1.0 - min((tokens/50000 + time/300) / 2, 1.0)
FITNESS = test_pass_rate × 0.50 + quality_gates_rate × 0.25 + efficiency × 0.25
```
### Step 4: Decide Action
| Fitness | Action |
|---------|--------|
| >= 0.85 | Log to fitness-history.jsonl, done |
| 0.70-0.84 | Call `prompt-optimizer` for minor tuning |
| 0.50-0.69 | Call `prompt-optimizer` for major rewrite |
| < 0.50 | Call `agent-architect` to redesign agent |
### Step 5: Re-test After Optimization
If optimization was triggered:
1. Re-run the same workflow with new prompts
2. Call `pipeline-judge` again
3. Compare fitness_before vs fitness_after
4. If improved: commit prompts
5. If not improved: revert
### Step 6: Log Results
Append to `.kilo/logs/fitness-history.jsonl`:
```jsonl
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
```
## Usage
### Automatic (post-pipeline)
The workflow triggers automatically after any workflow completes.
### Manual
```bash
/evolve # evolve last completed workflow
/evolve --issue 42 # evolve workflow for issue #42
/evolve --agent planner # focus evolution on one agent
/evolve --dry-run # show what would change without applying
/evolve --history # print fitness trend chart
```
## Integration Points
- **After `/pipeline`**: pipeline-judge scores the workflow
- **After prompt update**: evolution loop retries
- **Weekly**: Performance trend analysis
- **On request**: Recommendation generation
## Orchestrator Learning
The orchestrator uses fitness history to optimize future pipeline construction:
### Pipeline Selection Strategy
```
For each new issue:
1. Classify issue type (feature|bugfix|refactor|api|security)
2. Look up fitness history for same type
3. Find pipeline configuration with highest fitness
4. Use that as template, but adapt to current issue
5. Skip agents that consistently score 0 contribution
```
### Agent Ordering Optimization
```
From fitness-history.jsonl, extract per-agent metrics:
- avg tokens consumed
- avg contribution to fitness
- failure rate (how often this agent's output causes downstream failures)
agents_by_roi = sort(agents, key=contribution/tokens, descending)
For parallel phases:
- Run high-ROI agents first
- Skip agents with ROI < 0.1 (cost more than they contribute)
```
### Token Budget Allocation
```
total_budget = 50000 tokens (configurable)
For each agent in pipeline:
agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
If agent exceeds budget by >50%:
→ prompt-optimizer compresses that agent's prompt
→ or swap to a smaller/faster model
```
## Prompt Evolution Protocol
When prompt-optimizer is triggered:
1. Read current agent prompt from `.kilo/agents/<agent>.md`
2. Read fitness report identifying the problem
3. Read last 5 fitness entries for this agent from history
4. Analyze pattern:
- IF consistently low → systemic prompt issue
- IF regression after change → revert
- IF one-time failure → might be task-specific, no action
5. Generate improved prompt:
- Keep same structure (description, mode, model, permissions)
- Modify ONLY the instruction body
- Add explicit output format IF was the issue
- Add few-shot examples IF quality was the issue
- Compress verbose sections IF tokens were the issue
6. Save to `.kilo/agents/<agent>.md.candidate`
7. Re-run workflow with .candidate prompt
8. `@pipeline-judge` scores again
9. IF fitness_new > fitness_old: mv .candidate → .md (commit)
ELSE: rm .candidate (revert)