feat: add pipeline-judge agent and evolution workflow system
- Add pipeline-judge agent for objective fitness scoring - Update capability-index.yaml with pipeline-judge, evolution config - Add fitness-evaluation.md workflow for auto-optimization - Update evolution.md command with /evolve CLI - Create .kilo/logs/fitness-history.jsonl for metrics logging - Update AGENTS.md with new workflow state machine - Add 6 new issues to MILESTONE_ISSUES.md for evolution integration - Preserve ideas in agent-evolution/ideas/ Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25) Auto-triggers prompt-optimizer when fitness < 0.70
This commit is contained in:
201
agent-evolution/ideas/evolution-workflow.md
Normal file
201
agent-evolution/ideas/evolution-workflow.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# Evolution Workflow
|
||||
|
||||
Continuous self-improvement loop for the agent pipeline.
|
||||
Triggered automatically after every workflow completion.
|
||||
|
||||
## Overview
|
||||
|
||||
```
|
||||
[Workflow Completes]
|
||||
↓
|
||||
[@pipeline-judge] ← runs tests, measures tokens/time
|
||||
↓
|
||||
fitness score
|
||||
↓
|
||||
┌──────────────────────────┐
|
||||
│ fitness >= 0.85 │──→ Log + done (no action)
|
||||
│ fitness 0.70 - 0.84 │──→ [@prompt-optimizer] minor tuning
|
||||
│ fitness < 0.70 │──→ [@prompt-optimizer] major rewrite
|
||||
│ fitness < 0.50 │──→ [@agent-architect] redesign agent
|
||||
└──────────────────────────┘
|
||||
↓
|
||||
[Re-run same workflow with new prompts]
|
||||
↓
|
||||
[@pipeline-judge] again
|
||||
↓
|
||||
compare fitness_before vs fitness_after
|
||||
↓
|
||||
┌──────────────────────────┐
|
||||
│ improved? │
|
||||
│ Yes → commit new prompts│
|
||||
│ No → revert, try │
|
||||
│ different strategy │
|
||||
│ (max 3 attempts) │
|
||||
└──────────────────────────┘
|
||||
```
|
||||
|
||||
## Fitness History
|
||||
|
||||
All fitness scores are appended to `.kilo/logs/fitness-history.jsonl`:
|
||||
|
||||
```jsonl
|
||||
{"ts":"2026-04-05T12:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
|
||||
{"ts":"2026-04-05T14:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
|
||||
```
|
||||
|
||||
This creates a time-series that shows pipeline evolution over time.
|
||||
|
||||
## Orchestrator Evolution
|
||||
|
||||
The orchestrator uses fitness history to optimize future pipeline construction:
|
||||
|
||||
### Pipeline Selection Strategy
|
||||
```
|
||||
For each new issue:
|
||||
1. Classify issue type (feature|bugfix|refactor|api|security)
|
||||
2. Look up fitness history for same type
|
||||
3. Find the pipeline configuration with highest fitness
|
||||
4. Use that as template, but adapt to current issue
|
||||
5. Skip agents that consistently score 0 contribution
|
||||
```
|
||||
|
||||
### Agent Ordering Optimization
|
||||
```
|
||||
From fitness-history.jsonl, extract per-agent metrics:
|
||||
- avg tokens consumed
|
||||
- avg contribution to fitness
|
||||
- failure rate (how often this agent's output causes downstream failures)
|
||||
|
||||
agents_by_roi = sort(agents, key=contribution/tokens, descending)
|
||||
|
||||
For parallel phases:
|
||||
- Run high-ROI agents first
|
||||
- Skip agents with ROI < 0.1 (cost more than they contribute)
|
||||
```
|
||||
|
||||
### Token Budget Allocation
|
||||
```
|
||||
total_budget = 50000 tokens (configurable)
|
||||
|
||||
For each agent in pipeline:
|
||||
agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
|
||||
|
||||
If agent exceeds budget by >50%:
|
||||
→ prompt-optimizer compresses that agent's prompt
|
||||
→ or swap to a smaller/faster model
|
||||
```
|
||||
|
||||
## Standard Test Suites
|
||||
|
||||
No manual test configuration needed. Tests are auto-discovered:
|
||||
|
||||
### Test Discovery
|
||||
```bash
|
||||
# Unit tests
|
||||
find src -name "*.test.ts" -o -name "*.spec.ts" | wc -l
|
||||
|
||||
# E2E tests
|
||||
find tests/e2e -name "*.test.ts" | wc -l
|
||||
|
||||
# Integration tests
|
||||
find tests/integration -name "*.test.ts" | wc -l
|
||||
```
|
||||
|
||||
### Quality Gates (standardized)
|
||||
```yaml
|
||||
gates:
|
||||
build: "bun run build"
|
||||
lint: "bun run lint"
|
||||
typecheck: "bun run typecheck"
|
||||
unit_tests: "bun test"
|
||||
e2e_tests: "bun test:e2e"
|
||||
coverage: "bun test --coverage | grep 'All files' | awk '{print $10}' >= 80"
|
||||
security: "bun audit --level=high | grep 'found 0'"
|
||||
```
|
||||
|
||||
### Workflow-Specific Benchmarks
|
||||
```yaml
|
||||
benchmarks:
|
||||
feature:
|
||||
token_budget: 50000
|
||||
time_budget_s: 300
|
||||
min_test_coverage: 80%
|
||||
max_iterations: 3
|
||||
|
||||
bugfix:
|
||||
token_budget: 20000
|
||||
time_budget_s: 120
|
||||
min_test_coverage: 90% # higher for bugfix — must prove fix works
|
||||
max_iterations: 2
|
||||
|
||||
refactor:
|
||||
token_budget: 40000
|
||||
time_budget_s: 240
|
||||
min_test_coverage: 95% # must not break anything
|
||||
max_iterations: 2
|
||||
|
||||
security:
|
||||
token_budget: 30000
|
||||
time_budget_s: 180
|
||||
min_test_coverage: 80%
|
||||
max_iterations: 2
|
||||
required_gates: [security] # security gate MUST pass
|
||||
```
|
||||
|
||||
## Prompt Evolution Protocol
|
||||
|
||||
When prompt-optimizer is triggered:
|
||||
|
||||
```
|
||||
1. Read current agent prompt from .kilo/agents/<agent>.md
|
||||
2. Read fitness report identifying the problem
|
||||
3. Read last 5 fitness entries for this agent from history
|
||||
|
||||
4. Analyze pattern:
|
||||
- IF consistently low → systemic prompt issue
|
||||
- IF regression after change → revert
|
||||
- IF one-time failure → might be task-specific, no action
|
||||
|
||||
5. Generate improved prompt:
|
||||
- Keep same structure (description, mode, model, permissions)
|
||||
- Modify ONLY the instruction body
|
||||
- Add explicit output format if IF was the issue
|
||||
- Add few-shot examples if quality was the issue
|
||||
- Compress verbose sections if tokens were the issue
|
||||
|
||||
6. Save to .kilo/agents/<agent>.md.candidate
|
||||
|
||||
7. Re-run the SAME workflow with .candidate prompt
|
||||
|
||||
8. [@pipeline-judge] scores again
|
||||
|
||||
9. IF fitness_new > fitness_old:
|
||||
mv .candidate → .md (commit)
|
||||
ELSE:
|
||||
rm .candidate (revert)
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Triggered automatically after any workflow
|
||||
# OR manually:
|
||||
/evolve # run evolution on last workflow
|
||||
/evolve --issue 42 # run evolution on specific issue
|
||||
/evolve --agent planner # evolve specific agent's prompt
|
||||
/evolve --history # show fitness trend
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
```yaml
|
||||
# Add to kilo.jsonc or capability-index.yaml
|
||||
evolution:
|
||||
enabled: true
|
||||
auto_trigger: true # trigger after every workflow
|
||||
fitness_threshold: 0.70 # below this → auto-optimize
|
||||
max_evolution_attempts: 3 # max retries per cycle
|
||||
fitness_history: .kilo/logs/fitness-history.jsonl
|
||||
token_budget_default: 50000
|
||||
time_budget_default: 300
|
||||
```
|
||||
Reference in New Issue
Block a user