feat: add pipeline-judge agent and evolution workflow system

- Add pipeline-judge agent for objective fitness scoring
- Update capability-index.yaml with pipeline-judge, evolution config
- Add fitness-evaluation.md workflow for auto-optimization
- Update evolution.md command with /evolve CLI
- Create .kilo/logs/fitness-history.jsonl for metrics logging
- Update AGENTS.md with new workflow state machine
- Add 6 new issues to MILESTONE_ISSUES.md for evolution integration
- Preserve ideas in agent-evolution/ideas/

Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25)
Auto-triggers prompt-optimizer when fitness < 0.70
This commit is contained in:
¨NW¨
2026-04-06 00:23:50 +01:00
parent 1ab9939c92
commit fa68141d47
12 changed files with 1653 additions and 193 deletions

View File

@@ -1,163 +1,167 @@
# Agent Evolution Workflow
---
description: Run evolution cycle - judge last workflow, optimize underperforming agents, re-test
---
Tracks and records agent model improvements, capability changes, and performance metrics.
# /evolution — Pipeline Evolution Command
Runs the automated evolution cycle on the most recent (or specified) workflow.
## Usage
```
/evolution [action] [agent]
/evolution # evolve last completed workflow
/evolution --issue 42 # evolve workflow for issue #42
/evolution --agent planner # focus evolution on one agent
/evolution --dry-run # show what would change without applying
/evolution --history # print fitness trend chart
/evolution --fitness # run fitness evaluation (alias for /evolve)
```
### Actions
## Aliases
| Action | Description |
|--------|-------------|
| `log` | Log an agent improvement to Gitea and evolution data |
| `report` | Generate evolution report for agent or all agents |
| `history` | Show model change history |
| `metrics` | Display performance metrics |
| `recommend` | Get model recommendations |
- `/evolve` — same as `/evolution --fitness`
- `/evolution log` — log agent model change to Gitea
### Examples
## Execution
### Step 1: Judge (Fitness Evaluation)
```bash
Task(subagent_type: "pipeline-judge")
→ produces fitness report
```
### Step 2: Decide (Threshold Routing)
```
IF fitness >= 0.85:
echo "✅ Pipeline healthy (fitness: {score}). No action needed."
append to fitness-history.jsonl
EXIT
IF fitness >= 0.70:
echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
identify agents with lowest per-agent scores
Task(subagent_type: "prompt-optimizer", target: weak_agents)
IF fitness < 0.70:
echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
IF fitness < 0.50:
Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)
```
### Step 3: Re-test (After Optimization)
```
Re-run the SAME workflow with updated prompts
Task(subagent_type: "pipeline-judge") → fitness_after
IF fitness_after > fitness_before:
commit prompt changes
echo "📈 Fitness improved: {before} → {after}"
ELSE:
revert prompt changes
echo "📉 No improvement. Reverting."
```
### Step 4: Log
Append to `.kilo/logs/fitness-history.jsonl`:
```json
{
"ts": "<now>",
"issue": <N>,
"workflow": "<type>",
"fitness_before": <score>,
"fitness_after": <score>,
"agents_optimized": ["planner", "requirement-refiner"],
"tokens_saved": <delta>,
"time_saved_ms": <delta>
}
```
## Subcommands
### `log` — Log Model Change
Log an agent model improvement to Gitea and evolution data.
```bash
# Log improvement
/evolution log capability-analyst "Updated to qwen3.6-plus for better IF score"
```
# Generate report
/evolution report capability-analyst
Steps:
1. Read current model from `.kilo/agents/{agent}.md`
2. Get previous model from `agent-evolution/data/agent-versions.json`
3. Calculate improvement (IF score, context window)
4. Write to evolution data
5. Post Gitea comment
# Show all changes
/evolution history
### `report` — Generate Evolution Report
# Get recommendations
Generate comprehensive report for agent or all agents:
```bash
/evolution report # all agents
/evolution report planner # specific agent
```
Output includes:
- Total agents
- Model changes this month
- Average quality improvement
- Recent changes table
- Performance metrics
- Model distribution
- Recommendations
### `history` — Show Fitness Trend
Print fitness trend chart:
```bash
/evolution --history
```
Output:
```
Fitness Trend (Last 30 days):
1.00 ┤
0.90 ┤ ╭─╮ ╭──╮
0.80 ┤ ╭─╯ ╰─╮ ╭─╯ ╰──╮
0.70 ┤ ╭─╯ ╰─╯ ╰──╮
0.60 ┤ │ ╰─╮
0.50 ┼─┴───────────────────────────┴──
Apr 1 Apr 8 Apr 15 Apr 22 Apr 29
Avg fitness: 0.82
Trend: ↑ improving
```
### `recommend` — Get Model Recommendations
```bash
/evolution recommend
```
## Workflow Steps
### Step 1: Parse Command
```bash
action=$1
agent=$2
message=$3
```
### Step 2: Execute Action
#### Log Action
When logging an improvement:
1. **Read current model**
```bash
# From .kilo/agents/{agent}.md
current_model=$(grep "^model:" .kilo/agents/${agent}.md | cut -d' ' -f2)
# From .kilo/capability-index.yaml
yaml_model=$(grep -A1 "${agent}:" .kilo/capability-index.yaml | grep "model:" | cut -d' ' -f2)
```
2. **Get previous model from history**
```bash
# Read from agent-evolution/data/agent-versions.json
previous_model=$(cat agent-evolution/data/agent-versions.json | ...)
```
3. **Calculate improvement**
- Look up model scores from capability-index.yaml
- Compare IF scores
- Compare context windows
4. **Write to evolution data**
```json
{
"agent": "capability-analyst",
"timestamp": "2026-04-05T22:20:00Z",
"type": "model_change",
"from": "ollama-cloud/nemotron-3-super",
"to": "qwen/qwen3.6-plus:free",
"improvement": {
"quality": "+23%",
"context_window": "130K→1M",
"if_score": "85→90"
},
"rationale": "Better structured output, FREE via OpenRouter"
}
```
5. **Post Gitea comment**
```markdown
## 🚀 Agent Evolution: {agent}
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| Model | {old} | {new} | ⬆️ |
| IF Score | 85 | 90 | +5 |
| Quality | 64 | 79 | +23% |
| Context | 130K | 1M | +670K |
**Rationale**: {message}
```
#### Report Action
Generate comprehensive report:
```markdown
# Agent Evolution Report
## Overview
- Total agents: 28
- Model changes this month: 4
- Average quality improvement: +18%
## Recent Changes
| Date | Agent | Old Model | New Model | Impact |
|------|-------|-----------|-----------|--------|
| 2026-04-05 | capability-analyst | nemotron-3-super | qwen3.6-plus | +23% |
| 2026-04-05 | requirement-refiner | nemotron-3-super | glm-5 | +33% |
| ... | ... | ... | ... | ... |
## Performance Metrics
### Agent Scores Over Time
```
capability-analyst: 64 → 79 (+23%)
requirement-refiner: 60 → 80 (+33%)
agent-architect: 67 → 82 (+22%)
evaluator: 78 → 81 (+4%)
```
### Model Distribution
- qwen3.6-plus: 5 agents
- nemotron-3-super: 8 agents
- glm-5: 3 agents
- minimax-m2.5: 1 agent
- ...
## Recommendations
1. Consider updating history-miner to nemotron-3-super-120b
2. code-skeptic optimal with minimax-m2.5
3. ...
```
### Step 3: Update Files
After logging:
1. Update `agent-evolution/data/agent-versions.json`
2. Post comment to related Gitea issue
3. Update capability-index.yaml metrics
Shows:
- Agents with fitness < 0.70 (need optimization)
- Agents consuming > 30% of token budget (bottlenecks)
- Model upgrade recommendations
- Priority order
## Data Storage
### fitness-history.jsonl
```jsonl
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.65},"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47,"verdict":"PASS"}
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"breakdown":{"test_pass_rate":1.00,"quality_gates_rate":0.80,"efficiency_score":0.88},"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47,"verdict":"PASS"}
```
### agent-versions.json
```json
@@ -186,22 +190,6 @@ After logging:
}
```
### Gitea Issue Comments
Each evolution log posts a formatted comment:
```markdown
## 🚀 Agent Evolution Log
### {agent}
- **Model**: {old} → {new}
- **Quality**: {old_score} → {new_score} ({change}%)
- **Context**: {old_ctx} → {new_ctx}
- **Rationale**: {reason}
_This change was tracked by /evolution workflow._
```
## Integration Points
- **After `/pipeline`**: Evaluator scores logged
@@ -209,29 +197,52 @@ _This change was tracked by /evolution workflow._
- **Weekly**: Performance report generated
- **On request**: Recommendations provided
## Configuration
```yaml
# In capability-index.yaml
evolution:
enabled: true
auto_trigger: true # trigger after every workflow
fitness_threshold: 0.70 # below this → auto-optimize
max_evolution_attempts: 3 # max retries per cycle
fitness_history: .kilo/logs/fitness-history.jsonl
token_budget_default: 50000
time_budget_default: 300
```
## Metrics Tracked
| Metric | Source | Purpose |
|--------|--------|---------|
| IF Score | KILO_SPEC.md | Instruction Following |
| Quality Score | Research | Overall performance |
| Context Window | Model spec | Max tokens |
| Provider | Config | API endpoint |
| Cost | Pricing | Resource planning |
| SWE-bench | Research | Code benchmark |
| RULER | Research | Long-context benchmark |
| Fitness Score | pipeline-judge | Overall pipeline health |
| Test Pass Rate | bun test | Code quality |
| Quality Gates | build/lint/typecheck | Standards compliance |
| Token Cost | pipeline logs | Resource efficiency |
| Wall-Clock Time | pipeline logs | Speed |
| Agent ROI | history analysis | Cost/benefit |
## Example Session
```bash
$ /evolution log capability-analyst "Updated to qwen3.6-plus for FREE tier and better IF"
$ /evolution
✅ Logged evolution for capability-analyst
📊 Quality improvement: +23%
📄 Posted comment to Issue #27
📝 Updated agent-versions.json
## Pipeline Judgment: Issue #42
**Fitness: 0.82/1.00** [PASS]
| Metric | Value | Weight | Contribution |
|--------|-------|--------|-------------|
| Tests | 95% (45/47) | 50% | 0.475 |
| Gates | 80% (4/5) | 25% | 0.200 |
| Cost | 38.4K tok / 245s | 25% | 0.163 |
**Bottleneck:** lead-developer (31% of tokens)
**Verdict:** PASS - within acceptable range
✅ Logged to .kilo/logs/fitness-history.jsonl
```
---
_Evolution workflow v1.0 - Track agent improvements_
*Evolution workflow v2.0 - Objective fitness scoring with pipeline-judge*