feat: add pipeline-judge agent and evolution workflow system
- Add pipeline-judge agent for objective fitness scoring - Update capability-index.yaml with pipeline-judge, evolution config - Add fitness-evaluation.md workflow for auto-optimization - Update evolution.md command with /evolve CLI - Create .kilo/logs/fitness-history.jsonl for metrics logging - Update AGENTS.md with new workflow state machine - Add 6 new issues to MILESTONE_ISSUES.md for evolution integration - Preserve ideas in agent-evolution/ideas/ Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25) Auto-triggers prompt-optimizer when fitness < 0.70
This commit is contained in:
@@ -1,163 +1,167 @@
|
||||
# Agent Evolution Workflow
|
||||
---
|
||||
description: Run evolution cycle - judge last workflow, optimize underperforming agents, re-test
|
||||
---
|
||||
|
||||
Tracks and records agent model improvements, capability changes, and performance metrics.
|
||||
# /evolution — Pipeline Evolution Command
|
||||
|
||||
Runs the automated evolution cycle on the most recent (or specified) workflow.
|
||||
|
||||
## Usage
|
||||
|
||||
```
|
||||
/evolution [action] [agent]
|
||||
/evolution # evolve last completed workflow
|
||||
/evolution --issue 42 # evolve workflow for issue #42
|
||||
/evolution --agent planner # focus evolution on one agent
|
||||
/evolution --dry-run # show what would change without applying
|
||||
/evolution --history # print fitness trend chart
|
||||
/evolution --fitness # run fitness evaluation (alias for /evolve)
|
||||
```
|
||||
|
||||
### Actions
|
||||
## Aliases
|
||||
|
||||
| Action | Description |
|
||||
|--------|-------------|
|
||||
| `log` | Log an agent improvement to Gitea and evolution data |
|
||||
| `report` | Generate evolution report for agent or all agents |
|
||||
| `history` | Show model change history |
|
||||
| `metrics` | Display performance metrics |
|
||||
| `recommend` | Get model recommendations |
|
||||
- `/evolve` — same as `/evolution --fitness`
|
||||
- `/evolution log` — log agent model change to Gitea
|
||||
|
||||
### Examples
|
||||
## Execution
|
||||
|
||||
### Step 1: Judge (Fitness Evaluation)
|
||||
|
||||
```bash
|
||||
Task(subagent_type: "pipeline-judge")
|
||||
→ produces fitness report
|
||||
```
|
||||
|
||||
### Step 2: Decide (Threshold Routing)
|
||||
|
||||
```
|
||||
IF fitness >= 0.85:
|
||||
echo "✅ Pipeline healthy (fitness: {score}). No action needed."
|
||||
append to fitness-history.jsonl
|
||||
EXIT
|
||||
|
||||
IF fitness >= 0.70:
|
||||
echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
|
||||
identify agents with lowest per-agent scores
|
||||
Task(subagent_type: "prompt-optimizer", target: weak_agents)
|
||||
|
||||
IF fitness < 0.70:
|
||||
echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
|
||||
Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
|
||||
IF fitness < 0.50:
|
||||
Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)
|
||||
```
|
||||
|
||||
### Step 3: Re-test (After Optimization)
|
||||
|
||||
```
|
||||
Re-run the SAME workflow with updated prompts
|
||||
Task(subagent_type: "pipeline-judge") → fitness_after
|
||||
|
||||
IF fitness_after > fitness_before:
|
||||
commit prompt changes
|
||||
echo "📈 Fitness improved: {before} → {after}"
|
||||
ELSE:
|
||||
revert prompt changes
|
||||
echo "📉 No improvement. Reverting."
|
||||
```
|
||||
|
||||
### Step 4: Log
|
||||
|
||||
Append to `.kilo/logs/fitness-history.jsonl`:
|
||||
|
||||
```json
|
||||
{
|
||||
"ts": "<now>",
|
||||
"issue": <N>,
|
||||
"workflow": "<type>",
|
||||
"fitness_before": <score>,
|
||||
"fitness_after": <score>,
|
||||
"agents_optimized": ["planner", "requirement-refiner"],
|
||||
"tokens_saved": <delta>,
|
||||
"time_saved_ms": <delta>
|
||||
}
|
||||
```
|
||||
|
||||
## Subcommands
|
||||
|
||||
### `log` — Log Model Change
|
||||
|
||||
Log an agent model improvement to Gitea and evolution data.
|
||||
|
||||
```bash
|
||||
# Log improvement
|
||||
/evolution log capability-analyst "Updated to qwen3.6-plus for better IF score"
|
||||
```
|
||||
|
||||
# Generate report
|
||||
/evolution report capability-analyst
|
||||
Steps:
|
||||
1. Read current model from `.kilo/agents/{agent}.md`
|
||||
2. Get previous model from `agent-evolution/data/agent-versions.json`
|
||||
3. Calculate improvement (IF score, context window)
|
||||
4. Write to evolution data
|
||||
5. Post Gitea comment
|
||||
|
||||
# Show all changes
|
||||
/evolution history
|
||||
### `report` — Generate Evolution Report
|
||||
|
||||
# Get recommendations
|
||||
Generate comprehensive report for agent or all agents:
|
||||
|
||||
```bash
|
||||
/evolution report # all agents
|
||||
/evolution report planner # specific agent
|
||||
```
|
||||
|
||||
Output includes:
|
||||
- Total agents
|
||||
- Model changes this month
|
||||
- Average quality improvement
|
||||
- Recent changes table
|
||||
- Performance metrics
|
||||
- Model distribution
|
||||
- Recommendations
|
||||
|
||||
### `history` — Show Fitness Trend
|
||||
|
||||
Print fitness trend chart:
|
||||
|
||||
```bash
|
||||
/evolution --history
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Fitness Trend (Last 30 days):
|
||||
|
||||
1.00 ┤
|
||||
0.90 ┤ ╭─╮ ╭──╮
|
||||
0.80 ┤ ╭─╯ ╰─╮ ╭─╯ ╰──╮
|
||||
0.70 ┤ ╭─╯ ╰─╯ ╰──╮
|
||||
0.60 ┤ │ ╰─╮
|
||||
0.50 ┼─┴───────────────────────────┴──
|
||||
Apr 1 Apr 8 Apr 15 Apr 22 Apr 29
|
||||
|
||||
Avg fitness: 0.82
|
||||
Trend: ↑ improving
|
||||
```
|
||||
|
||||
### `recommend` — Get Model Recommendations
|
||||
|
||||
```bash
|
||||
/evolution recommend
|
||||
```
|
||||
|
||||
## Workflow Steps
|
||||
|
||||
### Step 1: Parse Command
|
||||
|
||||
```bash
|
||||
action=$1
|
||||
agent=$2
|
||||
message=$3
|
||||
```
|
||||
|
||||
### Step 2: Execute Action
|
||||
|
||||
#### Log Action
|
||||
|
||||
When logging an improvement:
|
||||
|
||||
1. **Read current model**
|
||||
```bash
|
||||
# From .kilo/agents/{agent}.md
|
||||
current_model=$(grep "^model:" .kilo/agents/${agent}.md | cut -d' ' -f2)
|
||||
|
||||
# From .kilo/capability-index.yaml
|
||||
yaml_model=$(grep -A1 "${agent}:" .kilo/capability-index.yaml | grep "model:" | cut -d' ' -f2)
|
||||
```
|
||||
|
||||
2. **Get previous model from history**
|
||||
```bash
|
||||
# Read from agent-evolution/data/agent-versions.json
|
||||
previous_model=$(cat agent-evolution/data/agent-versions.json | ...)
|
||||
```
|
||||
|
||||
3. **Calculate improvement**
|
||||
- Look up model scores from capability-index.yaml
|
||||
- Compare IF scores
|
||||
- Compare context windows
|
||||
|
||||
4. **Write to evolution data**
|
||||
```json
|
||||
{
|
||||
"agent": "capability-analyst",
|
||||
"timestamp": "2026-04-05T22:20:00Z",
|
||||
"type": "model_change",
|
||||
"from": "ollama-cloud/nemotron-3-super",
|
||||
"to": "qwen/qwen3.6-plus:free",
|
||||
"improvement": {
|
||||
"quality": "+23%",
|
||||
"context_window": "130K→1M",
|
||||
"if_score": "85→90"
|
||||
},
|
||||
"rationale": "Better structured output, FREE via OpenRouter"
|
||||
}
|
||||
```
|
||||
|
||||
5. **Post Gitea comment**
|
||||
```markdown
|
||||
## 🚀 Agent Evolution: {agent}
|
||||
|
||||
| Metric | Before | After | Change |
|
||||
|--------|--------|-------|--------|
|
||||
| Model | {old} | {new} | ⬆️ |
|
||||
| IF Score | 85 | 90 | +5 |
|
||||
| Quality | 64 | 79 | +23% |
|
||||
| Context | 130K | 1M | +670K |
|
||||
|
||||
**Rationale**: {message}
|
||||
```
|
||||
|
||||
#### Report Action
|
||||
|
||||
Generate comprehensive report:
|
||||
|
||||
```markdown
|
||||
# Agent Evolution Report
|
||||
|
||||
## Overview
|
||||
|
||||
- Total agents: 28
|
||||
- Model changes this month: 4
|
||||
- Average quality improvement: +18%
|
||||
|
||||
## Recent Changes
|
||||
|
||||
| Date | Agent | Old Model | New Model | Impact |
|
||||
|------|-------|-----------|-----------|--------|
|
||||
| 2026-04-05 | capability-analyst | nemotron-3-super | qwen3.6-plus | +23% |
|
||||
| 2026-04-05 | requirement-refiner | nemotron-3-super | glm-5 | +33% |
|
||||
| ... | ... | ... | ... | ... |
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
### Agent Scores Over Time
|
||||
|
||||
```
|
||||
capability-analyst: 64 → 79 (+23%)
|
||||
requirement-refiner: 60 → 80 (+33%)
|
||||
agent-architect: 67 → 82 (+22%)
|
||||
evaluator: 78 → 81 (+4%)
|
||||
```
|
||||
|
||||
### Model Distribution
|
||||
|
||||
- qwen3.6-plus: 5 agents
|
||||
- nemotron-3-super: 8 agents
|
||||
- glm-5: 3 agents
|
||||
- minimax-m2.5: 1 agent
|
||||
- ...
|
||||
|
||||
## Recommendations
|
||||
|
||||
1. Consider updating history-miner to nemotron-3-super-120b
|
||||
2. code-skeptic optimal with minimax-m2.5
|
||||
3. ...
|
||||
```
|
||||
|
||||
### Step 3: Update Files
|
||||
|
||||
After logging:
|
||||
|
||||
1. Update `agent-evolution/data/agent-versions.json`
|
||||
2. Post comment to related Gitea issue
|
||||
3. Update capability-index.yaml metrics
|
||||
Shows:
|
||||
- Agents with fitness < 0.70 (need optimization)
|
||||
- Agents consuming > 30% of token budget (bottlenecks)
|
||||
- Model upgrade recommendations
|
||||
- Priority order
|
||||
|
||||
## Data Storage
|
||||
|
||||
### fitness-history.jsonl
|
||||
|
||||
```jsonl
|
||||
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.65},"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47,"verdict":"PASS"}
|
||||
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"breakdown":{"test_pass_rate":1.00,"quality_gates_rate":0.80,"efficiency_score":0.88},"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47,"verdict":"PASS"}
|
||||
```
|
||||
|
||||
### agent-versions.json
|
||||
|
||||
```json
|
||||
@@ -186,22 +190,6 @@ After logging:
|
||||
}
|
||||
```
|
||||
|
||||
### Gitea Issue Comments
|
||||
|
||||
Each evolution log posts a formatted comment:
|
||||
|
||||
```markdown
|
||||
## 🚀 Agent Evolution Log
|
||||
|
||||
### {agent}
|
||||
- **Model**: {old} → {new}
|
||||
- **Quality**: {old_score} → {new_score} ({change}%)
|
||||
- **Context**: {old_ctx} → {new_ctx}
|
||||
- **Rationale**: {reason}
|
||||
|
||||
_This change was tracked by /evolution workflow._
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
- **After `/pipeline`**: Evaluator scores logged
|
||||
@@ -209,29 +197,52 @@ _This change was tracked by /evolution workflow._
|
||||
- **Weekly**: Performance report generated
|
||||
- **On request**: Recommendations provided
|
||||
|
||||
## Configuration
|
||||
|
||||
```yaml
|
||||
# In capability-index.yaml
|
||||
evolution:
|
||||
enabled: true
|
||||
auto_trigger: true # trigger after every workflow
|
||||
fitness_threshold: 0.70 # below this → auto-optimize
|
||||
max_evolution_attempts: 3 # max retries per cycle
|
||||
fitness_history: .kilo/logs/fitness-history.jsonl
|
||||
token_budget_default: 50000
|
||||
time_budget_default: 300
|
||||
```
|
||||
|
||||
## Metrics Tracked
|
||||
|
||||
| Metric | Source | Purpose |
|
||||
|--------|--------|---------|
|
||||
| IF Score | KILO_SPEC.md | Instruction Following |
|
||||
| Quality Score | Research | Overall performance |
|
||||
| Context Window | Model spec | Max tokens |
|
||||
| Provider | Config | API endpoint |
|
||||
| Cost | Pricing | Resource planning |
|
||||
| SWE-bench | Research | Code benchmark |
|
||||
| RULER | Research | Long-context benchmark |
|
||||
| Fitness Score | pipeline-judge | Overall pipeline health |
|
||||
| Test Pass Rate | bun test | Code quality |
|
||||
| Quality Gates | build/lint/typecheck | Standards compliance |
|
||||
| Token Cost | pipeline logs | Resource efficiency |
|
||||
| Wall-Clock Time | pipeline logs | Speed |
|
||||
| Agent ROI | history analysis | Cost/benefit |
|
||||
|
||||
## Example Session
|
||||
|
||||
```bash
|
||||
$ /evolution log capability-analyst "Updated to qwen3.6-plus for FREE tier and better IF"
|
||||
$ /evolution
|
||||
|
||||
✅ Logged evolution for capability-analyst
|
||||
📊 Quality improvement: +23%
|
||||
📄 Posted comment to Issue #27
|
||||
📝 Updated agent-versions.json
|
||||
## Pipeline Judgment: Issue #42
|
||||
|
||||
**Fitness: 0.82/1.00** [PASS]
|
||||
|
||||
| Metric | Value | Weight | Contribution |
|
||||
|--------|-------|--------|-------------|
|
||||
| Tests | 95% (45/47) | 50% | 0.475 |
|
||||
| Gates | 80% (4/5) | 25% | 0.200 |
|
||||
| Cost | 38.4K tok / 245s | 25% | 0.163 |
|
||||
|
||||
**Bottleneck:** lead-developer (31% of tokens)
|
||||
**Verdict:** PASS - within acceptable range
|
||||
|
||||
✅ Logged to .kilo/logs/fitness-history.jsonl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
_Evolution workflow v1.0 - Track agent improvements_
|
||||
*Evolution workflow v2.0 - Objective fitness scoring with pipeline-judge*
|
||||
Reference in New Issue
Block a user