- Integrate apaw_agent_model_research_v3.html as standalone dashboard - Add model-benchmarks.json with 32 agents, 11 scored models, 11 recommendations - Add build-research-dashboard.ts: inject live data into template → standalone HTML - Add rebuild-template.cjs: regenerate template from v3.html source - Add sync-benchmarks-from-yaml.cjs: sync YAML → JSON round-trip - Add sync-model-research.ts: apply recommendation matrix to config files - Add model-benchmarks.schema.json and model-research.schema.json for validation - Add bidirectional-data-flow.md architecture documentation - Add log-execution.cjs pipeline hook - Update capability-index.yaml: add fallback_models, failover_strategy - Update kilo-meta.json, kilo.jsonc, KILO_SPEC.md with synced models - Update evolution.md / research.md / self-evolution.md / evolutionary-sync.md docs - Fix security-auditor.md: quote YAML color (#DC2626) - Fix orchestrator.md: remove duplicate devops-engineer key - Build research-dashboard.html (106KB standalone) + dated archive
380 lines
11 KiB
Markdown
380 lines
11 KiB
Markdown
---
|
||
description: Run evolution cycle - judge last workflow, optimize underperforming agents, re-test
|
||
---
|
||
|
||
# /evolution — Pipeline Evolution Command
|
||
|
||
Runs the automated evolution cycle on the most recent (or specified) workflow.
|
||
|
||
## Usage
|
||
|
||
```
|
||
/evolution # evolve last completed workflow
|
||
/evolution --issue 42 # evolve workflow for issue #42
|
||
/evolution --agent planner # focus evolution on one agent
|
||
/evolution --dry-run # show what would change without applying
|
||
/evolution --history # print fitness trend chart
|
||
/evolution --fitness # run fitness evaluation (alias for /evolve)
|
||
```
|
||
|
||
## Aliases
|
||
|
||
- `/evolve` — same as `/evolution --fitness`
|
||
- `/evolution log` — log agent model change to Gitea
|
||
|
||
## Execution
|
||
|
||
### Step 0: Model Research
|
||
|
||
```
|
||
Check if model benchmarks are stale (older than 7 days):
|
||
READ agent-evolution/data/model-benchmarks.json → metadata.generated
|
||
|
||
IF metadata.generated > 7 days ago OR file missing:
|
||
Task(subagent_type: "capability-analyst")
|
||
→ research latest model benchmarks, IF scores, availability
|
||
→ output to agent-evolution/data/model-research-latest.json
|
||
→ validates against agent-evolution/data/model-research.schema.json
|
||
|
||
Read agent-evolution/data/model-benchmarks.json
|
||
→ load heatmap scores per agent
|
||
→ load recommendations
|
||
→ identify agents where current model != best-fit model (score gap > 5)
|
||
```
|
||
|
||
This step ensures the evolution cycle works with fresh model data. If benchmarks are stale,
|
||
the capability-analyst researches current model capabilities and pricing.
|
||
|
||
The research output follows the schema: agent-evolution/data/model-research.schema.json
|
||
|
||
### Step 1: Judge (Fitness Evaluation)
|
||
|
||
```bash
|
||
Task(subagent_type: "pipeline-judge")
|
||
→ produces fitness report
|
||
```
|
||
|
||
### Step 2: Decide (Threshold Routing)
|
||
|
||
```
|
||
IF fitness >= 0.85:
|
||
echo "✅ Pipeline healthy (fitness: {score}). No action needed."
|
||
append to fitness-history.jsonl
|
||
EXIT
|
||
|
||
IF fitness >= 0.70:
|
||
echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
|
||
identify agents with lowest per-agent scores
|
||
Task(subagent_type: "prompt-optimizer", target: weak_agents)
|
||
|
||
IF fitness < 0.70:
|
||
echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
|
||
Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
|
||
IF fitness < 0.50:
|
||
Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)
|
||
```
|
||
|
||
### Step 3: Re-test (After Optimization)
|
||
|
||
```
|
||
Re-run the SAME workflow with updated prompts
|
||
Task(subagent_type: "pipeline-judge") → fitness_after
|
||
|
||
IF fitness_after > fitness_before:
|
||
commit prompt changes
|
||
echo "📈 Fitness improved: {before} → {after}"
|
||
ELSE:
|
||
revert prompt changes
|
||
echo "📉 No improvement. Reverting."
|
||
```
|
||
|
||
### Step 4: Log + Dashboard
|
||
|
||
Append to `.kilo/logs/fitness-history.jsonl`:
|
||
|
||
```json
|
||
{
|
||
"ts": "<now>",
|
||
"issue": <N>,
|
||
"workflow": "<type>",
|
||
"fitness_before": <score>,
|
||
"fitness_after": <score>,
|
||
"agents_optimized": ["planner", "requirement-refiner"],
|
||
"tokens_saved": <delta>,
|
||
"time_saved_ms": <delta>
|
||
}
|
||
```
|
||
|
||
After logging, rebuild the research dashboard:
|
||
|
||
```bash
|
||
bun run agent-evolution/scripts/build-research-dashboard.ts
|
||
```
|
||
|
||
This ensures the dashboard reflects any model changes that occurred during evolution.
|
||
|
||
## Subcommands
|
||
|
||
### `log` — Log Model Change
|
||
|
||
Log an agent model improvement to Gitea and evolution data.
|
||
|
||
```bash
|
||
/evolution log capability-analyst "Updated to qwen3.6-plus for better IF score"
|
||
```
|
||
|
||
Steps:
|
||
1. Read current model from `.kilo/agents/{agent}.md`
|
||
2. Get previous model from `agent-evolution/data/agent-versions.json`
|
||
3. Calculate improvement (IF score, context window)
|
||
4. Write to evolution data
|
||
5. Post Gitea comment
|
||
|
||
### `report` — Generate Evolution Report
|
||
|
||
Generate comprehensive report for agent or all agents:
|
||
|
||
```bash
|
||
/evolution report # all agents
|
||
/evolution report planner # specific agent
|
||
```
|
||
|
||
Output includes:
|
||
- Total agents
|
||
- Model changes this month
|
||
- Average quality improvement
|
||
- Recent changes table
|
||
- Performance metrics
|
||
- Model distribution
|
||
- Recommendations
|
||
|
||
### `history` — Show Fitness Trend
|
||
|
||
Print fitness trend chart:
|
||
|
||
```bash
|
||
/evolution --history
|
||
```
|
||
|
||
Output:
|
||
```
|
||
Fitness Trend (Last 30 days):
|
||
|
||
1.00 ┤
|
||
0.90 ┤ ╭─╮ ╭──╮
|
||
0.80 ┤ ╭─╯ ╰─╮ ╭─╯ ╰──╮
|
||
0.70 ┤ ╭─╯ ╰─╯ ╰──╮
|
||
0.60 ┤ │ ╰─╮
|
||
0.50 ┼─┴───────────────────────────┴──
|
||
Apr 1 Apr 8 Apr 15 Apr 22 Apr 29
|
||
|
||
Avg fitness: 0.82
|
||
Trend: ↑ improving
|
||
```
|
||
|
||
### `recommend` — Get Model Recommendations
|
||
|
||
```bash
|
||
/evolution recommend
|
||
```
|
||
|
||
Shows:
|
||
- Agents with fitness < 0.70 (need optimization)
|
||
- Agents consuming > 30% of token budget (bottlenecks)
|
||
- Model upgrade recommendations
|
||
- Priority order
|
||
|
||
### `research` — Research Model Updates
|
||
|
||
```bash
|
||
/evolution research # research all models
|
||
/evolution research --agent planner # research models for specific agent
|
||
/evolution research --provider ollama-cloud # research specific provider
|
||
```
|
||
|
||
Steps:
|
||
1. Read current agents from `.kilo/capability-index.yaml`
|
||
2. Read existing benchmarks from `agent-evolution/data/model-benchmarks.json`
|
||
3. Fetch latest model info from provider APIs/docs
|
||
4. Score each model against each agent role (using IF-adjusted formula)
|
||
5. Generate recommendations where score improvement > 5 points
|
||
6. Output to `agent-evolution/data/model-research-latest.json`
|
||
7. Validate against `agent-evolution/data/model-research.schema.json`
|
||
8. If validation passes, update `agent-evolution/data/model-benchmarks.json`
|
||
|
||
## Data Storage
|
||
|
||
### fitness-history.jsonl
|
||
|
||
```jsonl
|
||
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.65},"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47,"verdict":"PASS"}
|
||
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"breakdown":{"test_pass_rate":1.00,"quality_gates_rate":0.80,"efficiency_score":0.88},"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47,"verdict":"PASS"}
|
||
```
|
||
|
||
### agent-versions.json
|
||
|
||
```json
|
||
{
|
||
"version": "1.0",
|
||
"agents": {
|
||
"capability-analyst": {
|
||
"current": {
|
||
"model": "qwen/qwen3.6-plus:free",
|
||
"provider": "openrouter",
|
||
"if_score": 90,
|
||
"quality_score": 79,
|
||
"context_window": "1M"
|
||
},
|
||
"history": [
|
||
{
|
||
"date": "2026-04-05T22:20:00Z",
|
||
"type": "model_change",
|
||
"from": "ollama-cloud/nemotron-3-super",
|
||
"to": "qwen/qwen3.6-plus:free",
|
||
"rationale": "Better IF score, FREE via OpenRouter"
|
||
}
|
||
]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### model-benchmarks.json
|
||
|
||
Static benchmark data extracted from research. Contains:
|
||
- Model capabilities (SWE-bench, IF scores, context windows)
|
||
- Agent × Model compatibility heatmap scores
|
||
- Groq/OpenRouter free tier availability
|
||
- Current agent configuration snapshot
|
||
- Recommendations (applied + pending)
|
||
- Impact analysis data
|
||
|
||
Path: `agent-evolution/data/model-benchmarks.json`
|
||
Schema: `agent-evolution/data/model-benchmarks.schema.json`
|
||
Refresh: When `/evolution research` runs or auto when stale (>7 days)
|
||
|
||
### model-research-latest.json
|
||
|
||
Latest research output from `/evolution research` or Step 0.
|
||
Dynamic file — overwritten each research cycle.
|
||
|
||
Path: `agent-evolution/data/model-research-latest.json`
|
||
Schema: `agent-evolution/data/model-research.schema.json`
|
||
|
||
## Integration Points
|
||
|
||
- **After `/pipeline`**: Evaluator scores logged
|
||
- **After model update**: Evolution logged
|
||
- **Weekly**: Performance report generated
|
||
- **On request**: Recommendations provided
|
||
|
||
## Configuration
|
||
|
||
```yaml
|
||
# In capability-index.yaml
|
||
evolution:
|
||
enabled: true
|
||
auto_trigger: true # trigger after every workflow
|
||
fitness_threshold: 0.70 # below this → auto-optimize
|
||
max_evolution_attempts: 3 # max retries per cycle
|
||
fitness_history: .kilo/logs/fitness-history.jsonl
|
||
token_budget_default: 50000
|
||
time_budget_default: 300
|
||
```
|
||
|
||
## Metrics Tracked
|
||
|
||
| Metric | Source | Purpose |
|
||
|--------|--------|---------|
|
||
| Fitness Score | pipeline-judge | Overall pipeline health |
|
||
| Test Pass Rate | bun test | Code quality |
|
||
| Quality Gates | build/lint/typecheck | Standards compliance |
|
||
| Token Cost | pipeline logs | Resource efficiency |
|
||
| Wall-Clock Time | pipeline logs | Speed |
|
||
| Agent ROI | history analysis | Cost/benefit |
|
||
| Model IF Score | model-benchmarks.json | Prompt adherence per model |
|
||
| Model Fit Score | heatmap data | Agent-model compatibility |
|
||
| Model Availability | provider APIs | Rate limits, free tier status |
|
||
| Staleness | metadata.generated | How fresh is benchmark data |
|
||
|
||
## Example Session
|
||
|
||
```bash
|
||
$ /evolution
|
||
|
||
## Pipeline Judgment: Issue #42
|
||
|
||
**Fitness: 0.82/1.00** [PASS]
|
||
|
||
| Metric | Value | Weight | Contribution |
|
||
|--------|-------|--------|-------------|
|
||
| Tests | 95% (45/47) | 50% | 0.475 |
|
||
| Gates | 80% (4/5) | 25% | 0.200 |
|
||
| Cost | 38.4K tok / 245s | 25% | 0.163 |
|
||
|
||
**Bottleneck:** lead-developer (31% of tokens)
|
||
**Verdict:** PASS - within acceptable range
|
||
|
||
✅ Logged to .kilo/logs/fitness-history.jsonl
|
||
```
|
||
|
||
## Example: Model Research Session
|
||
|
||
```bash
|
||
$ /evolution research
|
||
|
||
## Model Research: All Agents
|
||
|
||
**Benchmarks last updated**: 2026-04-20 (7 days ago — refreshing...)
|
||
|
||
### Research Phase
|
||
→ Fetching Ollama Cloud model list... 20 models found
|
||
→ Fetching OpenRouter free tier... 3 models found
|
||
→ Fetching Groq free tier... 5 models found
|
||
→ Scoring 28 models × 36 agents... 1008 scores computed
|
||
|
||
### Top Recommendations (score gap > 5)
|
||
|
||
| Agent | Current | Score | Recommended | Score | Δ | Impact |
|
||
|-------|---------|-------|-------------|-------|---|--------|
|
||
| planner | nemotron-3-super | 80 | deepseek-v4-pro-max | 88 | +8 | high |
|
||
| go-developer | qwen3-coder | 85 | deepseek-v4-pro-max | 88 | +3 | medium |
|
||
| [built-in] debug | glm-5.1 | 88 | kimi-k2.6:cloud | 90 | +2 | high |
|
||
|
||
### Output
|
||
✅ agent-evolution/data/model-research-latest.json (28 models, 11 recommendations)
|
||
✅ agent-evolution/data/model-benchmarks.json refreshed (36 agents)
|
||
|
||
### Next Steps
|
||
Run `/evolution` to apply recommendations and re-test
|
||
Or `/evolution --dry-run` to preview changes
|
||
|
||
### Dashboard Rebuild
|
||
|
||
After model research or applying recommendations, rebuild the dashboard:
|
||
|
||
```bash
|
||
bun run agent-evolution/scripts/build-research-dashboard.ts
|
||
```
|
||
|
||
Output:
|
||
- `agent-evolution/research-dashboard.html` — latest interactive dashboard
|
||
- `agent-evolution/dist/research-dashboard-YYYY_MM_DD.html` — dated archive
|
||
|
||
The dashboard reads from `agent-evolution/data/model-benchmarks.json` and renders:
|
||
- Current agent-model configuration table
|
||
- Model comparison cards with SWE-bench and IF scores
|
||
- Agent × Model heatmap with IF adjustment
|
||
- Selectable recommendations with JSON export
|
||
- Before/after impact analysis
|
||
|
||
Watch mode for continuous rebuild during research:
|
||
```bash
|
||
bun run agent-evolution/scripts/build-research-dashboard.ts --watch
|
||
```
|
||
Auto-triggers with `--watch` when `model-benchmarks.json` or template changes.
|
||
```
|
||
|
||
---
|
||
|
||
*Evolution workflow v2.0 - Objective fitness scoring with pipeline-judge* |