--- description: Run evolution cycle - judge last workflow, optimize underperforming agents, re-test --- # /evolution — Pipeline Evolution Command Runs the automated evolution cycle on the most recent (or specified) workflow. ## Usage ``` /evolution # evolve last completed workflow /evolution --issue 42 # evolve workflow for issue #42 /evolution --agent planner # focus evolution on one agent /evolution --dry-run # show what would change without applying /evolution --history # print fitness trend chart /evolution --fitness # run fitness evaluation (alias for /evolve) ``` ## Aliases - `/evolve` — same as `/evolution --fitness` - `/evolution log` — log agent model change to Gitea ## Execution ### Step 0: Model Research ``` Check if model benchmarks are stale (older than 7 days): READ agent-evolution/data/model-benchmarks.json → metadata.generated IF metadata.generated > 7 days ago OR file missing: Task(subagent_type: "capability-analyst") → research latest model benchmarks, IF scores, availability → output to agent-evolution/data/model-research-latest.json → validates against agent-evolution/data/model-research.schema.json Read agent-evolution/data/model-benchmarks.json → load heatmap scores per agent → load recommendations → identify agents where current model != best-fit model (score gap > 5) ``` This step ensures the evolution cycle works with fresh model data. If benchmarks are stale, the capability-analyst researches current model capabilities and pricing. The research output follows the schema: agent-evolution/data/model-research.schema.json ### Step 1: Judge (Fitness Evaluation) ```bash Task(subagent_type: "pipeline-judge") → produces fitness report ``` ### Step 2: Decide (Threshold Routing) ``` IF fitness >= 0.85: echo "✅ Pipeline healthy (fitness: {score}). No action needed." append to fitness-history.jsonl EXIT IF fitness >= 0.70: echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..." identify agents with lowest per-agent scores Task(subagent_type: "prompt-optimizer", target: weak_agents) IF fitness < 0.70: echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..." Task(subagent_type: "prompt-optimizer", target: all_flagged_agents) IF fitness < 0.50: Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent) ``` ### Step 3: Re-test (After Optimization) ``` Re-run the SAME workflow with updated prompts Task(subagent_type: "pipeline-judge") → fitness_after IF fitness_after > fitness_before: commit prompt changes echo "📈 Fitness improved: {before} → {after}" ELSE: revert prompt changes echo "📉 No improvement. Reverting." ``` ### Step 4: Log + Dashboard Append to `.kilo/logs/fitness-history.jsonl`: ```json { "ts": "", "issue": , "workflow": "", "fitness_before": , "fitness_after": , "agents_optimized": ["planner", "requirement-refiner"], "tokens_saved": , "time_saved_ms": } ``` After logging, rebuild the research dashboard: ```bash bun run agent-evolution/scripts/build-research-dashboard.ts ``` This ensures the dashboard reflects any model changes that occurred during evolution. ## Subcommands ### `log` — Log Model Change Log an agent model improvement to Gitea and evolution data. ```bash /evolution log capability-analyst "Updated to qwen3.6-plus for better IF score" ``` Steps: 1. Read current model from `.kilo/agents/{agent}.md` 2. Get previous model from `agent-evolution/data/agent-versions.json` 3. Calculate improvement (IF score, context window) 4. Write to evolution data 5. Post Gitea comment ### `report` — Generate Evolution Report Generate comprehensive report for agent or all agents: ```bash /evolution report # all agents /evolution report planner # specific agent ``` Output includes: - Total agents - Model changes this month - Average quality improvement - Recent changes table - Performance metrics - Model distribution - Recommendations ### `history` — Show Fitness Trend Print fitness trend chart: ```bash /evolution --history ``` Output: ``` Fitness Trend (Last 30 days): 1.00 ┤ 0.90 ┤ ╭─╮ ╭──╮ 0.80 ┤ ╭─╯ ╰─╮ ╭─╯ ╰──╮ 0.70 ┤ ╭─╯ ╰─╯ ╰──╮ 0.60 ┤ │ ╰─╮ 0.50 ┼─┴───────────────────────────┴── Apr 1 Apr 8 Apr 15 Apr 22 Apr 29 Avg fitness: 0.82 Trend: ↑ improving ``` ### `recommend` — Get Model Recommendations ```bash /evolution recommend ``` Shows: - Agents with fitness < 0.70 (need optimization) - Agents consuming > 30% of token budget (bottlenecks) - Model upgrade recommendations - Priority order ### `research` — Research Model Updates ```bash /evolution research # research all models /evolution research --agent planner # research models for specific agent /evolution research --provider ollama-cloud # research specific provider ``` Steps: 1. Read current agents from `.kilo/capability-index.yaml` 2. Read existing benchmarks from `agent-evolution/data/model-benchmarks.json` 3. Fetch latest model info from provider APIs/docs 4. Score each model against each agent role (using IF-adjusted formula) 5. Generate recommendations where score improvement > 5 points 6. Output to `agent-evolution/data/model-research-latest.json` 7. Validate against `agent-evolution/data/model-research.schema.json` 8. If validation passes, update `agent-evolution/data/model-benchmarks.json` ## Data Storage ### fitness-history.jsonl ```jsonl {"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.65},"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47,"verdict":"PASS"} {"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"breakdown":{"test_pass_rate":1.00,"quality_gates_rate":0.80,"efficiency_score":0.88},"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47,"verdict":"PASS"} ``` ### agent-versions.json ```json { "version": "1.0", "agents": { "capability-analyst": { "current": { "model": "qwen/qwen3.6-plus:free", "provider": "openrouter", "if_score": 90, "quality_score": 79, "context_window": "1M" }, "history": [ { "date": "2026-04-05T22:20:00Z", "type": "model_change", "from": "ollama-cloud/nemotron-3-super", "to": "qwen/qwen3.6-plus:free", "rationale": "Better IF score, FREE via OpenRouter" } ] } } } ``` ### model-benchmarks.json Static benchmark data extracted from research. Contains: - Model capabilities (SWE-bench, IF scores, context windows) - Agent × Model compatibility heatmap scores - Groq/OpenRouter free tier availability - Current agent configuration snapshot - Recommendations (applied + pending) - Impact analysis data Path: `agent-evolution/data/model-benchmarks.json` Schema: `agent-evolution/data/model-benchmarks.schema.json` Refresh: When `/evolution research` runs or auto when stale (>7 days) ### model-research-latest.json Latest research output from `/evolution research` or Step 0. Dynamic file — overwritten each research cycle. Path: `agent-evolution/data/model-research-latest.json` Schema: `agent-evolution/data/model-research.schema.json` ## Integration Points - **After `/pipeline`**: Evaluator scores logged - **After model update**: Evolution logged - **Weekly**: Performance report generated - **On request**: Recommendations provided ## Configuration ```yaml # In capability-index.yaml evolution: enabled: true auto_trigger: true # trigger after every workflow fitness_threshold: 0.70 # below this → auto-optimize max_evolution_attempts: 3 # max retries per cycle fitness_history: .kilo/logs/fitness-history.jsonl token_budget_default: 50000 time_budget_default: 300 ``` ## Metrics Tracked | Metric | Source | Purpose | |--------|--------|---------| | Fitness Score | pipeline-judge | Overall pipeline health | | Test Pass Rate | bun test | Code quality | | Quality Gates | build/lint/typecheck | Standards compliance | | Token Cost | pipeline logs | Resource efficiency | | Wall-Clock Time | pipeline logs | Speed | | Agent ROI | history analysis | Cost/benefit | | Model IF Score | model-benchmarks.json | Prompt adherence per model | | Model Fit Score | heatmap data | Agent-model compatibility | | Model Availability | provider APIs | Rate limits, free tier status | | Staleness | metadata.generated | How fresh is benchmark data | ## Example Session ```bash $ /evolution ## Pipeline Judgment: Issue #42 **Fitness: 0.82/1.00** [PASS] | Metric | Value | Weight | Contribution | |--------|-------|--------|-------------| | Tests | 95% (45/47) | 50% | 0.475 | | Gates | 80% (4/5) | 25% | 0.200 | | Cost | 38.4K tok / 245s | 25% | 0.163 | **Bottleneck:** lead-developer (31% of tokens) **Verdict:** PASS - within acceptable range ✅ Logged to .kilo/logs/fitness-history.jsonl ``` ## Example: Model Research Session ```bash $ /evolution research ## Model Research: All Agents **Benchmarks last updated**: 2026-04-20 (7 days ago — refreshing...) ### Research Phase → Fetching Ollama Cloud model list... 20 models found → Fetching OpenRouter free tier... 3 models found → Fetching Groq free tier... 5 models found → Scoring 28 models × 36 agents... 1008 scores computed ### Top Recommendations (score gap > 5) | Agent | Current | Score | Recommended | Score | Δ | Impact | |-------|---------|-------|-------------|-------|---|--------| | planner | nemotron-3-super | 80 | deepseek-v4-pro-max | 88 | +8 | high | | go-developer | qwen3-coder | 85 | deepseek-v4-pro-max | 88 | +3 | medium | | [built-in] debug | glm-5.1 | 88 | kimi-k2.6:cloud | 90 | +2 | high | ### Output ✅ agent-evolution/data/model-research-latest.json (28 models, 11 recommendations) ✅ agent-evolution/data/model-benchmarks.json refreshed (36 agents) ### Next Steps Run `/evolution` to apply recommendations and re-test Or `/evolution --dry-run` to preview changes ### Dashboard Rebuild After model research or applying recommendations, rebuild the dashboard: ```bash bun run agent-evolution/scripts/build-research-dashboard.ts ``` Output: - `agent-evolution/research-dashboard.html` — latest interactive dashboard - `agent-evolution/dist/research-dashboard-YYYY_MM_DD.html` — dated archive The dashboard reads from `agent-evolution/data/model-benchmarks.json` and renders: - Current agent-model configuration table - Model comparison cards with SWE-bench and IF scores - Agent × Model heatmap with IF adjustment - Selectable recommendations with JSON export - Before/after impact analysis Watch mode for continuous rebuild during research: ```bash bun run agent-evolution/scripts/build-research-dashboard.ts --watch ``` Auto-triggers with `--watch` when `model-benchmarks.json` or template changes. ``` --- *Evolution workflow v2.0 - Objective fitness scoring with pipeline-judge*