Files
APAW/.kilo/commands/evolution.md
¨NW¨ 3badb259cc feat: bidirectional research dashboard + agent config fixes
- Integrate apaw_agent_model_research_v3.html as standalone dashboard
- Add model-benchmarks.json with 32 agents, 11 scored models, 11 recommendations
- Add build-research-dashboard.ts: inject live data into template → standalone HTML
- Add rebuild-template.cjs: regenerate template from v3.html source
- Add sync-benchmarks-from-yaml.cjs: sync YAML → JSON round-trip
- Add sync-model-research.ts: apply recommendation matrix to config files
- Add model-benchmarks.schema.json and model-research.schema.json for validation
- Add bidirectional-data-flow.md architecture documentation
- Add log-execution.cjs pipeline hook
- Update capability-index.yaml: add fallback_models, failover_strategy
- Update kilo-meta.json, kilo.jsonc, KILO_SPEC.md with synced models
- Update evolution.md / research.md / self-evolution.md / evolutionary-sync.md docs
- Fix security-auditor.md: quote YAML color (#DC2626)
- Fix orchestrator.md: remove duplicate devops-engineer key
- Build research-dashboard.html (106KB standalone) + dated archive
2026-04-29 21:04:22 +01:00

380 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
description: Run evolution cycle - judge last workflow, optimize underperforming agents, re-test
---
# /evolution — Pipeline Evolution Command
Runs the automated evolution cycle on the most recent (or specified) workflow.
## Usage
```
/evolution # evolve last completed workflow
/evolution --issue 42 # evolve workflow for issue #42
/evolution --agent planner # focus evolution on one agent
/evolution --dry-run # show what would change without applying
/evolution --history # print fitness trend chart
/evolution --fitness # run fitness evaluation (alias for /evolve)
```
## Aliases
- `/evolve` — same as `/evolution --fitness`
- `/evolution log` — log agent model change to Gitea
## Execution
### Step 0: Model Research
```
Check if model benchmarks are stale (older than 7 days):
READ agent-evolution/data/model-benchmarks.json → metadata.generated
IF metadata.generated > 7 days ago OR file missing:
Task(subagent_type: "capability-analyst")
→ research latest model benchmarks, IF scores, availability
→ output to agent-evolution/data/model-research-latest.json
→ validates against agent-evolution/data/model-research.schema.json
Read agent-evolution/data/model-benchmarks.json
→ load heatmap scores per agent
→ load recommendations
→ identify agents where current model != best-fit model (score gap > 5)
```
This step ensures the evolution cycle works with fresh model data. If benchmarks are stale,
the capability-analyst researches current model capabilities and pricing.
The research output follows the schema: agent-evolution/data/model-research.schema.json
### Step 1: Judge (Fitness Evaluation)
```bash
Task(subagent_type: "pipeline-judge")
→ produces fitness report
```
### Step 2: Decide (Threshold Routing)
```
IF fitness >= 0.85:
echo "✅ Pipeline healthy (fitness: {score}). No action needed."
append to fitness-history.jsonl
EXIT
IF fitness >= 0.70:
echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
identify agents with lowest per-agent scores
Task(subagent_type: "prompt-optimizer", target: weak_agents)
IF fitness < 0.70:
echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
IF fitness < 0.50:
Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)
```
### Step 3: Re-test (After Optimization)
```
Re-run the SAME workflow with updated prompts
Task(subagent_type: "pipeline-judge") → fitness_after
IF fitness_after > fitness_before:
commit prompt changes
echo "📈 Fitness improved: {before} → {after}"
ELSE:
revert prompt changes
echo "📉 No improvement. Reverting."
```
### Step 4: Log + Dashboard
Append to `.kilo/logs/fitness-history.jsonl`:
```json
{
"ts": "<now>",
"issue": <N>,
"workflow": "<type>",
"fitness_before": <score>,
"fitness_after": <score>,
"agents_optimized": ["planner", "requirement-refiner"],
"tokens_saved": <delta>,
"time_saved_ms": <delta>
}
```
After logging, rebuild the research dashboard:
```bash
bun run agent-evolution/scripts/build-research-dashboard.ts
```
This ensures the dashboard reflects any model changes that occurred during evolution.
## Subcommands
### `log` — Log Model Change
Log an agent model improvement to Gitea and evolution data.
```bash
/evolution log capability-analyst "Updated to qwen3.6-plus for better IF score"
```
Steps:
1. Read current model from `.kilo/agents/{agent}.md`
2. Get previous model from `agent-evolution/data/agent-versions.json`
3. Calculate improvement (IF score, context window)
4. Write to evolution data
5. Post Gitea comment
### `report` — Generate Evolution Report
Generate comprehensive report for agent or all agents:
```bash
/evolution report # all agents
/evolution report planner # specific agent
```
Output includes:
- Total agents
- Model changes this month
- Average quality improvement
- Recent changes table
- Performance metrics
- Model distribution
- Recommendations
### `history` — Show Fitness Trend
Print fitness trend chart:
```bash
/evolution --history
```
Output:
```
Fitness Trend (Last 30 days):
1.00 ┤
0.90 ┤ ╭─╮ ╭──╮
0.80 ┤ ╭─╯ ╰─╮ ╭─╯ ╰──╮
0.70 ┤ ╭─╯ ╰─╯ ╰──╮
0.60 ┤ │ ╰─╮
0.50 ┼─┴───────────────────────────┴──
Apr 1 Apr 8 Apr 15 Apr 22 Apr 29
Avg fitness: 0.82
Trend: ↑ improving
```
### `recommend` — Get Model Recommendations
```bash
/evolution recommend
```
Shows:
- Agents with fitness < 0.70 (need optimization)
- Agents consuming > 30% of token budget (bottlenecks)
- Model upgrade recommendations
- Priority order
### `research` — Research Model Updates
```bash
/evolution research # research all models
/evolution research --agent planner # research models for specific agent
/evolution research --provider ollama-cloud # research specific provider
```
Steps:
1. Read current agents from `.kilo/capability-index.yaml`
2. Read existing benchmarks from `agent-evolution/data/model-benchmarks.json`
3. Fetch latest model info from provider APIs/docs
4. Score each model against each agent role (using IF-adjusted formula)
5. Generate recommendations where score improvement > 5 points
6. Output to `agent-evolution/data/model-research-latest.json`
7. Validate against `agent-evolution/data/model-research.schema.json`
8. If validation passes, update `agent-evolution/data/model-benchmarks.json`
## Data Storage
### fitness-history.jsonl
```jsonl
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.65},"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47,"verdict":"PASS"}
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"breakdown":{"test_pass_rate":1.00,"quality_gates_rate":0.80,"efficiency_score":0.88},"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47,"verdict":"PASS"}
```
### agent-versions.json
```json
{
"version": "1.0",
"agents": {
"capability-analyst": {
"current": {
"model": "qwen/qwen3.6-plus:free",
"provider": "openrouter",
"if_score": 90,
"quality_score": 79,
"context_window": "1M"
},
"history": [
{
"date": "2026-04-05T22:20:00Z",
"type": "model_change",
"from": "ollama-cloud/nemotron-3-super",
"to": "qwen/qwen3.6-plus:free",
"rationale": "Better IF score, FREE via OpenRouter"
}
]
}
}
}
```
### model-benchmarks.json
Static benchmark data extracted from research. Contains:
- Model capabilities (SWE-bench, IF scores, context windows)
- Agent × Model compatibility heatmap scores
- Groq/OpenRouter free tier availability
- Current agent configuration snapshot
- Recommendations (applied + pending)
- Impact analysis data
Path: `agent-evolution/data/model-benchmarks.json`
Schema: `agent-evolution/data/model-benchmarks.schema.json`
Refresh: When `/evolution research` runs or auto when stale (>7 days)
### model-research-latest.json
Latest research output from `/evolution research` or Step 0.
Dynamic file — overwritten each research cycle.
Path: `agent-evolution/data/model-research-latest.json`
Schema: `agent-evolution/data/model-research.schema.json`
## Integration Points
- **After `/pipeline`**: Evaluator scores logged
- **After model update**: Evolution logged
- **Weekly**: Performance report generated
- **On request**: Recommendations provided
## Configuration
```yaml
# In capability-index.yaml
evolution:
enabled: true
auto_trigger: true # trigger after every workflow
fitness_threshold: 0.70 # below this → auto-optimize
max_evolution_attempts: 3 # max retries per cycle
fitness_history: .kilo/logs/fitness-history.jsonl
token_budget_default: 50000
time_budget_default: 300
```
## Metrics Tracked
| Metric | Source | Purpose |
|--------|--------|---------|
| Fitness Score | pipeline-judge | Overall pipeline health |
| Test Pass Rate | bun test | Code quality |
| Quality Gates | build/lint/typecheck | Standards compliance |
| Token Cost | pipeline logs | Resource efficiency |
| Wall-Clock Time | pipeline logs | Speed |
| Agent ROI | history analysis | Cost/benefit |
| Model IF Score | model-benchmarks.json | Prompt adherence per model |
| Model Fit Score | heatmap data | Agent-model compatibility |
| Model Availability | provider APIs | Rate limits, free tier status |
| Staleness | metadata.generated | How fresh is benchmark data |
## Example Session
```bash
$ /evolution
## Pipeline Judgment: Issue #42
**Fitness: 0.82/1.00** [PASS]
| Metric | Value | Weight | Contribution |
|--------|-------|--------|-------------|
| Tests | 95% (45/47) | 50% | 0.475 |
| Gates | 80% (4/5) | 25% | 0.200 |
| Cost | 38.4K tok / 245s | 25% | 0.163 |
**Bottleneck:** lead-developer (31% of tokens)
**Verdict:** PASS - within acceptable range
✅ Logged to .kilo/logs/fitness-history.jsonl
```
## Example: Model Research Session
```bash
$ /evolution research
## Model Research: All Agents
**Benchmarks last updated**: 2026-04-20 (7 days ago — refreshing...)
### Research Phase
→ Fetching Ollama Cloud model list... 20 models found
→ Fetching OpenRouter free tier... 3 models found
→ Fetching Groq free tier... 5 models found
→ Scoring 28 models × 36 agents... 1008 scores computed
### Top Recommendations (score gap > 5)
| Agent | Current | Score | Recommended | Score | Δ | Impact |
|-------|---------|-------|-------------|-------|---|--------|
| planner | nemotron-3-super | 80 | deepseek-v4-pro-max | 88 | +8 | high |
| go-developer | qwen3-coder | 85 | deepseek-v4-pro-max | 88 | +3 | medium |
| [built-in] debug | glm-5.1 | 88 | kimi-k2.6:cloud | 90 | +2 | high |
### Output
✅ agent-evolution/data/model-research-latest.json (28 models, 11 recommendations)
✅ agent-evolution/data/model-benchmarks.json refreshed (36 agents)
### Next Steps
Run `/evolution` to apply recommendations and re-test
Or `/evolution --dry-run` to preview changes
### Dashboard Rebuild
After model research or applying recommendations, rebuild the dashboard:
```bash
bun run agent-evolution/scripts/build-research-dashboard.ts
```
Output:
- `agent-evolution/research-dashboard.html` — latest interactive dashboard
- `agent-evolution/dist/research-dashboard-YYYY_MM_DD.html` — dated archive
The dashboard reads from `agent-evolution/data/model-benchmarks.json` and renders:
- Current agent-model configuration table
- Model comparison cards with SWE-bench and IF scores
- Agent × Model heatmap with IF adjustment
- Selectable recommendations with JSON export
- Before/after impact analysis
Watch mode for continuous rebuild during research:
```bash
bun run agent-evolution/scripts/build-research-dashboard.ts --watch
```
Auto-triggers with `--watch` when `model-benchmarks.json` or template changes.
```
---
*Evolution workflow v2.0 - Objective fitness scoring with pipeline-judge*