Files
APAW/.kilo/commands/evolution.md
¨NW¨ 3badb259cc feat: bidirectional research dashboard + agent config fixes
- Integrate apaw_agent_model_research_v3.html as standalone dashboard
- Add model-benchmarks.json with 32 agents, 11 scored models, 11 recommendations
- Add build-research-dashboard.ts: inject live data into template → standalone HTML
- Add rebuild-template.cjs: regenerate template from v3.html source
- Add sync-benchmarks-from-yaml.cjs: sync YAML → JSON round-trip
- Add sync-model-research.ts: apply recommendation matrix to config files
- Add model-benchmarks.schema.json and model-research.schema.json for validation
- Add bidirectional-data-flow.md architecture documentation
- Add log-execution.cjs pipeline hook
- Update capability-index.yaml: add fallback_models, failover_strategy
- Update kilo-meta.json, kilo.jsonc, KILO_SPEC.md with synced models
- Update evolution.md / research.md / self-evolution.md / evolutionary-sync.md docs
- Fix security-auditor.md: quote YAML color (#DC2626)
- Fix orchestrator.md: remove duplicate devops-engineer key
- Build research-dashboard.html (106KB standalone) + dated archive
2026-04-29 21:04:22 +01:00

11 KiB
Raw Blame History

description
description
Run evolution cycle - judge last workflow, optimize underperforming agents, re-test

/evolution — Pipeline Evolution Command

Runs the automated evolution cycle on the most recent (or specified) workflow.

Usage

/evolution                     # evolve last completed workflow
/evolution --issue 42          # evolve workflow for issue #42
/evolution --agent planner     # focus evolution on one agent
/evolution --dry-run           # show what would change without applying
/evolution --history           # print fitness trend chart
/evolution --fitness           # run fitness evaluation (alias for /evolve)

Aliases

  • /evolve — same as /evolution --fitness
  • /evolution log — log agent model change to Gitea

Execution

Step 0: Model Research

Check if model benchmarks are stale (older than 7 days):
  READ agent-evolution/data/model-benchmarks.json → metadata.generated
  
  IF metadata.generated > 7 days ago OR file missing:
    Task(subagent_type: "capability-analyst")
    → research latest model benchmarks, IF scores, availability
    → output to agent-evolution/data/model-research-latest.json
    → validates against agent-evolution/data/model-research.schema.json
    
  Read agent-evolution/data/model-benchmarks.json
  → load heatmap scores per agent
  → load recommendations
  → identify agents where current model != best-fit model (score gap > 5)

This step ensures the evolution cycle works with fresh model data. If benchmarks are stale, the capability-analyst researches current model capabilities and pricing.

The research output follows the schema: agent-evolution/data/model-research.schema.json

Step 1: Judge (Fitness Evaluation)

Task(subagent_type: "pipeline-judge")
→ produces fitness report

Step 2: Decide (Threshold Routing)

IF fitness >= 0.85:
  echo "✅ Pipeline healthy (fitness: {score}). No action needed."
  append to fitness-history.jsonl
  EXIT

IF fitness >= 0.70:
  echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
  identify agents with lowest per-agent scores
  Task(subagent_type: "prompt-optimizer", target: weak_agents)

IF fitness < 0.70:
  echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
  Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
  IF fitness < 0.50:
    Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)

Step 3: Re-test (After Optimization)

Re-run the SAME workflow with updated prompts
Task(subagent_type: "pipeline-judge") → fitness_after

IF fitness_after > fitness_before:
  commit prompt changes
  echo "📈 Fitness improved: {before} → {after}"
ELSE:
  revert prompt changes
  echo "📉 No improvement. Reverting."

Step 4: Log + Dashboard

Append to .kilo/logs/fitness-history.jsonl:

{
  "ts": "<now>",
  "issue": <N>,
  "workflow": "<type>",
  "fitness_before": <score>,
  "fitness_after": <score>,
  "agents_optimized": ["planner", "requirement-refiner"],
  "tokens_saved": <delta>,
  "time_saved_ms": <delta>
}

After logging, rebuild the research dashboard:

bun run agent-evolution/scripts/build-research-dashboard.ts

This ensures the dashboard reflects any model changes that occurred during evolution.

Subcommands

log — Log Model Change

Log an agent model improvement to Gitea and evolution data.

/evolution log capability-analyst "Updated to qwen3.6-plus for better IF score"

Steps:

  1. Read current model from .kilo/agents/{agent}.md
  2. Get previous model from agent-evolution/data/agent-versions.json
  3. Calculate improvement (IF score, context window)
  4. Write to evolution data
  5. Post Gitea comment

report — Generate Evolution Report

Generate comprehensive report for agent or all agents:

/evolution report           # all agents
/evolution report planner   # specific agent

Output includes:

  • Total agents
  • Model changes this month
  • Average quality improvement
  • Recent changes table
  • Performance metrics
  • Model distribution
  • Recommendations

history — Show Fitness Trend

Print fitness trend chart:

/evolution --history

Output:

Fitness Trend (Last 30 days):

1.00 ┤
0.90 ┤     ╭─╮     ╭──╮
0.80 ┤   ╭─╯ ╰─╮ ╭─╯  ╰──╮
0.70 ┤ ╭─╯     ╰─╯        ╰──╮
0.60 ┤ │                         ╰─╮
0.50 ┼─┴───────────────────────────┴──
     Apr 1  Apr 8  Apr 15  Apr 22  Apr 29

Avg fitness: 0.82
Trend: ↑ improving

recommend — Get Model Recommendations

/evolution recommend

Shows:

  • Agents with fitness < 0.70 (need optimization)
  • Agents consuming > 30% of token budget (bottlenecks)
  • Model upgrade recommendations
  • Priority order

research — Research Model Updates

/evolution research            # research all models
/evolution research --agent planner  # research models for specific agent
/evolution research --provider ollama-cloud  # research specific provider

Steps:

  1. Read current agents from .kilo/capability-index.yaml
  2. Read existing benchmarks from agent-evolution/data/model-benchmarks.json
  3. Fetch latest model info from provider APIs/docs
  4. Score each model against each agent role (using IF-adjusted formula)
  5. Generate recommendations where score improvement > 5 points
  6. Output to agent-evolution/data/model-research-latest.json
  7. Validate against agent-evolution/data/model-research.schema.json
  8. If validation passes, update agent-evolution/data/model-benchmarks.json

Data Storage

fitness-history.jsonl

{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.65},"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47,"verdict":"PASS"}
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"breakdown":{"test_pass_rate":1.00,"quality_gates_rate":0.80,"efficiency_score":0.88},"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47,"verdict":"PASS"}

agent-versions.json

{
  "version": "1.0",
  "agents": {
    "capability-analyst": {
      "current": {
        "model": "qwen/qwen3.6-plus:free",
        "provider": "openrouter",
        "if_score": 90,
        "quality_score": 79,
        "context_window": "1M"
      },
      "history": [
        {
          "date": "2026-04-05T22:20:00Z",
          "type": "model_change",
          "from": "ollama-cloud/nemotron-3-super",
          "to": "qwen/qwen3.6-plus:free",
          "rationale": "Better IF score, FREE via OpenRouter"
        }
      ]
    }
  }
}

model-benchmarks.json

Static benchmark data extracted from research. Contains:

  • Model capabilities (SWE-bench, IF scores, context windows)
  • Agent × Model compatibility heatmap scores
  • Groq/OpenRouter free tier availability
  • Current agent configuration snapshot
  • Recommendations (applied + pending)
  • Impact analysis data

Path: agent-evolution/data/model-benchmarks.json Schema: agent-evolution/data/model-benchmarks.schema.json Refresh: When /evolution research runs or auto when stale (>7 days)

model-research-latest.json

Latest research output from /evolution research or Step 0. Dynamic file — overwritten each research cycle.

Path: agent-evolution/data/model-research-latest.json Schema: agent-evolution/data/model-research.schema.json

Integration Points

  • After /pipeline: Evaluator scores logged
  • After model update: Evolution logged
  • Weekly: Performance report generated
  • On request: Recommendations provided

Configuration

# In capability-index.yaml
evolution:
  enabled: true
  auto_trigger: true           # trigger after every workflow
  fitness_threshold: 0.70      # below this → auto-optimize
  max_evolution_attempts: 3    # max retries per cycle
  fitness_history: .kilo/logs/fitness-history.jsonl
  token_budget_default: 50000
  time_budget_default: 300

Metrics Tracked

Metric Source Purpose
Fitness Score pipeline-judge Overall pipeline health
Test Pass Rate bun test Code quality
Quality Gates build/lint/typecheck Standards compliance
Token Cost pipeline logs Resource efficiency
Wall-Clock Time pipeline logs Speed
Agent ROI history analysis Cost/benefit
Model IF Score model-benchmarks.json Prompt adherence per model
Model Fit Score heatmap data Agent-model compatibility
Model Availability provider APIs Rate limits, free tier status
Staleness metadata.generated How fresh is benchmark data

Example Session

$ /evolution

## Pipeline Judgment: Issue #42

**Fitness: 0.82/1.00** [PASS]

| Metric | Value | Weight | Contribution |
|--------|-------|--------|-------------|
| Tests  | 95% (45/47) | 50% | 0.475 |
| Gates  | 80% (4/5) | 25% | 0.200 |
| Cost   | 38.4K tok / 245s | 25% | 0.163 |

**Bottleneck:** lead-developer (31% of tokens)
**Verdict:** PASS - within acceptable range

✅ Logged to .kilo/logs/fitness-history.jsonl

Example: Model Research Session

$ /evolution research

## Model Research: All Agents

**Benchmarks last updated**: 2026-04-20 (7 days ago — refreshing...)

### Research Phase
→ Fetching Ollama Cloud model list... 20 models found
→ Fetching OpenRouter free tier... 3 models found
→ Fetching Groq free tier... 5 models found
→ Scoring 28 models × 36 agents... 1008 scores computed

### Top Recommendations (score gap > 5)

| Agent | Current | Score | Recommended | Score | Δ | Impact |
|-------|---------|-------|-------------|-------|---|--------|
| planner | nemotron-3-super | 80 | deepseek-v4-pro-max | 88 | +8 | high |
| go-developer | qwen3-coder | 85 | deepseek-v4-pro-max | 88 | +3 | medium |
| [built-in] debug | glm-5.1 | 88 | kimi-k2.6:cloud | 90 | +2 | high |

### Output
✅ agent-evolution/data/model-research-latest.json (28 models, 11 recommendations)
✅ agent-evolution/data/model-benchmarks.json refreshed (36 agents)

### Next Steps
Run `/evolution` to apply recommendations and re-test
Or `/evolution --dry-run` to preview changes

### Dashboard Rebuild

After model research or applying recommendations, rebuild the dashboard:

```bash
bun run agent-evolution/scripts/build-research-dashboard.ts

Output:

  • agent-evolution/research-dashboard.html — latest interactive dashboard
  • agent-evolution/dist/research-dashboard-YYYY_MM_DD.html — dated archive

The dashboard reads from agent-evolution/data/model-benchmarks.json and renders:

  • Current agent-model configuration table
  • Model comparison cards with SWE-bench and IF scores
  • Agent × Model heatmap with IF adjustment
  • Selectable recommendations with JSON export
  • Before/after impact analysis

Watch mode for continuous rebuild during research:

bun run agent-evolution/scripts/build-research-dashboard.ts --watch

Auto-triggers with --watch when model-benchmarks.json or template changes.


---

*Evolution workflow v2.0 - Objective fitness scoring with pipeline-judge*