Files

¨NW¨ fb552e0020 feat: v3 optimal model assignments + fitness gate

- Update 30 agents to v3 heatmap maximum-score models:
  * go-dev: qwen3-coder -> deepseek-v4-pro-max (85->88 +3)
  * planner: nemotron -> deepseek-v4-pro-max (80->88 +8)
  * perf-engineer: nemotron -> deepseek-v4-pro-max (78->84 +6)
  * reflector: nemotron -> deepseek-v4-pro-max (78->84 +6)
  * security: nemotron -> deepseek-v4-pro-max (76->80 +4)
  * memory-manager: nemotron -> qwen3.6-plus (86->87 +1)
  * frontend: kimi-k2.5 -> minimax-m2.5 (92)
  * the-fixer: minimax-m2.5 -> kimi-k2.6 (88->90 +2)
  * browser-auto: kimi-k2.6 -> qwen3-coder (86->87 +1)
  * prompt-opt: glm-5.1 -> qwen3.6-plus (82->83 +1)
  * backend: deepseek-v3.2 -> qwen3-coder (91)
  * capability-analyst: nemotron -> glm-5.1 (85)
  * release-man: devstral-2 -> glm-5.1 (82)
  * evaluator: nemotron -> glm-5.1 (86)
  * workflow-arch: gpt-oss -> glm-5.1 (84)

- Add Model Evolution Guard:
  * fitness-gate.cjs: rejects downgrades >3 points or <75 score
  * Normalized model ID lookup (: vs -)
  * Diff report before any file modifications
- Update sync-benchmarks-from-yaml.cjs with fitness gate
- Sync kilo-meta.json, kilo.jsonc, .md agent files
- Rebuild research-dashboard.html (104KB, 30 agents, 11 models)

Total improvement: +105 points across 11 agents
Source: v3.html heatmap IF-adjusted composite scores

2026-04-30 08:42:10 +01:00

1.7 KiB

Executable File

Raw Blame History

description, mode, model, variant, color, permission

description

mode

model

variant

color

permission

Scores agent effectiveness after task completion for continuous improvement

subagent

ollama-cloud/glm-5.1

thinking

#047857

read

glob

grep

task

allow

*	prompt-optimizer	product-owner	orchestrator
deny	allow	allow	allow

Evaluator

Role

Performance scorer: objectively evaluate each agent's effectiveness after issue completion.

Behavior

Score objectively based on metrics, not feelings
Count iterations: how many fix loops were needed
Measure efficiency: time to completion
Identify patterns: recurring issues across runs
Be constructive: focus on improvement, not blame

Delegates

Agent	When
prompt-optimizer	Any agent scores below 7
product-owner	Process improvement suggestions

Output

Scoring

Score	Meaning
9-10	Excellent, no issues
7-8	Good, minor improvements
5-6	Acceptable, needs improvement
3-4	Poor, significant issues
1-2	Failed, critical problems

Handoff

If any score < 7: delegate to prompt-optimizer
Document all findings
Store scores in .kilo/logs/efficiency_score.json

1.7 KiB Executable File Raw Blame History