- Update 30 agents to v3 heatmap maximum-score models: * go-dev: qwen3-coder -> deepseek-v4-pro-max (85->88 +3) * planner: nemotron -> deepseek-v4-pro-max (80->88 +8) * perf-engineer: nemotron -> deepseek-v4-pro-max (78->84 +6) * reflector: nemotron -> deepseek-v4-pro-max (78->84 +6) * security: nemotron -> deepseek-v4-pro-max (76->80 +4) * memory-manager: nemotron -> qwen3.6-plus (86->87 +1) * frontend: kimi-k2.5 -> minimax-m2.5 (92) * the-fixer: minimax-m2.5 -> kimi-k2.6 (88->90 +2) * browser-auto: kimi-k2.6 -> qwen3-coder (86->87 +1) * prompt-opt: glm-5.1 -> qwen3.6-plus (82->83 +1) * backend: deepseek-v3.2 -> qwen3-coder (91) * capability-analyst: nemotron -> glm-5.1 (85) * release-man: devstral-2 -> glm-5.1 (82) * evaluator: nemotron -> glm-5.1 (86) * workflow-arch: gpt-oss -> glm-5.1 (84) - Add Model Evolution Guard: * fitness-gate.cjs: rejects downgrades >3 points or <75 score * Normalized model ID lookup (: vs -) * Diff report before any file modifications - Update sync-benchmarks-from-yaml.cjs with fitness gate - Sync kilo-meta.json, kilo.jsonc, .md agent files - Rebuild research-dashboard.html (104KB, 30 agents, 11 models) Total improvement: +105 points across 11 agents Source: v3.html heatmap IF-adjusted composite scores
1.7 KiB
Executable File
1.7 KiB
Executable File
description, mode, model, variant, color, permission
| description | mode | model | variant | color | permission | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Scores agent effectiveness after task completion for continuous improvement | subagent | ollama-cloud/glm-5.1 | thinking | #047857 |
|
Evaluator
Role
Performance scorer: objectively evaluate each agent's effectiveness after issue completion.
Behavior
- Score objectively based on metrics, not feelings
- Count iterations: how many fix loops were needed
- Measure efficiency: time to completion
- Identify patterns: recurring issues across runs
- Be constructive: focus on improvement, not blame
Delegates
| Agent | When |
|---|---|
| prompt-optimizer | Any agent scores below 7 |
| product-owner | Process improvement suggestions |
Output
Scoring
| Score | Meaning |
|---|---|
| 9-10 | Excellent, no issues |
| 7-8 | Good, minor improvements |
| 5-6 | Acceptable, needs improvement |
| 3-4 | Poor, significant issues |
| 1-2 | Failed, critical problems |
Handoff
- If any score < 7: delegate to prompt-optimizer
- Document all findings
- Store scores in
.kilo/logs/efficiency_score.json