- Add 9 missing agents to orchestrator task whitelist (20→28 agents) - Fix 2 broken agents: debug (gpt-oss:20b→qwen3.6-plus), release-manager (devstral-2→qwen3.6-plus) - Upgrade orchestrator (glm-5→qwen3.6-plus, IF:80→90, 128K→1M context) - Upgrade pipeline-judge (nemotron→qwen3.6-plus, IF:85→90) - Add orchestrator escalation path to 7 agents (lead-dev, sdet, skeptic, perf, security, evaluator, devops) - Create self-evolution protocol (.kilo/rules/orchestrator-self-evolution.md) - Create evolution log (.kilo/EVOLUTION_LOG.md) - Full audit of all 29 agents with verification tests
8.6 KiB
8.6 KiB
Model Evolution Proposal Analysis
Date: 2026-04-06T22:28:00+01:00 Source: APAW Agent Model Research v3 Analyst: Orchestrator
Executive Summary
Critical Issues Found 🔴
| Agent | Current Model | Status | Action Required |
|---|---|---|---|
debug (built-in) |
gpt-oss:20b | BROKEN | Fix immediately |
release-manager |
devstral-2:123b | BROKEN | Fix immediately |
Recommended Changes
| Priority | Agent | Change | Impact |
|---|---|---|---|
| P0 | debug | gpt-oss:20b → gemma4:31b | +29% quality |
| P0 | release-manager | devstral-2:123b → qwen3.6-plus:free | Fix broken agent |
| P1 | orchestrator | glm-5 → qwen3.6-plus:free | +2% quality, +3x speed |
| P1 | pipeline-judge | nemotron-3-super → qwen3.6-plus:free | +3% quality |
| P2 | evaluator | Add Groq burst for fast scoring | +6x speed |
| P3 | Others | Keep current | No change needed |
Detailed Analysis
1. CRITICAL: Debug Agent (Built-in)
Current State:
debug:
model: ollama-cloud/gpt-oss:20b
status: BROKEN
IF: ~65 (underwhelming)
Recommendation:
debug:
model: ollama-cloud/gemma4:31b
provider: ollama
IF: 83
context: 256K
features: thinking mode, vision
license: Apache 2.0
Rationale:
- gpt-oss:20b is BROKEN on Ollama Cloud
- Gemma 4 31B has IF:83 vs gpt-oss IF:65 = +29% improvement
- 256K context (vs 8K) = 32x more context
- Thinking mode enables better debugging
- Alternative: Nemotron-Cascade-2 (IF:82.9, LiveCodeBench 87.2)
Action: Apply immediately
2. CRITICAL: Release Manager
Current State:
release-manager:
model: ollama-cloud/devstral-2:123b
status: BROKEN
IF: ~75
Recommendation:
release-manager:
model: openrouter/qwen/qwen3.6-plus:free
provider: openrouter
IF: 90
score: 76★
context: 1M
cost: FREE
Rationale:
- devstral-2:123b NOT WORKING on Ollama Cloud
- Comparison matrix shows Qwen 3.6+ = 76, GLM-5 = 76 (tie)
- BUT Qwen has IF:90 vs GLM-5 IF:80 = better for git operations
- 1M context for complex changelogs
- FREE via OpenRouter
- Fallback: nemotron-3-super (IF:85, 1M context) for heavy tasks
Action: Apply immediately
3. HIGH: Orchestrator
Current State:
orchestrator:
model: ollama-cloud/glm-5
IF: 80
score: 82
context: 128K
Recommendation:
orchestrator:
model: openrouter/qwen/qwen3.6-plus:free
provider: openrouter
IF: 90
score: 84★
context: 1M
cost: FREE
Rationale:
- Orchestrator is CRITICAL agent - needs best possible IF for routing
- IF:90 vs IF:80 = +12.5% improvement in instruction following
- 1M context for complex workflow state management
- Score: 84 vs 82 = +2% overall
- +3x speed improvement
- FREE via OpenRouter
Action: Apply after critical fixes
4. HIGH: Pipeline Judge
Current State:
pipeline-judge:
model: ollama-cloud/nemotron-3-super
IF: 85
score: 78
context: 1M
Recommendation:
pipeline-judge:
model: openrouter/qwen/qwen3.6-plus:free
provider: openrouter
IF: 90
score: 80★
context: 1M
cost: FREE
Rationale:
- Judge needs IF:90 for accurate fitness scoring
- Score: 80 vs 78 = +3% improvement
- Same 1M context as Nemotron
- FREE via OpenRouter
- Keep Nemotron as fallback for heavy parsing tasks
Action: Apply after critical fixes
5. MEDIUM: Evaluator (Burst Mode)
Current State:
evaluator:
model: openrouter/qwen/qwen3.6-plus:free
IF: 90
score: 81
Recommendation: TWO-TIER APPROACH
# Primary: Qwen 3.6+ (for detailed scoring)
evaluator:
model: openrouter/qwen/qwen3.6-plus:free
IF: 90
score: 81
use: detailed_scoring
# Burst: Groq gpt-oss:120b (for fast numeric scoring)
evaluator-burst:
model: groq/gpt-oss-120b
speed: 500 t/s
IF: 72
use: quick_numeric_scoring
limit: 50-100 calls/day
Rationale:
- Qwen 3.6+ score: 81 is already optimal
- Groq gpt-oss:120b: 500 tokens/sec = +6x speed for quick scoring
- IF:72 is sufficient for numeric evaluation
- Use burst for simple: "Score: 8/10" responses
- Use Qwen for complex: full report with recommendations
Action: Optional enhancement
6. LOW: Keep Current Models
These agents are ALREADY OPTIMAL:
| Agent | Current Model | Score | Reason to Keep |
|---|---|---|---|
requirement-refiner |
glm-5 | 80★ | Best score for system analysis |
security-auditor |
nemotron-3-super | 76 | Best for 1M ctx security scans |
markdown-validator |
nemotron-3-nano | 70★ | Lightweight validation |
code-skeptic |
minimax-m2.5 | 85★ | Absolute LEADER in code review |
the-fixer |
minimax-m2.5 | 88★ | Absolute LEADER in bug fixing |
lead-developer |
qwen3-coder:480b | 92 | SWE-bench 66.5%, best coding model |
frontend-developer |
qwen3-coder:480b | 90 | Excellent for UI |
backend-developer |
qwen3-coder:480b | 91 | Excellent for API |
Action: No changes needed
Implementation Plan
Phase 1: CRITICAL Fixes (Immediately)
# 1. Fix debug agent
kilo.jsonc:
agent.debug.model: "ollama-cloud/gemma4:31b"
# 2. Fix release-manager
capability-index.yaml:
agents.release-manager.model: "openrouter/qwen/qwen3.6-plus:free"
Phase 2: HIGH Priority (Within 24h)
# 3. Upgrade orchestrator
kilo.jsonc:
agent.orchestrator.model: "openrouter/qwen/qwen3.6-plus:free"
# 4. Upgrade pipeline-judge
capability-index.yaml:
agents.pipeline-judge.model: "openrouter/qwen/qwen3.6-plus:free"
Phase 3: MEDIUM Priority (Within 1 week)
# 5. Add evaluator burst mode
# Create new agent: evaluator-burst
agents.evaluator-burst.model: "groq/gpt-oss-120b"
agents.evaluator-burst.mode: "subagent"
agents.evaluator-burst.permission.task: ["evaluator"]
Phase 4: LOW Priority (No changes)
# 6-10. Keep current models
# No action needed
Risk Assessment
High Risk
| Change | Risk | Mitigation |
|---|---|---|
| orchestrator to openrouter | Provider dependency | Keep GLM-5 as fallback |
| release-manager to openrouter | Provider dependency | Keep Nemotron as fallback |
Medium Risk
| Change | Risk | Mitigation |
|---|---|---|
| debug to gemma4 | New model | Test with sample debug tasks |
| pipeline-judge to openrouter | Provider dependency | Keep Nemotron fallback |
Low Risk
| Change | Risk | Mitigation |
|---|---|---|
| evaluator burst mode | Rate limits | Limit to 100 calls/day |
Quality Metrics
Expected Improvement
| Agent | Before IF | After IF | Δ | Before Score | After Score | Δ |
|---|---|---|---|---|---|---|
| debug | 65 | 83 | +18 | - | - | - |
| release-manager | 75 | 90 | +15 | 75 | 76 | +1 |
| orchestrator | 80 | 90 | +10 | 82 | 84 | +2 |
| pipeline-judge | 85 | 90 | +5 | 78 | 80 | +2 |
| evaluator | 90 | 90 | 0 | 81 | 81 | 0 |
Overall System Impact
- Broken agents fixed: 2 → 0
- Average IF improvement: +18% (weighted by usage)
- Average score improvement: +1.25%
- Context window improvement: 128K → 1M for key agents
Verification Checklist
Before applying changes:
- Backup current configuration
- Test new models with sample tasks
- Verify OpenRouter API key configured
- Verify Groq API key configured (for burst mode)
- Document fallback models
- Update agent-versions.json after changes
- Run sync:evolution to update dashboard
Recommendation
Apply Immediately:
- debug: gpt-oss:20b → gemma4:31b (fixes broken agent)
- release-manager: devstral-2:123b → qwen3.6-plus:free (fixes broken agent)
Apply Within 24h:
- orchestrator: glm-5 → qwen3.6-plus:free (+2% score, +10 IF)
- pipeline-judge: nemotron-3-super → qwen3.6-plus:free (+2% score)
Consider:
- evaluator: Add Groq burst mode for +6x speed
Keep Unchanged:
6-10. All other agents are already optimal
Files to Modify
Phase 1 (Critical)
# kilo.jsonc - Fix debug agent
.agent.debug.model = "ollama-cloud/gemma4:31b"
# capability-index.yaml - Fix release-manager
agents.release-manager.model = "openrouter/qwen/qwen3.6-plus:free"
Phase 2 (High)
# kilo.jsonc - Upgrade orchestrator
.agent.orchestrator.model = "openrouter/qwen/qwen3.6-plus:free"
# capability-index.yaml - Upgrade pipeline-judge
agents.pipeline-judge.model = "openrouter/qwen/qwen3.6-plus:free"
Analysis Status: ✅ COMPLETE Recommendation: Apply Phase 1 immediately (2 broken agents)