# Model Evolution Proposal Analysis **Date**: 2026-04-06T22:28:00+01:00 **Source**: APAW Agent Model Research v3 **Analyst**: Orchestrator --- ## Executive Summary ### Critical Issues Found 🔴 | Agent | Current Model | Status | Action Required | |-------|---------------|--------|-----------------| | `debug` (built-in) | gpt-oss:20b | **BROKEN** | Fix immediately | | `release-manager` | devstral-2:123b | **BROKEN** | Fix immediately | ### Recommended Changes | Priority | Agent | Change | Impact | |----------|--------|--------|--------| | **P0** | debug | gpt-oss:20b → gemma4:31b | +29% quality | | **P0** | release-manager | devstral-2:123b → qwen3.6-plus:free | Fix broken agent | | **P1** | orchestrator | glm-5 → qwen3.6-plus:free | +2% quality, +3x speed | | **P1** | pipeline-judge | nemotron-3-super → qwen3.6-plus:free | +3% quality | | **P2** | evaluator | Add Groq burst for fast scoring | +6x speed | | **P3** | Others | Keep current | No change needed | --- ## Detailed Analysis ### 1. CRITICAL: Debug Agent (Built-in) **Current State:** ```yaml debug: model: ollama-cloud/gpt-oss:20b status: BROKEN IF: ~65 (underwhelming) ``` **Recommendation:** ```yaml debug: model: ollama-cloud/gemma4:31b provider: ollama IF: 83 context: 256K features: thinking mode, vision license: Apache 2.0 ``` **Rationale:** - gpt-oss:20b is BROKEN on Ollama Cloud - Gemma 4 31B has IF:83 vs gpt-oss IF:65 = **+29% improvement** - 256K context (vs 8K) = 32x more context - Thinking mode enables better debugging - Alternative: Nemotron-Cascade-2 (IF:82.9, LiveCodeBench 87.2) **Action: Apply immediately** --- ### 2. CRITICAL: Release Manager **Current State:** ```yaml release-manager: model: ollama-cloud/devstral-2:123b status: BROKEN IF: ~75 ``` **Recommendation:** ```yaml release-manager: model: openrouter/qwen/qwen3.6-plus:free provider: openrouter IF: 90 score: 76★ context: 1M cost: FREE ``` **Rationale:** - devstral-2:123b NOT WORKING on Ollama Cloud - Comparison matrix shows Qwen 3.6+ = 76, GLM-5 = 76 (tie) - BUT Qwen has IF:90 vs GLM-5 IF:80 = better for git operations - 1M context for complex changelogs - FREE via OpenRouter - Fallback: nemotron-3-super (IF:85, 1M context) for heavy tasks **Action: Apply immediately** --- ### 3. HIGH: Orchestrator **Current State:** ```yaml orchestrator: model: ollama-cloud/glm-5 IF: 80 score: 82 context: 128K ``` **Recommendation:** ```yaml orchestrator: model: openrouter/qwen/qwen3.6-plus:free provider: openrouter IF: 90 score: 84★ context: 1M cost: FREE ``` **Rationale:** - Orchestrator is CRITICAL agent - needs best possible IF for routing - IF:90 vs IF:80 = **+12.5% improvement in instruction following** - 1M context for complex workflow state management - Score: 84 vs 82 = +2% overall - +3x speed improvement - FREE via OpenRouter **Action: Apply after critical fixes** --- ### 4. HIGH: Pipeline Judge **Current State:** ```yaml pipeline-judge: model: ollama-cloud/nemotron-3-super IF: 85 score: 78 context: 1M ``` **Recommendation:** ```yaml pipeline-judge: model: openrouter/qwen/qwen3.6-plus:free provider: openrouter IF: 90 score: 80★ context: 1M cost: FREE ``` **Rationale:** - Judge needs IF:90 for accurate fitness scoring - Score: 80 vs 78 = +3% improvement - Same 1M context as Nemotron - FREE via OpenRouter - Keep Nemotron as fallback for heavy parsing tasks **Action: Apply after critical fixes** --- ### 5. MEDIUM: Evaluator (Burst Mode) **Current State:** ```yaml evaluator: model: openrouter/qwen/qwen3.6-plus:free IF: 90 score: 81 ``` **Recommendation: TWO-TIER APPROACH** ```yaml # Primary: Qwen 3.6+ (for detailed scoring) evaluator: model: openrouter/qwen/qwen3.6-plus:free IF: 90 score: 81 use: detailed_scoring # Burst: Groq gpt-oss:120b (for fast numeric scoring) evaluator-burst: model: groq/gpt-oss-120b speed: 500 t/s IF: 72 use: quick_numeric_scoring limit: 50-100 calls/day ``` **Rationale:** - Qwen 3.6+ score: 81 is already optimal - Groq gpt-oss:120b: 500 tokens/sec = +6x speed for quick scoring - IF:72 is sufficient for numeric evaluation - Use burst for simple: "Score: 8/10" responses - Use Qwen for complex: full report with recommendations **Action: Optional enhancement** --- ### 6. LOW: Keep Current Models These agents are ALREADY OPTIMAL: | Agent | Current Model | Score | Reason to Keep | |-------|---------------|-------|----------------| | `requirement-refiner` | glm-5 | 80★ | Best score for system analysis | | `security-auditor` | nemotron-3-super | 76 | Best for 1M ctx security scans | | `markdown-validator` | nemotron-3-nano | 70★ | Lightweight validation | | `code-skeptic` | minimax-m2.5 | 85★ | Absolute LEADER in code review | | `the-fixer` | minimax-m2.5 | 88★ | Absolute LEADER in bug fixing | | `lead-developer` | qwen3-coder:480b | 92 | SWE-bench 66.5%, best coding model | | `frontend-developer` | qwen3-coder:480b | 90 | Excellent for UI | | `backend-developer` | qwen3-coder:480b | 91 | Excellent for API | **Action: No changes needed** --- ## Implementation Plan ### Phase 1: CRITICAL Fixes (Immediately) ```yaml # 1. Fix debug agent kilo.jsonc: agent.debug.model: "ollama-cloud/gemma4:31b" # 2. Fix release-manager capability-index.yaml: agents.release-manager.model: "openrouter/qwen/qwen3.6-plus:free" ``` ### Phase 2: HIGH Priority (Within 24h) ```yaml # 3. Upgrade orchestrator kilo.jsonc: agent.orchestrator.model: "openrouter/qwen/qwen3.6-plus:free" # 4. Upgrade pipeline-judge capability-index.yaml: agents.pipeline-judge.model: "openrouter/qwen/qwen3.6-plus:free" ``` ### Phase 3: MEDIUM Priority (Within 1 week) ```yaml # 5. Add evaluator burst mode # Create new agent: evaluator-burst agents.evaluator-burst.model: "groq/gpt-oss-120b" agents.evaluator-burst.mode: "subagent" agents.evaluator-burst.permission.task: ["evaluator"] ``` ### Phase 4: LOW Priority (No changes) ```yaml # 6-10. Keep current models # No action needed ``` --- ## Risk Assessment ### High Risk | Change | Risk | Mitigation | |--------|------|------------| | orchestrator to openrouter | Provider dependency | Keep GLM-5 as fallback | | release-manager to openrouter | Provider dependency | Keep Nemotron as fallback | ### Medium Risk | Change | Risk | Mitigation | |--------|------|------------| | debug to gemma4 | New model | Test with sample debug tasks | | pipeline-judge to openrouter | Provider dependency | Keep Nemotron fallback | ### Low Risk | Change | Risk | Mitigation | |--------|------|------------| | evaluator burst mode | Rate limits | Limit to 100 calls/day | --- ## Quality Metrics ### Expected Improvement | Agent | Before IF | After IF | Δ | Before Score | After Score | Δ | |-------|-----------|----------|---|--------------|-------------|---| | debug | 65 | 83 | +18 | - | - | - | | release-manager | 75 | 90 | +15 | 75 | 76 | +1 | | orchestrator | 80 | 90 | +10 | 82 | 84 | +2 | | pipeline-judge | 85 | 90 | +5 | 78 | 80 | +2 | | evaluator | 90 | 90 | 0 | 81 | 81 | 0 | ### Overall System Impact - **Broken agents fixed**: 2 → 0 - **Average IF improvement**: +18% (weighted by usage) - **Average score improvement**: +1.25% - **Context window improvement**: 128K → 1M for key agents --- ## Verification Checklist Before applying changes: - [ ] Backup current configuration - [ ] Test new models with sample tasks - [ ] Verify OpenRouter API key configured - [ ] Verify Groq API key configured (for burst mode) - [ ] Document fallback models - [ ] Update agent-versions.json after changes - [ ] Run sync:evolution to update dashboard --- ## Recommendation ### Apply Immediately: 1. **debug**: gpt-oss:20b → gemma4:31b (fixes broken agent) 2. **release-manager**: devstral-2:123b → qwen3.6-plus:free (fixes broken agent) ### Apply Within 24h: 3. **orchestrator**: glm-5 → qwen3.6-plus:free (+2% score, +10 IF) 4. **pipeline-judge**: nemotron-3-super → qwen3.6-plus:free (+2% score) ### Consider: 5. **evaluator**: Add Groq burst mode for +6x speed ### Keep Unchanged: 6-10. **All other agents** are already optimal --- ## Files to Modify ### Phase 1 (Critical) ```bash # kilo.jsonc - Fix debug agent .agent.debug.model = "ollama-cloud/gemma4:31b" # capability-index.yaml - Fix release-manager agents.release-manager.model = "openrouter/qwen/qwen3.6-plus:free" ``` ### Phase 2 (High) ```bash # kilo.jsonc - Upgrade orchestrator .agent.orchestrator.model = "openrouter/qwen/qwen3.6-plus:free" # capability-index.yaml - Upgrade pipeline-judge agents.pipeline-judge.model = "openrouter/qwen/qwen3.6-plus:free" ``` --- **Analysis Status**: ✅ COMPLETE **Recommendation**: **Apply Phase 1 immediately (2 broken agents)**