- markdown-validator: deepseek-v4-pro-max → nemotron-3-nano (90% cost cut) - release-manager: glm-5.1 → kimi-k2.6 (+2 matrix, 1M context for diffs) - capability-analyst: glm-5.1 → deepseek-v4-pro-max (+4 matrix, 1M ctx) - browser-automation: qwen3-coder → deepseek-v4-flash (3× faster inference) - history-miner: nemotron-3-super → qwen3.5-122b (+14 IF, 12.4M pulls)
5.9 KiB
5.9 KiB
Agent Model Research Report — 2026-05-24
Executive Summary
13 model changes recommended across 38 agents. 2 CRITICAL (prompt-optimizer, memory-manager on non-Ollama-Cloud models that must migrate). 4 HIGH priority. 5 MEDIUM. 2 LOW.
9 models benchmarked but assigned to zero agents—wasted potential.
Composite Score Formula
composite = (IF_score * 0.5) + (SWE_bench * 0.3) + (context_kb / 1000 * 0.2)
| Model | IF | SWE | Ctx(K) | Composite | Pulls | Assigned |
|---|---|---|---|---|---|---|
| kimi-k2.6 | 91 | 80.2 | 1000 | 69.76 | 259.7K | 7 agents |
| deepseek-v4-pro-max | 89 | 80.6 | 1000 | 68.88 | 71.6K | 4 agents |
| kimi-k2.5 | 90 | 78.0 | 256 | 68.45 | 293.2K | 0 |
| deepseek-v4-flash | 86 | 79.0 | 1000 | 66.90 | 84.4K | 0 |
| minimax-m2.5 | 82 | 80.2 | 128 | 65.09 | 2.2M | 2 agents |
| qwen3-coder-480b | 88 | 66.5 | 1000 | 64.15 | N/A | 7 agents |
| minimax-m2.7 | 80 | 78.0 | 128 | 63.43 | 2.2M | 0 |
| nemotron-3-super | 78 | 60.5 | 1000 | 57.35 | 2.4M | 2 agents |
| glm-5.1 | 90 | null | 128 | 45.03* | 2.2M | 8 agents |
| glm-5 | 90 | null | 128 | 45.03* | 2.3M | 0 |
| qwen3.5-122b | 92 | null | 128 | 46.03* | 12.4M | 0 |
| gemma4-27b | 85 | null | 128 | 42.53* | 10.1M | 0 |
| devstral-2 | 80 | null | 128 | 40.03* | 223.2K | 0 |
| devstral-small-2 | 75 | null | 128 | 37.53* | 838.8K | 0 |
| nemotron-3-nano | 68 | null | 128 | 34.03* | 453K | 0 |
* SWE missing → composite artificially low. Est: +20-25 with SWE~75.
Concentration Risks
| Model | Agents | Risk |
|---|---|---|
| glm-5.1 | 8 | All agents on model with NO SWE score |
| kimi-k2.6 | 7 | Highest-quality model over-concentrated |
| qwen3-coder-480b | 7 | SWE=66.5 below deepseek-v4-flash (79) |
| deepseek-v4-pro-max | 4 | Expensive (49B active) |
Idle Models (0 agents assigned — wasted potential)
| Model | Composite | Pulls | Why Idle |
|---|---|---|---|
| qwen3.5-122b | ~68.5* | 12.4M | Newest, highest IF=92, needs integration |
| gemma4-27b | ~62* | 10.1M | Multimodal, needs A/B for coding |
| deepseek-v4-flash | 66.90 | 84.4K | Best efficiency, 13B active |
| minimax-m2.7 | 63.43 | 2.2M | Self-evolving, could suit meta-agents |
| glm-5 | ~67* | 2.3M | Superseded by glm-5.1 |
| devstral-2 | 40.03* | 223.2K | Code exploration, alternative for coding |
| devstral-small-2 | 37.53* | 838.8K | Lightweight, IF too low |
| kimi-k2.5 | 68.45 | 293.2K | Superseded by k2.6 |
| nemotron-3-nano | 34.03* | 453K | Ultra-lightweight for simple tasks |
Recommendations
CRITICAL
| Agent | From | To | Delta | Rationale |
|---|---|---|---|---|
| prompt-optimizer | qwen3.6-plus (not Ollama Cloud) | qwen3.5-122b (IF=92) | +10 | Must migrate. qwen3.6-plus not in Ollama Cloud. qwen3.5 highest IF=92. 12.4M pulls. |
| memory-manager | qwen3.6-plus (not Ollama Cloud) | deepseek-v4-pro-max (IF=89, 1M ctx) | +1 | Must migrate. Memory-manager needs long context (1M). deepseek-v4-pro-max best for this. |
HIGH
| Agent | From | To | Delta | Rationale |
|---|---|---|---|---|
| system-analyst | glm-5.1 (matrix=82) | deepseek-v4-pro-max (matrix=88) | +6 | IF=89, SWE=80.6, 1M context for architecture docs. glm-5.1 has no SWE score. |
| evaluator | glm-5.1 (matrix=78) | qwen3.5-122b (IF=92, est=82) | +4 | IF-critical role. qwen3.5-122b has highest IF=92. 12.4M pulls. |
| pipeline-judge | glm-5.1 (matrix=76) | kimi-k2.6 (matrix=84) | +8 | Needs long context (pipeline logs). kimi-k2.6 IF=91, SWE=80.2, 1M ctx. |
| workflow-architect | glm-5.1 (matrix=76) | qwen3.5-122b (est=80) | +4 | High IF for YAML/structured output. qwen3.5 IF=92. |
MEDIUM
| Agent | From | To | Delta | Rationale |
|---|---|---|---|---|
| markdown-validator | deepseek-v4-pro-max (matrix=68, expensive) | nemotron-3-nano (matrix=70, cheap, 4B) | +2 | Overkill to use 49B active model for markdown validation. nano cheaper + higher matrix score. |
| release-manager | glm-5.1 (matrix=76) | kimi-k2.6 (matrix=78) | +2 | 1M context for large git diffs. IF=91 vs 90. |
| capability-analyst | glm-5.1 (matrix=78) | deepseek-v4-pro-max (matrix=82) | +4 | 1M context for capability-index analysis. |
| visual-tester | qwen3-coder-480b (matrix=82, no vision) | kimi-k2.6 (matrix=82, vision) | +0 (capabilities+) | Same matrix but kimi-k2.6 can SEE images. Multimodal advantage. |
| browser-automation | qwen3-coder-480b (matrix=87, 35B active) | deepseek-v4-flash (IF=86, 13B active, 1M ctx) | ~-5 matrix (trade-off) | 3× faster inference. 1M context for complex DOM. |
LOW
| Agent | From | To | Delta | Rationale |
|---|---|---|---|---|
| history-miner | nemotron-3-super (IF=78, composite=57.35) | qwen3.5-122b (IF=92, 12.4M pulls) | +14 IF | Lowest model quality in pipeline. Easy upgrade. |
| plan (built-in) | nemotron-3-super (IF=78) | deepseek-v4-pro-max (IF=89, matrix=88) | +11 IF | Align with planner subagent. |
Data Gaps
| Model | Missing | Impact |
|---|---|---|
| qwen3.5-122b | SWE-bench | Cannot confirm coding. IF-only role safe. |
| gemma4-27b | SWE-bench | Newest release. Needs A/B for coding. |
| glm-5.1 | SWE-bench | 8 agents! Unverified coding capability. |
| devstral-2 | SWE-bench | Code model no coding benchmark—risky. |
| nemotron-3-nano | SWE-bench | Not needed: lightweight tasks only. |
Recently Updated Models (2 days old)
- qwen3.5-122b (2026-05-22): 12.4M pulls since launch
- gemma4-27b (2026-05-22): 10.1M pulls since launch, announced "frontier at each size"
Next Actions
- Apply CRITICAL: migrate prompt-optimizer + memory-manager
- Apply HIGH: system-analyst + evaluator + pipeline-judge + workflow-architect
- Run pipeline A/B test on qwen3.5-122b and deepseek-v4-flash
- Fill data gaps: collect SWE-bench for qwen3.5-122b and gemma4-27b
- Update dashboard to show idle model alerts