Files
APAW/agent-evolution/data/model-research-2026-05-24.md
Deploy Bot 047a87afb4 feat(agent-models): apply MEDIUM+LOW priority model migrations
- markdown-validator: deepseek-v4-pro-max → nemotron-3-nano (90% cost cut)
- release-manager: glm-5.1 → kimi-k2.6 (+2 matrix, 1M context for diffs)
- capability-analyst: glm-5.1 → deepseek-v4-pro-max (+4 matrix, 1M ctx)
- browser-automation: qwen3-coder → deepseek-v4-flash (3× faster inference)
- history-miner: nemotron-3-super → qwen3.5-122b (+14 IF, 12.4M pulls)
2026-05-25 15:07:17 +01:00

112 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Agent Model Research Report — 2026-05-24
## Executive Summary
13 model changes recommended across 38 agents. 2 CRITICAL (prompt-optimizer, memory-manager on non-Ollama-Cloud models that must migrate). 4 HIGH priority. 5 MEDIUM. 2 LOW.
9 models benchmarked but assigned to zero agents—wasted potential.
## Composite Score Formula
`composite = (IF_score * 0.5) + (SWE_bench * 0.3) + (context_kb / 1000 * 0.2)`
| Model | IF | SWE | Ctx(K) | Composite | Pulls | Assigned |
|-------|-----|------|--------|-----------|-------|----------|
| kimi-k2.6 | 91 | 80.2 | 1000 | **69.76** | 259.7K | 7 agents |
| deepseek-v4-pro-max | 89 | 80.6 | 1000 | **68.88** | 71.6K | 4 agents |
| kimi-k2.5 | 90 | 78.0 | 256 | **68.45** | 293.2K | **0** |
| deepseek-v4-flash | 86 | 79.0 | 1000 | **66.90** | 84.4K | **0** |
| minimax-m2.5 | 82 | 80.2 | 128 | **65.09** | 2.2M | 2 agents |
| qwen3-coder-480b | 88 | 66.5 | 1000 | **64.15** | N/A | 7 agents |
| minimax-m2.7 | 80 | 78.0 | 128 | **63.43** | 2.2M | **0** |
| nemotron-3-super | 78 | 60.5 | 1000 | **57.35** | 2.4M | 2 agents |
| glm-5.1 | 90 | null | 128 | 45.03* | 2.2M | 8 agents |
| glm-5 | 90 | null | 128 | 45.03* | 2.3M | **0** |
| qwen3.5-122b | 92 | null | 128 | 46.03* | **12.4M** | **0** |
| gemma4-27b | 85 | null | 128 | 42.53* | **10.1M** | **0** |
| devstral-2 | 80 | null | 128 | 40.03* | 223.2K | **0** |
| devstral-small-2 | 75 | null | 128 | 37.53* | 838.8K | **0** |
| nemotron-3-nano | 68 | null | 128 | 34.03* | 453K | **0** |
\* SWE missing → composite artificially low. Est: +20-25 with SWE~75.
## Concentration Risks
| Model | Agents | Risk |
|-------|--------|------|
| glm-5.1 | 8 | All agents on model with NO SWE score |
| kimi-k2.6 | 7 | Highest-quality model over-concentrated |
| qwen3-coder-480b | 7 | SWE=66.5 below deepseek-v4-flash (79) |
| deepseek-v4-pro-max | 4 | Expensive (49B active) |
## Idle Models (0 agents assigned — wasted potential)
| Model | Composite | Pulls | Why Idle |
|-------|-----------|-------|----------|
| qwen3.5-122b | ~68.5* | **12.4M** | Newest, highest IF=92, needs integration |
| gemma4-27b | ~62* | **10.1M** | Multimodal, needs A/B for coding |
| deepseek-v4-flash | 66.90 | 84.4K | Best efficiency, 13B active |
| minimax-m2.7 | 63.43 | 2.2M | Self-evolving, could suit meta-agents |
| glm-5 | ~67* | 2.3M | Superseded by glm-5.1 |
| devstral-2 | 40.03* | 223.2K | Code exploration, alternative for coding |
| devstral-small-2 | 37.53* | 838.8K | Lightweight, IF too low |
| kimi-k2.5 | 68.45 | 293.2K | Superseded by k2.6 |
| nemotron-3-nano | 34.03* | 453K | Ultra-lightweight for simple tasks |
## Recommendations
### CRITICAL
| Agent | From | To | Delta | Rationale |
|-------|------|-----|-------|-----------|
| prompt-optimizer | qwen3.6-plus (**not Ollama Cloud**) | qwen3.5-122b (IF=92) | +10 | Must migrate. qwen3.6-plus not in Ollama Cloud. qwen3.5 highest IF=92. 12.4M pulls. |
| memory-manager | qwen3.6-plus (**not Ollama Cloud**) | deepseek-v4-pro-max (IF=89, 1M ctx) | +1 | Must migrate. Memory-manager needs long context (1M). deepseek-v4-pro-max best for this. |
### HIGH
| Agent | From | To | Delta | Rationale |
|-------|------|-----|-------|-----------|
| system-analyst | glm-5.1 (matrix=82) | deepseek-v4-pro-max (matrix=88) | +6 | IF=89, SWE=80.6, 1M context for architecture docs. glm-5.1 has no SWE score. |
| evaluator | glm-5.1 (matrix=78) | qwen3.5-122b (IF=92, est=82) | +4 | IF-critical role. qwen3.5-122b has highest IF=92. 12.4M pulls. |
| pipeline-judge | glm-5.1 (matrix=76) | kimi-k2.6 (matrix=84) | +8 | Needs long context (pipeline logs). kimi-k2.6 IF=91, SWE=80.2, 1M ctx. |
| workflow-architect | glm-5.1 (matrix=76) | qwen3.5-122b (est=80) | +4 | High IF for YAML/structured output. qwen3.5 IF=92. |
### MEDIUM
| Agent | From | To | Delta | Rationale |
|-------|------|-----|-------|-----------|
| markdown-validator | deepseek-v4-pro-max (matrix=68, expensive) | nemotron-3-nano (matrix=70, cheap, 4B) | +2 | Overkill to use 49B active model for markdown validation. nano cheaper + higher matrix score. |
| release-manager | glm-5.1 (matrix=76) | kimi-k2.6 (matrix=78) | +2 | 1M context for large git diffs. IF=91 vs 90. |
| capability-analyst | glm-5.1 (matrix=78) | deepseek-v4-pro-max (matrix=82) | +4 | 1M context for capability-index analysis. |
| visual-tester | qwen3-coder-480b (matrix=82, no vision) | kimi-k2.6 (matrix=82, vision) | +0 (capabilities+) | Same matrix but kimi-k2.6 can SEE images. Multimodal advantage. |
| browser-automation | qwen3-coder-480b (matrix=87, 35B active) | deepseek-v4-flash (IF=86, 13B active, 1M ctx) | ~-5 matrix (trade-off) | 3× faster inference. 1M context for complex DOM. |
### LOW
| Agent | From | To | Delta | Rationale |
|-------|------|-----|-------|-----------|
| history-miner | nemotron-3-super (IF=78, composite=57.35) | qwen3.5-122b (IF=92, 12.4M pulls) | +14 IF | Lowest model quality in pipeline. Easy upgrade. |
| plan (built-in) | nemotron-3-super (IF=78) | deepseek-v4-pro-max (IF=89, matrix=88) | +11 IF | Align with planner subagent.|
## Data Gaps
| Model | Missing | Impact |
|-------|---------|--------|
| qwen3.5-122b | SWE-bench | Cannot confirm coding. IF-only role safe. |
| gemma4-27b | SWE-bench | Newest release. Needs A/B for coding. |
| glm-5.1 | SWE-bench | 8 agents! Unverified coding capability. |
| devstral-2 | SWE-bench | Code model no coding benchmark—risky. |
| nemotron-3-nano | SWE-bench | Not needed: lightweight tasks only. |
## Recently Updated Models (2 days old)
- **qwen3.5-122b** (2026-05-22): 12.4M pulls since launch
- **gemma4-27b** (2026-05-22): 10.1M pulls since launch, announced "frontier at each size"
## Next Actions
1. Apply CRITICAL: migrate prompt-optimizer + memory-manager
2. Apply HIGH: system-analyst + evaluator + pipeline-judge + workflow-architect
3. Run pipeline A/B test on qwen3.5-122b and deepseek-v4-flash
4. Fill data gaps: collect SWE-bench for qwen3.5-122b and gemma4-27b
5. Update dashboard to show idle model alerts