Files
APAW/agent-evolution/data/model-research-2026-05-24.md
Deploy Bot 047a87afb4 feat(agent-models): apply MEDIUM+LOW priority model migrations
- markdown-validator: deepseek-v4-pro-max → nemotron-3-nano (90% cost cut)
- release-manager: glm-5.1 → kimi-k2.6 (+2 matrix, 1M context for diffs)
- capability-analyst: glm-5.1 → deepseek-v4-pro-max (+4 matrix, 1M ctx)
- browser-automation: qwen3-coder → deepseek-v4-flash (3× faster inference)
- history-miner: nemotron-3-super → qwen3.5-122b (+14 IF, 12.4M pulls)
2026-05-25 15:07:17 +01:00

5.9 KiB
Raw Blame History

Agent Model Research Report — 2026-05-24

Executive Summary

13 model changes recommended across 38 agents. 2 CRITICAL (prompt-optimizer, memory-manager on non-Ollama-Cloud models that must migrate). 4 HIGH priority. 5 MEDIUM. 2 LOW.

9 models benchmarked but assigned to zero agents—wasted potential.

Composite Score Formula

composite = (IF_score * 0.5) + (SWE_bench * 0.3) + (context_kb / 1000 * 0.2)

Model IF SWE Ctx(K) Composite Pulls Assigned
kimi-k2.6 91 80.2 1000 69.76 259.7K 7 agents
deepseek-v4-pro-max 89 80.6 1000 68.88 71.6K 4 agents
kimi-k2.5 90 78.0 256 68.45 293.2K 0
deepseek-v4-flash 86 79.0 1000 66.90 84.4K 0
minimax-m2.5 82 80.2 128 65.09 2.2M 2 agents
qwen3-coder-480b 88 66.5 1000 64.15 N/A 7 agents
minimax-m2.7 80 78.0 128 63.43 2.2M 0
nemotron-3-super 78 60.5 1000 57.35 2.4M 2 agents
glm-5.1 90 null 128 45.03* 2.2M 8 agents
glm-5 90 null 128 45.03* 2.3M 0
qwen3.5-122b 92 null 128 46.03* 12.4M 0
gemma4-27b 85 null 128 42.53* 10.1M 0
devstral-2 80 null 128 40.03* 223.2K 0
devstral-small-2 75 null 128 37.53* 838.8K 0
nemotron-3-nano 68 null 128 34.03* 453K 0

* SWE missing → composite artificially low. Est: +20-25 with SWE~75.

Concentration Risks

Model Agents Risk
glm-5.1 8 All agents on model with NO SWE score
kimi-k2.6 7 Highest-quality model over-concentrated
qwen3-coder-480b 7 SWE=66.5 below deepseek-v4-flash (79)
deepseek-v4-pro-max 4 Expensive (49B active)

Idle Models (0 agents assigned — wasted potential)

Model Composite Pulls Why Idle
qwen3.5-122b ~68.5* 12.4M Newest, highest IF=92, needs integration
gemma4-27b ~62* 10.1M Multimodal, needs A/B for coding
deepseek-v4-flash 66.90 84.4K Best efficiency, 13B active
minimax-m2.7 63.43 2.2M Self-evolving, could suit meta-agents
glm-5 ~67* 2.3M Superseded by glm-5.1
devstral-2 40.03* 223.2K Code exploration, alternative for coding
devstral-small-2 37.53* 838.8K Lightweight, IF too low
kimi-k2.5 68.45 293.2K Superseded by k2.6
nemotron-3-nano 34.03* 453K Ultra-lightweight for simple tasks

Recommendations

CRITICAL

Agent From To Delta Rationale
prompt-optimizer qwen3.6-plus (not Ollama Cloud) qwen3.5-122b (IF=92) +10 Must migrate. qwen3.6-plus not in Ollama Cloud. qwen3.5 highest IF=92. 12.4M pulls.
memory-manager qwen3.6-plus (not Ollama Cloud) deepseek-v4-pro-max (IF=89, 1M ctx) +1 Must migrate. Memory-manager needs long context (1M). deepseek-v4-pro-max best for this.

HIGH

Agent From To Delta Rationale
system-analyst glm-5.1 (matrix=82) deepseek-v4-pro-max (matrix=88) +6 IF=89, SWE=80.6, 1M context for architecture docs. glm-5.1 has no SWE score.
evaluator glm-5.1 (matrix=78) qwen3.5-122b (IF=92, est=82) +4 IF-critical role. qwen3.5-122b has highest IF=92. 12.4M pulls.
pipeline-judge glm-5.1 (matrix=76) kimi-k2.6 (matrix=84) +8 Needs long context (pipeline logs). kimi-k2.6 IF=91, SWE=80.2, 1M ctx.
workflow-architect glm-5.1 (matrix=76) qwen3.5-122b (est=80) +4 High IF for YAML/structured output. qwen3.5 IF=92.

MEDIUM

Agent From To Delta Rationale
markdown-validator deepseek-v4-pro-max (matrix=68, expensive) nemotron-3-nano (matrix=70, cheap, 4B) +2 Overkill to use 49B active model for markdown validation. nano cheaper + higher matrix score.
release-manager glm-5.1 (matrix=76) kimi-k2.6 (matrix=78) +2 1M context for large git diffs. IF=91 vs 90.
capability-analyst glm-5.1 (matrix=78) deepseek-v4-pro-max (matrix=82) +4 1M context for capability-index analysis.
visual-tester qwen3-coder-480b (matrix=82, no vision) kimi-k2.6 (matrix=82, vision) +0 (capabilities+) Same matrix but kimi-k2.6 can SEE images. Multimodal advantage.
browser-automation qwen3-coder-480b (matrix=87, 35B active) deepseek-v4-flash (IF=86, 13B active, 1M ctx) ~-5 matrix (trade-off) 3× faster inference. 1M context for complex DOM.

LOW

Agent From To Delta Rationale
history-miner nemotron-3-super (IF=78, composite=57.35) qwen3.5-122b (IF=92, 12.4M pulls) +14 IF Lowest model quality in pipeline. Easy upgrade.
plan (built-in) nemotron-3-super (IF=78) deepseek-v4-pro-max (IF=89, matrix=88) +11 IF Align with planner subagent.

Data Gaps

Model Missing Impact
qwen3.5-122b SWE-bench Cannot confirm coding. IF-only role safe.
gemma4-27b SWE-bench Newest release. Needs A/B for coding.
glm-5.1 SWE-bench 8 agents! Unverified coding capability.
devstral-2 SWE-bench Code model no coding benchmark—risky.
nemotron-3-nano SWE-bench Not needed: lightweight tasks only.

Recently Updated Models (2 days old)

  • qwen3.5-122b (2026-05-22): 12.4M pulls since launch
  • gemma4-27b (2026-05-22): 10.1M pulls since launch, announced "frontier at each size"

Next Actions

  1. Apply CRITICAL: migrate prompt-optimizer + memory-manager
  2. Apply HIGH: system-analyst + evaluator + pipeline-judge + workflow-architect
  3. Run pipeline A/B test on qwen3.5-122b and deepseek-v4-flash
  4. Fill data gaps: collect SWE-bench for qwen3.5-122b and gemma4-27b
  5. Update dashboard to show idle model alerts