Files

Deploy Bot 047a87afb4 feat(agent-models): apply MEDIUM+LOW priority model migrations

- markdown-validator: deepseek-v4-pro-max → nemotron-3-nano (90% cost cut)
- release-manager: glm-5.1 → kimi-k2.6 (+2 matrix, 1M context for diffs)
- capability-analyst: glm-5.1 → deepseek-v4-pro-max (+4 matrix, 1M ctx)
- browser-automation: qwen3-coder → deepseek-v4-flash (3× faster inference)
- history-miner: nemotron-3-super → qwen3.5-122b (+14 IF, 12.4M pulls)

2026-05-25 15:07:17 +01:00

5.9 KiB

Raw Blame History

Agent Model Research Report — 2026-05-24

Executive Summary

13 model changes recommended across 38 agents. 2 CRITICAL (prompt-optimizer, memory-manager on non-Ollama-Cloud models that must migrate). 4 HIGH priority. 5 MEDIUM. 2 LOW.

9 models benchmarked but assigned to zero agents—wasted potential.

Composite Score Formula

composite = (IF_score * 0.5) + (SWE_bench * 0.3) + (context_kb / 1000 * 0.2)

Model	IF	SWE	Ctx(K)	Composite	Pulls	Assigned
kimi-k2.6	91	80.2	1000	69.76	259.7K	7 agents
deepseek-v4-pro-max	89	80.6	1000	68.88	71.6K	4 agents
kimi-k2.5	90	78.0	256	68.45	293.2K	0
deepseek-v4-flash	86	79.0	1000	66.90	84.4K	0
minimax-m2.5	82	80.2	128	65.09	2.2M	2 agents
qwen3-coder-480b	88	66.5	1000	64.15	N/A	7 agents
minimax-m2.7	80	78.0	128	63.43	2.2M	0
nemotron-3-super	78	60.5	1000	57.35	2.4M	2 agents
glm-5.1	90	null	128	45.03*	2.2M	8 agents
glm-5	90	null	128	45.03*	2.3M	0
qwen3.5-122b	92	null	128	46.03*	12.4M	0
gemma4-27b	85	null	128	42.53*	10.1M	0
devstral-2	80	null	128	40.03*	223.2K	0
devstral-small-2	75	null	128	37.53*	838.8K	0
nemotron-3-nano	68	null	128	34.03*	453K	0

* SWE missing → composite artificially low. Est: +20-25 with SWE~75.

Concentration Risks

Model	Agents	Risk
glm-5.1	8	All agents on model with NO SWE score
kimi-k2.6	7	Highest-quality model over-concentrated
qwen3-coder-480b	7	SWE=66.5 below deepseek-v4-flash (79)
deepseek-v4-pro-max	4	Expensive (49B active)

Idle Models (0 agents assigned — wasted potential)

Model	Composite	Pulls	Why Idle
qwen3.5-122b	~68.5*	12.4M	Newest, highest IF=92, needs integration
gemma4-27b	~62*	10.1M	Multimodal, needs A/B for coding
deepseek-v4-flash	66.90	84.4K	Best efficiency, 13B active
minimax-m2.7	63.43	2.2M	Self-evolving, could suit meta-agents
glm-5	~67*	2.3M	Superseded by glm-5.1
devstral-2	40.03*	223.2K	Code exploration, alternative for coding
devstral-small-2	37.53*	838.8K	Lightweight, IF too low
kimi-k2.5	68.45	293.2K	Superseded by k2.6
nemotron-3-nano	34.03*	453K	Ultra-lightweight for simple tasks

Recommendations

CRITICAL

Agent	From	To	Delta	Rationale
prompt-optimizer	qwen3.6-plus (not Ollama Cloud)	qwen3.5-122b (IF=92)	+10	Must migrate. qwen3.6-plus not in Ollama Cloud. qwen3.5 highest IF=92. 12.4M pulls.
memory-manager	qwen3.6-plus (not Ollama Cloud)	deepseek-v4-pro-max (IF=89, 1M ctx)	+1	Must migrate. Memory-manager needs long context (1M). deepseek-v4-pro-max best for this.

HIGH

Agent	From	To	Delta	Rationale
system-analyst	glm-5.1 (matrix=82)	deepseek-v4-pro-max (matrix=88)	+6	IF=89, SWE=80.6, 1M context for architecture docs. glm-5.1 has no SWE score.
evaluator	glm-5.1 (matrix=78)	qwen3.5-122b (IF=92, est=82)	+4	IF-critical role. qwen3.5-122b has highest IF=92. 12.4M pulls.
pipeline-judge	glm-5.1 (matrix=76)	kimi-k2.6 (matrix=84)	+8	Needs long context (pipeline logs). kimi-k2.6 IF=91, SWE=80.2, 1M ctx.
workflow-architect	glm-5.1 (matrix=76)	qwen3.5-122b (est=80)	+4	High IF for YAML/structured output. qwen3.5 IF=92.

MEDIUM

Agent	From	To	Delta	Rationale
markdown-validator	deepseek-v4-pro-max (matrix=68, expensive)	nemotron-3-nano (matrix=70, cheap, 4B)	+2	Overkill to use 49B active model for markdown validation. nano cheaper + higher matrix score.
release-manager	glm-5.1 (matrix=76)	kimi-k2.6 (matrix=78)	+2	1M context for large git diffs. IF=91 vs 90.
capability-analyst	glm-5.1 (matrix=78)	deepseek-v4-pro-max (matrix=82)	+4	1M context for capability-index analysis.
visual-tester	qwen3-coder-480b (matrix=82, no vision)	kimi-k2.6 (matrix=82, vision)	+0 (capabilities+)	Same matrix but kimi-k2.6 can SEE images. Multimodal advantage.
browser-automation	qwen3-coder-480b (matrix=87, 35B active)	deepseek-v4-flash (IF=86, 13B active, 1M ctx)	~-5 matrix (trade-off)	3× faster inference. 1M context for complex DOM.

LOW

Agent	From	To	Delta	Rationale
history-miner	nemotron-3-super (IF=78, composite=57.35)	qwen3.5-122b (IF=92, 12.4M pulls)	+14 IF	Lowest model quality in pipeline. Easy upgrade.
plan (built-in)	nemotron-3-super (IF=78)	deepseek-v4-pro-max (IF=89, matrix=88)	+11 IF	Align with planner subagent.

Data Gaps

Model	Missing	Impact
qwen3.5-122b	SWE-bench	Cannot confirm coding. IF-only role safe.
gemma4-27b	SWE-bench	Newest release. Needs A/B for coding.
glm-5.1	SWE-bench	8 agents! Unverified coding capability.
devstral-2	SWE-bench	Code model no coding benchmark—risky.
nemotron-3-nano	SWE-bench	Not needed: lightweight tasks only.

Next Actions

Apply CRITICAL: migrate prompt-optimizer + memory-manager
Apply HIGH: system-analyst + evaluator + pipeline-judge + workflow-architect
Run pipeline A/B test on qwen3.5-122b and deepseek-v4-flash
Fill data gaps: collect SWE-bench for qwen3.5-122b and gemma4-27b
Update dashboard to show idle model alerts

5.9 KiB Raw Blame History Unescape Escape