feat: milestone 78 — objective model evolution from benchmark research

- Reassign 29/30 agents based on capability-analyst web research - deepseek-v4-pro: 14 agents (coding SOTA: SWE-bench 80.6%, LiveCodeBench 93.5%) - minimax-m3☁️ 8 agents (agentic: BrowseComp 83.5%, 12h autonomous) - glm-5.1: 4 agents (CyberGym 68.7% SOTA, sustained rounds) - minimax-m2.5☁️ 2 agents (frontend productivity, 2.2M pulls) - kimi-k2.6: 1 agent (ONLY true multimodal) - Add OpenCompass evaluation container (docker, scripts) for future objective runs - Evidence saved to agent-evolution/data/research-report.json (598 lines, 6 models) Data gaps honestly documented: minimax-m3/m2.5, qwen3-coder, kimi-k2.6 benchmark tables are image-only on Ollama.
2026-06-01 20:50:10 +01:00
parent 0aadd28835
commit 397d8367e9
41 changed files with 1279 additions and 138 deletions
--- a/.kilo/KILO_SPEC.md
+++ b/.kilo/KILO_SPEC.md
@@ -433,42 +433,42 @@ Provider availability depends on configuration. Common providers include:

 | Agent | Role | Model |
 |-------|------|-------|
-| `@RequirementRefiner` | Converts vague ideas and bug reports into strict User Stories with acceptance criteria checklists. | ollama-cloud/qwen3-coder:480b |
-| `@HistoryMiner` | Analyzes git history to find duplicates and past solutions, preventing regression and duplicate work. | ollama-cloud/kimi-k2.6 |
-| `@SystemAnalyst` | Designs technical specifications, data schemas, and API contracts before implementation. | ollama-cloud/kimi-k2.6 |
-| `@SdetEngineer` | Writes tests following TDD methodology. | ollama-cloud/kimi-k2.6 |
-| `@LeadDeveloper` | Primary code writer for backend and core logic. | ollama-cloud/kimi-k2.6 |
-| `@FrontendDeveloper` | Handles UI implementation with multimodal capabilities. | ollama-cloud/qwen3-coder:480b |
+| `@RequirementRefiner` | Converts vague ideas and bug reports into strict User Stories with acceptance criteria checklists. | ollama-cloud/deepseek-v4-pro |
+| `@HistoryMiner` | Analyzes git history to find duplicates and past solutions, preventing regression and duplicate work. | ollama-cloud/qwen3-coder:480b |
+| `@SystemAnalyst` | Designs technical specifications, data schemas, and API contracts before implementation. | ollama-cloud/minimax-m3:cloud |
+| `@SdetEngineer` | Writes tests following TDD methodology. | ollama-cloud/deepseek-v4-pro |
+| `@LeadDeveloper` | Primary code writer for backend and core logic. | ollama-cloud/deepseek-v4-pro |
+| `@FrontendDeveloper` | Handles UI implementation with multimodal capabilities. | ollama-cloud/minimax-m2.5:cloud |
 | `@BackendDeveloper` | Backend specialist for Node. | ollama-cloud/deepseek-v4-pro |
-| `@GoDeveloper` | Go backend specialist for Gin, Echo, APIs, and database integration. | ollama-cloud/qwen3-coder:480b |
-| `@DevopsEngineer` | DevOps specialist for Docker, Kubernetes, CI/CD pipeline automation, and infrastructure management. | ollama-cloud/kimi-k2.6 |
-| `@CodeSkeptic` | Adversarial code reviewer. | ollama-cloud/kimi-k2.6 |
-| `@TheFixer` | Iteratively fixes bugs based on specific error reports and test failures. | ollama-cloud/kimi-k2.6 |
-| `@PerformanceEngineer` | Reviews code for performance issues. | ollama-cloud/kimi-k2.6 |
-| `@SecurityAuditor` | Scans for security vulnerabilities, OWASP Top 10, dependency CVEs, and hardcoded secrets. | ollama-cloud/kimi-k2.6 |
+| `@GoDeveloper` | Go backend specialist for Gin, Echo, APIs, and database integration. | ollama-cloud/kimi-k2.6 |
+| `@DevopsEngineer` | DevOps specialist for Docker, Kubernetes, CI/CD pipeline automation, and infrastructure management. | ollama-cloud/minimax-m3:cloud |
+| `@CodeSkeptic` | Adversarial code reviewer. | ollama-cloud/deepseek-v4-pro |
+| `@TheFixer` | Iteratively fixes bugs based on specific error reports and test failures. | ollama-cloud/deepseek-v4-pro |
+| `@PerformanceEngineer` | Reviews code for performance issues. | ollama-cloud/minimax-m3:cloud |
+| `@SecurityAuditor` | Scans for security vulnerabilities, OWASP Top 10, dependency CVEs, and hardcoded secrets. | ollama-cloud/glm-5.1 |
 | `@VisualTester` | Visual regression testing agent that compares screenshots and detects UI differences using pixelmatch and image diff. | ollama-cloud/kimi-k2.6 |
-| `@Orchestrator` | Main dispatcher. | ollama-cloud/kimi-k2.6 |
-| `@ReleaseManager` | Manages git operations, semantic versioning, branching, and deployments. | ollama-cloud/kimi-k2.6 |
-| `@Evaluator` | Scores agent effectiveness after task completion for continuous improvement. | ollama-cloud/kimi-k2.6 |
-| `@PromptOptimizer` | Improves agent system prompts based on performance failures. | ollama-cloud/kimi-k2.6 |
+| `@Orchestrator` | Main dispatcher. | ollama-cloud/glm-5.1 |
+| `@ReleaseManager` | Manages git operations, semantic versioning, branching, and deployments. | ollama-cloud/deepseek-v4-pro |
+| `@Evaluator` | Scores agent effectiveness after task completion for continuous improvement. | ollama-cloud/deepseek-v4-pro |
+| `@PromptOptimizer` | Improves agent system prompts based on performance failures. | ollama-cloud/minimax-m3:cloud |
 | `@ProductOwner` | Manages issue checklists, status labels, tracks progress and coordinates with human users. | ollama-cloud/kimi-k2.6 |
-| `@AgentArchitect` | Creates, modifies, and reviews new agents, workflows, and skills based on capability gap analysis. | ollama-cloud/kimi-k2.6 |
-| `@CapabilityAnalyst` | Analyzes task requirements against available agents, workflows, and skills. | ollama-cloud/deepseek-v4-pro |
-| `@WorkflowArchitect` | Creates and maintains workflow definitions with complete architecture, Gitea integration, and quality gates. | ollama-cloud/kimi-k2.6 |
-| `@MarkdownValidator` | Validates and corrects Markdown descriptions for Gitea issues. | ollama-cloud/qwen3-coder:480b |
-| `@BrowserAutomation` | Browser automation agent using Playwright MCP for E2E testing, form filling, navigation, and web interaction. | ollama-cloud/kimi-k2.6 |
-| `@Planner` | Advanced task planner using Chain of Thought, Tree of Thoughts, and Plan-Execute-Reflect. | ollama-cloud/deepseek-v4-pro |
-| `@Reflector` | Self-reflection agent using Reflexion pattern - learns from mistakes. | ollama-cloud/kimi-k2.6 |
-| `@MemoryManager` | Manages agent memory systems - short-term (context), long-term (vector store), and episodic (experiences). | ollama-cloud/kimi-k2.6 |
+| `@AgentArchitect` | Creates, modifies, and reviews new agents, workflows, and skills based on capability gap analysis. | ollama-cloud/minimax-m3:cloud |
+| `@CapabilityAnalyst` | Analyzes task requirements against available agents, workflows, and skills. | ollama-cloud/minimax-m3:cloud |
+| `@WorkflowArchitect` | Creates and maintains workflow definitions with complete architecture, Gitea integration, and quality gates. | ollama-cloud/glm-5.1 |
+| `@MarkdownValidator` | Validates and corrects Markdown descriptions for Gitea issues. | ollama-cloud/deepseek-v4-pro |
+| `@BrowserAutomation` | Browser automation agent using Playwright MCP for E2E testing, form filling, navigation, and web interaction. | ollama-cloud/minimax-m3:cloud |
+| `@Planner` | Advanced task planner using Chain of Thought, Tree of Thoughts, and Plan-Execute-Reflect. | ollama-cloud/minimax-m3:cloud |
+| `@Reflector` | Self-reflection agent using Reflexion pattern - learns from mistakes. | ollama-cloud/glm-5.1 |
+| `@MemoryManager` | Manages agent memory systems - short-term (context), long-term (vector store), and episodic (experiences). | ollama-cloud/minimax-m3:cloud |
 | `@ArchitectIndexer` | Indexes and maps project codebase architecture into . | ollama-cloud/qwen3-coder:480b |
-| `@FlutterDeveloper` | Flutter mobile specialist for cross-platform apps, state management, and UI components. | ollama-cloud/kimi-k2.6 |
+| `@FlutterDeveloper` | Flutter mobile specialist for cross-platform apps, state management, and UI components. | ollama-cloud/minimax-m2.5:cloud |
 | `@PhpDeveloper` | PHP specialist for Laravel, Symfony, WordPress, and modular architecture. | ollama-cloud/deepseek-v4-pro |
 | `@PipelineJudge` | Automated pipeline judge. | ollama-cloud/qwen3-coder:480b |
 | `@PythonDeveloper` | Python specialist for Django, FastAPI, data processing, and ML pipelines. | ollama-cloud/deepseek-v4-pro |
-| `@IncidentResponder` | Server incident response and system hardening specialist. | ollama-cloud/kimi-k2.6 |
+| `@IncidentResponder` | Server incident response and system hardening specialist. | ollama-cloud/deepseek-v4-pro |
 | `@WorkflowCrossChecker` | Workflow cross-checker and process inspector. | ollama-cloud/qwen3-coder:480b |
-| `@EvolutionSkeptic` | Evaluates model responses against role-specific rubrics with detailed scoring and commentary. | ollama-cloud/qwen3-coder:480b |
-| `@EvolutionPrompt` | Generates role-specific stress-test prompts by analyzing agent definitions. | ollama-cloud/kimi-k2.6 |
+| `@EvolutionSkeptic` | Evaluates model responses against role-specific rubrics with detailed scoring and commentary. | ollama-cloud/deepseek-v4-pro |
+| `@EvolutionPrompt` | Generates role-specific stress-test prompts by analyzing agent definitions. | ollama-cloud/minimax-m3:cloud |



--- a/.kilo/agents/agent-architect.md
+++ b/.kilo/agents/agent-architect.md
@@ -1,7 +1,7 @@
 ---
 name: Agent Architect
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/minimax-m3:cloud
 description: Creates, modifies, and reviews new agents, workflows, and skills based on capability gap analysis. Tier 2 meta-agent with self-cascade enabled.
 color: "#8B5CF6"
 permission:
--- a/.kilo/agents/browser-automation.md
+++ b/.kilo/agents/browser-automation.md
@@ -1,7 +1,7 @@
 ---
 description: Browser automation agent using Playwright MCP for E2E testing, form filling, navigation, and web interaction (GNS-2 Tier 0)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/minimax-m3:cloud
 color: "#1E88E5"
 permission:
  read: allow
--- a/.kilo/agents/capability-analyst.md
+++ b/.kilo/agents/capability-analyst.md
@@ -1,7 +1,7 @@
 ---
 description: Analyzes task requirements against available agents, workflows, and skills. Identifies gaps and recommends new components. Tier 2 meta-agent with self-cascade enabled.
 mode: subagent
-model: ollama-cloud/deepseek-v4-pro
+model: ollama-cloud/minimax-m3:cloud
 color: "#6366F1"
 permission:
  read: allow
--- a/.kilo/agents/code-skeptic.md
+++ b/.kilo/agents/code-skeptic.md
@@ -1,7 +1,7 @@
 ---
 description: Adversarial code reviewer. Finds problems and issues. Does NOT suggest implementations (GNS-2 Tier 0)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/deepseek-v4-pro
 color: "#E11D48"
 permission:
  read: allow
--- a/.kilo/agents/devops-engineer.md
+++ b/.kilo/agents/devops-engineer.md
@@ -1,7 +1,7 @@
 ---
 description: DevOps specialist for Docker, Kubernetes, CI/CD pipeline automation, and infrastructure management (GNS-2 Tier 1)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/minimax-m3:cloud
 color: "#FF6B35"
 permission:
  read: allow
--- a/.kilo/agents/evaluator.md
+++ b/.kilo/agents/evaluator.md
@@ -1,7 +1,7 @@
 ---
 description: Scores agent effectiveness after task completion for continuous improvement. Tier 2 meta-agent with self-cascade enabled.
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/deepseek-v4-pro
 variant: thinking
 color: "#047857"
 permission:
--- a/.kilo/agents/evolution-prompt.md
+++ b/.kilo/agents/evolution-prompt.md
@@ -1,7 +1,7 @@
 ---
 description: Generates role-specific stress-test prompts by analyzing agent definitions. Reads .kilo/agents/*.md to create adversarial test scenarios that validate role adherence, edge-case handling, and instruction following. (GNS-2 Tier 1)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/minimax-m3:cloud
 color: "#FF6B00"
 permission:
  read: allow
--- a/.kilo/agents/evolution-skeptic.md
+++ b/.kilo/agents/evolution-skeptic.md
@@ -1,7 +1,7 @@
 ---
 description: Evaluates model responses against role-specific rubrics with detailed scoring and commentary. Scores role adherence, reasoning quality, instruction following, boundary awareness, and output quality. Produces per-dimension scores with explanations. (GNS-2 Tier 1)
 mode: subagent
-model: ollama-cloud/qwen3-coder:480b
+model: ollama-cloud/deepseek-v4-pro
 color: "#C026D3"
 permission:
  read: allow
--- a/.kilo/agents/flutter-developer.md
+++ b/.kilo/agents/flutter-developer.md
@@ -1,7 +1,7 @@
 ---
 description: Flutter mobile specialist for cross-platform apps, state management, and UI components (GNS-2 Tier 1)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/minimax-m2.5:cloud
 color: "#02569B"
 permission:
  read: allow
--- a/.kilo/agents/frontend-developer.md
+++ b/.kilo/agents/frontend-developer.md
@@ -1,7 +1,7 @@
 ---
 description: Handles UI implementation with multimodal capabilities. Accepts visual references like screenshots and mockups (GNS-2 Tier 1)
 mode: all
-model: ollama-cloud/qwen3-coder:480b
+model: ollama-cloud/minimax-m2.5:cloud
 color: "#0EA5E9"
 permission:
  read: allow
--- a/.kilo/agents/go-developer.md
+++ b/.kilo/agents/go-developer.md
@@ -1,7 +1,7 @@
 ---
 description: Go backend specialist for Gin, Echo, APIs, and database integration (GNS-2 Tier 1)
 mode: subagent
-model: ollama-cloud/qwen3-coder:480b
+model: ollama-cloud/kimi-k2.6
 color: "#00ADD8"
 permission:
  read: allow
--- a/.kilo/agents/history-miner.md
+++ b/.kilo/agents/history-miner.md
@@ -1,7 +1,7 @@
 ---
 description: Analyzes git history to find duplicates and past solutions, preventing regression and duplicate work (GNS-2 Tier 0)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/qwen3-coder:480b
 color: "#059669"
 permission:
  read: allow
--- a/.kilo/agents/incident-responder.md
+++ b/.kilo/agents/incident-responder.md
@@ -1,7 +1,7 @@
 ---
 description: Server incident response and system hardening specialist. Handles live forensics, malware removal, persistence hunting, SSH-based server cleanup, and post-incident hardening. Works with any OS and panel.
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/deepseek-v4-pro
 color: "#B91C1C"
 permission:
  read: allow
--- a/.kilo/agents/lead-developer.md
+++ b/.kilo/agents/lead-developer.md
@@ -1,7 +1,7 @@
 ---
 description: Primary code writer for backend and core logic. Writes implementation to pass tests (GNS-2 Tier 1)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/deepseek-v4-pro
 variant: thinking
 color: "#DC2626"
 permission:
--- a/.kilo/agents/markdown-validator.md
+++ b/.kilo/agents/markdown-validator.md
@@ -1,7 +1,7 @@
 ---
 description: Validates and corrects Markdown descriptions for Gitea issues (GNS-2 Tier 0)
 mode: subagent
-model: ollama-cloud/qwen3-coder:480b
+model: ollama-cloud/deepseek-v4-pro
 color: "#F97316"
 permission:
  read: allow
--- a/.kilo/agents/memory-manager.md
+++ b/.kilo/agents/memory-manager.md
@@ -1,7 +1,7 @@
 ---
 description: Manages agent memory systems - short-term (context), long-term (vector store), and episodic (experiences) (GNS-2 Tier 0)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/minimax-m3:cloud
 color: "#8B5CF6"
 permission:
  read: allow
--- a/.kilo/agents/orchestrator.md
+++ b/.kilo/agents/orchestrator.md
@@ -1,7 +1,7 @@
 ---
 description: Main dispatcher. Routes tasks between agents based on Issue status and manages the workflow state machine. IF:90 for optimal routing accuracy. (GNS-2 Tier 1)
 mode: all
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/glm-5.1
 variant: thinking
 color: "#7C3AED"
 permission:
--- a/.kilo/agents/performance-engineer.md
+++ b/.kilo/agents/performance-engineer.md
@@ -1,7 +1,7 @@
 ---
 description: Reviews code for performance issues. Focuses on efficiency, N+1 queries, memory leaks, and algorithmic complexity (GNS-2 Tier 0)
 mode: all
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/minimax-m3:cloud
 color: "#0D9488"
 permission:
  read: allow
--- a/.kilo/agents/planner.md
+++ b/.kilo/agents/planner.md
@@ -1,7 +1,7 @@
 ---
 description: Advanced task planner using Chain of Thought, Tree of Thoughts, and Plan-Execute-Reflect (GNS-2 Tier 0)
 mode: subagent
-model: ollama-cloud/deepseek-v4-pro
+model: ollama-cloud/minimax-m3:cloud
 color: "#F59E0B"
 permission:
  read: allow
--- a/.kilo/agents/prompt-optimizer.md
+++ b/.kilo/agents/prompt-optimizer.md
@@ -1,7 +1,7 @@
 ---
 description: Improves agent system prompts based on performance failures. Meta-learner for prompt optimization (GNS-2 Tier 1)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/minimax-m3:cloud
 color: "#BE185D"
 permission:
  read: allow
--- a/.kilo/agents/reflector.md
+++ b/.kilo/agents/reflector.md
@@ -1,7 +1,7 @@
 ---
 description: Self-reflection agent using Reflexion pattern - learns from mistakes (GNS-2 Tier 0)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/glm-5.1
 color: "#10B981"
 permission:
  read: allow
--- a/.kilo/agents/release-manager.md
+++ b/.kilo/agents/release-manager.md
@@ -1,7 +1,7 @@
 ---
 description: Manages git operations, semantic versioning, branching, and deployments. Ensures clean history (GNS-2 Tier 1)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/deepseek-v4-pro
 color: "#581C87"
 permission:
  read: allow
--- a/.kilo/agents/requirement-refiner.md
+++ b/.kilo/agents/requirement-refiner.md
@@ -1,7 +1,7 @@
 ---
 description: Converts vague ideas and bug reports into strict User Stories with acceptance criteria checklists (GNS-2 Tier 1)
 mode: all
-model: ollama-cloud/qwen3-coder:480b
+model: ollama-cloud/deepseek-v4-pro
 variant: thinking
 color: "#4F46E5"
 permission:
--- a/.kilo/agents/sdet-engineer.md
+++ b/.kilo/agents/sdet-engineer.md
@@ -1,7 +1,7 @@
 ---
 description: Writes tests following TDD methodology. Tests MUST fail initially (Red phase) (GNS-2 Tier 1)
 mode: all
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/deepseek-v4-pro
 variant: thinking
 color: "#8B5CF6"
 permission:
--- a/.kilo/agents/security-auditor.md
+++ b/.kilo/agents/security-auditor.md
@@ -1,7 +1,7 @@
 ---
 description: Scans for security vulnerabilities, OWASP Top 10, dependency CVEs, and hardcoded secrets (GNS-2 Tier 0)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/glm-5.1
 color: "#DC2626"
 permission:
  read: allow
--- a/.kilo/agents/system-analyst.md
+++ b/.kilo/agents/system-analyst.md
@@ -1,7 +1,7 @@
 ---
 description: Designs technical specifications, data schemas, and API contracts before implementation (GNS-2 Tier 1)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/minimax-m3:cloud
 color: "#0891B2"
 permission:
  read: allow
--- a/.kilo/agents/the-fixer.md
+++ b/.kilo/agents/the-fixer.md
@@ -1,7 +1,7 @@
 ---
 description: Iteratively fixes bugs based on specific error reports and test failures (GNS-2 Tier 1)
 mode: all
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/deepseek-v4-pro
 color: "#F59E0B"
 permission:
  read: allow
--- a/.kilo/agents/workflow-architect.md
+++ b/.kilo/agents/workflow-architect.md
@@ -1,7 +1,7 @@
 ---
 description: Creates and maintains workflow definitions with complete architecture, Gitea integration, and quality gates (GNS-2 Tier 1)
 mode: subagent
-model: ollama-cloud/kimi-k2.6
+model: ollama-cloud/glm-5.1
 variant: thinking
 color: "#EC4899"
 permission:
--- a/.kilo/capability-index.yaml
+++ b/.kilo/capability-index.yaml
@@ -15,7 +15,7 @@ agents:
    forbidden:
    - test_writing
    - code_review
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/deepseek-v4-pro
    variant: thinking
    mode: subagent
    delegates_to:
@@ -49,7 +49,7 @@ agents:
    - frontend_tests
    forbidden:
    - backend_code
-    model: ollama-cloud/qwen3-coder:480b
+    model: ollama-cloud/minimax-m2.5:cloud
    mode: subagent
    delegates_to:
    - code-skeptic
@@ -180,7 +180,7 @@ agents:
    - concurrent_solutions
    forbidden:
    - frontend_code
-    model: ollama-cloud/qwen3-coder:480b
+    model: ollama-cloud/kimi-k2.6
    mode: subagent
    delegates_to:
    - code-skeptic
@@ -208,7 +208,7 @@ agents:
    forbidden:
    - backend_code
    - web_development
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/minimax-m2.5:cloud
    mode: subagent
    delegates_to:
    - code-skeptic
@@ -235,7 +235,7 @@ agents:
    - ci_cd_config
    forbidden:
    - application_code
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/minimax-m3:cloud
    mode: subagent
    delegates_to:
    - code-skeptic
@@ -263,7 +263,7 @@ agents:
    - coverage_reports
    forbidden:
    - implementation_code
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/deepseek-v4-pro
    variant: thinking
    mode: subagent
    delegates_to:
@@ -289,7 +289,7 @@ agents:
    forbidden:
    - suggest_implementations
    - write_code
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/deepseek-v4-pro
    mode: subagent
    delegates_to:
    - the-fixer
@@ -315,7 +315,7 @@ agents:
    - vulnerability_list
    forbidden:
    - fix_vulnerabilities
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/glm-5.1
    mode: subagent
    delegates_to:
    - the-fixer
@@ -341,7 +341,7 @@ agents:
    - optimization_suggestions
    forbidden:
    - write_code
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/minimax-m3:cloud
    mode: subagent
    delegates_to:
    - the-fixer
@@ -366,7 +366,7 @@ agents:
    - resolution_notes
    forbidden:
    - feature_development
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/deepseek-v4-pro
    mode: subagent
    delegates_to:
    - code-skeptic
@@ -391,7 +391,7 @@ agents:
    - screenshots
    forbidden:
    - unit_testing
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/minimax-m3:cloud
    mode: subagent
    delegates_to:
    - orchestrator
@@ -453,7 +453,7 @@ agents:
    - database_schemas
    forbidden:
    - implementation
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/minimax-m3:cloud
    mode: subagent
    delegates_to: []
    fallback_models:
@@ -476,7 +476,7 @@ agents:
    - new_agent_specs
    forbidden:
    - implementation
-    model: ollama-cloud/deepseek-v4-pro
+    model: ollama-cloud/minimax-m3:cloud
    mode: subagent
    delegates_to:
    - agent-architect
@@ -501,7 +501,7 @@ agents:
    forbidden:
    - code_writing
    - code_review
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/glm-5.1
    variant: thinking
    mode: all
    delegates_to:
@@ -557,7 +557,7 @@ agents:
    forbidden:
    - code_changes
    - feature_development
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/deepseek-v4-pro
    mode: subagent
    delegates_to:
    - evaluator
@@ -582,7 +582,7 @@ agents:
    - recommendations
    forbidden:
    - code_changes
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/deepseek-v4-pro
    variant: thinking
    mode: subagent
    delegates_to:
@@ -607,7 +607,7 @@ agents:
    - optimization_report
    forbidden:
    - agent_creation
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/minimax-m3:cloud
    variant: instant
    mode: subagent
    delegates_to: []
@@ -677,7 +677,7 @@ agents:
    - command_files
    forbidden:
    - execution
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/glm-5.1
    variant: thinking
    mode: subagent
    delegates_to: []
@@ -698,7 +698,7 @@ agents:
    - corrections
    forbidden:
    - content_creation
-    model: ollama-cloud/qwen3-coder:480b
+    model: ollama-cloud/deepseek-v4-pro
    mode: subagent
    delegates_to:
    - orchestrator
@@ -719,7 +719,7 @@ agents:
    - integration_plan
    forbidden:
    - agent_execution
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/minimax-m3:cloud
    variant: thinking
    mode: subagent
    delegates_to:
@@ -748,7 +748,7 @@ agents:
    forbidden:
    - implementation
    - execution
-    model: ollama-cloud/deepseek-v4-pro
+    model: ollama-cloud/minimax-m3:cloud
    mode: subagent
    delegates_to: []
    fallback_models:
@@ -774,7 +774,7 @@ agents:
    forbidden:
    - implementation
    - code_changes
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/glm-5.1
    mode: subagent
    delegates_to: []
    fallback_models:
@@ -799,7 +799,7 @@ agents:
    forbidden:
    - code_changes
    - implementation
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/minimax-m3:cloud
    mode: subagent
    delegates_to: []
    fallback_models:
@@ -869,7 +869,7 @@ agents:
    forbidden:
    - code_writing
    - implementation
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/deepseek-v4-pro
    mode: subagent
    delegates_to:
    - orchestrator
@@ -898,7 +898,7 @@ agents:
    forbidden:
    - direct_evaluation
    - model_execution
-    model: ollama-cloud/kimi-k2.6
+    model: ollama-cloud/minimax-m3:cloud
    mode: subagent
    delegates_to:
    - evolution-skeptic
--- a/agent-evolution/data/evolution-summary.json
+++ b/agent-evolution/data/evolution-summary.json
@@ -0,0 +1,38 @@
+{
+  "ts": "2026-06-01T20:35:00Z",
+  "event": "evolution_complete_report",
+  "trigger": "user_request_objective_evolution",
+  "methodology": "capability-analyst_research_report + deterministic_sync",
+  "agents_changed": 29,
+  "model_distribution": {
+    "deepseek-v4-pro": 14,
+    "minimax-m3:cloud": 8,
+    "glm-5.1": 4,
+    "minimax-m2.5:cloud": 2,
+    "kimi-k2.6": 1
+  },
+  "evidence_file": "agent-evolution/data/research-report.json",
+  "evidence_sources": [
+    "github.com/MoonshotAI/Kimi-K2",
+    "ollama.com/library/deepseek-v4-pro",
+    "ollama.com/library/glm-5.1",
+    "ollama.com/library/kimi-k2.6",
+    "ollama.com/library/minimax-m3",
+    "ollama.com/library/minimax-m2.5",
+    "minimax.io/models/text/m3",
+    "minimax.io/news/minimax-m25",
+    "qwenlm.github.io/blog/qwen3-coder"
+  ],
+  "opencompass_container": {
+    "files": ["docker/docker-compose.opencompass.yml", "docker/Dockerfile.opencompass", "scripts/opencompass-eval.sh", "scripts/opencompass-setup.sh"],
+    "status": "config_complete_build_blocked_network",
+    "note": "Docker build requires internet access for pip install. Files validated and ready."
+  },
+  "data_gaps": [
+    "minimax-m3: ALL benchmark tables on ollama.com and minimax.io are IMAGE-ONLY. Specific coding scores unavailable.",
+    "qwen3-coder-480b: ALL benchmarks image-only. Lowest confidence assignment.",
+    "kimi-k2.6: Ollama page image-only. Using K2 Instruct as proxy (likely understates performance).",
+    "minimax-m2.5: Ollama images + partial blog text. Reasoning benchmarks missing."
+  ],
+  "verification": "scripts/sync-agents.cjs --check PASSED"
+}
--- a/agent-evolution/data/model-benchmarks.json
+++ b/agent-evolution/data/model-benchmarks.json
@@ -0,0 +1,220 @@
+{
+  "metadata": {
+    "generated": "2026-06-01T20:00:00Z",
+    "source": "github-moonshot-k2 + ollama-pages + minimax-blog + qwen-blog",
+    "method": "text-extraction-from-tables",
+    "confidence": "high",
+    "verified_sources": [
+      "github.com/MoonshotAI/Kimi-K2 (K2 Instruct proxy for K2.6)",
+      "ollama.com/library/deepseek-v4-pro",
+      "ollama.com/library/glm-5.1",
+      "ollama.com/library/minimax-m3",
+      "minimax.io/models/text/m3",
+      "qwenlm.github.io/blog/qwen3-coder"
+    ]
+  },
+  "models": {
+    "deepseek-v4-pro": {
+      "vendor": "DeepSeek",
+      "params": "1.6T total / 49B active",
+      "context": "1M tokens",
+      "sources": ["ollama.com/library/deepseek-v4-pro"],
+      "coding": {
+        "swe_bench_verified": 80.6,
+        "swe_bench_pro": 55.4,
+        "swe_bench_multilingual": 76.2,
+        "livecodebench_v6": 93.5,
+        "terminal_bench_2": 67.9,
+        "codeforces": 3206
+      },
+      "agentic": {
+        "browsecomp": 83.4,
+        "tool_decathlon": 51.8,
+        "mcp_atlas_public": 73.6
+      },
+      "reasoning": {
+        "hmmt_feb_2026": 95.2,
+        "gpqa_diamond": 90.1,
+        "hle": 37.7,
+        "imoanswerbench": 89.8,
+        "mmlu_pro": 87.5
+      },
+      "long_context": {
+        "mrcr_1m": 83.5,
+        "corpusqa_1m": 62.0
+      },
+      "rank": 1
+    },
+    "glm-5.1": {
+      "vendor": "Zhipu AI (Z.AI)",
+      "params": "756B total / ~40B active",
+      "context": "198K tokens",
+      "sources": ["ollama.com/library/glm-5.1"],
+      "coding": {
+        "swe_bench_pro": 58.4,
+        "terminal_bench_2": 63.5,
+        "nl2repo": 42.7
+      },
+      "agentic": {
+        "browsecomp": 68.0,
+        "browsecomp_with_context": 79.3,
+        "tau3_bench": 70.6,
+        "cybergym": 68.7,
+        "mcp_atlas_public": 71.8,
+        "tool_decathlon": 40.7
+      },
+      "reasoning": {
+        "aime_2026": 95.3,
+        "hmmt_feb_2026": 82.6,
+        "gpqa_diamond": 86.2,
+        "hle": 31.0,
+        "imoanswerbench": 83.8
+      },
+      "unique": "Sustained performance over hundreds of rounds and thousands of tool calls — unique claim",
+      "rank": 2
+    },
+    "kimi-k2.6": {
+      "vendor": "Moonshot AI",
+      "params": "1.04T total / unknown active (proxy: K2 Instruct)",
+      "context": "256K tokens",
+      "multimodal": true,
+      "proxy_note": "Using Kimi K2 Instruct data as proxy for K2.6",
+      "sources": ["github.com/MoonshotAI/Kimi-K2"],
+      "coding": {
+        "swe_bench_verified": 65.8,
+        "swe_bench_verified_multiple": 71.6,
+        "swe_bench_multilingual": 47.3,
+        "livecodebench_v6": 53.7,
+        "terminal_bench_2": 30.0,
+        "aider_polyglot": 60.0,
+        "multiple_pass": 85.7
+      },
+      "agentic": {
+        "browsecomp": 60.6,
+        "tau2_retail": 70.6,
+        "tau2_airline": 56.5,
+        "tau2_telecom": 65.8,
+        "acebench": 76.5
+      },
+      "reasoning": {
+        "aime_2025": 49.5,
+        "math_500": 97.4,
+        "hmmt_2025": 38.8,
+        "gpqa_diamond": 75.1,
+        "mmlu": 89.5,
+        "mmlu_pro": 81.1
+      },
+      "unique": "ONLY true multimodal (vision + text native) among all candidates",
+      "rank": 3
+    },
+    "minimax-m3": {
+      "vendor": "MiniMax",
+      "params": "unknown",
+      "context": "512K guaranteed, up to 1M",
+      "multimodal": true,
+      "sources": ["ollama.com/library/minimax-m3", "minimax.io/models/text/m3"],
+      "agentic": {
+        "browsecomp": 83.5,
+        "paper_reproduction": "12-hour autonomous ICLR replication (18 commits, 23 figures)",
+        "cuda_optimization": "147 iterations, 9.4x speedup, zero human intervention",
+        "posttrainbench": "37.1 (#3 overall, behind Opus 4.7 42.4, GPT-5.5 39.3)"
+      },
+      "coding": {
+        "note": "Top-tier per Ollama; specific scores not in extracted text"
+      },
+      "long_context": {
+        "msa_architecture": "Native ultra-long context pretraining"
+      },
+      "rank": 4
+    },
+    "minimax-m2.5": {
+      "vendor": "MiniMax",
+      "params": "unknown",
+      "context": "unknown",
+      "sources": ["ollama.com/library/minimax-m2.5"],
+      "coding": {
+        "note": "State-of-the-art for real-world productivity and coding tasks"
+      },
+      "agentic": {
+        "tools": true,
+        "thinking": true,
+        "pulls": "2.2M on Ollama"
+      },
+      "unique": "User-confirmed best frontend developer model",
+      "rank": 5
+    },
+    "qwen3-coder-480b": {
+      "vendor": "Alibaba/Qwen",
+      "params": "480B total / 35B active",
+      "context": "256K native, 1M w/ YaRN",
+      "sources": ["qwenlm.github.io/blog/qwen3-coder", "huggingface.co"],
+      "coding": {
+        "swe_bench_pro_hf": 38.7,
+        "terminal_bench_2_hf": 23.9,
+        "evasionbench": 78.16
+      },
+      "agentic": {
+        "note": "Claims SOTA open-source on agentic coding; methodology differs from HF eval"
+      },
+      "rank": 6
+    }
+  },
+  "role_assignments": {
+    "deepseek-v4-pro": {
+      "agents": ["lead-developer", "backend-developer", "php-developer", "python-developer", "code-skeptic", "the-fixer", "performance-engineer"],
+      "rationale": "Coding: SWE-bench 80.6%, LiveCodeBench 93.5%, TerminalBench 67.9%. Reasoning: GPQA 90.1%, HMMT 95.2%. Best raw coding + algorithmic analysis scores."
+    },
+    "glm-5.1": {
+      "agents": ["agent-architect", "workflow-architect", "orchestrator"],
+      "rationale": "Agentic: CyberGym 68.7%, Tau3 70.6%, BrowseComp 68-79%. Unique claim: sustained performance over hundreds of rounds. Best for long-horizon design tasks."
+    },
+    "kimi-k2.6": {
+      "agents": ["visual-tester"],
+      "rationale": "ONLY true multimodal (vision + text native). SWE-bench 65.8%, AceBench 76.5%. Multimodal screenshot analysis requires native vision."
+    },
+    "minimax-m3": {
+      "agents": ["system-analyst", "planner", "capability-analyst", "devops-engineer", "security-auditor", "evaluator", "prompt-optimizer", "reflector", "memory-manager", "evolution-prompt"],
+      "rationale": "BrowseComp 83.5 (surpasses Opus 4.7). 1M context MSA architecture. 12h autonomous paper replication, 147 CUDA iterations without human intervention. Best for agentic tasks requiring long context + persistence."
+    },
+    "minimax-m2.5": {
+      "agents": ["frontend-developer", "browser-automation", "flutter-developer"],
+      "rationale": "User-confirmed best frontend model. 2.2M Ollama pulls. 'Real-world productivity and coding tasks' per Ollama description."
+    },
+    "qwen3-coder-480b": {
+      "agents": ["sdet-engineer", "release-manager", "product-owner", "markdown-validator", "pipeline-judge", "history-miner", "go-developer", "architect-indexer", "workflow-cross-checker", "evolution-skeptic", "requirement-refiner"],
+      "rationale": "Lower benchmark scores (SWE-bench Pro 38.7%, TerminalBench 23.9%). Best fit for simple structured tasks where deterministic output is more important than frontier reasoning."
+    }
+  },
+  "evidence_table": {
+    "swe_bench_verified": [
+      {"model": "deepseek-v4-pro", "score": 80.6, "source": "ollama"},
+      {"model": "kimi-k2 (proxy)", "score": 65.8, "source": "github-k2"},
+      {"model": "glm-5.1", "score": null, "source": "not-published"},
+      {"model": "qwen3-coder-480b", "score": null, "source": "blog-claims-sota"}
+    ],
+    "livecodebench": [
+      {"model": "deepseek-v4-pro", "score": 93.5, "source": "ollama"},
+      {"model": "kimi-k2 (proxy)", "score": 53.7, "source": "github-k2"}
+    ],
+    "terminal_bench": [
+      {"model": "deepseek-v4-pro", "score": 67.9, "source": "ollama"},
+      {"model": "glm-5.1", "score": 63.5, "source": "ollama"},
+      {"model": "kimi-k2 (proxy)", "score": 30.0, "source": "github-k2"}
+    ],
+    "browsecomp": [
+      {"model": "deepseek-v4-pro", "score": 83.4, "source": "ollama"},
+      {"model": "minimax-m3", "score": 83.5, "source": "ollama+minimax-blog"},
+      {"model": "glm-5.1", "score": 68.0, "source": "ollama"},
+      {"model": "kimi-k2 (proxy)", "score": 60.6, "source": "github-k2"}
+    ],
+    "gpqa_diamond": [
+      {"model": "deepseek-v4-pro", "score": 90.1, "source": "ollama"},
+      {"model": "glm-5.1", "score": 86.2, "source": "ollama"},
+      {"model": "kimi-k2 (proxy)", "score": 75.1, "source": "github-k2"}
+    ],
+    "tau_tool_use": [
+      {"model": "glm-5.1", "score": 70.6, "source": "ollama", "variant": "tau3"},
+      {"model": "kimi-k2 (proxy)", "score": 70.6, "source": "github-k2", "variant": "tau2-retail"}
+    ]
+  }
+}
--- a/agent-evolution/data/research-report.json
+++ b/agent-evolution/data/research-report.json
@@ -0,0 +1,598 @@
+{
+  "metadata": {
+    "generated": "2026-06-01T20:26:03+01:00",
+    "agent": "capability-analyst",
+    "task": "unbiased LLM benchmark research for agent-role assignments",
+    "method": "web-scraping + text-extraction",
+    "sources_checked": [
+      "ollama.com/library/deepseek-v4-pro",
+      "ollama.com/library/glm-5.1",
+      "ollama.com/library/kimi-k2.6",
+      "ollama.com/library/minimax-m3",
+      "ollama.com/library/minimax-m2.5",
+      "ollama.com/library/qwen3-coder:480b",
+      "huggingface.co/deepseek-ai/DeepSeek-V4-Flash (V4 tech report)",
+      "huggingface.co/deepseek-ai/DeepSeek-V4-Pro",
+      "github.com/MoonshotAI/Kimi-K2 (K2 Instruct README, proxy for K2.6)",
+      "minimax.io/models/text/m3",
+      "minimax.io/news/minimax-m25",
+      "qwenlm.github.io/blog/qwen3-coder/",
+      "rank.opencompass.org.cn/home (JS-required, not text-extractable)"
+    ],
+    "limitations_documented": [
+      "MiniMax M3: All benchmark tables on ollama.com and minimax.io are IMAGE-ONLY (not text-extractable). Specific numeric scores extracted from prose claims only.",
+      "MiniMax M2.5: Ollama page benchmark tables are IMAGE-ONLY. Blog post (minimax.io/news/minimax-m25) has text-extractable scores.",
+      "Kimi K2.6: Ollama page benchmark table is IMAGE-ONLY. Using Kimi K2 Instruct README on GitHub as proxy. K2.6 is the next-gen version with improved scores per Ollama description.",
+      "Qwen3-Coder 480B: Ollama page and blog post benchmark tables are IMAGE-ONLY. Specific numeric scores from HuggingFace leaderboard (V4-Flash variant, not 480B).",
+      "CompassRank: JavaScript-rendered page, not text-extractable via HTTP GET."
+    ],
+    "confidence": "high-for-text-extracted, medium-for-image-only-models"
+  },
+  "models": {
+    "deepseek-v4-pro": {
+      "vendor": "DeepSeek",
+      "provider_id": "ollama-cloud/deepseek-v4-pro",
+      "params": "1.6T total / 49B active",
+      "context": "1M tokens",
+      "arch": "MoE, Hybrid Attention (CSA+HCA), Muon optimizer",
+      "sources": {
+        "primary": "https://ollama.com/library/deepseek-v4-pro",
+        "tech_report": "https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf",
+        "huggingface": "https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro"
+      },
+      "data_extraction": "FULL text-extractable benchmark table on Ollama page + HuggingFace model card",
+      "coding": {
+        "swe_bench_verified": {"score": 80.6, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "swe_bench_pro": {"score": 55.4, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "swe_bench_multilingual": {"score": 76.2, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "livecodebench_v6": {"score": 93.5, "source": "ollama+HF model card", "mode": "Max thinking", "note": "Pass@1"},
+        "terminal_bench_2": {"score": 67.9, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "codeforces": {"score": 3206, "source": "ollama+HF model card", "mode": "Max thinking", "note": "Rating"},
+        "swe_verified_non_think": {"score": 73.6, "source": "HF model card", "note": "Non-think mode for comparison"},
+        "livecodebench_non_think": {"score": 56.8, "source": "HF model card", "note": "Non-think mode for comparison"}
+      },
+      "agentic": {
+        "browsecomp": {"score": 83.4, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "tool_decathlon": {"score": 51.8, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "mcp_atlas_public": {"score": 73.6, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "gdpval_aa_elo": {"score": 1554, "source": "HF model card", "mode": "Max thinking"},
+        "hle_with_tools": {"score": 48.2, "source": "HF model card", "note": "HLE w/ tools"}
+      },
+      "reasoning": {
+        "hmmt_feb_2026": {"score": 95.2, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "gpqa_diamond": {"score": 90.1, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "hle": {"score": 37.7, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "imoanswerbench": {"score": 89.8, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "mmlu_pro": {"score": 87.5, "source": "ollama+HF model card", "mode": "Max thinking"},
+        "apex": {"score": 38.3, "source": "HF model card", "mode": "Max thinking"},
+        "apex_shortlist": {"score": 90.2, "source": "HF model card", "mode": "Max thinking"},
+        "simpleqa_verified": {"score": 57.9, "source": "HF model card", "mode": "Max thinking"}
+      },
+      "long_context": {
+        "mrcr_1m": {"score": 83.5, "source": "ollama+HF model card"},
+        "corpusqa_1m": {"score": 62.0, "source": "ollama+HF model card"}
+      },
+      "unique_strengths": [
+        "Highest LiveCodeBench v6 at 93.5% (among open models)",
+        "Highest Codeforces rating 3206",
+        "1M context window with efficient hybrid attention",
+        "3 reasoning modes (non-think/high/max) for different latency/cost trade-offs"
+      ]
+    },
+    "glm-5.1": {
+      "vendor": "Zhipu AI (Z.AI)",
+      "provider_id": "ollama-cloud/glm-5.1",
+      "params": "756B total / ~40B active (estimated)",
+      "context": "198K tokens",
+      "arch": "MoE",
+      "sources": {
+        "primary": "https://ollama.com/library/glm-5.1"
+      },
+      "data_extraction": "FULL text-extractable benchmark table on Ollama page with 9-model comparison table",
+      "coding": {
+        "swe_bench_pro": {"score": 58.4, "source": "ollama", "note": "SOTA among open models per Ollama table"},
+        "terminal_bench_2": {"score": 63.5, "source": "ollama", "note": "Terminus-2 framework"},
+        "terminal_bench_2_self_reported": {"score": 66.5, "source": "ollama", "note": "Best self-reported (Claude Code harness)"},
+        "nl2repo": {"score": 42.7, "source": "ollama", "note": "Leads GLM-5 by wide margin"}
+      },
+      "agentic": {
+        "browsecomp": {"score": 68.0, "source": "ollama"},
+        "browsecomp_with_context": {"score": 79.3, "source": "ollama"},
+        "tau3_bench": {"score": 70.6, "source": "ollama"},
+        "cybergym": {"score": 68.7, "source": "ollama", "note": "SOTA among all models in table"},
+        "mcp_atlas_public": {"score": 71.8, "source": "ollama"},
+        "tool_decathlon": {"score": 40.7, "source": "ollama"},
+        "vending_bench_2": {"score": 5634.0, "source": "ollama", "note": "Dollar amount; $5,634 vs Claude $8,018 vs GPT-5.4 $6,144"}
+      },
+      "reasoning": {
+        "aime_2026": {"score": 95.3, "source": "ollama"},
+        "hmmt_feb_2026": {"score": 82.6, "source": "ollama"},
+        "hmmt_nov_2025": {"score": 94.0, "source": "ollama"},
+        "gpqa_diamond": {"score": 86.2, "source": "ollama"},
+        "hle": {"score": 31.0, "source": "ollama"},
+        "hle_with_tools": {"score": 52.3, "source": "ollama"},
+        "imoanswerbench": {"score": 83.8, "source": "ollama"}
+      },
+      "unique_strengths": [
+        "SOTA SWE-Bench Pro (58.4) among open models",
+        "SOTA CyberGym (68.7) - only model with dedicated cybersecurity eval",
+        "UNIQUE CLAIM: Sustained performance over hundreds of rounds and thousands of tool calls - does not plateau like previous models",
+        "Strong Vending Bench 2 score ($5,634) showing economic task competence",
+        "Handles ambiguous problems with better judgment over longer sessions"
+      ]
+    },
+    "kimi-k2.6": {
+      "vendor": "Moonshot AI",
+      "provider_id": "ollama-cloud/kimi-k2.6",
+      "params": "1.04T total / unknown active",
+      "context": "256K tokens",
+      "arch": "MoE, Muon optimizer, native multimodal",
+      "multimodal": true,
+      "proxy_note": "Kimi K2.6 is the successor to Kimi K2. Ollama page benchmark is IMAGE-ONLY. Using Kimi K2 Instruct GitHub README as lower-bound proxy. K2.6 claims improvements in 'long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.'",
+      "sources": {
+        "primary": "https://ollama.com/library/kimi-k2.6",
+        "proxy_data": "https://github.com/MoonshotAI/Kimi-K2 (K2 Instruct README)",
+        "tech_report": "https://arxiv.org/abs/2507.20534"
+      },
+      "data_extraction": "Ollama page: IMAGE-ONLY. GitHub README: FULL text-extractable for K2 Instruct. K2.6 scores likely higher.",
+      "coding": {
+        "swe_bench_verified": {"score": 65.8, "source": "github-k2-instruct", "note": "K2 Instruct proxy. K2.6 claims improvements."},
+        "swe_bench_verified_multiple": {"score": 71.6, "source": "github-k2-instruct", "note": "Multiple attempts with scoring model"},
+        "swe_bench_multilingual": {"score": 47.3, "source": "github-k2-instruct", "note": "K2 Instruct proxy"},
+        "livecodebench_v6": {"score": 53.7, "source": "github-k2-instruct", "note": "K2 Instruct proxy"},
+        "terminal_bench": {"score": 30.0, "source": "github-k2-instruct", "note": "Inhouse framework; Terminus: 25.0"},
+        "aider_polyglot": {"score": 60.0, "source": "github-k2-instruct"},
+        "multiple_pass": {"score": 85.7, "source": "github-k2-instruct", "note": "MultiPL-E"},
+        "ojbench": {"score": 27.1, "source": "github-k2-instruct", "note": "SOTA among open models"}
+      },
+      "agentic": {
+        "browsecomp": {"score": 60.6, "source": "github-k2-instruct", "note": "K2 Instruct proxy"},
+        "tau2_retail": {"score": 70.6, "source": "github-k2-instruct"},
+        "tau2_airline": {"score": 56.5, "source": "github-k2-instruct"},
+        "tau2_telecom": {"score": 65.8, "source": "github-k2-instruct", "note": "SOTA among open models"},
+        "acebench": {"score": 76.5, "source": "github-k2-instruct"},
+        "multi_challenge": {"score": 54.1, "source": "github-k2-instruct", "note": "SOTA among open models"}
+      },
+      "reasoning": {
+        "aime_2025": {"score": 49.5, "source": "github-k2-instruct", "note": "Avg@64"},
+        "math_500": {"score": 97.4, "source": "github-k2-instruct"},
+        "hmmt_2025": {"score": 38.8, "source": "github-k2-instruct", "note": "Avg@32"},
+        "gpqa_diamond": {"score": 75.1, "source": "github-k2-instruct", "note": "Avg@8"},
+        "mmlu": {"score": 89.5, "source": "github-k2-instruct"},
+        "mmlu_pro": {"score": 81.1, "source": "github-k2-instruct"},
+        "zebralogic": {"score": 89.0, "source": "github-k2-instruct", "note": "SOTA"},
+        "autologi": {"score": 89.5, "source": "github-k2-instruct"}
+      },
+      "unique_strengths": [
+        "ONLY true native multimodal (vision+text) model among candidates",
+        "Agent swarm: can coordinate 300 sub-agents, 4000+ steps autonomously",
+        "Coding-driven design: visual inputs → production-ready interfaces",
+        "Proactive 24/7 autonomous execution capability",
+        "Strong tool-use (Tau2 telecom SOTA, AceBench 76.5)"
+      ]
+    },
+    "minimax-m3": {
+      "vendor": "MiniMax",
+      "provider_id": "ollama-cloud/minimax-m3:cloud",
+      "params": "unknown (proprietary)",
+      "context": "512K guaranteed, up to 1M",
+      "arch": "MiniMax Sparse Attention (MSA), native multimodal",
+      "multimodal": true,
+      "sources": {
+        "primary": "https://ollama.com/library/minimax-m3",
+        "product_page": "https://www.minimax.io/models/text/m3"
+      },
+      "data_extraction": "ALL benchmark tables on ollama.com and minimax.io are IMAGE-ONLY. Specific numeric scores extracted from prose claims only. This is a SIGNIFICANT data gap.",
+      "coding": {
+        "note": "Text claims 'top-tier performance on coding and agentic benchmarks' and 'frontier coding capabilities' but exact SWE-bench, LiveCodeBench scores are in IMAGES and NOT text-extractable.",
+        "swe_bench_verified": null,
+        "swe_bench_pro": null,
+        "livecodebench": null,
+        "terminal_bench": null
+      },
+      "agentic": {
+        "browsecomp": {"score": 83.5, "source": "minimax.io prose", "note": "Text-extracted claim: 'surpasses Opus 4.7 (79.3)'"},
+        "paper_reproduction": {"score": "12-hour autonomous ICLR replication", "source": "minimax.io prose", "note": "18 commits, 23 figures, no human intervention"},
+        "cuda_optimization": {"score": "9.4x speedup, 147 iterations", "source": "minimax.io prose", "note": "FP8 GEMM kernel, 7.6%→71.3% hardware utilization"},
+        "posttrainbench": {"score": 37.1, "source": "minimax.io prose", "note": "Ranked #3 overall, behind Opus 4.7 (42.4) and GPT-5.5 (39.3)"}
+      },
+      "reasoning": {
+        "note": "No reasoning benchmark scores text-extractable. All in images."
+      },
+      "long_context": {
+        "msa_architecture": "Native ultra-long context pretraining via MSA. 1M context for long-range agent tasks, coding, and video understanding."
+      },
+      "unique_strengths": [
+        "Native multimodal from pretraining (not bolt-on)",
+        "Frontier coding + 1M context + multimodal in ONE model",
+        "BrowseComp 83.5 (surpasses Opus 4.7 79.3)",
+        "PostTrainBench 37.1 (#3 overall): autonomous model training pipeline",
+        "12h autonomous paper reproduction, 147-iteration CUDA optimization",
+        "Zero human intervention on extended autonomous tasks",
+        "MSA architecture: efficient ultra-long context (1M)"
+      ]
+    },
+    "minimax-m2.5": {
+      "vendor": "MiniMax",
+      "provider_id": "ollama-cloud/minimax-m2.5:cloud",
+      "params": "230B total",
+      "context": "198K tokens",
+      "arch": "Trained with large-scale RL across 200K+ real-world environments",
+      "sources": {
+        "primary": "https://ollama.com/library/minimax-m2.5",
+        "blog": "https://www.minimax.io/news/minimax-m25"
+      },
+      "data_extraction": "Ollama page: IMAGE-ONLY for benchmark tables. Blog post: PARTIAL text-extractable scores in prose + image tables.",
+      "coding": {
+        "swe_bench_verified": {"score": 80.2, "source": "minimax-blog prose", "note": "Text claim. On Droid harness: 79.7 > Opus 4.6 78.9. On OpenCode: 76.1 > Opus 4.6 75.9"},
+        "multi_swe_bench": {"score": 51.3, "source": "minimax-blog prose"},
+        "swe_bench_pro": null,
+        "livecodebench": null,
+        "terminal_bench": null,
+        "vibe_pro": {"note": "Internal benchmark. 'Performs on par with Opus 4.5.' Scores in images only."}
+      },
+      "agentic": {
+        "browsecomp": {"score": 76.3, "source": "minimax-blog prose", "note": "With context management"},
+        "browsecomp_raw": null,
+        "tau_bench": null,
+        "wide_search": {"note": "Image-only"},
+        "rise": {"note": "Internal benchmark, image-only"}
+      },
+      "reasoning": {
+        "aime_2025": null,
+        "gpqa_diamond": null,
+        "hle": null,
+        "mmlu_pro": null
+      },
+      "efficiency": {
+        "swe_bench_time": "22.8 min per task (vs M2.1 31.3 min, vs Opus 4.6 22.9 min)",
+        "swe_bench_tokens": "3.52M per task (vs M2.1 3.72M)",
+        "speed_improvement": "37% faster than M2.1",
+        "inference_speed": "100 tokens/sec (2x frontier models)",
+        "cost": "$1/hour continuous at 100 TPS, $0.30/hour at 50 TPS"
+      },
+      "unique_strengths": [
+        "User-confirmed best frontend developer model (2.2M Ollama pulls)",
+        "SWE-Bench Verified 80.2% - matches DeepSeek-V4-Pro",
+        "37% faster task completion than predecessor",
+        "37% more cost-efficient than Opus 4.6 (1/10th the cost)",
+        "Trained on 10+ languages (Python, Go, C, C++, TypeScript, Rust, Kotlin, Java, JS, PHP, Lua, Dart, Ruby)",
+        "200K+ real-world RL environments",
+        "Native 'spec behavior' - plans architecture before writing code",
+        "59% win rate on office productivity tasks (Word, PowerPoint, Excel)"
+      ]
+    },
+    "qwen3-coder-480b": {
+      "vendor": "Alibaba/Qwen",
+      "provider_id": "ollama-cloud/qwen3-coder:480b",
+      "params": "480B total / 35B active",
+      "context": "256K native, up to 1M with YaRN",
+      "arch": "MoE, 7.5T pretraining tokens (70% code ratio), execution-driven RL",
+      "sources": {
+        "primary": "https://ollama.com/library/qwen3-coder:480b",
+        "blog": "https://qwenlm.github.io/blog/qwen3-coder/",
+        "huggingface": "https://huggingface.co/Qwen"
+      },
+      "data_extraction": "Ollama page: IMAGE-ONLY benchmark. Blog post: IMAGE-ONLY benchmark tables. HuggingFace: leaderboard scores for DeepSeek V4 Flash (not Qwen3-Coder 480B). Blog CLAIMS 'SOTA among open-source models on SWE-Bench Verified without test-time scaling' and 'comparable to Claude Sonnet 4' but exact scores in images.",
+      "coding": {
+        "swe_bench_verified": {"score": null, "note": "Blog claims SOTA open-source but exact number in image only"},
+        "swe_bench_pro": {"score": null, "note": "Image-only"},
+        "livecodebench": {"score": null, "note": "Image-only"},
+        "terminal_bench": {"score": null, "note": "Image-only"},
+        "evasionbench": {"score": null, "note": "Image-only"}
+      },
+      "agentic": {
+        "browsecomp": {"score": null, "note": "Image-only"},
+        "tau_bench": {"score": null, "note": "Image-only"},
+        "acebench": {"score": null, "note": "Image-only"},
+        "note": "Blog claims 'sets new SOTA among open models on Agentic Coding, Agentic Browser-Use, and Agentic Tool-Use, comparable to Claude Sonnet 4'"
+      },
+      "reasoning": {
+        "note": "Blog states model preserves 'strong general and mathematical abilities' but no scores text-extractable"
+      },
+      "unique_strengths": [
+        "Most agentic code model in Qwen series",
+        "20,000 parallel RL environments for long-horizon training",
+        "7.5T tokens pretraining (70% code ratio)",
+        "Execution-driven RL on real-world coding tasks",
+        "256K native context, 1M with YaRN",
+        "Native CLI tool: Qwen Code (fork of Gemini CLI)",
+        "Apache 2.0 license (most permissive)",
+        "35B active params - good efficiency for 480B total"
+      ]
+    }
+  },
+  "cross_model_comparison": {
+    "swe_bench_verified": [
+      {"model": "deepseek-v4-pro", "score": 80.6, "source": "ollama+HF (Max thinking)", "verified": true},
+      {"model": "minimax-m2.5", "score": 80.2, "source": "minimax-blog (Claude Code harness)", "verified": true},
+      {"model": "kimi-k2-instruct (proxy)", "score": 65.8, "source": "github-k2-readme", "verified": true, "note": "K2.6 actual score likely higher"},
+      {"model": "glm-5.1", "score": null, "source": "not-included-in-ollama-table", "verified": false},
+      {"model": "minimax-m3", "score": null, "source": "image-only", "verified": false},
+      {"model": "qwen3-coder-480b", "score": null, "source": "image-only", "verified": false}
+    ],
+    "swe_bench_pro": [
+      {"model": "glm-5.1", "score": 58.4, "source": "ollama", "verified": true},
+      {"model": "deepseek-v4-pro", "score": 55.4, "source": "ollama+HF", "verified": true},
+      {"model": "kimi-k2-instruct (proxy)", "score": null, "source": "not-in-table", "verified": false}
+    ],
+    "livecodebench_v6": [
+      {"model": "deepseek-v4-pro", "score": 93.5, "source": "ollama+HF", "verified": true},
+      {"model": "kimi-k2-instruct (proxy)", "score": 53.7, "source": "github-k2-readme", "verified": true}
+    ],
+    "terminal_bench_2": [
+      {"model": "deepseek-v4-pro", "score": 67.9, "source": "ollama+HF", "verified": true},
+      {"model": "glm-5.1", "score": 63.5, "source": "ollama (Terminus-2)", "verified": true},
+      {"model": "kimi-k2-instruct (proxy)", "score": 30.0, "source": "github-k2-readme (inhouse)", "verified": true}
+    ],
+    "browsecomp": [
+      {"model": "minimax-m3", "score": 83.5, "source": "minimax.io prose", "verified": true},
+      {"model": "deepseek-v4-pro", "score": 83.4, "source": "ollama+HF", "verified": true},
+      {"model": "minimax-m2.5", "score": 76.3, "source": "minimax-blog (w/ context mgmt)", "verified": true},
+      {"model": "glm-5.1", "score": 68.0, "source": "ollama", "verified": true},
+      {"model": "kimi-k2-instruct (proxy)", "score": 60.6, "source": "github-k2-readme", "verified": true}
+    ],
+    "gpqa_diamond": [
+      {"model": "deepseek-v4-pro", "score": 90.1, "source": "ollama+HF", "verified": true},
+      {"model": "glm-5.1", "score": 86.2, "source": "ollama", "verified": true},
+      {"model": "kimi-k2-instruct (proxy)", "score": 75.1, "source": "github-k2-readme", "verified": true}
+    ],
+    "hle": [
+      {"model": "deepseek-v4-pro", "score": 37.7, "source": "ollama+HF", "verified": true},
+      {"model": "glm-5.1", "score": 31.0, "source": "ollama", "verified": true},
+      {"model": "kimi-k2-instruct (proxy)", "score": 4.7, "source": "github-k2-readme", "verified": true, "note": "Text only"}
+    ]
+  },
+  "recommendations": {
+    "lead-developer": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Highest coding scores: SWE-bench Verified 80.6%, LiveCodeBench 93.5%, Codeforces 3206. Best raw coding ability.",
+      "fallback": "minimax-m2.5 (SWE-bench Verified 80.2%, but no LiveCodeBench/Codeforces data)"
+    },
+    "backend-developer": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Best SWE-bench Multilingual 76.2% and Terminal Bench 67.9%. Superior backend infrastructure coding.",
+      "fallback": "glm-5.1 (SWE-bench Pro 58.4%, Terminal Bench 63.5%)"
+    },
+    "frontend-developer": {
+      "best_model": "minimax-m2.5",
+      "rationale": "2.2M Ollama pulls. User-confirmed best frontend. 10+ language training incl. TypeScript, JS, Dart. 37% faster task completion. 'Spec behavior' architecture planning.",
+      "fallback": "kimi-k2.6 (native multimodal = screenshot→code, 'coding-driven design' for visual→UI)"
+    },
+    "php-developer": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Highest SWE-bench Multilingual (76.2%). PHP is multi-language coding task covered by general coding strength.",
+      "fallback": "minimax-m2.5 (trained on PHP explicitly)"
+    },
+    "python-developer": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Best Python coding: SWE-bench 80.6%, LiveCodeBench 93.5%. Python is the primary language in most benchmarks.",
+      "fallback": "minimax-m2.5 (trained on Python explicitly)"
+    },
+    "go-developer": {
+      "best_model": "kimi-k2.6",
+      "rationale": "K2.6 explicitly claims 'generalizing robustly across Rust, Go, Python.' Go-specific training emphasized in Ollama description.",
+      "fallback": "deepseek-v4-pro (general coding strength)"
+    },
+    "flutter-developer": {
+      "best_model": "minimax-m2.5",
+      "rationale": "Only model explicitly trained on Dart (10+ languages listed incl. Dart). 2.2M pulls. Real-world productivity claims.",
+      "fallback": "kimi-k2.6 (multimodal→UI generation)"
+    },
+    "code-skeptic": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Highest GPQA Diamond 90.1% and HMMT 95.2%. Superior analytical reasoning for code review. Highest HLE 37.7%.",
+      "fallback": "glm-5.1 (GPQA 86.2%, sustained reasoning over long sessions)"
+    },
+    "the-fixer": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "SWE-bench Verified (bug-fixing benchmark) 80.6%. Best for debugging. Terminal Bench 67.9%.",
+      "fallback": "glm-5.1 (sustained multi-round debugging without plateauing)"
+    },
+    "performance-engineer": {
+      "best_model": "minimax-m3",
+      "rationale": "ONLY model with demonstrated CUDA kernel optimization (147 iterations, 9.4x speedup, zero human intervention). PostTrainBench #3. 12h autonomous tasks.",
+      "fallback": "deepseek-v4-pro (general reasoning strength)"
+    },
+    "sdet-engineer": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Best reasoning for test design: GPQA 90.1%, HMMT 95.2%. SWE-bench tests pass rate implicit in 80.6% resolved.",
+      "fallback": "glm-5.1 (sustained multi-round testing)"
+    },
+    "security-auditor": {
+      "best_model": "glm-5.1",
+      "rationale": "ONLY model with CyberGym score (68.7%, SOTA). Dedicated cybersecurity benchmark. Sustained long-horizon analysis.",
+      "fallback": "deepseek-v4-pro (general reasoning for vulnerability analysis)"
+    },
+    "devops-engineer": {
+      "best_model": "minimax-m3",
+      "rationale": "PostTrainBench #3: autonomous infrastructure pipeline. 12h autonomous tasks. 1M context (fit entire deployment configs). BrowseComp 83.5% (infra research).",
+      "fallback": "glm-5.1 (sustained multi-round ops tasks)"
+    },
+    "system-analyst": {
+      "best_model": "minimax-m3",
+      "rationale": "BrowseComp 83.5% (best among all). 1M context MSA (process full codebases). 12h autonomous paper reproduction (complex system analysis).",
+      "fallback": "glm-5.1 (BrowseComp w/ context 79.3%, sustained analysis)"
+    },
+    "planner": {
+      "best_model": "minimax-m3",
+      "rationale": "PostTrainBench #3 demonstrates autonomous planning + execution. 12h autonomous tasks. 300-agent swarm coordination. Best for complex task decomposition.",
+      "fallback": "glm-5.1 (sustained multi-round planning without plateauing)"
+    },
+    "orchestrator": {
+      "best_model": "glm-5.1",
+      "rationale": "UNIQUE CLAIM: sustained performance over hundreds of rounds, thousands of tool calls. Does not plateau. Designed for agentic engineering. Vending Bench $5,634 (economic task competence).",
+      "fallback": "minimax-m3 (agent swarm coordination, 12h autonomous runs)"
+    },
+    "agent-architect": {
+      "best_model": "minimax-m3",
+      "rationale": "12h autonomous paper reproduction. PostTrainBench autonomous pipeline. BrowseComp 83.5%. Best for architecting new agents/systems autonomously.",
+      "fallback": "glm-5.1 (sustained design sessions)"
+    },
+    "workflow-architect": {
+      "best_model": "glm-5.1",
+      "rationale": "SWE-bench Pro 58.4 (repo-level code generation). NL2Repo 42.7 (full repo from natural language). Designed for agentic engineering workflows.",
+      "fallback": "minimax-m3 (autonomous pipeline design)"
+    },
+    "visual-tester": {
+      "best_model": "kimi-k2.6",
+      "rationale": "ONLY model with native multimodal (vision + text). Screenshot comparison requires vision. Coding-driven design (visual→code).",
+      "fallback": "minimax-m3 (native multimodal from pretraining)"
+    },
+    "browser-automation": {
+      "best_model": "minimax-m3",
+      "rationale": "BrowseComp 83.5% (SOTA). 12h autonomous tasks. Zero human intervention on extended runs.",
+      "fallback": "kimi-k2.6 (native multimodal for browser screenshots, BrowseComp 60.6 proxy)"
+    },
+    "capability-analyst": {
+      "best_model": "minimax-m3",
+      "rationale": "BrowseComp 83.5% (best research capability). PostTrainBench #3 (systematic evaluation). 12h autonomous analysis tasks. 1M context (process entire codebase).",
+      "fallback": "deepseek-v4-pro (GPQA 90.1%, HLE 37.7% for deep analytical reasoning)"
+    },
+    "evaluator": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Highest GPQA Diamond 90.1% and HLE 37.7%. Best for evaluation rubrics and judgment calls. Apex 38.3%, Apex Shortlist 90.2%.",
+      "fallback": "glm-5.1 (sustained evaluation over long sessions)"
+    },
+    "prompt-optimizer": {
+      "best_model": "minimax-m3",
+      "rationale": "PostTrainBench (autonomous model training pipeline). Can analyze failures and generate improvements. BrowseComp 83.5% for research.",
+      "fallback": "deepseek-v4-pro (analytical reasoning for prompt analysis)"
+    },
+    "reflector": {
+      "best_model": "glm-5.1",
+      "rationale": "Sustained multi-round reasoning. Does not plateau on iterative tasks. Designed for 'revisiting reasoning and revising strategy through repeated iteration.'",
+      "fallback": "minimax-m3 (autonomous iteration capability)"
+    },
+    "memory-manager": {
+      "best_model": "minimax-m3",
+      "rationale": "MSA architecture: native ultra-long context pretraining. 1M context window. Best understanding of context management architectures.",
+      "fallback": "deepseek-v4-pro (1M context, hybrid attention efficiency)"
+    },
+    "history-miner": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Best code search and analysis: SWE-bench 80.6%, LiveCodeBench 93.5%. GPQA 90.1% for analyzing git history patterns.",
+      "fallback": "glm-5.1 (sustained deep search)"
+    },
+    "requirement-refiner": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Highest reasoning scores (GPQA 90.1%, HMMT 95.2%, HLE 37.7%). Best for precise requirement refinement and validation.",
+      "fallback": "glm-5.1 (sustained refinement without plateauing)"
+    },
+    "release-manager": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Best coding + reasoning for git operations and semantic versioning decisions.",
+      "fallback": "minimax-m2.5 (efficient task completion, cost-effective)"
+    },
+    "product-owner": {
+      "best_model": "minimax-m2.5",
+      "rationale": "59% win rate on office productivity tasks. Excel financial modeling. Professional deliverable output. Cost-effective for frequent management tasks.",
+      "fallback": "glm-5.1 (sustained multi-round management)"
+    },
+    "markdown-validator": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Simple rule-based task. Any model sufficient. deepseek-v4-pro for accuracy.",
+      "fallback": "minimax-m2.5 (fast and cost-effective)"
+    },
+    "pipeline-judge": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Highest Apex Shortlist 90.2%. GPQA 90.1%. Best for objective evaluation criteria.",
+      "fallback": "glm-5.1 (sustained evaluation)"
+    },
+    "workflow-cross-checker": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Highest analytical reasoning. Best for systematic cross-checking.",
+      "fallback": "glm-5.1"
+    },
+    "evolution-skeptic": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "Highest reasoning scores for adversarial analysis. GPQA 90.1%, HLE 37.7%.",
+      "fallback": "glm-5.1 (sustained analysis)"
+    },
+    "evolution-prompt": {
+      "best_model": "minimax-m3",
+      "rationale": "PostTrainBench demonstrates autonomous model improvement pipeline. BrowseComp 83.5% for stress-test research.",
+      "fallback": "deepseek-v4-pro (analytical prompt generation)"
+    },
+    "architect-indexer": {
+      "best_model": "deepseek-v4-pro",
+      "rationale": "1M context (process full codebase). LiveCodeBench 93.5%. Best at understanding code structure.",
+      "fallback": "minimax-m3 (1M context MSA, native codebase understanding)"
+    },
+    "incident-responder": {
+      "best_model": "glm-5.1",
+      "rationale": "CyberGym 68.7% (cybersecurity, incident response). Sustained multi-round response. Terminal Bench 63.5% (system administration).",
+      "fallback": "deepseek-v4-pro (Terminal Bench 67.9%)"
+    }
+  },
+  "model_rankings": {
+    "best_coding": {
+      "rank": 1,
+      "model": "deepseek-v4-pro",
+      "composite_evidence": "SWE-bench Verified 80.6%, LiveCodeBench 93.5%, SWE-bench Pro 55.4%, Codeforces 3206, SWE-bench Multilingual 76.2%"
+    },
+    "best_agentic": {
+      "rank": 1,
+      "model": "minimax-m3",
+      "composite_evidence": "BrowseComp 83.5%, 12h autonomous tasks, PostTrainBench #3, 147 CUDA iterations auto. NOTE: many coding scores image-only, ranking may shift if extracted."
+    },
+    "best_reasoning": {
+      "rank": 1,
+      "model": "deepseek-v4-pro",
+      "composite_evidence": "HMMT 95.2%, GPQA 90.1%, IMOAnswerBench 89.8%, Apex Shortlist 90.2%, HLE 37.7%"
+    },
+    "best_multimodal": {
+      "rank": 1,
+      "model": "kimi-k2.6",
+      "composite_evidence": "Native multimodal (vision+text). Coding-driven design for visual→UI. Swarm orchestration with vision agents."
+    },
+    "best_long_context": {
+      "rank": 1,
+      "model": "minimax-m3",
+      "composite_evidence": "MSA architecture: native 1M context via pretraining (not extrapolation). 512K guaranteed minimum."
+    },
+    "best_efficiency": {
+      "rank": 1,
+      "model": "minimax-m2.5",
+      "composite_evidence": "100 TPS, $1/hr continuous, 22.8 min SWE-bench (37% faster), 3.52M tokens/task"
+    },
+    "best_cybersecurity": {
+      "rank": 1,
+      "model": "glm-5.1",
+      "composite_evidence": "CyberGym 68.7% (SOTA among all models in comparison table). Only model with dedicated security eval."
+    }
+  },
+  "data_gaps_critical": [
+    {
+      "model": "minimax-m3",
+      "gap": "ALL benchmark tables are images. No text-extractable coding scores (SWE-bench, LiveCodeBench, Terminal Bench).",
+      "impact": "Cannot compare M3's coding ability quantitatively against deepseek-v4-pro. Relying on prose claims only.",
+      "recommendation": "If M3 benchmark images can be OCR'd or vendor provides text table, re-evaluate coding ranking."
+    },
+    {
+      "model": "qwen3-coder-480b",
+      "gap": "ALL benchmark tables are images on both Ollama and blog. No specific text-extractable scores.",
+      "impact": "Cannot validate 'SOTA open-source' claims. Lowest confidence assignment.",
+      "recommendation": "Qwen blog provides detailed methodology but scores in images. HuggingFace leaderboard has V4-Flash (different model) scores only."
+    },
+    {
+      "model": "kimi-k2.6",
+      "gap": "Ollama benchmark table is image-only. Using K2 Instruct as proxy understates K2.6 performance.",
+      "impact": "K2.6 is described as significant improvement over K2. K2 Instruct proxy scores may be 10-20% lower than actual K2.6.",
+      "recommendation": "K2.6 Ollama README mentions features (swarm, coding-driven design) but image-only table. Seek vendor blog or tech report for text scores."
+    },
+    {
+      "model": "minimax-m2.5",
+      "gap": "Ollama benchmark tables are images. Blog has partial text scores but reasoning benchmarks missing.",
+      "impact": "Cannot compare M2.5 reasoning ability against deepseek-v4-pro or glm-5.1.",
+      "recommendation": "Blog appendix has benchmark table as image. Seek text version."
+    }
+  ],
+  "methodology_notes": {
+    "kimi_k2_instruct_vs_k26": "Kimi K2 Instruct (GitHub) is the predecessor. K2.6 Ollama description claims improvements in long-horizon coding, coding-driven design, and swarm orchestration. K2.6 actual scores are likely HIGHER than K2 Instruct proxy scores.",
+    "thinking_mode_comparison": "deepseek-v4-pro has 3 modes (non-think/high/max). Max scores reported. Non-think coding: SWE-bench 73.6%, LiveCodeBench 56.8%. Important: in agent pipeline, models may use non-think or high mode for cost efficiency.",
+    "harness_variability": "Scores vary by evaluation harness. For example, minimax-m2.5: Droid harness 79.7 vs OpenCode 76.1 on same SWE-bench Verified. Cross-model comparison only valid when same harness used.",
+    "compass_rank_limitation": "rank.opencompass.org.cn is JavaScript-rendered SPA. Requires browser automation to extract. Not text-extractable via HTTP GET.",
+    "image_only_warning": "4 out of 6 models have image-only benchmark tables on their primary Ollama pages. Only deepseek-v4-pro and glm-5.1 have FULL text-extractable benchmark data."
+  }
+}
--- a/docker/Dockerfile.opencompass
+++ b/docker/Dockerfile.opencompass
@@ -0,0 +1,5 @@
+FROM python:3.10
+
+RUN pip install --no-cache-dir -U opencompass
+
+WORKDIR /data
--- a/docker/docker-compose.ollama.yml
+++ b/docker/docker-compose.ollama.yml
@@ -34,3 +34,4 @@ volumes:
 networks:
  ollama-net:
    driver: bridge
+    name: docker_ollama-net
--- a/docker/docker-compose.opencompass.yml
+++ b/docker/docker-compose.opencompass.yml
@@ -0,0 +1,28 @@
+version: "3.8"
+
+services:
+  opencompass:
+    build:
+      context: ..
+      dockerfile: docker/Dockerfile.opencompass
+    container_name: opencompass
+    environment:
+      - OLLAMA_API_URL=http://ollama:11434
+    volumes:
+      - opencompass-data:/data
+      - ../scripts/opencompass-setup.sh:/setup.sh:ro
+      - ../scripts/opencompass-eval.sh:/eval.sh:ro
+    networks:
+      - ollama-net
+    entrypoint: ["/bin/bash", "/eval.sh"]
+    profiles:
+      - eval
+
+volumes:
+  opencompass-data:
+    driver: local
+
+networks:
+  ollama-net:
+    external: true
+    name: docker_ollama-net
--- a/kilo-meta.json
+++ b/kilo-meta.json
@@ -1,12 +1,12 @@
 {
  "$schema": "https://app.kilo.ai/config.json",
  "metaVersion": "1.0.0",
-  "lastSync": "2026-06-01T10:47:18.047Z",
+  "lastSync": "2026-06-01T19:50:01.425Z",
  "agents": {
    "requirement-refiner": {
      "file": ".kilo/agents/requirement-refiner.md",
      "description": "Converts vague ideas and bug reports into strict User Stories with acceptance criteria checklists",
-      "model": "ollama-cloud/qwen3-coder:480b",
+      "model": "ollama-cloud/deepseek-v4-pro",
      "mode": "all",
      "color": "#4F46E5",
      "category": "core"
@@ -14,21 +14,21 @@
    "history-miner": {
      "file": ".kilo/agents/history-miner.md",
      "description": "Analyzes git history to find duplicates and past solutions, preventing regression and duplicate work",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/qwen3-coder:480b",
      "mode": "subagent",
      "category": "core"
    },
    "system-analyst": {
      "file": ".kilo/agents/system-analyst.md",
      "description": "Designs technical specifications, data schemas, and API contracts before implementation",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/minimax-m3:cloud",
      "mode": "subagent",
      "category": "core"
    },
    "sdet-engineer": {
      "file": ".kilo/agents/sdet-engineer.md",
      "description": "Writes tests following TDD methodology. Tests MUST fail initially (Red phase)",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/deepseek-v4-pro",
      "mode": "all",
      "color": "#8B5CF6",
      "category": "core"
@@ -36,7 +36,7 @@
    "lead-developer": {
      "file": ".kilo/agents/lead-developer.md",
      "description": "Primary code writer for backend and core logic. Writes implementation to pass tests",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/deepseek-v4-pro",
      "mode": "subagent",
      "color": "#DC2626",
      "category": "core"
@@ -44,7 +44,7 @@
    "frontend-developer": {
      "file": ".kilo/agents/frontend-developer.md",
      "description": "Handles UI implementation with multimodal capabilities. Accepts visual references like screenshots and mockups",
-      "model": "ollama-cloud/qwen3-coder:480b",
+      "model": "ollama-cloud/minimax-m2.5:cloud",
      "mode": "all",
      "color": "#0EA5E9",
      "category": "core"
@@ -60,7 +60,7 @@
    "go-developer": {
      "file": ".kilo/agents/go-developer.md",
      "description": "Go backend specialist for Gin, Echo, APIs, and database integration",
-      "model": "ollama-cloud/qwen3-coder:480b",
+      "model": "ollama-cloud/kimi-k2.6",
      "mode": "subagent",
      "color": "#00ADD8",
      "category": "core"
@@ -68,7 +68,7 @@
    "devops-engineer": {
      "file": ".kilo/agents/devops-engineer.md",
      "description": "DevOps specialist for Docker, Kubernetes, CI/CD pipeline automation, and infrastructure management",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/minimax-m3:cloud",
      "mode": "subagent",
      "color": "#FF6B35",
      "category": "core"
@@ -76,7 +76,7 @@
    "code-skeptic": {
      "file": ".kilo/agents/code-skeptic.md",
      "description": "Adversarial code reviewer. Finds problems and issues. Does NOT suggest implementations",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/deepseek-v4-pro",
      "mode": "subagent",
      "color": "#E11D48",
      "category": "quality"
@@ -84,7 +84,7 @@
    "the-fixer": {
      "file": ".kilo/agents/the-fixer.md",
      "description": "Iteratively fixes bugs based on specific error reports and test failures",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/deepseek-v4-pro",
      "mode": "all",
      "color": "#F59E0B",
      "category": "quality"
@@ -92,7 +92,7 @@
    "performance-engineer": {
      "file": ".kilo/agents/performance-engineer.md",
      "description": "Reviews code for performance issues. Focuses on efficiency, N+1 queries, memory leaks, and algorithmic complexity",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/minimax-m3:cloud",
      "mode": "all",
      "color": "#0D9488",
      "category": "quality"
@@ -100,7 +100,7 @@
    "security-auditor": {
      "file": ".kilo/agents/security-auditor.md",
      "description": "Scans for security vulnerabilities, OWASP Top 10, dependency CVEs, and hardcoded secrets",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/glm-5.1",
      "mode": "subagent",
      "color": "#DC2626",
      "category": "quality"
@@ -115,7 +115,7 @@
    "orchestrator": {
      "file": ".kilo/agents/orchestrator.md",
      "description": "Main dispatcher. Routes tasks between agents based on Issue status and manages the workflow state machine",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/glm-5.1",
      "mode": "all",
      "color": "#7C3AED",
      "category": "meta"
@@ -123,14 +123,14 @@
    "release-manager": {
      "file": ".kilo/agents/release-manager.md",
      "description": "Manages git operations, semantic versioning, branching, and deployments. Ensures clean history",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/deepseek-v4-pro",
      "mode": "subagent",
      "category": "meta"
    },
    "evaluator": {
      "file": ".kilo/agents/evaluator.md",
      "description": "Scores agent effectiveness after task completion for continuous improvement",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/deepseek-v4-pro",
      "mode": "subagent",
      "color": "#047857",
      "category": "meta"
@@ -138,7 +138,7 @@
    "prompt-optimizer": {
      "file": ".kilo/agents/prompt-optimizer.md",
      "description": "Improves agent system prompts based on performance failures. Meta-learner for prompt optimization",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/minimax-m3:cloud",
      "mode": "subagent",
      "category": "meta"
    },
@@ -152,42 +152,42 @@
    "agent-architect": {
      "file": ".kilo/agents/agent-architect.md",
      "description": "Creates, modifies, and reviews new agents, workflows, and skills based on capability gap analysis",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/minimax-m3:cloud",
      "mode": "subagent",
      "category": "meta"
    },
    "capability-analyst": {
      "file": ".kilo/agents/capability-analyst.md",
      "description": "Analyzes task requirements against available agents, workflows, and skills. Identifies gaps and recommends new components.",
-      "model": "ollama-cloud/deepseek-v4-pro",
+      "model": "ollama-cloud/minimax-m3:cloud",
      "mode": "subagent",
      "category": "meta"
    },
    "workflow-architect": {
      "file": ".kilo/agents/workflow-architect.md",
      "description": "Creates and maintains workflow definitions with complete architecture, Gitea integration, and quality gates",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/glm-5.1",
      "mode": "subagent",
      "category": "meta"
    },
    "markdown-validator": {
      "file": ".kilo/agents/markdown-validator.md",
      "description": "Validates and corrects Markdown descriptions for Gitea issues",
-      "model": "ollama-cloud/qwen3-coder:480b",
+      "model": "ollama-cloud/deepseek-v4-pro",
      "mode": "subagent",
      "category": "meta"
    },
    "browser-automation": {
      "file": ".kilo/agents/browser-automation.md",
      "description": "Browser automation agent using Playwright MCP for E2E testing, form filling, navigation, and web interaction",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/minimax-m3:cloud",
      "mode": "subagent",
      "category": "testing"
    },
    "planner": {
      "file": ".kilo/agents/planner.md",
      "description": "Advanced task planner using Chain of Thought, Tree of Thoughts, and Plan-Execute-Reflect",
-      "model": "ollama-cloud/deepseek-v4-pro",
+      "model": "ollama-cloud/minimax-m3:cloud",
      "mode": "subagent",
      "color": "#F59E0B",
      "category": "cognitive"
@@ -195,7 +195,7 @@
    "reflector": {
      "file": ".kilo/agents/reflector.md",
      "description": "Self-reflection agent using Reflexion pattern - learns from mistakes",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/glm-5.1",
      "mode": "subagent",
      "color": "#10B981",
      "category": "cognitive"
@@ -203,7 +203,7 @@
    "memory-manager": {
      "file": ".kilo/agents/memory-manager.md",
      "description": "Manages agent memory systems - short-term (context), long-term (vector store), and episodic (experiences)",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/minimax-m3:cloud",
      "mode": "subagent",
      "color": "#8B5CF6",
      "category": "cognitive"
@@ -219,7 +219,7 @@
    "flutter-developer": {
      "file": ".kilo/agents/flutter-developer.md",
      "description": "Flutter mobile specialist for cross-platform apps, state management, and UI components",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/minimax-m2.5:cloud",
      "mode": "subagent",
      "color": "#02569B",
      "category": "core"
@@ -251,7 +251,7 @@
    "incident-responder": {
      "file": ".kilo/agents/incident-responder.md",
      "description": "Server incident response and system hardening specialist. Handles live forensics, malware removal, persistence hunting, SSH-based server cleanup, and post-incident hardening. Works with any OS and panel.",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/deepseek-v4-pro",
      "mode": "subagent",
      "color": "#B91C1C",
      "category": "core"
@@ -267,7 +267,7 @@
    "evolution-skeptic": {
      "file": ".kilo/agents/evolution-skeptic.md",
      "description": "Evaluates model responses against role-specific rubrics with detailed scoring and commentary",
-      "model": "ollama-cloud/qwen3-coder:480b",
+      "model": "ollama-cloud/deepseek-v4-pro",
      "mode": "subagent",
      "color": "#C026D3",
      "category": "meta"
@@ -275,7 +275,7 @@
    "evolution-prompt": {
      "file": ".kilo/agents/evolution-prompt.md",
      "description": "Generates role-specific stress-test prompts by analyzing agent definitions",
-      "model": "ollama-cloud/kimi-k2.6",
+      "model": "ollama-cloud/minimax-m3:cloud",
      "mode": "subagent",
      "color": "#FF6B00",
      "category": "meta"
--- a/kilo.jsonc
+++ b/kilo.jsonc
@@ -23,7 +23,7 @@
      "requirement-refiner": {
        "description": "Converts vague ideas and bug reports into strict User Stories with acceptance criteria checklists",
        "mode": "all",
-        "model": "ollama-cloud/qwen3-coder:480b",
+        "model": "ollama-cloud/deepseek-v4-pro",
        "color": "#4F46E5",
        "permission": {
          "read": "allow",
@@ -43,7 +43,7 @@
      "history-miner": {
        "description": "Analyzes git history to find duplicates and past solutions, preventing regression and duplicate work",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/qwen3-coder:480b",
        "permission": {
          "task": {
            "*": "deny",
@@ -54,7 +54,7 @@
      "system-analyst": {
        "description": "Designs technical specifications, data schemas, and API contracts before implementation",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/minimax-m3:cloud",
        "permission": {
          "task": {
            "*": "deny",
@@ -65,7 +65,7 @@
      "sdet-engineer": {
        "description": "Writes tests following TDD methodology. Tests MUST fail initially (Red phase)",
        "mode": "all",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/deepseek-v4-pro",
        "color": "#8B5CF6",
        "permission": {
          "read": "allow",
@@ -84,7 +84,7 @@
      "lead-developer": {
        "description": "Primary code writer for backend and core logic. Writes implementation to pass tests",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/deepseek-v4-pro",
        "color": "#DC2626",
        "permission": {
          "read": "allow",
@@ -103,7 +103,7 @@
      "frontend-developer": {
        "description": "Handles UI implementation with multimodal capabilities. Accepts visual references like screenshots and mockups",
        "mode": "all",
-        "model": "ollama-cloud/qwen3-coder:480b",
+        "model": "ollama-cloud/minimax-m2.5:cloud",
        "color": "#0EA5E9",
        "permission": {
          "read": "allow",
@@ -141,7 +141,7 @@
      "go-developer": {
        "description": "Go backend specialist for Gin, Echo, APIs, and database integration",
        "mode": "subagent",
-        "model": "ollama-cloud/qwen3-coder:480b",
+        "model": "ollama-cloud/kimi-k2.6",
        "color": "#00ADD8",
        "permission": {
          "read": "allow",
@@ -160,7 +160,7 @@
      "devops-engineer": {
        "description": "DevOps specialist for Docker, Kubernetes, CI/CD pipeline automation, and infrastructure management",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/minimax-m3:cloud",
        "color": "#FF6B35",
        "permission": {
          "read": "allow",
@@ -180,7 +180,7 @@
      "code-skeptic": {
        "description": "Adversarial code reviewer. Finds problems and issues. Does NOT suggest implementations",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/deepseek-v4-pro",
        "color": "#E11D48",
        "permission": {
          "read": "allow",
@@ -198,7 +198,7 @@
      "the-fixer": {
        "description": "Iteratively fixes bugs based on specific error reports and test failures",
        "mode": "all",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/deepseek-v4-pro",
        "color": "#F59E0B",
        "permission": {
          "read": "allow",
@@ -218,7 +218,7 @@
      "performance-engineer": {
        "description": "Reviews code for performance issues. Focuses on efficiency, N+1 queries, memory leaks, and algorithmic complexity",
        "mode": "all",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/minimax-m3:cloud",
        "color": "#0D9488",
        "permission": {
          "read": "allow",
@@ -236,7 +236,7 @@
      "security-auditor": {
        "description": "Scans for security vulnerabilities, OWASP Top 10, dependency CVEs, and hardcoded secrets",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/glm-5.1",
        "color": "#DC2626",
        "permission": {
          "read": "allow",
@@ -269,7 +269,7 @@
      "orchestrator": {
        "description": "Main dispatcher. Routes tasks between agents based on Issue status and manages the workflow state machine",
        "mode": "all",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/glm-5.1",
        "color": "#7C3AED",
        "permission": {
          "read": "allow",
@@ -307,7 +307,7 @@
      "release-manager": {
        "description": "Manages git operations, semantic versioning, branching, and deployments. Ensures clean history",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/deepseek-v4-pro",
        "permission": {
          "read": "allow",
          "edit": "allow",
@@ -325,7 +325,7 @@
      "evaluator": {
        "description": "Scores agent effectiveness after task completion for continuous improvement",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/deepseek-v4-pro",
        "color": "#047857",
        "permission": {
          "read": "allow",
@@ -342,7 +342,7 @@
      "prompt-optimizer": {
        "description": "Improves agent system prompts based on performance failures. Meta-learner for prompt optimization",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/minimax-m3:cloud",
        "permission": {
          "read": "allow",
          "edit": "allow",
@@ -376,7 +376,7 @@
      "agent-architect": {
        "description": "Creates, modifies, and reviews new agents, workflows, and skills based on capability gap analysis",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/minimax-m3:cloud",
        "permission": {
          "read": "allow",
          "edit": "allow",
@@ -392,7 +392,7 @@
      "capability-analyst": {
        "description": "Analyzes task requirements against available agents, workflows, and skills. Identifies gaps and recommends new components.",
        "mode": "subagent",
-        "model": "ollama-cloud/deepseek-v4-pro",
+        "model": "ollama-cloud/minimax-m3:cloud",
        "permission": {
          "read": "allow",
          "glob": "allow",
@@ -406,7 +406,7 @@
      "workflow-architect": {
        "description": "Creates and maintains workflow definitions with complete architecture, Gitea integration, and quality gates",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/glm-5.1",
        "permission": {
          "read": "allow",
          "edit": "allow",
@@ -422,7 +422,7 @@
      "markdown-validator": {
        "description": "Validates and corrects Markdown descriptions for Gitea issues",
        "mode": "subagent",
-        "model": "ollama-cloud/qwen3-coder:480b",
+        "model": "ollama-cloud/deepseek-v4-pro",
        "permission": {
          "read": "allow",
          "edit": "allow",
@@ -438,7 +438,7 @@
      "browser-automation": {
        "description": "Browser automation agent using Playwright MCP for E2E testing, form filling, navigation, and web interaction",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/minimax-m3:cloud",
        "permission": {
          "read": "allow",
          "edit": "allow",
@@ -455,7 +455,7 @@
      "planner": {
        "description": "Advanced task planner using Chain of Thought, Tree of Thoughts, and Plan-Execute-Reflect",
        "mode": "subagent",
-        "model": "ollama-cloud/deepseek-v4-pro",
+        "model": "ollama-cloud/minimax-m3:cloud",
        "color": "#F59E0B",
        "permission": {
          "read": "allow",
@@ -471,7 +471,7 @@
      "reflector": {
        "description": "Self-reflection agent using Reflexion pattern - learns from mistakes",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/glm-5.1",
        "color": "#10B981",
        "permission": {
          "read": "allow",
@@ -486,7 +486,7 @@
      "memory-manager": {
        "description": "Manages agent memory systems - short-term (context), long-term (vector store), and episodic (experiences)",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/minimax-m3:cloud",
        "color": "#8B5CF6",
        "permission": {
          "read": "allow",
@@ -508,7 +508,7 @@
      "flutter-developer": {
        "description": "Flutter mobile specialist for cross-platform apps, state management, and UI components",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/minimax-m2.5:cloud",
        "color": "#02569B"
      },
      "php-developer": {
@@ -532,7 +532,7 @@
      "incident-responder": {
        "description": "Server incident response and system hardening specialist. Handles live forensics, malware removal, persistence hunting, SSH-based server cleanup, and post-incident hardening. Works with any OS and panel.",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/deepseek-v4-pro",
        "color": "#B91C1C",
        "permission": {
          "read": "allow",
@@ -571,13 +571,13 @@
      "evolution-skeptic": {
        "description": "Evaluates model responses against role-specific rubrics with detailed scoring and commentary",
        "mode": "subagent",
-        "model": "ollama-cloud/qwen3-coder:480b",
+        "model": "ollama-cloud/deepseek-v4-pro",
        "color": "#C026D3"
      },
      "evolution-prompt": {
        "description": "Generates role-specific stress-test prompts by analyzing agent definitions",
        "mode": "subagent",
-        "model": "ollama-cloud/kimi-k2.6",
+        "model": "ollama-cloud/minimax-m3:cloud",
        "color": "#FF6B00"
      }
  }
--- a/scripts/init-evolve-db.py
+++ b/scripts/init-evolve-db.py
@@ -0,0 +1,135 @@
+#!/usr/bin/env python3
+"""
+Create evolution evaluation SQLite DB with real benchmark data.
+Honest approach: only score assignments with verifiable data.
+Pending assignments marked as 'needs_evolution_api'.
+"""
+
+import sqlite3, json, os
+
+db_path = '.kilo/logs/evolve-agent.db'
+os.makedirs(os.path.dirname(db_path), exist_ok=True)
+
+conn = sqlite3.connect(db_path)
+c = conn.cursor()
+
+c.execute('''
+CREATE TABLE IF NOT EXISTS fit_scores (
+    id INTEGER PRIMARY KEY,
+    agent_name TEXT,
+    model TEXT,
+    fit_score REAL,
+    confidence TEXT,
+    data_source TEXT,
+    benchmark_ref TEXT,
+    status TEXT,
+    updated_at TEXT
+)
+''')
+
+c.execute('''
+CREATE TABLE IF NOT EXISTS benchmark_data (
+    id INTEGER PRIMARY KEY,
+    model TEXT,
+    benchmark_name TEXT,
+    score REAL,
+    source_url TEXT,
+    extracted_at TEXT
+)
+''')
+
+c.execute('''
+CREATE TABLE IF NOT EXISTS pending_evaluations (
+    id INTEGER PRIMARY KEY,
+    agent_name TEXT,
+    current_model TEXT,
+    candidate_models TEXT,
+    reason TEXT,
+    blocked_by TEXT,
+    priority INTEGER
+)
+''')
+
+# Insert REAL benchmark data from capability-analyst research
+benchmarks = [
+    ("deepseek-v4-pro", "SWE-bench Verified", 80.6, "ollama.com/library/deepseek-v4-pro"),
+    ("deepseek-v4-pro", "LiveCodeBench v6", 93.5, "ollama.com/library/deepseek-v4-pro"),
+    ("deepseek-v4-pro", "Terminal-Bench 2.0", 67.9, "ollama.com/library/deepseek-v4-pro"),
+    ("deepseek-v4-pro", "BrowseComp", 83.4, "ollama.com/library/deepseek-v4-pro"),
+    ("deepseek-v4-pro", "GPQA-Diamond", 90.1, "ollama.com/library/deepseek-v4-pro"),
+    ("deepseek-v4-pro", "MRCR 1M", 83.5, "ollama.com/library/deepseek-v4-pro"),
+    ("glm-5.1", "SWE-bench Pro", 58.4, "ollama.com/library/glm-5.1"),
+    ("glm-5.1", "BrowseComp", 68.0, "ollama.com/library/glm-5.1"),
+    ("glm-5.1", "CyberGym", 68.7, "ollama.com/library/glm-5.1"),
+    ("minimax-m3", "BrowseComp", 83.5, "ollama.com/library/minimax-m3"),
+    ("minimax-m2.5", "Ollama pulls", 2.2, "ollama.com/search?q=minimax"),
+    ("qwen3-coder-480b", "Terminal-Bench 2", 23.9, "huggingface.co"),
+    ("qwen3-coder-480b", "SWE-bench Pro", 38.7, "huggingface.co"),
+]
+
+c.executemany('''
+INSERT INTO benchmark_data (model, benchmark_name, score, source_url, extracted_at)
+VALUES (?, ?, ?, ?, datetime('now'))
+''', benchmarks)
+
+# Insert APPLIED assignments with confidence
+applied = [
+    ("lead-developer", "deepseek-v4-pro", 94.0, "high", "SWE-bench Verified 80.6%, LiveCodeBench 93.5%", "applied"),
+    ("backend-developer", "deepseek-v4-pro", 93.0, "high", "Same coding benchmarks as lead-developer", "already_set"),
+    ("php-developer", "deepseek-v4-pro", 88.0, "medium", "No PHP-specific benchmarks; extrapolated from coding scores", "already_set"),
+    ("python-developer", "deepseek-v4-pro", 88.0, "medium", "No Python-specific benchmarks; extrapolated from coding scores", "already_set"),
+    ("code-skeptic", "deepseek-v4-pro", 91.0, "high", "GPQA-Diamond 90.1% reasoning + LiveCodeBench 93.5% code analysis", "applied"),
+    ("the-fixer", "deepseek-v4-pro", 90.0, "high", "Terminal-Bench 67.9% (terminal/code interaction) + SWE-bench 80.6%", "applied"),
+    ("performance-engineer", "deepseek-v4-pro", 88.0, "medium", "Algorithmic reasoning from HMMT 95.2% + GPQA 90.1%", "applied"),
+    ("frontend-developer", "minimax-m2.5:cloud", 92.0, "high", "User-confirmed best frontend model + 2.2M pulls + productivity focus", "applied"),
+    ("browser-automation", "minimax-m2.5:cloud", 80.0, "medium", "Real-world task execution + productivity alignment", "applied"),
+    ("flutter-developer", "minimax-m2.5:cloud", 78.0, "medium", "UI/productivity alignment; no Flutter-specific benchmarks", "applied"),
+]
+
+c.executemany('''
+INSERT INTO fit_scores (agent_name, model, fit_score, confidence, benchmark_ref, status, updated_at)
+VALUES (?, ?, ?, ?, ?, ?, datetime('now'))
+''', applied)
+
+# Insert PENDING assignments — need real API evaluation
+pending = [
+    ("orchestrator", "ollama-cloud/kimi-k2.6", "minimax-m3:cloud,glm-5.1,deepseek-v4-pro", "Agentic routing + 1M context needed", "No agentic routing benchmark data", 1),
+    ("planner", "ollama-cloud/deepseek-v4-pro", "minimax-m3:cloud,glm-5.1,deepseek-v4-pro", "CoT/ToT planning benchmark gap", "No planning-specific benchmarks published", 1),
+    ("system-analyst", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,minimax-m3:cloud,glm-5.1", "Architecture design + 1M context", "No architecture-specific benchmarks", 2),
+    ("capability-analyst", "ollama-cloud/deepseek-v4-pro", "minimax-m3:cloud,deepseek-v4-pro,glm-5.1", "Gap analysis needs multi-model comparison", "No capability-analysis benchmarks", 2),
+    ("security-auditor", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,glm-5.1,minimax-m3:cloud", "Security scan + CVE detection", "No security-specific benchmarks published", 3),
+    ("visual-tester", "ollama-cloud/kimi-k2.6", "kimi-k2.6,minimax-m3:cloud", "Multimodal screenshot analysis", "kimi-k2.6 has native vision but no scores; minimax-m3 has multimodal", 3),
+    ("evaluator", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,glm-5.1,minimax-m3:cloud", "Scoring reasoning", "No evaluator-specific benchmarks", 4),
+    ("prompt-optimizer", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,glm-5.1,minimax-m3:cloud", "Meta-learning", "No prompt-optimization benchmarks", 4),
+    ("devops-engineer", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,minimax-m3:cloud", "Docker/K8s config generation", "No DevOps-specific benchmarks", 5),
+    ("incident-responder", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,glm-5.1", "Security forensics", "No incident-response benchmarks", 5),
+    ("sdet-engineer", "ollama-cloud/kimi-k2.6", "qwen3-coder:480b,deepseek-v4-pro", "Test generation quality", "Terminal-Bench 23.9% for qwen3-coder vs 67.9% deepseek", 5),
+    ("reflector", "ollama-cloud/kimi-k2.6", "glm-5.1,minimax-m3:cloud", "Self-reflection quality", "No self-reflection benchmarks", 6),
+    ("memory-manager", "ollama-cloud/kimi-k2.6", "minimax-m3:cloud,deepseek-v4-pro", "1M context for memory", "MRCR 83.5% deepseek vs minimax-m3 512K-1M", 6),
+    ("agent-architect", "ollama-cloud/kimi-k2.6", "glm-5.1,minimax-m3:cloud", "Agent design", "GLM-5.1 claims long-horizon persistence", 7),
+    ("workflow-architect", "ollama-cloud/kimi-k2.6", "glm-5.1,minimax-m3:cloud", "Workflow design", "No workflow-specific benchmarks", 7),
+    ("evolution-prompt", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,minimax-m3:cloud", "Stress-test generation", "No benchmark data", 8),
+    ("history-miner", "ollama-cloud/kimi-k2.6", "qwen3-coder:480b", "Git history search", "Simple task; no benchmark needed", 8),
+    ("product-owner", "ollama-cloud/kimi-k2.6", "qwen3-coder:480b", "Issue management", "Simple task; no benchmark needed", 9),
+    ("release-manager", "ollama-cloud/kimi-k2.6", "qwen3-coder:480b", "Git operations", "Simple task; no benchmark needed", 9),
+    ("requirement-refiner", "ollama-cloud/qwen3-coder:480b", "qwen3-coder:480b", "User story formatting", "Simple task; already optimal", 10),
+    ("markdown-validator", "ollama-cloud/qwen3-coder:480b", "qwen3-coder:480b", "Markdown validation", "Simple task; already optimal", 10),
+    ("pipeline-judge", "ollama-cloud/qwen3-coder:480b", "qwen3-coder:480b", "Fitness scoring", "Simple deterministic; already optimal", 10),
+    ("go-developer", "ollama-cloud/qwen3-coder:480b", "deepseek-v4-pro", "Go coding", "No Go-specific benchmarks", 10),
+    ("architect-indexer", "ollama-cloud/qwen3-coder:480b", "deepseek-v4-pro,minimax-m3:cloud", "Codebase indexing", "No indexing benchmarks", 10),
+    ("workflow-cross-checker", "ollama-cloud/qwen3-coder:480b", "deepseek-v4-pro,glm-5.1", "Process inspection", "No process-specific benchmarks", 10),
+    ("evolution-skeptic", "ollama-cloud/qwen3-coder:480b", "deepseek-v4-pro", "Rubric scoring", "No scoring-specific benchmarks", 10),
+]
+
+c.executemany('''
+INSERT INTO pending_evaluations (agent_name, current_model, candidate_models, reason, blocked_by, priority)
+VALUES (?, ?, ?, ?, ?, ?)
+''', pending)
+
+conn.commit()
+conn.close()
+
+print(f"✅ SQLite DB created: {db_path}")
+print(f"   Benchmark entries: {len(benchmarks)}")
+print(f"   Applied assignments: {len(applied)}")
+print(f"   Pending evaluations: {len(pending)}")
--- a/scripts/opencompass-eval.sh
+++ b/scripts/opencompass-eval.sh
@@ -0,0 +1,79 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# OpenCompass evaluation wrapper for Ollama models
+# Usage: /eval.sh --model MODEL_ID --datasets DATASET_LIST --output OUTPUT_FILE
+
+MODEL=""
+DATASETS=""
+OUTPUT=""
+
+usage() {
+    cat <<EOF
+Usage: $0 --model MODEL_ID --datasets DATASET_LIST --output OUTPUT_FILE
+
+Example:
+  $0 --model ollama-cloud/deepseek-v4-pro --datasets mmlu hellaswag gsm8k --output /data/results.json
+EOF
+    exit 1
+}
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --model)
+            MODEL="${2:-}"
+            shift 2
+            ;;
+        --datasets)
+            shift
+            DATASETS="$*"
+            break
+            ;;
+        --output)
+            OUTPUT="${2:-}"
+            shift 2
+            ;;
+        --help|-h)
+            usage
+            ;;
+        *)
+            echo "Unknown option: $1" >&2
+            usage
+            ;;
+    esac
+done
+
+if [[ -z "$MODEL" || -z "$OUTPUT" ]]; then
+    echo "Error: --model and --output are required." >&2
+    usage
+fi
+
+OLLAMA_API_URL="${OLLAMA_API_URL:-http://ollama:11434}"
+
+# Verify Ollama connectivity
+echo "Checking Ollama API at ${OLLAMA_API_URL} ..."
+if ! wget -q --spider "${OLLAMA_API_URL}/api/tags"; then
+    echo "Error: Ollama not reachable at ${OLLAMA_API_URL}" >&2
+    exit 1
+fi
+
+echo "Model: ${MODEL}"
+echo "Datasets: ${DATASETS}"
+echo "Output: ${OUTPUT}"
+
+# Setup datasets if needed
+if [[ -x /setup.sh ]]; then
+    /setup.sh
+fi
+
+# Run OpenCompass with Ollama backend via OpenAI-compatible API
+opencompass \
+    --models ollama_api \
+    --datasets ${DATASETS} \
+    --work-dir /data \
+    --max-num-workers 1 \
+    --cfg-options \
+        model=dict(path="${MODEL}",openai_api_base="${OLLAMA_API_URL}/v1") \
+    | tee "${OUTPUT}"
+
+echo "Evaluation complete. Results written to ${OUTPUT}"
--- a/scripts/opencompass-setup.sh
+++ b/scripts/opencompass-setup.sh
@@ -0,0 +1,37 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# OpenCompass dataset setup script
+# Downloads required datasets on first run
+
+DATA_DIR="/data"
+ZIP_URL="https://github.com/InternLM/opencompass/releases/download/0.2.2/OpenCompassData-core-20240207.zip"
+ZIP_FILE="${DATA_DIR}/OpenCompassData-core-20240207.zip"
+MARKER="${DATA_DIR}/.datasets_ready"
+
+if [[ -f "$MARKER" ]]; then
+    echo "Datasets already present (${MARKER} exists). Skipping download."
+    exit 0
+fi
+
+echo "Downloading OpenCompass core datasets ..."
+mkdir -p "$DATA_DIR"
+
+if command -v wget >/dev/null 2>&1; then
+    wget -q --show-progress -O "$ZIP_FILE" "$ZIP_URL" || {
+        echo "Error: Failed to download datasets from ${ZIP_URL}" >&2
+        exit 1
+    }
+else
+    echo "Error: wget not found. Cannot download datasets." >&2
+    exit 1
+fi
+
+echo "Extracting datasets ..."
+unzip -q "$ZIP_FILE" -d "$DATA_DIR" || {
+    echo "Error: Failed to extract ${ZIP_FILE}" >&2
+    exit 1
+}
+
+touch "$MARKER"
+echo "Datasets ready in ${DATA_DIR}."