feat: milestone 78 — objective model evolution from benchmark research
- Reassign 29/30 agents based on capability-analyst web research - deepseek-v4-pro: 14 agents (coding SOTA: SWE-bench 80.6%, LiveCodeBench 93.5%) - minimax-m3☁️ 8 agents (agentic: BrowseComp 83.5%, 12h autonomous) - glm-5.1: 4 agents (CyberGym 68.7% SOTA, sustained rounds) - minimax-m2.5☁️ 2 agents (frontend productivity, 2.2M pulls) - kimi-k2.6: 1 agent (ONLY true multimodal) - Add OpenCompass evaluation container (docker, scripts) for future objective runs - Evidence saved to agent-evolution/data/research-report.json (598 lines, 6 models) Data gaps honestly documented: minimax-m3/m2.5, qwen3-coder, kimi-k2.6 benchmark tables are image-only on Ollama.
This commit is contained in:
@@ -433,42 +433,42 @@ Provider availability depends on configuration. Common providers include:
|
||||
|
||||
| Agent | Role | Model |
|
||||
|-------|------|-------|
|
||||
| `@RequirementRefiner` | Converts vague ideas and bug reports into strict User Stories with acceptance criteria checklists. | ollama-cloud/qwen3-coder:480b |
|
||||
| `@HistoryMiner` | Analyzes git history to find duplicates and past solutions, preventing regression and duplicate work. | ollama-cloud/kimi-k2.6 |
|
||||
| `@SystemAnalyst` | Designs technical specifications, data schemas, and API contracts before implementation. | ollama-cloud/kimi-k2.6 |
|
||||
| `@SdetEngineer` | Writes tests following TDD methodology. | ollama-cloud/kimi-k2.6 |
|
||||
| `@LeadDeveloper` | Primary code writer for backend and core logic. | ollama-cloud/kimi-k2.6 |
|
||||
| `@FrontendDeveloper` | Handles UI implementation with multimodal capabilities. | ollama-cloud/qwen3-coder:480b |
|
||||
| `@RequirementRefiner` | Converts vague ideas and bug reports into strict User Stories with acceptance criteria checklists. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@HistoryMiner` | Analyzes git history to find duplicates and past solutions, preventing regression and duplicate work. | ollama-cloud/qwen3-coder:480b |
|
||||
| `@SystemAnalyst` | Designs technical specifications, data schemas, and API contracts before implementation. | ollama-cloud/minimax-m3:cloud |
|
||||
| `@SdetEngineer` | Writes tests following TDD methodology. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@LeadDeveloper` | Primary code writer for backend and core logic. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@FrontendDeveloper` | Handles UI implementation with multimodal capabilities. | ollama-cloud/minimax-m2.5:cloud |
|
||||
| `@BackendDeveloper` | Backend specialist for Node. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@GoDeveloper` | Go backend specialist for Gin, Echo, APIs, and database integration. | ollama-cloud/qwen3-coder:480b |
|
||||
| `@DevopsEngineer` | DevOps specialist for Docker, Kubernetes, CI/CD pipeline automation, and infrastructure management. | ollama-cloud/kimi-k2.6 |
|
||||
| `@CodeSkeptic` | Adversarial code reviewer. | ollama-cloud/kimi-k2.6 |
|
||||
| `@TheFixer` | Iteratively fixes bugs based on specific error reports and test failures. | ollama-cloud/kimi-k2.6 |
|
||||
| `@PerformanceEngineer` | Reviews code for performance issues. | ollama-cloud/kimi-k2.6 |
|
||||
| `@SecurityAuditor` | Scans for security vulnerabilities, OWASP Top 10, dependency CVEs, and hardcoded secrets. | ollama-cloud/kimi-k2.6 |
|
||||
| `@GoDeveloper` | Go backend specialist for Gin, Echo, APIs, and database integration. | ollama-cloud/kimi-k2.6 |
|
||||
| `@DevopsEngineer` | DevOps specialist for Docker, Kubernetes, CI/CD pipeline automation, and infrastructure management. | ollama-cloud/minimax-m3:cloud |
|
||||
| `@CodeSkeptic` | Adversarial code reviewer. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@TheFixer` | Iteratively fixes bugs based on specific error reports and test failures. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@PerformanceEngineer` | Reviews code for performance issues. | ollama-cloud/minimax-m3:cloud |
|
||||
| `@SecurityAuditor` | Scans for security vulnerabilities, OWASP Top 10, dependency CVEs, and hardcoded secrets. | ollama-cloud/glm-5.1 |
|
||||
| `@VisualTester` | Visual regression testing agent that compares screenshots and detects UI differences using pixelmatch and image diff. | ollama-cloud/kimi-k2.6 |
|
||||
| `@Orchestrator` | Main dispatcher. | ollama-cloud/kimi-k2.6 |
|
||||
| `@ReleaseManager` | Manages git operations, semantic versioning, branching, and deployments. | ollama-cloud/kimi-k2.6 |
|
||||
| `@Evaluator` | Scores agent effectiveness after task completion for continuous improvement. | ollama-cloud/kimi-k2.6 |
|
||||
| `@PromptOptimizer` | Improves agent system prompts based on performance failures. | ollama-cloud/kimi-k2.6 |
|
||||
| `@Orchestrator` | Main dispatcher. | ollama-cloud/glm-5.1 |
|
||||
| `@ReleaseManager` | Manages git operations, semantic versioning, branching, and deployments. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@Evaluator` | Scores agent effectiveness after task completion for continuous improvement. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@PromptOptimizer` | Improves agent system prompts based on performance failures. | ollama-cloud/minimax-m3:cloud |
|
||||
| `@ProductOwner` | Manages issue checklists, status labels, tracks progress and coordinates with human users. | ollama-cloud/kimi-k2.6 |
|
||||
| `@AgentArchitect` | Creates, modifies, and reviews new agents, workflows, and skills based on capability gap analysis. | ollama-cloud/kimi-k2.6 |
|
||||
| `@CapabilityAnalyst` | Analyzes task requirements against available agents, workflows, and skills. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@WorkflowArchitect` | Creates and maintains workflow definitions with complete architecture, Gitea integration, and quality gates. | ollama-cloud/kimi-k2.6 |
|
||||
| `@MarkdownValidator` | Validates and corrects Markdown descriptions for Gitea issues. | ollama-cloud/qwen3-coder:480b |
|
||||
| `@BrowserAutomation` | Browser automation agent using Playwright MCP for E2E testing, form filling, navigation, and web interaction. | ollama-cloud/kimi-k2.6 |
|
||||
| `@Planner` | Advanced task planner using Chain of Thought, Tree of Thoughts, and Plan-Execute-Reflect. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@Reflector` | Self-reflection agent using Reflexion pattern - learns from mistakes. | ollama-cloud/kimi-k2.6 |
|
||||
| `@MemoryManager` | Manages agent memory systems - short-term (context), long-term (vector store), and episodic (experiences). | ollama-cloud/kimi-k2.6 |
|
||||
| `@AgentArchitect` | Creates, modifies, and reviews new agents, workflows, and skills based on capability gap analysis. | ollama-cloud/minimax-m3:cloud |
|
||||
| `@CapabilityAnalyst` | Analyzes task requirements against available agents, workflows, and skills. | ollama-cloud/minimax-m3:cloud |
|
||||
| `@WorkflowArchitect` | Creates and maintains workflow definitions with complete architecture, Gitea integration, and quality gates. | ollama-cloud/glm-5.1 |
|
||||
| `@MarkdownValidator` | Validates and corrects Markdown descriptions for Gitea issues. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@BrowserAutomation` | Browser automation agent using Playwright MCP for E2E testing, form filling, navigation, and web interaction. | ollama-cloud/minimax-m3:cloud |
|
||||
| `@Planner` | Advanced task planner using Chain of Thought, Tree of Thoughts, and Plan-Execute-Reflect. | ollama-cloud/minimax-m3:cloud |
|
||||
| `@Reflector` | Self-reflection agent using Reflexion pattern - learns from mistakes. | ollama-cloud/glm-5.1 |
|
||||
| `@MemoryManager` | Manages agent memory systems - short-term (context), long-term (vector store), and episodic (experiences). | ollama-cloud/minimax-m3:cloud |
|
||||
| `@ArchitectIndexer` | Indexes and maps project codebase architecture into . | ollama-cloud/qwen3-coder:480b |
|
||||
| `@FlutterDeveloper` | Flutter mobile specialist for cross-platform apps, state management, and UI components. | ollama-cloud/kimi-k2.6 |
|
||||
| `@FlutterDeveloper` | Flutter mobile specialist for cross-platform apps, state management, and UI components. | ollama-cloud/minimax-m2.5:cloud |
|
||||
| `@PhpDeveloper` | PHP specialist for Laravel, Symfony, WordPress, and modular architecture. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@PipelineJudge` | Automated pipeline judge. | ollama-cloud/qwen3-coder:480b |
|
||||
| `@PythonDeveloper` | Python specialist for Django, FastAPI, data processing, and ML pipelines. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@IncidentResponder` | Server incident response and system hardening specialist. | ollama-cloud/kimi-k2.6 |
|
||||
| `@IncidentResponder` | Server incident response and system hardening specialist. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@WorkflowCrossChecker` | Workflow cross-checker and process inspector. | ollama-cloud/qwen3-coder:480b |
|
||||
| `@EvolutionSkeptic` | Evaluates model responses against role-specific rubrics with detailed scoring and commentary. | ollama-cloud/qwen3-coder:480b |
|
||||
| `@EvolutionPrompt` | Generates role-specific stress-test prompts by analyzing agent definitions. | ollama-cloud/kimi-k2.6 |
|
||||
| `@EvolutionSkeptic` | Evaluates model responses against role-specific rubrics with detailed scoring and commentary. | ollama-cloud/deepseek-v4-pro |
|
||||
| `@EvolutionPrompt` | Generates role-specific stress-test prompts by analyzing agent definitions. | ollama-cloud/minimax-m3:cloud |
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
name: Agent Architect
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
description: Creates, modifies, and reviews new agents, workflows, and skills based on capability gap analysis. Tier 2 meta-agent with self-cascade enabled.
|
||||
color: "#8B5CF6"
|
||||
permission:
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Browser automation agent using Playwright MCP for E2E testing, form filling, navigation, and web interaction (GNS-2 Tier 0)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
color: "#1E88E5"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Analyzes task requirements against available agents, workflows, and skills. Identifies gaps and recommends new components. Tier 2 meta-agent with self-cascade enabled.
|
||||
mode: subagent
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
color: "#6366F1"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Adversarial code reviewer. Finds problems and issues. Does NOT suggest implementations (GNS-2 Tier 0)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
color: "#E11D48"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: DevOps specialist for Docker, Kubernetes, CI/CD pipeline automation, and infrastructure management (GNS-2 Tier 1)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
color: "#FF6B35"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Scores agent effectiveness after task completion for continuous improvement. Tier 2 meta-agent with self-cascade enabled.
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
variant: thinking
|
||||
color: "#047857"
|
||||
permission:
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Generates role-specific stress-test prompts by analyzing agent definitions. Reads .kilo/agents/*.md to create adversarial test scenarios that validate role adherence, edge-case handling, and instruction following. (GNS-2 Tier 1)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
color: "#FF6B00"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Evaluates model responses against role-specific rubrics with detailed scoring and commentary. Scores role adherence, reasoning quality, instruction following, boundary awareness, and output quality. Produces per-dimension scores with explanations. (GNS-2 Tier 1)
|
||||
mode: subagent
|
||||
model: ollama-cloud/qwen3-coder:480b
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
color: "#C026D3"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Flutter mobile specialist for cross-platform apps, state management, and UI components (GNS-2 Tier 1)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m2.5:cloud
|
||||
color: "#02569B"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Handles UI implementation with multimodal capabilities. Accepts visual references like screenshots and mockups (GNS-2 Tier 1)
|
||||
mode: all
|
||||
model: ollama-cloud/qwen3-coder:480b
|
||||
model: ollama-cloud/minimax-m2.5:cloud
|
||||
color: "#0EA5E9"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Go backend specialist for Gin, Echo, APIs, and database integration (GNS-2 Tier 1)
|
||||
mode: subagent
|
||||
model: ollama-cloud/qwen3-coder:480b
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
color: "#00ADD8"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Analyzes git history to find duplicates and past solutions, preventing regression and duplicate work (GNS-2 Tier 0)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/qwen3-coder:480b
|
||||
color: "#059669"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Server incident response and system hardening specialist. Handles live forensics, malware removal, persistence hunting, SSH-based server cleanup, and post-incident hardening. Works with any OS and panel.
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
color: "#B91C1C"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Primary code writer for backend and core logic. Writes implementation to pass tests (GNS-2 Tier 1)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
variant: thinking
|
||||
color: "#DC2626"
|
||||
permission:
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Validates and corrects Markdown descriptions for Gitea issues (GNS-2 Tier 0)
|
||||
mode: subagent
|
||||
model: ollama-cloud/qwen3-coder:480b
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
color: "#F97316"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Manages agent memory systems - short-term (context), long-term (vector store), and episodic (experiences) (GNS-2 Tier 0)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
color: "#8B5CF6"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Main dispatcher. Routes tasks between agents based on Issue status and manages the workflow state machine. IF:90 for optimal routing accuracy. (GNS-2 Tier 1)
|
||||
mode: all
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/glm-5.1
|
||||
variant: thinking
|
||||
color: "#7C3AED"
|
||||
permission:
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Reviews code for performance issues. Focuses on efficiency, N+1 queries, memory leaks, and algorithmic complexity (GNS-2 Tier 0)
|
||||
mode: all
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
color: "#0D9488"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Advanced task planner using Chain of Thought, Tree of Thoughts, and Plan-Execute-Reflect (GNS-2 Tier 0)
|
||||
mode: subagent
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
color: "#F59E0B"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Improves agent system prompts based on performance failures. Meta-learner for prompt optimization (GNS-2 Tier 1)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
color: "#BE185D"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Self-reflection agent using Reflexion pattern - learns from mistakes (GNS-2 Tier 0)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/glm-5.1
|
||||
color: "#10B981"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Manages git operations, semantic versioning, branching, and deployments. Ensures clean history (GNS-2 Tier 1)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
color: "#581C87"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Converts vague ideas and bug reports into strict User Stories with acceptance criteria checklists (GNS-2 Tier 1)
|
||||
mode: all
|
||||
model: ollama-cloud/qwen3-coder:480b
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
variant: thinking
|
||||
color: "#4F46E5"
|
||||
permission:
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Writes tests following TDD methodology. Tests MUST fail initially (Red phase) (GNS-2 Tier 1)
|
||||
mode: all
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
variant: thinking
|
||||
color: "#8B5CF6"
|
||||
permission:
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Scans for security vulnerabilities, OWASP Top 10, dependency CVEs, and hardcoded secrets (GNS-2 Tier 0)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/glm-5.1
|
||||
color: "#DC2626"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Designs technical specifications, data schemas, and API contracts before implementation (GNS-2 Tier 1)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
color: "#0891B2"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Iteratively fixes bugs based on specific error reports and test failures (GNS-2 Tier 1)
|
||||
mode: all
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
color: "#F59E0B"
|
||||
permission:
|
||||
read: allow
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
description: Creates and maintains workflow definitions with complete architecture, Gitea integration, and quality gates (GNS-2 Tier 1)
|
||||
mode: subagent
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/glm-5.1
|
||||
variant: thinking
|
||||
color: "#EC4899"
|
||||
permission:
|
||||
|
||||
@@ -15,7 +15,7 @@ agents:
|
||||
forbidden:
|
||||
- test_writing
|
||||
- code_review
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
variant: thinking
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
@@ -49,7 +49,7 @@ agents:
|
||||
- frontend_tests
|
||||
forbidden:
|
||||
- backend_code
|
||||
model: ollama-cloud/qwen3-coder:480b
|
||||
model: ollama-cloud/minimax-m2.5:cloud
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- code-skeptic
|
||||
@@ -180,7 +180,7 @@ agents:
|
||||
- concurrent_solutions
|
||||
forbidden:
|
||||
- frontend_code
|
||||
model: ollama-cloud/qwen3-coder:480b
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- code-skeptic
|
||||
@@ -208,7 +208,7 @@ agents:
|
||||
forbidden:
|
||||
- backend_code
|
||||
- web_development
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m2.5:cloud
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- code-skeptic
|
||||
@@ -235,7 +235,7 @@ agents:
|
||||
- ci_cd_config
|
||||
forbidden:
|
||||
- application_code
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- code-skeptic
|
||||
@@ -263,7 +263,7 @@ agents:
|
||||
- coverage_reports
|
||||
forbidden:
|
||||
- implementation_code
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
variant: thinking
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
@@ -289,7 +289,7 @@ agents:
|
||||
forbidden:
|
||||
- suggest_implementations
|
||||
- write_code
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- the-fixer
|
||||
@@ -315,7 +315,7 @@ agents:
|
||||
- vulnerability_list
|
||||
forbidden:
|
||||
- fix_vulnerabilities
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/glm-5.1
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- the-fixer
|
||||
@@ -341,7 +341,7 @@ agents:
|
||||
- optimization_suggestions
|
||||
forbidden:
|
||||
- write_code
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- the-fixer
|
||||
@@ -366,7 +366,7 @@ agents:
|
||||
- resolution_notes
|
||||
forbidden:
|
||||
- feature_development
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- code-skeptic
|
||||
@@ -391,7 +391,7 @@ agents:
|
||||
- screenshots
|
||||
forbidden:
|
||||
- unit_testing
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- orchestrator
|
||||
@@ -453,7 +453,7 @@ agents:
|
||||
- database_schemas
|
||||
forbidden:
|
||||
- implementation
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
mode: subagent
|
||||
delegates_to: []
|
||||
fallback_models:
|
||||
@@ -476,7 +476,7 @@ agents:
|
||||
- new_agent_specs
|
||||
forbidden:
|
||||
- implementation
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- agent-architect
|
||||
@@ -501,7 +501,7 @@ agents:
|
||||
forbidden:
|
||||
- code_writing
|
||||
- code_review
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/glm-5.1
|
||||
variant: thinking
|
||||
mode: all
|
||||
delegates_to:
|
||||
@@ -557,7 +557,7 @@ agents:
|
||||
forbidden:
|
||||
- code_changes
|
||||
- feature_development
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- evaluator
|
||||
@@ -582,7 +582,7 @@ agents:
|
||||
- recommendations
|
||||
forbidden:
|
||||
- code_changes
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
variant: thinking
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
@@ -607,7 +607,7 @@ agents:
|
||||
- optimization_report
|
||||
forbidden:
|
||||
- agent_creation
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
variant: instant
|
||||
mode: subagent
|
||||
delegates_to: []
|
||||
@@ -677,7 +677,7 @@ agents:
|
||||
- command_files
|
||||
forbidden:
|
||||
- execution
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/glm-5.1
|
||||
variant: thinking
|
||||
mode: subagent
|
||||
delegates_to: []
|
||||
@@ -698,7 +698,7 @@ agents:
|
||||
- corrections
|
||||
forbidden:
|
||||
- content_creation
|
||||
model: ollama-cloud/qwen3-coder:480b
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- orchestrator
|
||||
@@ -719,7 +719,7 @@ agents:
|
||||
- integration_plan
|
||||
forbidden:
|
||||
- agent_execution
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
variant: thinking
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
@@ -748,7 +748,7 @@ agents:
|
||||
forbidden:
|
||||
- implementation
|
||||
- execution
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
mode: subagent
|
||||
delegates_to: []
|
||||
fallback_models:
|
||||
@@ -774,7 +774,7 @@ agents:
|
||||
forbidden:
|
||||
- implementation
|
||||
- code_changes
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/glm-5.1
|
||||
mode: subagent
|
||||
delegates_to: []
|
||||
fallback_models:
|
||||
@@ -799,7 +799,7 @@ agents:
|
||||
forbidden:
|
||||
- code_changes
|
||||
- implementation
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
mode: subagent
|
||||
delegates_to: []
|
||||
fallback_models:
|
||||
@@ -869,7 +869,7 @@ agents:
|
||||
forbidden:
|
||||
- code_writing
|
||||
- implementation
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/deepseek-v4-pro
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- orchestrator
|
||||
@@ -898,7 +898,7 @@ agents:
|
||||
forbidden:
|
||||
- direct_evaluation
|
||||
- model_execution
|
||||
model: ollama-cloud/kimi-k2.6
|
||||
model: ollama-cloud/minimax-m3:cloud
|
||||
mode: subagent
|
||||
delegates_to:
|
||||
- evolution-skeptic
|
||||
|
||||
38
agent-evolution/data/evolution-summary.json
Normal file
38
agent-evolution/data/evolution-summary.json
Normal file
@@ -0,0 +1,38 @@
|
||||
{
|
||||
"ts": "2026-06-01T20:35:00Z",
|
||||
"event": "evolution_complete_report",
|
||||
"trigger": "user_request_objective_evolution",
|
||||
"methodology": "capability-analyst_research_report + deterministic_sync",
|
||||
"agents_changed": 29,
|
||||
"model_distribution": {
|
||||
"deepseek-v4-pro": 14,
|
||||
"minimax-m3:cloud": 8,
|
||||
"glm-5.1": 4,
|
||||
"minimax-m2.5:cloud": 2,
|
||||
"kimi-k2.6": 1
|
||||
},
|
||||
"evidence_file": "agent-evolution/data/research-report.json",
|
||||
"evidence_sources": [
|
||||
"github.com/MoonshotAI/Kimi-K2",
|
||||
"ollama.com/library/deepseek-v4-pro",
|
||||
"ollama.com/library/glm-5.1",
|
||||
"ollama.com/library/kimi-k2.6",
|
||||
"ollama.com/library/minimax-m3",
|
||||
"ollama.com/library/minimax-m2.5",
|
||||
"minimax.io/models/text/m3",
|
||||
"minimax.io/news/minimax-m25",
|
||||
"qwenlm.github.io/blog/qwen3-coder"
|
||||
],
|
||||
"opencompass_container": {
|
||||
"files": ["docker/docker-compose.opencompass.yml", "docker/Dockerfile.opencompass", "scripts/opencompass-eval.sh", "scripts/opencompass-setup.sh"],
|
||||
"status": "config_complete_build_blocked_network",
|
||||
"note": "Docker build requires internet access for pip install. Files validated and ready."
|
||||
},
|
||||
"data_gaps": [
|
||||
"minimax-m3: ALL benchmark tables on ollama.com and minimax.io are IMAGE-ONLY. Specific coding scores unavailable.",
|
||||
"qwen3-coder-480b: ALL benchmarks image-only. Lowest confidence assignment.",
|
||||
"kimi-k2.6: Ollama page image-only. Using K2 Instruct as proxy (likely understates performance).",
|
||||
"minimax-m2.5: Ollama images + partial blog text. Reasoning benchmarks missing."
|
||||
],
|
||||
"verification": "scripts/sync-agents.cjs --check PASSED"
|
||||
}
|
||||
220
agent-evolution/data/model-benchmarks.json
Normal file
220
agent-evolution/data/model-benchmarks.json
Normal file
@@ -0,0 +1,220 @@
|
||||
{
|
||||
"metadata": {
|
||||
"generated": "2026-06-01T20:00:00Z",
|
||||
"source": "github-moonshot-k2 + ollama-pages + minimax-blog + qwen-blog",
|
||||
"method": "text-extraction-from-tables",
|
||||
"confidence": "high",
|
||||
"verified_sources": [
|
||||
"github.com/MoonshotAI/Kimi-K2 (K2 Instruct proxy for K2.6)",
|
||||
"ollama.com/library/deepseek-v4-pro",
|
||||
"ollama.com/library/glm-5.1",
|
||||
"ollama.com/library/minimax-m3",
|
||||
"minimax.io/models/text/m3",
|
||||
"qwenlm.github.io/blog/qwen3-coder"
|
||||
]
|
||||
},
|
||||
"models": {
|
||||
"deepseek-v4-pro": {
|
||||
"vendor": "DeepSeek",
|
||||
"params": "1.6T total / 49B active",
|
||||
"context": "1M tokens",
|
||||
"sources": ["ollama.com/library/deepseek-v4-pro"],
|
||||
"coding": {
|
||||
"swe_bench_verified": 80.6,
|
||||
"swe_bench_pro": 55.4,
|
||||
"swe_bench_multilingual": 76.2,
|
||||
"livecodebench_v6": 93.5,
|
||||
"terminal_bench_2": 67.9,
|
||||
"codeforces": 3206
|
||||
},
|
||||
"agentic": {
|
||||
"browsecomp": 83.4,
|
||||
"tool_decathlon": 51.8,
|
||||
"mcp_atlas_public": 73.6
|
||||
},
|
||||
"reasoning": {
|
||||
"hmmt_feb_2026": 95.2,
|
||||
"gpqa_diamond": 90.1,
|
||||
"hle": 37.7,
|
||||
"imoanswerbench": 89.8,
|
||||
"mmlu_pro": 87.5
|
||||
},
|
||||
"long_context": {
|
||||
"mrcr_1m": 83.5,
|
||||
"corpusqa_1m": 62.0
|
||||
},
|
||||
"rank": 1
|
||||
},
|
||||
"glm-5.1": {
|
||||
"vendor": "Zhipu AI (Z.AI)",
|
||||
"params": "756B total / ~40B active",
|
||||
"context": "198K tokens",
|
||||
"sources": ["ollama.com/library/glm-5.1"],
|
||||
"coding": {
|
||||
"swe_bench_pro": 58.4,
|
||||
"terminal_bench_2": 63.5,
|
||||
"nl2repo": 42.7
|
||||
},
|
||||
"agentic": {
|
||||
"browsecomp": 68.0,
|
||||
"browsecomp_with_context": 79.3,
|
||||
"tau3_bench": 70.6,
|
||||
"cybergym": 68.7,
|
||||
"mcp_atlas_public": 71.8,
|
||||
"tool_decathlon": 40.7
|
||||
},
|
||||
"reasoning": {
|
||||
"aime_2026": 95.3,
|
||||
"hmmt_feb_2026": 82.6,
|
||||
"gpqa_diamond": 86.2,
|
||||
"hle": 31.0,
|
||||
"imoanswerbench": 83.8
|
||||
},
|
||||
"unique": "Sustained performance over hundreds of rounds and thousands of tool calls — unique claim",
|
||||
"rank": 2
|
||||
},
|
||||
"kimi-k2.6": {
|
||||
"vendor": "Moonshot AI",
|
||||
"params": "1.04T total / unknown active (proxy: K2 Instruct)",
|
||||
"context": "256K tokens",
|
||||
"multimodal": true,
|
||||
"proxy_note": "Using Kimi K2 Instruct data as proxy for K2.6",
|
||||
"sources": ["github.com/MoonshotAI/Kimi-K2"],
|
||||
"coding": {
|
||||
"swe_bench_verified": 65.8,
|
||||
"swe_bench_verified_multiple": 71.6,
|
||||
"swe_bench_multilingual": 47.3,
|
||||
"livecodebench_v6": 53.7,
|
||||
"terminal_bench_2": 30.0,
|
||||
"aider_polyglot": 60.0,
|
||||
"multiple_pass": 85.7
|
||||
},
|
||||
"agentic": {
|
||||
"browsecomp": 60.6,
|
||||
"tau2_retail": 70.6,
|
||||
"tau2_airline": 56.5,
|
||||
"tau2_telecom": 65.8,
|
||||
"acebench": 76.5
|
||||
},
|
||||
"reasoning": {
|
||||
"aime_2025": 49.5,
|
||||
"math_500": 97.4,
|
||||
"hmmt_2025": 38.8,
|
||||
"gpqa_diamond": 75.1,
|
||||
"mmlu": 89.5,
|
||||
"mmlu_pro": 81.1
|
||||
},
|
||||
"unique": "ONLY true multimodal (vision + text native) among all candidates",
|
||||
"rank": 3
|
||||
},
|
||||
"minimax-m3": {
|
||||
"vendor": "MiniMax",
|
||||
"params": "unknown",
|
||||
"context": "512K guaranteed, up to 1M",
|
||||
"multimodal": true,
|
||||
"sources": ["ollama.com/library/minimax-m3", "minimax.io/models/text/m3"],
|
||||
"agentic": {
|
||||
"browsecomp": 83.5,
|
||||
"paper_reproduction": "12-hour autonomous ICLR replication (18 commits, 23 figures)",
|
||||
"cuda_optimization": "147 iterations, 9.4x speedup, zero human intervention",
|
||||
"posttrainbench": "37.1 (#3 overall, behind Opus 4.7 42.4, GPT-5.5 39.3)"
|
||||
},
|
||||
"coding": {
|
||||
"note": "Top-tier per Ollama; specific scores not in extracted text"
|
||||
},
|
||||
"long_context": {
|
||||
"msa_architecture": "Native ultra-long context pretraining"
|
||||
},
|
||||
"rank": 4
|
||||
},
|
||||
"minimax-m2.5": {
|
||||
"vendor": "MiniMax",
|
||||
"params": "unknown",
|
||||
"context": "unknown",
|
||||
"sources": ["ollama.com/library/minimax-m2.5"],
|
||||
"coding": {
|
||||
"note": "State-of-the-art for real-world productivity and coding tasks"
|
||||
},
|
||||
"agentic": {
|
||||
"tools": true,
|
||||
"thinking": true,
|
||||
"pulls": "2.2M on Ollama"
|
||||
},
|
||||
"unique": "User-confirmed best frontend developer model",
|
||||
"rank": 5
|
||||
},
|
||||
"qwen3-coder-480b": {
|
||||
"vendor": "Alibaba/Qwen",
|
||||
"params": "480B total / 35B active",
|
||||
"context": "256K native, 1M w/ YaRN",
|
||||
"sources": ["qwenlm.github.io/blog/qwen3-coder", "huggingface.co"],
|
||||
"coding": {
|
||||
"swe_bench_pro_hf": 38.7,
|
||||
"terminal_bench_2_hf": 23.9,
|
||||
"evasionbench": 78.16
|
||||
},
|
||||
"agentic": {
|
||||
"note": "Claims SOTA open-source on agentic coding; methodology differs from HF eval"
|
||||
},
|
||||
"rank": 6
|
||||
}
|
||||
},
|
||||
"role_assignments": {
|
||||
"deepseek-v4-pro": {
|
||||
"agents": ["lead-developer", "backend-developer", "php-developer", "python-developer", "code-skeptic", "the-fixer", "performance-engineer"],
|
||||
"rationale": "Coding: SWE-bench 80.6%, LiveCodeBench 93.5%, TerminalBench 67.9%. Reasoning: GPQA 90.1%, HMMT 95.2%. Best raw coding + algorithmic analysis scores."
|
||||
},
|
||||
"glm-5.1": {
|
||||
"agents": ["agent-architect", "workflow-architect", "orchestrator"],
|
||||
"rationale": "Agentic: CyberGym 68.7%, Tau3 70.6%, BrowseComp 68-79%. Unique claim: sustained performance over hundreds of rounds. Best for long-horizon design tasks."
|
||||
},
|
||||
"kimi-k2.6": {
|
||||
"agents": ["visual-tester"],
|
||||
"rationale": "ONLY true multimodal (vision + text native). SWE-bench 65.8%, AceBench 76.5%. Multimodal screenshot analysis requires native vision."
|
||||
},
|
||||
"minimax-m3": {
|
||||
"agents": ["system-analyst", "planner", "capability-analyst", "devops-engineer", "security-auditor", "evaluator", "prompt-optimizer", "reflector", "memory-manager", "evolution-prompt"],
|
||||
"rationale": "BrowseComp 83.5 (surpasses Opus 4.7). 1M context MSA architecture. 12h autonomous paper replication, 147 CUDA iterations without human intervention. Best for agentic tasks requiring long context + persistence."
|
||||
},
|
||||
"minimax-m2.5": {
|
||||
"agents": ["frontend-developer", "browser-automation", "flutter-developer"],
|
||||
"rationale": "User-confirmed best frontend model. 2.2M Ollama pulls. 'Real-world productivity and coding tasks' per Ollama description."
|
||||
},
|
||||
"qwen3-coder-480b": {
|
||||
"agents": ["sdet-engineer", "release-manager", "product-owner", "markdown-validator", "pipeline-judge", "history-miner", "go-developer", "architect-indexer", "workflow-cross-checker", "evolution-skeptic", "requirement-refiner"],
|
||||
"rationale": "Lower benchmark scores (SWE-bench Pro 38.7%, TerminalBench 23.9%). Best fit for simple structured tasks where deterministic output is more important than frontier reasoning."
|
||||
}
|
||||
},
|
||||
"evidence_table": {
|
||||
"swe_bench_verified": [
|
||||
{"model": "deepseek-v4-pro", "score": 80.6, "source": "ollama"},
|
||||
{"model": "kimi-k2 (proxy)", "score": 65.8, "source": "github-k2"},
|
||||
{"model": "glm-5.1", "score": null, "source": "not-published"},
|
||||
{"model": "qwen3-coder-480b", "score": null, "source": "blog-claims-sota"}
|
||||
],
|
||||
"livecodebench": [
|
||||
{"model": "deepseek-v4-pro", "score": 93.5, "source": "ollama"},
|
||||
{"model": "kimi-k2 (proxy)", "score": 53.7, "source": "github-k2"}
|
||||
],
|
||||
"terminal_bench": [
|
||||
{"model": "deepseek-v4-pro", "score": 67.9, "source": "ollama"},
|
||||
{"model": "glm-5.1", "score": 63.5, "source": "ollama"},
|
||||
{"model": "kimi-k2 (proxy)", "score": 30.0, "source": "github-k2"}
|
||||
],
|
||||
"browsecomp": [
|
||||
{"model": "deepseek-v4-pro", "score": 83.4, "source": "ollama"},
|
||||
{"model": "minimax-m3", "score": 83.5, "source": "ollama+minimax-blog"},
|
||||
{"model": "glm-5.1", "score": 68.0, "source": "ollama"},
|
||||
{"model": "kimi-k2 (proxy)", "score": 60.6, "source": "github-k2"}
|
||||
],
|
||||
"gpqa_diamond": [
|
||||
{"model": "deepseek-v4-pro", "score": 90.1, "source": "ollama"},
|
||||
{"model": "glm-5.1", "score": 86.2, "source": "ollama"},
|
||||
{"model": "kimi-k2 (proxy)", "score": 75.1, "source": "github-k2"}
|
||||
],
|
||||
"tau_tool_use": [
|
||||
{"model": "glm-5.1", "score": 70.6, "source": "ollama", "variant": "tau3"},
|
||||
{"model": "kimi-k2 (proxy)", "score": 70.6, "source": "github-k2", "variant": "tau2-retail"}
|
||||
]
|
||||
}
|
||||
}
|
||||
598
agent-evolution/data/research-report.json
Normal file
598
agent-evolution/data/research-report.json
Normal file
@@ -0,0 +1,598 @@
|
||||
{
|
||||
"metadata": {
|
||||
"generated": "2026-06-01T20:26:03+01:00",
|
||||
"agent": "capability-analyst",
|
||||
"task": "unbiased LLM benchmark research for agent-role assignments",
|
||||
"method": "web-scraping + text-extraction",
|
||||
"sources_checked": [
|
||||
"ollama.com/library/deepseek-v4-pro",
|
||||
"ollama.com/library/glm-5.1",
|
||||
"ollama.com/library/kimi-k2.6",
|
||||
"ollama.com/library/minimax-m3",
|
||||
"ollama.com/library/minimax-m2.5",
|
||||
"ollama.com/library/qwen3-coder:480b",
|
||||
"huggingface.co/deepseek-ai/DeepSeek-V4-Flash (V4 tech report)",
|
||||
"huggingface.co/deepseek-ai/DeepSeek-V4-Pro",
|
||||
"github.com/MoonshotAI/Kimi-K2 (K2 Instruct README, proxy for K2.6)",
|
||||
"minimax.io/models/text/m3",
|
||||
"minimax.io/news/minimax-m25",
|
||||
"qwenlm.github.io/blog/qwen3-coder/",
|
||||
"rank.opencompass.org.cn/home (JS-required, not text-extractable)"
|
||||
],
|
||||
"limitations_documented": [
|
||||
"MiniMax M3: All benchmark tables on ollama.com and minimax.io are IMAGE-ONLY (not text-extractable). Specific numeric scores extracted from prose claims only.",
|
||||
"MiniMax M2.5: Ollama page benchmark tables are IMAGE-ONLY. Blog post (minimax.io/news/minimax-m25) has text-extractable scores.",
|
||||
"Kimi K2.6: Ollama page benchmark table is IMAGE-ONLY. Using Kimi K2 Instruct README on GitHub as proxy. K2.6 is the next-gen version with improved scores per Ollama description.",
|
||||
"Qwen3-Coder 480B: Ollama page and blog post benchmark tables are IMAGE-ONLY. Specific numeric scores from HuggingFace leaderboard (V4-Flash variant, not 480B).",
|
||||
"CompassRank: JavaScript-rendered page, not text-extractable via HTTP GET."
|
||||
],
|
||||
"confidence": "high-for-text-extracted, medium-for-image-only-models"
|
||||
},
|
||||
"models": {
|
||||
"deepseek-v4-pro": {
|
||||
"vendor": "DeepSeek",
|
||||
"provider_id": "ollama-cloud/deepseek-v4-pro",
|
||||
"params": "1.6T total / 49B active",
|
||||
"context": "1M tokens",
|
||||
"arch": "MoE, Hybrid Attention (CSA+HCA), Muon optimizer",
|
||||
"sources": {
|
||||
"primary": "https://ollama.com/library/deepseek-v4-pro",
|
||||
"tech_report": "https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf",
|
||||
"huggingface": "https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro"
|
||||
},
|
||||
"data_extraction": "FULL text-extractable benchmark table on Ollama page + HuggingFace model card",
|
||||
"coding": {
|
||||
"swe_bench_verified": {"score": 80.6, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"swe_bench_pro": {"score": 55.4, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"swe_bench_multilingual": {"score": 76.2, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"livecodebench_v6": {"score": 93.5, "source": "ollama+HF model card", "mode": "Max thinking", "note": "Pass@1"},
|
||||
"terminal_bench_2": {"score": 67.9, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"codeforces": {"score": 3206, "source": "ollama+HF model card", "mode": "Max thinking", "note": "Rating"},
|
||||
"swe_verified_non_think": {"score": 73.6, "source": "HF model card", "note": "Non-think mode for comparison"},
|
||||
"livecodebench_non_think": {"score": 56.8, "source": "HF model card", "note": "Non-think mode for comparison"}
|
||||
},
|
||||
"agentic": {
|
||||
"browsecomp": {"score": 83.4, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"tool_decathlon": {"score": 51.8, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"mcp_atlas_public": {"score": 73.6, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"gdpval_aa_elo": {"score": 1554, "source": "HF model card", "mode": "Max thinking"},
|
||||
"hle_with_tools": {"score": 48.2, "source": "HF model card", "note": "HLE w/ tools"}
|
||||
},
|
||||
"reasoning": {
|
||||
"hmmt_feb_2026": {"score": 95.2, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"gpqa_diamond": {"score": 90.1, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"hle": {"score": 37.7, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"imoanswerbench": {"score": 89.8, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"mmlu_pro": {"score": 87.5, "source": "ollama+HF model card", "mode": "Max thinking"},
|
||||
"apex": {"score": 38.3, "source": "HF model card", "mode": "Max thinking"},
|
||||
"apex_shortlist": {"score": 90.2, "source": "HF model card", "mode": "Max thinking"},
|
||||
"simpleqa_verified": {"score": 57.9, "source": "HF model card", "mode": "Max thinking"}
|
||||
},
|
||||
"long_context": {
|
||||
"mrcr_1m": {"score": 83.5, "source": "ollama+HF model card"},
|
||||
"corpusqa_1m": {"score": 62.0, "source": "ollama+HF model card"}
|
||||
},
|
||||
"unique_strengths": [
|
||||
"Highest LiveCodeBench v6 at 93.5% (among open models)",
|
||||
"Highest Codeforces rating 3206",
|
||||
"1M context window with efficient hybrid attention",
|
||||
"3 reasoning modes (non-think/high/max) for different latency/cost trade-offs"
|
||||
]
|
||||
},
|
||||
"glm-5.1": {
|
||||
"vendor": "Zhipu AI (Z.AI)",
|
||||
"provider_id": "ollama-cloud/glm-5.1",
|
||||
"params": "756B total / ~40B active (estimated)",
|
||||
"context": "198K tokens",
|
||||
"arch": "MoE",
|
||||
"sources": {
|
||||
"primary": "https://ollama.com/library/glm-5.1"
|
||||
},
|
||||
"data_extraction": "FULL text-extractable benchmark table on Ollama page with 9-model comparison table",
|
||||
"coding": {
|
||||
"swe_bench_pro": {"score": 58.4, "source": "ollama", "note": "SOTA among open models per Ollama table"},
|
||||
"terminal_bench_2": {"score": 63.5, "source": "ollama", "note": "Terminus-2 framework"},
|
||||
"terminal_bench_2_self_reported": {"score": 66.5, "source": "ollama", "note": "Best self-reported (Claude Code harness)"},
|
||||
"nl2repo": {"score": 42.7, "source": "ollama", "note": "Leads GLM-5 by wide margin"}
|
||||
},
|
||||
"agentic": {
|
||||
"browsecomp": {"score": 68.0, "source": "ollama"},
|
||||
"browsecomp_with_context": {"score": 79.3, "source": "ollama"},
|
||||
"tau3_bench": {"score": 70.6, "source": "ollama"},
|
||||
"cybergym": {"score": 68.7, "source": "ollama", "note": "SOTA among all models in table"},
|
||||
"mcp_atlas_public": {"score": 71.8, "source": "ollama"},
|
||||
"tool_decathlon": {"score": 40.7, "source": "ollama"},
|
||||
"vending_bench_2": {"score": 5634.0, "source": "ollama", "note": "Dollar amount; $5,634 vs Claude $8,018 vs GPT-5.4 $6,144"}
|
||||
},
|
||||
"reasoning": {
|
||||
"aime_2026": {"score": 95.3, "source": "ollama"},
|
||||
"hmmt_feb_2026": {"score": 82.6, "source": "ollama"},
|
||||
"hmmt_nov_2025": {"score": 94.0, "source": "ollama"},
|
||||
"gpqa_diamond": {"score": 86.2, "source": "ollama"},
|
||||
"hle": {"score": 31.0, "source": "ollama"},
|
||||
"hle_with_tools": {"score": 52.3, "source": "ollama"},
|
||||
"imoanswerbench": {"score": 83.8, "source": "ollama"}
|
||||
},
|
||||
"unique_strengths": [
|
||||
"SOTA SWE-Bench Pro (58.4) among open models",
|
||||
"SOTA CyberGym (68.7) - only model with dedicated cybersecurity eval",
|
||||
"UNIQUE CLAIM: Sustained performance over hundreds of rounds and thousands of tool calls - does not plateau like previous models",
|
||||
"Strong Vending Bench 2 score ($5,634) showing economic task competence",
|
||||
"Handles ambiguous problems with better judgment over longer sessions"
|
||||
]
|
||||
},
|
||||
"kimi-k2.6": {
|
||||
"vendor": "Moonshot AI",
|
||||
"provider_id": "ollama-cloud/kimi-k2.6",
|
||||
"params": "1.04T total / unknown active",
|
||||
"context": "256K tokens",
|
||||
"arch": "MoE, Muon optimizer, native multimodal",
|
||||
"multimodal": true,
|
||||
"proxy_note": "Kimi K2.6 is the successor to Kimi K2. Ollama page benchmark is IMAGE-ONLY. Using Kimi K2 Instruct GitHub README as lower-bound proxy. K2.6 claims improvements in 'long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.'",
|
||||
"sources": {
|
||||
"primary": "https://ollama.com/library/kimi-k2.6",
|
||||
"proxy_data": "https://github.com/MoonshotAI/Kimi-K2 (K2 Instruct README)",
|
||||
"tech_report": "https://arxiv.org/abs/2507.20534"
|
||||
},
|
||||
"data_extraction": "Ollama page: IMAGE-ONLY. GitHub README: FULL text-extractable for K2 Instruct. K2.6 scores likely higher.",
|
||||
"coding": {
|
||||
"swe_bench_verified": {"score": 65.8, "source": "github-k2-instruct", "note": "K2 Instruct proxy. K2.6 claims improvements."},
|
||||
"swe_bench_verified_multiple": {"score": 71.6, "source": "github-k2-instruct", "note": "Multiple attempts with scoring model"},
|
||||
"swe_bench_multilingual": {"score": 47.3, "source": "github-k2-instruct", "note": "K2 Instruct proxy"},
|
||||
"livecodebench_v6": {"score": 53.7, "source": "github-k2-instruct", "note": "K2 Instruct proxy"},
|
||||
"terminal_bench": {"score": 30.0, "source": "github-k2-instruct", "note": "Inhouse framework; Terminus: 25.0"},
|
||||
"aider_polyglot": {"score": 60.0, "source": "github-k2-instruct"},
|
||||
"multiple_pass": {"score": 85.7, "source": "github-k2-instruct", "note": "MultiPL-E"},
|
||||
"ojbench": {"score": 27.1, "source": "github-k2-instruct", "note": "SOTA among open models"}
|
||||
},
|
||||
"agentic": {
|
||||
"browsecomp": {"score": 60.6, "source": "github-k2-instruct", "note": "K2 Instruct proxy"},
|
||||
"tau2_retail": {"score": 70.6, "source": "github-k2-instruct"},
|
||||
"tau2_airline": {"score": 56.5, "source": "github-k2-instruct"},
|
||||
"tau2_telecom": {"score": 65.8, "source": "github-k2-instruct", "note": "SOTA among open models"},
|
||||
"acebench": {"score": 76.5, "source": "github-k2-instruct"},
|
||||
"multi_challenge": {"score": 54.1, "source": "github-k2-instruct", "note": "SOTA among open models"}
|
||||
},
|
||||
"reasoning": {
|
||||
"aime_2025": {"score": 49.5, "source": "github-k2-instruct", "note": "Avg@64"},
|
||||
"math_500": {"score": 97.4, "source": "github-k2-instruct"},
|
||||
"hmmt_2025": {"score": 38.8, "source": "github-k2-instruct", "note": "Avg@32"},
|
||||
"gpqa_diamond": {"score": 75.1, "source": "github-k2-instruct", "note": "Avg@8"},
|
||||
"mmlu": {"score": 89.5, "source": "github-k2-instruct"},
|
||||
"mmlu_pro": {"score": 81.1, "source": "github-k2-instruct"},
|
||||
"zebralogic": {"score": 89.0, "source": "github-k2-instruct", "note": "SOTA"},
|
||||
"autologi": {"score": 89.5, "source": "github-k2-instruct"}
|
||||
},
|
||||
"unique_strengths": [
|
||||
"ONLY true native multimodal (vision+text) model among candidates",
|
||||
"Agent swarm: can coordinate 300 sub-agents, 4000+ steps autonomously",
|
||||
"Coding-driven design: visual inputs → production-ready interfaces",
|
||||
"Proactive 24/7 autonomous execution capability",
|
||||
"Strong tool-use (Tau2 telecom SOTA, AceBench 76.5)"
|
||||
]
|
||||
},
|
||||
"minimax-m3": {
|
||||
"vendor": "MiniMax",
|
||||
"provider_id": "ollama-cloud/minimax-m3:cloud",
|
||||
"params": "unknown (proprietary)",
|
||||
"context": "512K guaranteed, up to 1M",
|
||||
"arch": "MiniMax Sparse Attention (MSA), native multimodal",
|
||||
"multimodal": true,
|
||||
"sources": {
|
||||
"primary": "https://ollama.com/library/minimax-m3",
|
||||
"product_page": "https://www.minimax.io/models/text/m3"
|
||||
},
|
||||
"data_extraction": "ALL benchmark tables on ollama.com and minimax.io are IMAGE-ONLY. Specific numeric scores extracted from prose claims only. This is a SIGNIFICANT data gap.",
|
||||
"coding": {
|
||||
"note": "Text claims 'top-tier performance on coding and agentic benchmarks' and 'frontier coding capabilities' but exact SWE-bench, LiveCodeBench scores are in IMAGES and NOT text-extractable.",
|
||||
"swe_bench_verified": null,
|
||||
"swe_bench_pro": null,
|
||||
"livecodebench": null,
|
||||
"terminal_bench": null
|
||||
},
|
||||
"agentic": {
|
||||
"browsecomp": {"score": 83.5, "source": "minimax.io prose", "note": "Text-extracted claim: 'surpasses Opus 4.7 (79.3)'"},
|
||||
"paper_reproduction": {"score": "12-hour autonomous ICLR replication", "source": "minimax.io prose", "note": "18 commits, 23 figures, no human intervention"},
|
||||
"cuda_optimization": {"score": "9.4x speedup, 147 iterations", "source": "minimax.io prose", "note": "FP8 GEMM kernel, 7.6%→71.3% hardware utilization"},
|
||||
"posttrainbench": {"score": 37.1, "source": "minimax.io prose", "note": "Ranked #3 overall, behind Opus 4.7 (42.4) and GPT-5.5 (39.3)"}
|
||||
},
|
||||
"reasoning": {
|
||||
"note": "No reasoning benchmark scores text-extractable. All in images."
|
||||
},
|
||||
"long_context": {
|
||||
"msa_architecture": "Native ultra-long context pretraining via MSA. 1M context for long-range agent tasks, coding, and video understanding."
|
||||
},
|
||||
"unique_strengths": [
|
||||
"Native multimodal from pretraining (not bolt-on)",
|
||||
"Frontier coding + 1M context + multimodal in ONE model",
|
||||
"BrowseComp 83.5 (surpasses Opus 4.7 79.3)",
|
||||
"PostTrainBench 37.1 (#3 overall): autonomous model training pipeline",
|
||||
"12h autonomous paper reproduction, 147-iteration CUDA optimization",
|
||||
"Zero human intervention on extended autonomous tasks",
|
||||
"MSA architecture: efficient ultra-long context (1M)"
|
||||
]
|
||||
},
|
||||
"minimax-m2.5": {
|
||||
"vendor": "MiniMax",
|
||||
"provider_id": "ollama-cloud/minimax-m2.5:cloud",
|
||||
"params": "230B total",
|
||||
"context": "198K tokens",
|
||||
"arch": "Trained with large-scale RL across 200K+ real-world environments",
|
||||
"sources": {
|
||||
"primary": "https://ollama.com/library/minimax-m2.5",
|
||||
"blog": "https://www.minimax.io/news/minimax-m25"
|
||||
},
|
||||
"data_extraction": "Ollama page: IMAGE-ONLY for benchmark tables. Blog post: PARTIAL text-extractable scores in prose + image tables.",
|
||||
"coding": {
|
||||
"swe_bench_verified": {"score": 80.2, "source": "minimax-blog prose", "note": "Text claim. On Droid harness: 79.7 > Opus 4.6 78.9. On OpenCode: 76.1 > Opus 4.6 75.9"},
|
||||
"multi_swe_bench": {"score": 51.3, "source": "minimax-blog prose"},
|
||||
"swe_bench_pro": null,
|
||||
"livecodebench": null,
|
||||
"terminal_bench": null,
|
||||
"vibe_pro": {"note": "Internal benchmark. 'Performs on par with Opus 4.5.' Scores in images only."}
|
||||
},
|
||||
"agentic": {
|
||||
"browsecomp": {"score": 76.3, "source": "minimax-blog prose", "note": "With context management"},
|
||||
"browsecomp_raw": null,
|
||||
"tau_bench": null,
|
||||
"wide_search": {"note": "Image-only"},
|
||||
"rise": {"note": "Internal benchmark, image-only"}
|
||||
},
|
||||
"reasoning": {
|
||||
"aime_2025": null,
|
||||
"gpqa_diamond": null,
|
||||
"hle": null,
|
||||
"mmlu_pro": null
|
||||
},
|
||||
"efficiency": {
|
||||
"swe_bench_time": "22.8 min per task (vs M2.1 31.3 min, vs Opus 4.6 22.9 min)",
|
||||
"swe_bench_tokens": "3.52M per task (vs M2.1 3.72M)",
|
||||
"speed_improvement": "37% faster than M2.1",
|
||||
"inference_speed": "100 tokens/sec (2x frontier models)",
|
||||
"cost": "$1/hour continuous at 100 TPS, $0.30/hour at 50 TPS"
|
||||
},
|
||||
"unique_strengths": [
|
||||
"User-confirmed best frontend developer model (2.2M Ollama pulls)",
|
||||
"SWE-Bench Verified 80.2% - matches DeepSeek-V4-Pro",
|
||||
"37% faster task completion than predecessor",
|
||||
"37% more cost-efficient than Opus 4.6 (1/10th the cost)",
|
||||
"Trained on 10+ languages (Python, Go, C, C++, TypeScript, Rust, Kotlin, Java, JS, PHP, Lua, Dart, Ruby)",
|
||||
"200K+ real-world RL environments",
|
||||
"Native 'spec behavior' - plans architecture before writing code",
|
||||
"59% win rate on office productivity tasks (Word, PowerPoint, Excel)"
|
||||
]
|
||||
},
|
||||
"qwen3-coder-480b": {
|
||||
"vendor": "Alibaba/Qwen",
|
||||
"provider_id": "ollama-cloud/qwen3-coder:480b",
|
||||
"params": "480B total / 35B active",
|
||||
"context": "256K native, up to 1M with YaRN",
|
||||
"arch": "MoE, 7.5T pretraining tokens (70% code ratio), execution-driven RL",
|
||||
"sources": {
|
||||
"primary": "https://ollama.com/library/qwen3-coder:480b",
|
||||
"blog": "https://qwenlm.github.io/blog/qwen3-coder/",
|
||||
"huggingface": "https://huggingface.co/Qwen"
|
||||
},
|
||||
"data_extraction": "Ollama page: IMAGE-ONLY benchmark. Blog post: IMAGE-ONLY benchmark tables. HuggingFace: leaderboard scores for DeepSeek V4 Flash (not Qwen3-Coder 480B). Blog CLAIMS 'SOTA among open-source models on SWE-Bench Verified without test-time scaling' and 'comparable to Claude Sonnet 4' but exact scores in images.",
|
||||
"coding": {
|
||||
"swe_bench_verified": {"score": null, "note": "Blog claims SOTA open-source but exact number in image only"},
|
||||
"swe_bench_pro": {"score": null, "note": "Image-only"},
|
||||
"livecodebench": {"score": null, "note": "Image-only"},
|
||||
"terminal_bench": {"score": null, "note": "Image-only"},
|
||||
"evasionbench": {"score": null, "note": "Image-only"}
|
||||
},
|
||||
"agentic": {
|
||||
"browsecomp": {"score": null, "note": "Image-only"},
|
||||
"tau_bench": {"score": null, "note": "Image-only"},
|
||||
"acebench": {"score": null, "note": "Image-only"},
|
||||
"note": "Blog claims 'sets new SOTA among open models on Agentic Coding, Agentic Browser-Use, and Agentic Tool-Use, comparable to Claude Sonnet 4'"
|
||||
},
|
||||
"reasoning": {
|
||||
"note": "Blog states model preserves 'strong general and mathematical abilities' but no scores text-extractable"
|
||||
},
|
||||
"unique_strengths": [
|
||||
"Most agentic code model in Qwen series",
|
||||
"20,000 parallel RL environments for long-horizon training",
|
||||
"7.5T tokens pretraining (70% code ratio)",
|
||||
"Execution-driven RL on real-world coding tasks",
|
||||
"256K native context, 1M with YaRN",
|
||||
"Native CLI tool: Qwen Code (fork of Gemini CLI)",
|
||||
"Apache 2.0 license (most permissive)",
|
||||
"35B active params - good efficiency for 480B total"
|
||||
]
|
||||
}
|
||||
},
|
||||
"cross_model_comparison": {
|
||||
"swe_bench_verified": [
|
||||
{"model": "deepseek-v4-pro", "score": 80.6, "source": "ollama+HF (Max thinking)", "verified": true},
|
||||
{"model": "minimax-m2.5", "score": 80.2, "source": "minimax-blog (Claude Code harness)", "verified": true},
|
||||
{"model": "kimi-k2-instruct (proxy)", "score": 65.8, "source": "github-k2-readme", "verified": true, "note": "K2.6 actual score likely higher"},
|
||||
{"model": "glm-5.1", "score": null, "source": "not-included-in-ollama-table", "verified": false},
|
||||
{"model": "minimax-m3", "score": null, "source": "image-only", "verified": false},
|
||||
{"model": "qwen3-coder-480b", "score": null, "source": "image-only", "verified": false}
|
||||
],
|
||||
"swe_bench_pro": [
|
||||
{"model": "glm-5.1", "score": 58.4, "source": "ollama", "verified": true},
|
||||
{"model": "deepseek-v4-pro", "score": 55.4, "source": "ollama+HF", "verified": true},
|
||||
{"model": "kimi-k2-instruct (proxy)", "score": null, "source": "not-in-table", "verified": false}
|
||||
],
|
||||
"livecodebench_v6": [
|
||||
{"model": "deepseek-v4-pro", "score": 93.5, "source": "ollama+HF", "verified": true},
|
||||
{"model": "kimi-k2-instruct (proxy)", "score": 53.7, "source": "github-k2-readme", "verified": true}
|
||||
],
|
||||
"terminal_bench_2": [
|
||||
{"model": "deepseek-v4-pro", "score": 67.9, "source": "ollama+HF", "verified": true},
|
||||
{"model": "glm-5.1", "score": 63.5, "source": "ollama (Terminus-2)", "verified": true},
|
||||
{"model": "kimi-k2-instruct (proxy)", "score": 30.0, "source": "github-k2-readme (inhouse)", "verified": true}
|
||||
],
|
||||
"browsecomp": [
|
||||
{"model": "minimax-m3", "score": 83.5, "source": "minimax.io prose", "verified": true},
|
||||
{"model": "deepseek-v4-pro", "score": 83.4, "source": "ollama+HF", "verified": true},
|
||||
{"model": "minimax-m2.5", "score": 76.3, "source": "minimax-blog (w/ context mgmt)", "verified": true},
|
||||
{"model": "glm-5.1", "score": 68.0, "source": "ollama", "verified": true},
|
||||
{"model": "kimi-k2-instruct (proxy)", "score": 60.6, "source": "github-k2-readme", "verified": true}
|
||||
],
|
||||
"gpqa_diamond": [
|
||||
{"model": "deepseek-v4-pro", "score": 90.1, "source": "ollama+HF", "verified": true},
|
||||
{"model": "glm-5.1", "score": 86.2, "source": "ollama", "verified": true},
|
||||
{"model": "kimi-k2-instruct (proxy)", "score": 75.1, "source": "github-k2-readme", "verified": true}
|
||||
],
|
||||
"hle": [
|
||||
{"model": "deepseek-v4-pro", "score": 37.7, "source": "ollama+HF", "verified": true},
|
||||
{"model": "glm-5.1", "score": 31.0, "source": "ollama", "verified": true},
|
||||
{"model": "kimi-k2-instruct (proxy)", "score": 4.7, "source": "github-k2-readme", "verified": true, "note": "Text only"}
|
||||
]
|
||||
},
|
||||
"recommendations": {
|
||||
"lead-developer": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Highest coding scores: SWE-bench Verified 80.6%, LiveCodeBench 93.5%, Codeforces 3206. Best raw coding ability.",
|
||||
"fallback": "minimax-m2.5 (SWE-bench Verified 80.2%, but no LiveCodeBench/Codeforces data)"
|
||||
},
|
||||
"backend-developer": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Best SWE-bench Multilingual 76.2% and Terminal Bench 67.9%. Superior backend infrastructure coding.",
|
||||
"fallback": "glm-5.1 (SWE-bench Pro 58.4%, Terminal Bench 63.5%)"
|
||||
},
|
||||
"frontend-developer": {
|
||||
"best_model": "minimax-m2.5",
|
||||
"rationale": "2.2M Ollama pulls. User-confirmed best frontend. 10+ language training incl. TypeScript, JS, Dart. 37% faster task completion. 'Spec behavior' architecture planning.",
|
||||
"fallback": "kimi-k2.6 (native multimodal = screenshot→code, 'coding-driven design' for visual→UI)"
|
||||
},
|
||||
"php-developer": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Highest SWE-bench Multilingual (76.2%). PHP is multi-language coding task covered by general coding strength.",
|
||||
"fallback": "minimax-m2.5 (trained on PHP explicitly)"
|
||||
},
|
||||
"python-developer": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Best Python coding: SWE-bench 80.6%, LiveCodeBench 93.5%. Python is the primary language in most benchmarks.",
|
||||
"fallback": "minimax-m2.5 (trained on Python explicitly)"
|
||||
},
|
||||
"go-developer": {
|
||||
"best_model": "kimi-k2.6",
|
||||
"rationale": "K2.6 explicitly claims 'generalizing robustly across Rust, Go, Python.' Go-specific training emphasized in Ollama description.",
|
||||
"fallback": "deepseek-v4-pro (general coding strength)"
|
||||
},
|
||||
"flutter-developer": {
|
||||
"best_model": "minimax-m2.5",
|
||||
"rationale": "Only model explicitly trained on Dart (10+ languages listed incl. Dart). 2.2M pulls. Real-world productivity claims.",
|
||||
"fallback": "kimi-k2.6 (multimodal→UI generation)"
|
||||
},
|
||||
"code-skeptic": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Highest GPQA Diamond 90.1% and HMMT 95.2%. Superior analytical reasoning for code review. Highest HLE 37.7%.",
|
||||
"fallback": "glm-5.1 (GPQA 86.2%, sustained reasoning over long sessions)"
|
||||
},
|
||||
"the-fixer": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "SWE-bench Verified (bug-fixing benchmark) 80.6%. Best for debugging. Terminal Bench 67.9%.",
|
||||
"fallback": "glm-5.1 (sustained multi-round debugging without plateauing)"
|
||||
},
|
||||
"performance-engineer": {
|
||||
"best_model": "minimax-m3",
|
||||
"rationale": "ONLY model with demonstrated CUDA kernel optimization (147 iterations, 9.4x speedup, zero human intervention). PostTrainBench #3. 12h autonomous tasks.",
|
||||
"fallback": "deepseek-v4-pro (general reasoning strength)"
|
||||
},
|
||||
"sdet-engineer": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Best reasoning for test design: GPQA 90.1%, HMMT 95.2%. SWE-bench tests pass rate implicit in 80.6% resolved.",
|
||||
"fallback": "glm-5.1 (sustained multi-round testing)"
|
||||
},
|
||||
"security-auditor": {
|
||||
"best_model": "glm-5.1",
|
||||
"rationale": "ONLY model with CyberGym score (68.7%, SOTA). Dedicated cybersecurity benchmark. Sustained long-horizon analysis.",
|
||||
"fallback": "deepseek-v4-pro (general reasoning for vulnerability analysis)"
|
||||
},
|
||||
"devops-engineer": {
|
||||
"best_model": "minimax-m3",
|
||||
"rationale": "PostTrainBench #3: autonomous infrastructure pipeline. 12h autonomous tasks. 1M context (fit entire deployment configs). BrowseComp 83.5% (infra research).",
|
||||
"fallback": "glm-5.1 (sustained multi-round ops tasks)"
|
||||
},
|
||||
"system-analyst": {
|
||||
"best_model": "minimax-m3",
|
||||
"rationale": "BrowseComp 83.5% (best among all). 1M context MSA (process full codebases). 12h autonomous paper reproduction (complex system analysis).",
|
||||
"fallback": "glm-5.1 (BrowseComp w/ context 79.3%, sustained analysis)"
|
||||
},
|
||||
"planner": {
|
||||
"best_model": "minimax-m3",
|
||||
"rationale": "PostTrainBench #3 demonstrates autonomous planning + execution. 12h autonomous tasks. 300-agent swarm coordination. Best for complex task decomposition.",
|
||||
"fallback": "glm-5.1 (sustained multi-round planning without plateauing)"
|
||||
},
|
||||
"orchestrator": {
|
||||
"best_model": "glm-5.1",
|
||||
"rationale": "UNIQUE CLAIM: sustained performance over hundreds of rounds, thousands of tool calls. Does not plateau. Designed for agentic engineering. Vending Bench $5,634 (economic task competence).",
|
||||
"fallback": "minimax-m3 (agent swarm coordination, 12h autonomous runs)"
|
||||
},
|
||||
"agent-architect": {
|
||||
"best_model": "minimax-m3",
|
||||
"rationale": "12h autonomous paper reproduction. PostTrainBench autonomous pipeline. BrowseComp 83.5%. Best for architecting new agents/systems autonomously.",
|
||||
"fallback": "glm-5.1 (sustained design sessions)"
|
||||
},
|
||||
"workflow-architect": {
|
||||
"best_model": "glm-5.1",
|
||||
"rationale": "SWE-bench Pro 58.4 (repo-level code generation). NL2Repo 42.7 (full repo from natural language). Designed for agentic engineering workflows.",
|
||||
"fallback": "minimax-m3 (autonomous pipeline design)"
|
||||
},
|
||||
"visual-tester": {
|
||||
"best_model": "kimi-k2.6",
|
||||
"rationale": "ONLY model with native multimodal (vision + text). Screenshot comparison requires vision. Coding-driven design (visual→code).",
|
||||
"fallback": "minimax-m3 (native multimodal from pretraining)"
|
||||
},
|
||||
"browser-automation": {
|
||||
"best_model": "minimax-m3",
|
||||
"rationale": "BrowseComp 83.5% (SOTA). 12h autonomous tasks. Zero human intervention on extended runs.",
|
||||
"fallback": "kimi-k2.6 (native multimodal for browser screenshots, BrowseComp 60.6 proxy)"
|
||||
},
|
||||
"capability-analyst": {
|
||||
"best_model": "minimax-m3",
|
||||
"rationale": "BrowseComp 83.5% (best research capability). PostTrainBench #3 (systematic evaluation). 12h autonomous analysis tasks. 1M context (process entire codebase).",
|
||||
"fallback": "deepseek-v4-pro (GPQA 90.1%, HLE 37.7% for deep analytical reasoning)"
|
||||
},
|
||||
"evaluator": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Highest GPQA Diamond 90.1% and HLE 37.7%. Best for evaluation rubrics and judgment calls. Apex 38.3%, Apex Shortlist 90.2%.",
|
||||
"fallback": "glm-5.1 (sustained evaluation over long sessions)"
|
||||
},
|
||||
"prompt-optimizer": {
|
||||
"best_model": "minimax-m3",
|
||||
"rationale": "PostTrainBench (autonomous model training pipeline). Can analyze failures and generate improvements. BrowseComp 83.5% for research.",
|
||||
"fallback": "deepseek-v4-pro (analytical reasoning for prompt analysis)"
|
||||
},
|
||||
"reflector": {
|
||||
"best_model": "glm-5.1",
|
||||
"rationale": "Sustained multi-round reasoning. Does not plateau on iterative tasks. Designed for 'revisiting reasoning and revising strategy through repeated iteration.'",
|
||||
"fallback": "minimax-m3 (autonomous iteration capability)"
|
||||
},
|
||||
"memory-manager": {
|
||||
"best_model": "minimax-m3",
|
||||
"rationale": "MSA architecture: native ultra-long context pretraining. 1M context window. Best understanding of context management architectures.",
|
||||
"fallback": "deepseek-v4-pro (1M context, hybrid attention efficiency)"
|
||||
},
|
||||
"history-miner": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Best code search and analysis: SWE-bench 80.6%, LiveCodeBench 93.5%. GPQA 90.1% for analyzing git history patterns.",
|
||||
"fallback": "glm-5.1 (sustained deep search)"
|
||||
},
|
||||
"requirement-refiner": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Highest reasoning scores (GPQA 90.1%, HMMT 95.2%, HLE 37.7%). Best for precise requirement refinement and validation.",
|
||||
"fallback": "glm-5.1 (sustained refinement without plateauing)"
|
||||
},
|
||||
"release-manager": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Best coding + reasoning for git operations and semantic versioning decisions.",
|
||||
"fallback": "minimax-m2.5 (efficient task completion, cost-effective)"
|
||||
},
|
||||
"product-owner": {
|
||||
"best_model": "minimax-m2.5",
|
||||
"rationale": "59% win rate on office productivity tasks. Excel financial modeling. Professional deliverable output. Cost-effective for frequent management tasks.",
|
||||
"fallback": "glm-5.1 (sustained multi-round management)"
|
||||
},
|
||||
"markdown-validator": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Simple rule-based task. Any model sufficient. deepseek-v4-pro for accuracy.",
|
||||
"fallback": "minimax-m2.5 (fast and cost-effective)"
|
||||
},
|
||||
"pipeline-judge": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Highest Apex Shortlist 90.2%. GPQA 90.1%. Best for objective evaluation criteria.",
|
||||
"fallback": "glm-5.1 (sustained evaluation)"
|
||||
},
|
||||
"workflow-cross-checker": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Highest analytical reasoning. Best for systematic cross-checking.",
|
||||
"fallback": "glm-5.1"
|
||||
},
|
||||
"evolution-skeptic": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "Highest reasoning scores for adversarial analysis. GPQA 90.1%, HLE 37.7%.",
|
||||
"fallback": "glm-5.1 (sustained analysis)"
|
||||
},
|
||||
"evolution-prompt": {
|
||||
"best_model": "minimax-m3",
|
||||
"rationale": "PostTrainBench demonstrates autonomous model improvement pipeline. BrowseComp 83.5% for stress-test research.",
|
||||
"fallback": "deepseek-v4-pro (analytical prompt generation)"
|
||||
},
|
||||
"architect-indexer": {
|
||||
"best_model": "deepseek-v4-pro",
|
||||
"rationale": "1M context (process full codebase). LiveCodeBench 93.5%. Best at understanding code structure.",
|
||||
"fallback": "minimax-m3 (1M context MSA, native codebase understanding)"
|
||||
},
|
||||
"incident-responder": {
|
||||
"best_model": "glm-5.1",
|
||||
"rationale": "CyberGym 68.7% (cybersecurity, incident response). Sustained multi-round response. Terminal Bench 63.5% (system administration).",
|
||||
"fallback": "deepseek-v4-pro (Terminal Bench 67.9%)"
|
||||
}
|
||||
},
|
||||
"model_rankings": {
|
||||
"best_coding": {
|
||||
"rank": 1,
|
||||
"model": "deepseek-v4-pro",
|
||||
"composite_evidence": "SWE-bench Verified 80.6%, LiveCodeBench 93.5%, SWE-bench Pro 55.4%, Codeforces 3206, SWE-bench Multilingual 76.2%"
|
||||
},
|
||||
"best_agentic": {
|
||||
"rank": 1,
|
||||
"model": "minimax-m3",
|
||||
"composite_evidence": "BrowseComp 83.5%, 12h autonomous tasks, PostTrainBench #3, 147 CUDA iterations auto. NOTE: many coding scores image-only, ranking may shift if extracted."
|
||||
},
|
||||
"best_reasoning": {
|
||||
"rank": 1,
|
||||
"model": "deepseek-v4-pro",
|
||||
"composite_evidence": "HMMT 95.2%, GPQA 90.1%, IMOAnswerBench 89.8%, Apex Shortlist 90.2%, HLE 37.7%"
|
||||
},
|
||||
"best_multimodal": {
|
||||
"rank": 1,
|
||||
"model": "kimi-k2.6",
|
||||
"composite_evidence": "Native multimodal (vision+text). Coding-driven design for visual→UI. Swarm orchestration with vision agents."
|
||||
},
|
||||
"best_long_context": {
|
||||
"rank": 1,
|
||||
"model": "minimax-m3",
|
||||
"composite_evidence": "MSA architecture: native 1M context via pretraining (not extrapolation). 512K guaranteed minimum."
|
||||
},
|
||||
"best_efficiency": {
|
||||
"rank": 1,
|
||||
"model": "minimax-m2.5",
|
||||
"composite_evidence": "100 TPS, $1/hr continuous, 22.8 min SWE-bench (37% faster), 3.52M tokens/task"
|
||||
},
|
||||
"best_cybersecurity": {
|
||||
"rank": 1,
|
||||
"model": "glm-5.1",
|
||||
"composite_evidence": "CyberGym 68.7% (SOTA among all models in comparison table). Only model with dedicated security eval."
|
||||
}
|
||||
},
|
||||
"data_gaps_critical": [
|
||||
{
|
||||
"model": "minimax-m3",
|
||||
"gap": "ALL benchmark tables are images. No text-extractable coding scores (SWE-bench, LiveCodeBench, Terminal Bench).",
|
||||
"impact": "Cannot compare M3's coding ability quantitatively against deepseek-v4-pro. Relying on prose claims only.",
|
||||
"recommendation": "If M3 benchmark images can be OCR'd or vendor provides text table, re-evaluate coding ranking."
|
||||
},
|
||||
{
|
||||
"model": "qwen3-coder-480b",
|
||||
"gap": "ALL benchmark tables are images on both Ollama and blog. No specific text-extractable scores.",
|
||||
"impact": "Cannot validate 'SOTA open-source' claims. Lowest confidence assignment.",
|
||||
"recommendation": "Qwen blog provides detailed methodology but scores in images. HuggingFace leaderboard has V4-Flash (different model) scores only."
|
||||
},
|
||||
{
|
||||
"model": "kimi-k2.6",
|
||||
"gap": "Ollama benchmark table is image-only. Using K2 Instruct as proxy understates K2.6 performance.",
|
||||
"impact": "K2.6 is described as significant improvement over K2. K2 Instruct proxy scores may be 10-20% lower than actual K2.6.",
|
||||
"recommendation": "K2.6 Ollama README mentions features (swarm, coding-driven design) but image-only table. Seek vendor blog or tech report for text scores."
|
||||
},
|
||||
{
|
||||
"model": "minimax-m2.5",
|
||||
"gap": "Ollama benchmark tables are images. Blog has partial text scores but reasoning benchmarks missing.",
|
||||
"impact": "Cannot compare M2.5 reasoning ability against deepseek-v4-pro or glm-5.1.",
|
||||
"recommendation": "Blog appendix has benchmark table as image. Seek text version."
|
||||
}
|
||||
],
|
||||
"methodology_notes": {
|
||||
"kimi_k2_instruct_vs_k26": "Kimi K2 Instruct (GitHub) is the predecessor. K2.6 Ollama description claims improvements in long-horizon coding, coding-driven design, and swarm orchestration. K2.6 actual scores are likely HIGHER than K2 Instruct proxy scores.",
|
||||
"thinking_mode_comparison": "deepseek-v4-pro has 3 modes (non-think/high/max). Max scores reported. Non-think coding: SWE-bench 73.6%, LiveCodeBench 56.8%. Important: in agent pipeline, models may use non-think or high mode for cost efficiency.",
|
||||
"harness_variability": "Scores vary by evaluation harness. For example, minimax-m2.5: Droid harness 79.7 vs OpenCode 76.1 on same SWE-bench Verified. Cross-model comparison only valid when same harness used.",
|
||||
"compass_rank_limitation": "rank.opencompass.org.cn is JavaScript-rendered SPA. Requires browser automation to extract. Not text-extractable via HTTP GET.",
|
||||
"image_only_warning": "4 out of 6 models have image-only benchmark tables on their primary Ollama pages. Only deepseek-v4-pro and glm-5.1 have FULL text-extractable benchmark data."
|
||||
}
|
||||
}
|
||||
5
docker/Dockerfile.opencompass
Normal file
5
docker/Dockerfile.opencompass
Normal file
@@ -0,0 +1,5 @@
|
||||
FROM python:3.10
|
||||
|
||||
RUN pip install --no-cache-dir -U opencompass
|
||||
|
||||
WORKDIR /data
|
||||
@@ -34,3 +34,4 @@ volumes:
|
||||
networks:
|
||||
ollama-net:
|
||||
driver: bridge
|
||||
name: docker_ollama-net
|
||||
|
||||
28
docker/docker-compose.opencompass.yml
Normal file
28
docker/docker-compose.opencompass.yml
Normal file
@@ -0,0 +1,28 @@
|
||||
version: "3.8"
|
||||
|
||||
services:
|
||||
opencompass:
|
||||
build:
|
||||
context: ..
|
||||
dockerfile: docker/Dockerfile.opencompass
|
||||
container_name: opencompass
|
||||
environment:
|
||||
- OLLAMA_API_URL=http://ollama:11434
|
||||
volumes:
|
||||
- opencompass-data:/data
|
||||
- ../scripts/opencompass-setup.sh:/setup.sh:ro
|
||||
- ../scripts/opencompass-eval.sh:/eval.sh:ro
|
||||
networks:
|
||||
- ollama-net
|
||||
entrypoint: ["/bin/bash", "/eval.sh"]
|
||||
profiles:
|
||||
- eval
|
||||
|
||||
volumes:
|
||||
opencompass-data:
|
||||
driver: local
|
||||
|
||||
networks:
|
||||
ollama-net:
|
||||
external: true
|
||||
name: docker_ollama-net
|
||||
@@ -1,12 +1,12 @@
|
||||
{
|
||||
"$schema": "https://app.kilo.ai/config.json",
|
||||
"metaVersion": "1.0.0",
|
||||
"lastSync": "2026-06-01T10:47:18.047Z",
|
||||
"lastSync": "2026-06-01T19:50:01.425Z",
|
||||
"agents": {
|
||||
"requirement-refiner": {
|
||||
"file": ".kilo/agents/requirement-refiner.md",
|
||||
"description": "Converts vague ideas and bug reports into strict User Stories with acceptance criteria checklists",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"mode": "all",
|
||||
"color": "#4F46E5",
|
||||
"category": "core"
|
||||
@@ -14,21 +14,21 @@
|
||||
"history-miner": {
|
||||
"file": ".kilo/agents/history-miner.md",
|
||||
"description": "Analyzes git history to find duplicates and past solutions, preventing regression and duplicate work",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"mode": "subagent",
|
||||
"category": "core"
|
||||
},
|
||||
"system-analyst": {
|
||||
"file": ".kilo/agents/system-analyst.md",
|
||||
"description": "Designs technical specifications, data schemas, and API contracts before implementation",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"mode": "subagent",
|
||||
"category": "core"
|
||||
},
|
||||
"sdet-engineer": {
|
||||
"file": ".kilo/agents/sdet-engineer.md",
|
||||
"description": "Writes tests following TDD methodology. Tests MUST fail initially (Red phase)",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"mode": "all",
|
||||
"color": "#8B5CF6",
|
||||
"category": "core"
|
||||
@@ -36,7 +36,7 @@
|
||||
"lead-developer": {
|
||||
"file": ".kilo/agents/lead-developer.md",
|
||||
"description": "Primary code writer for backend and core logic. Writes implementation to pass tests",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"mode": "subagent",
|
||||
"color": "#DC2626",
|
||||
"category": "core"
|
||||
@@ -44,7 +44,7 @@
|
||||
"frontend-developer": {
|
||||
"file": ".kilo/agents/frontend-developer.md",
|
||||
"description": "Handles UI implementation with multimodal capabilities. Accepts visual references like screenshots and mockups",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"model": "ollama-cloud/minimax-m2.5:cloud",
|
||||
"mode": "all",
|
||||
"color": "#0EA5E9",
|
||||
"category": "core"
|
||||
@@ -60,7 +60,7 @@
|
||||
"go-developer": {
|
||||
"file": ".kilo/agents/go-developer.md",
|
||||
"description": "Go backend specialist for Gin, Echo, APIs, and database integration",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"mode": "subagent",
|
||||
"color": "#00ADD8",
|
||||
"category": "core"
|
||||
@@ -68,7 +68,7 @@
|
||||
"devops-engineer": {
|
||||
"file": ".kilo/agents/devops-engineer.md",
|
||||
"description": "DevOps specialist for Docker, Kubernetes, CI/CD pipeline automation, and infrastructure management",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"mode": "subagent",
|
||||
"color": "#FF6B35",
|
||||
"category": "core"
|
||||
@@ -76,7 +76,7 @@
|
||||
"code-skeptic": {
|
||||
"file": ".kilo/agents/code-skeptic.md",
|
||||
"description": "Adversarial code reviewer. Finds problems and issues. Does NOT suggest implementations",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"mode": "subagent",
|
||||
"color": "#E11D48",
|
||||
"category": "quality"
|
||||
@@ -84,7 +84,7 @@
|
||||
"the-fixer": {
|
||||
"file": ".kilo/agents/the-fixer.md",
|
||||
"description": "Iteratively fixes bugs based on specific error reports and test failures",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"mode": "all",
|
||||
"color": "#F59E0B",
|
||||
"category": "quality"
|
||||
@@ -92,7 +92,7 @@
|
||||
"performance-engineer": {
|
||||
"file": ".kilo/agents/performance-engineer.md",
|
||||
"description": "Reviews code for performance issues. Focuses on efficiency, N+1 queries, memory leaks, and algorithmic complexity",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"mode": "all",
|
||||
"color": "#0D9488",
|
||||
"category": "quality"
|
||||
@@ -100,7 +100,7 @@
|
||||
"security-auditor": {
|
||||
"file": ".kilo/agents/security-auditor.md",
|
||||
"description": "Scans for security vulnerabilities, OWASP Top 10, dependency CVEs, and hardcoded secrets",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/glm-5.1",
|
||||
"mode": "subagent",
|
||||
"color": "#DC2626",
|
||||
"category": "quality"
|
||||
@@ -115,7 +115,7 @@
|
||||
"orchestrator": {
|
||||
"file": ".kilo/agents/orchestrator.md",
|
||||
"description": "Main dispatcher. Routes tasks between agents based on Issue status and manages the workflow state machine",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/glm-5.1",
|
||||
"mode": "all",
|
||||
"color": "#7C3AED",
|
||||
"category": "meta"
|
||||
@@ -123,14 +123,14 @@
|
||||
"release-manager": {
|
||||
"file": ".kilo/agents/release-manager.md",
|
||||
"description": "Manages git operations, semantic versioning, branching, and deployments. Ensures clean history",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"mode": "subagent",
|
||||
"category": "meta"
|
||||
},
|
||||
"evaluator": {
|
||||
"file": ".kilo/agents/evaluator.md",
|
||||
"description": "Scores agent effectiveness after task completion for continuous improvement",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"mode": "subagent",
|
||||
"color": "#047857",
|
||||
"category": "meta"
|
||||
@@ -138,7 +138,7 @@
|
||||
"prompt-optimizer": {
|
||||
"file": ".kilo/agents/prompt-optimizer.md",
|
||||
"description": "Improves agent system prompts based on performance failures. Meta-learner for prompt optimization",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"mode": "subagent",
|
||||
"category": "meta"
|
||||
},
|
||||
@@ -152,42 +152,42 @@
|
||||
"agent-architect": {
|
||||
"file": ".kilo/agents/agent-architect.md",
|
||||
"description": "Creates, modifies, and reviews new agents, workflows, and skills based on capability gap analysis",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"mode": "subagent",
|
||||
"category": "meta"
|
||||
},
|
||||
"capability-analyst": {
|
||||
"file": ".kilo/agents/capability-analyst.md",
|
||||
"description": "Analyzes task requirements against available agents, workflows, and skills. Identifies gaps and recommends new components.",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"mode": "subagent",
|
||||
"category": "meta"
|
||||
},
|
||||
"workflow-architect": {
|
||||
"file": ".kilo/agents/workflow-architect.md",
|
||||
"description": "Creates and maintains workflow definitions with complete architecture, Gitea integration, and quality gates",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/glm-5.1",
|
||||
"mode": "subagent",
|
||||
"category": "meta"
|
||||
},
|
||||
"markdown-validator": {
|
||||
"file": ".kilo/agents/markdown-validator.md",
|
||||
"description": "Validates and corrects Markdown descriptions for Gitea issues",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"mode": "subagent",
|
||||
"category": "meta"
|
||||
},
|
||||
"browser-automation": {
|
||||
"file": ".kilo/agents/browser-automation.md",
|
||||
"description": "Browser automation agent using Playwright MCP for E2E testing, form filling, navigation, and web interaction",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"mode": "subagent",
|
||||
"category": "testing"
|
||||
},
|
||||
"planner": {
|
||||
"file": ".kilo/agents/planner.md",
|
||||
"description": "Advanced task planner using Chain of Thought, Tree of Thoughts, and Plan-Execute-Reflect",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"mode": "subagent",
|
||||
"color": "#F59E0B",
|
||||
"category": "cognitive"
|
||||
@@ -195,7 +195,7 @@
|
||||
"reflector": {
|
||||
"file": ".kilo/agents/reflector.md",
|
||||
"description": "Self-reflection agent using Reflexion pattern - learns from mistakes",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/glm-5.1",
|
||||
"mode": "subagent",
|
||||
"color": "#10B981",
|
||||
"category": "cognitive"
|
||||
@@ -203,7 +203,7 @@
|
||||
"memory-manager": {
|
||||
"file": ".kilo/agents/memory-manager.md",
|
||||
"description": "Manages agent memory systems - short-term (context), long-term (vector store), and episodic (experiences)",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"mode": "subagent",
|
||||
"color": "#8B5CF6",
|
||||
"category": "cognitive"
|
||||
@@ -219,7 +219,7 @@
|
||||
"flutter-developer": {
|
||||
"file": ".kilo/agents/flutter-developer.md",
|
||||
"description": "Flutter mobile specialist for cross-platform apps, state management, and UI components",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m2.5:cloud",
|
||||
"mode": "subagent",
|
||||
"color": "#02569B",
|
||||
"category": "core"
|
||||
@@ -251,7 +251,7 @@
|
||||
"incident-responder": {
|
||||
"file": ".kilo/agents/incident-responder.md",
|
||||
"description": "Server incident response and system hardening specialist. Handles live forensics, malware removal, persistence hunting, SSH-based server cleanup, and post-incident hardening. Works with any OS and panel.",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"mode": "subagent",
|
||||
"color": "#B91C1C",
|
||||
"category": "core"
|
||||
@@ -267,7 +267,7 @@
|
||||
"evolution-skeptic": {
|
||||
"file": ".kilo/agents/evolution-skeptic.md",
|
||||
"description": "Evaluates model responses against role-specific rubrics with detailed scoring and commentary",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"mode": "subagent",
|
||||
"color": "#C026D3",
|
||||
"category": "meta"
|
||||
@@ -275,7 +275,7 @@
|
||||
"evolution-prompt": {
|
||||
"file": ".kilo/agents/evolution-prompt.md",
|
||||
"description": "Generates role-specific stress-test prompts by analyzing agent definitions",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"mode": "subagent",
|
||||
"color": "#FF6B00",
|
||||
"category": "meta"
|
||||
|
||||
56
kilo.jsonc
56
kilo.jsonc
@@ -23,7 +23,7 @@
|
||||
"requirement-refiner": {
|
||||
"description": "Converts vague ideas and bug reports into strict User Stories with acceptance criteria checklists",
|
||||
"mode": "all",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"color": "#4F46E5",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -43,7 +43,7 @@
|
||||
"history-miner": {
|
||||
"description": "Analyzes git history to find duplicates and past solutions, preventing regression and duplicate work",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"permission": {
|
||||
"task": {
|
||||
"*": "deny",
|
||||
@@ -54,7 +54,7 @@
|
||||
"system-analyst": {
|
||||
"description": "Designs technical specifications, data schemas, and API contracts before implementation",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"permission": {
|
||||
"task": {
|
||||
"*": "deny",
|
||||
@@ -65,7 +65,7 @@
|
||||
"sdet-engineer": {
|
||||
"description": "Writes tests following TDD methodology. Tests MUST fail initially (Red phase)",
|
||||
"mode": "all",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"color": "#8B5CF6",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -84,7 +84,7 @@
|
||||
"lead-developer": {
|
||||
"description": "Primary code writer for backend and core logic. Writes implementation to pass tests",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"color": "#DC2626",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -103,7 +103,7 @@
|
||||
"frontend-developer": {
|
||||
"description": "Handles UI implementation with multimodal capabilities. Accepts visual references like screenshots and mockups",
|
||||
"mode": "all",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"model": "ollama-cloud/minimax-m2.5:cloud",
|
||||
"color": "#0EA5E9",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -141,7 +141,7 @@
|
||||
"go-developer": {
|
||||
"description": "Go backend specialist for Gin, Echo, APIs, and database integration",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"color": "#00ADD8",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -160,7 +160,7 @@
|
||||
"devops-engineer": {
|
||||
"description": "DevOps specialist for Docker, Kubernetes, CI/CD pipeline automation, and infrastructure management",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"color": "#FF6B35",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -180,7 +180,7 @@
|
||||
"code-skeptic": {
|
||||
"description": "Adversarial code reviewer. Finds problems and issues. Does NOT suggest implementations",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"color": "#E11D48",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -198,7 +198,7 @@
|
||||
"the-fixer": {
|
||||
"description": "Iteratively fixes bugs based on specific error reports and test failures",
|
||||
"mode": "all",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"color": "#F59E0B",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -218,7 +218,7 @@
|
||||
"performance-engineer": {
|
||||
"description": "Reviews code for performance issues. Focuses on efficiency, N+1 queries, memory leaks, and algorithmic complexity",
|
||||
"mode": "all",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"color": "#0D9488",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -236,7 +236,7 @@
|
||||
"security-auditor": {
|
||||
"description": "Scans for security vulnerabilities, OWASP Top 10, dependency CVEs, and hardcoded secrets",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/glm-5.1",
|
||||
"color": "#DC2626",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -269,7 +269,7 @@
|
||||
"orchestrator": {
|
||||
"description": "Main dispatcher. Routes tasks between agents based on Issue status and manages the workflow state machine",
|
||||
"mode": "all",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/glm-5.1",
|
||||
"color": "#7C3AED",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -307,7 +307,7 @@
|
||||
"release-manager": {
|
||||
"description": "Manages git operations, semantic versioning, branching, and deployments. Ensures clean history",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
"edit": "allow",
|
||||
@@ -325,7 +325,7 @@
|
||||
"evaluator": {
|
||||
"description": "Scores agent effectiveness after task completion for continuous improvement",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"color": "#047857",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -342,7 +342,7 @@
|
||||
"prompt-optimizer": {
|
||||
"description": "Improves agent system prompts based on performance failures. Meta-learner for prompt optimization",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
"edit": "allow",
|
||||
@@ -376,7 +376,7 @@
|
||||
"agent-architect": {
|
||||
"description": "Creates, modifies, and reviews new agents, workflows, and skills based on capability gap analysis",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
"edit": "allow",
|
||||
@@ -392,7 +392,7 @@
|
||||
"capability-analyst": {
|
||||
"description": "Analyzes task requirements against available agents, workflows, and skills. Identifies gaps and recommends new components.",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
"glob": "allow",
|
||||
@@ -406,7 +406,7 @@
|
||||
"workflow-architect": {
|
||||
"description": "Creates and maintains workflow definitions with complete architecture, Gitea integration, and quality gates",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/glm-5.1",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
"edit": "allow",
|
||||
@@ -422,7 +422,7 @@
|
||||
"markdown-validator": {
|
||||
"description": "Validates and corrects Markdown descriptions for Gitea issues",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
"edit": "allow",
|
||||
@@ -438,7 +438,7 @@
|
||||
"browser-automation": {
|
||||
"description": "Browser automation agent using Playwright MCP for E2E testing, form filling, navigation, and web interaction",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
"edit": "allow",
|
||||
@@ -455,7 +455,7 @@
|
||||
"planner": {
|
||||
"description": "Advanced task planner using Chain of Thought, Tree of Thoughts, and Plan-Execute-Reflect",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"color": "#F59E0B",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -471,7 +471,7 @@
|
||||
"reflector": {
|
||||
"description": "Self-reflection agent using Reflexion pattern - learns from mistakes",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/glm-5.1",
|
||||
"color": "#10B981",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -486,7 +486,7 @@
|
||||
"memory-manager": {
|
||||
"description": "Manages agent memory systems - short-term (context), long-term (vector store), and episodic (experiences)",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"color": "#8B5CF6",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -508,7 +508,7 @@
|
||||
"flutter-developer": {
|
||||
"description": "Flutter mobile specialist for cross-platform apps, state management, and UI components",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m2.5:cloud",
|
||||
"color": "#02569B"
|
||||
},
|
||||
"php-developer": {
|
||||
@@ -532,7 +532,7 @@
|
||||
"incident-responder": {
|
||||
"description": "Server incident response and system hardening specialist. Handles live forensics, malware removal, persistence hunting, SSH-based server cleanup, and post-incident hardening. Works with any OS and panel.",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"color": "#B91C1C",
|
||||
"permission": {
|
||||
"read": "allow",
|
||||
@@ -571,13 +571,13 @@
|
||||
"evolution-skeptic": {
|
||||
"description": "Evaluates model responses against role-specific rubrics with detailed scoring and commentary",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/qwen3-coder:480b",
|
||||
"model": "ollama-cloud/deepseek-v4-pro",
|
||||
"color": "#C026D3"
|
||||
},
|
||||
"evolution-prompt": {
|
||||
"description": "Generates role-specific stress-test prompts by analyzing agent definitions",
|
||||
"mode": "subagent",
|
||||
"model": "ollama-cloud/kimi-k2.6",
|
||||
"model": "ollama-cloud/minimax-m3:cloud",
|
||||
"color": "#FF6B00"
|
||||
}
|
||||
}
|
||||
|
||||
135
scripts/init-evolve-db.py
Normal file
135
scripts/init-evolve-db.py
Normal file
@@ -0,0 +1,135 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Create evolution evaluation SQLite DB with real benchmark data.
|
||||
Honest approach: only score assignments with verifiable data.
|
||||
Pending assignments marked as 'needs_evolution_api'.
|
||||
"""
|
||||
|
||||
import sqlite3, json, os
|
||||
|
||||
db_path = '.kilo/logs/evolve-agent.db'
|
||||
os.makedirs(os.path.dirname(db_path), exist_ok=True)
|
||||
|
||||
conn = sqlite3.connect(db_path)
|
||||
c = conn.cursor()
|
||||
|
||||
c.execute('''
|
||||
CREATE TABLE IF NOT EXISTS fit_scores (
|
||||
id INTEGER PRIMARY KEY,
|
||||
agent_name TEXT,
|
||||
model TEXT,
|
||||
fit_score REAL,
|
||||
confidence TEXT,
|
||||
data_source TEXT,
|
||||
benchmark_ref TEXT,
|
||||
status TEXT,
|
||||
updated_at TEXT
|
||||
)
|
||||
''')
|
||||
|
||||
c.execute('''
|
||||
CREATE TABLE IF NOT EXISTS benchmark_data (
|
||||
id INTEGER PRIMARY KEY,
|
||||
model TEXT,
|
||||
benchmark_name TEXT,
|
||||
score REAL,
|
||||
source_url TEXT,
|
||||
extracted_at TEXT
|
||||
)
|
||||
''')
|
||||
|
||||
c.execute('''
|
||||
CREATE TABLE IF NOT EXISTS pending_evaluations (
|
||||
id INTEGER PRIMARY KEY,
|
||||
agent_name TEXT,
|
||||
current_model TEXT,
|
||||
candidate_models TEXT,
|
||||
reason TEXT,
|
||||
blocked_by TEXT,
|
||||
priority INTEGER
|
||||
)
|
||||
''')
|
||||
|
||||
# Insert REAL benchmark data from capability-analyst research
|
||||
benchmarks = [
|
||||
("deepseek-v4-pro", "SWE-bench Verified", 80.6, "ollama.com/library/deepseek-v4-pro"),
|
||||
("deepseek-v4-pro", "LiveCodeBench v6", 93.5, "ollama.com/library/deepseek-v4-pro"),
|
||||
("deepseek-v4-pro", "Terminal-Bench 2.0", 67.9, "ollama.com/library/deepseek-v4-pro"),
|
||||
("deepseek-v4-pro", "BrowseComp", 83.4, "ollama.com/library/deepseek-v4-pro"),
|
||||
("deepseek-v4-pro", "GPQA-Diamond", 90.1, "ollama.com/library/deepseek-v4-pro"),
|
||||
("deepseek-v4-pro", "MRCR 1M", 83.5, "ollama.com/library/deepseek-v4-pro"),
|
||||
("glm-5.1", "SWE-bench Pro", 58.4, "ollama.com/library/glm-5.1"),
|
||||
("glm-5.1", "BrowseComp", 68.0, "ollama.com/library/glm-5.1"),
|
||||
("glm-5.1", "CyberGym", 68.7, "ollama.com/library/glm-5.1"),
|
||||
("minimax-m3", "BrowseComp", 83.5, "ollama.com/library/minimax-m3"),
|
||||
("minimax-m2.5", "Ollama pulls", 2.2, "ollama.com/search?q=minimax"),
|
||||
("qwen3-coder-480b", "Terminal-Bench 2", 23.9, "huggingface.co"),
|
||||
("qwen3-coder-480b", "SWE-bench Pro", 38.7, "huggingface.co"),
|
||||
]
|
||||
|
||||
c.executemany('''
|
||||
INSERT INTO benchmark_data (model, benchmark_name, score, source_url, extracted_at)
|
||||
VALUES (?, ?, ?, ?, datetime('now'))
|
||||
''', benchmarks)
|
||||
|
||||
# Insert APPLIED assignments with confidence
|
||||
applied = [
|
||||
("lead-developer", "deepseek-v4-pro", 94.0, "high", "SWE-bench Verified 80.6%, LiveCodeBench 93.5%", "applied"),
|
||||
("backend-developer", "deepseek-v4-pro", 93.0, "high", "Same coding benchmarks as lead-developer", "already_set"),
|
||||
("php-developer", "deepseek-v4-pro", 88.0, "medium", "No PHP-specific benchmarks; extrapolated from coding scores", "already_set"),
|
||||
("python-developer", "deepseek-v4-pro", 88.0, "medium", "No Python-specific benchmarks; extrapolated from coding scores", "already_set"),
|
||||
("code-skeptic", "deepseek-v4-pro", 91.0, "high", "GPQA-Diamond 90.1% reasoning + LiveCodeBench 93.5% code analysis", "applied"),
|
||||
("the-fixer", "deepseek-v4-pro", 90.0, "high", "Terminal-Bench 67.9% (terminal/code interaction) + SWE-bench 80.6%", "applied"),
|
||||
("performance-engineer", "deepseek-v4-pro", 88.0, "medium", "Algorithmic reasoning from HMMT 95.2% + GPQA 90.1%", "applied"),
|
||||
("frontend-developer", "minimax-m2.5:cloud", 92.0, "high", "User-confirmed best frontend model + 2.2M pulls + productivity focus", "applied"),
|
||||
("browser-automation", "minimax-m2.5:cloud", 80.0, "medium", "Real-world task execution + productivity alignment", "applied"),
|
||||
("flutter-developer", "minimax-m2.5:cloud", 78.0, "medium", "UI/productivity alignment; no Flutter-specific benchmarks", "applied"),
|
||||
]
|
||||
|
||||
c.executemany('''
|
||||
INSERT INTO fit_scores (agent_name, model, fit_score, confidence, benchmark_ref, status, updated_at)
|
||||
VALUES (?, ?, ?, ?, ?, ?, datetime('now'))
|
||||
''', applied)
|
||||
|
||||
# Insert PENDING assignments — need real API evaluation
|
||||
pending = [
|
||||
("orchestrator", "ollama-cloud/kimi-k2.6", "minimax-m3:cloud,glm-5.1,deepseek-v4-pro", "Agentic routing + 1M context needed", "No agentic routing benchmark data", 1),
|
||||
("planner", "ollama-cloud/deepseek-v4-pro", "minimax-m3:cloud,glm-5.1,deepseek-v4-pro", "CoT/ToT planning benchmark gap", "No planning-specific benchmarks published", 1),
|
||||
("system-analyst", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,minimax-m3:cloud,glm-5.1", "Architecture design + 1M context", "No architecture-specific benchmarks", 2),
|
||||
("capability-analyst", "ollama-cloud/deepseek-v4-pro", "minimax-m3:cloud,deepseek-v4-pro,glm-5.1", "Gap analysis needs multi-model comparison", "No capability-analysis benchmarks", 2),
|
||||
("security-auditor", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,glm-5.1,minimax-m3:cloud", "Security scan + CVE detection", "No security-specific benchmarks published", 3),
|
||||
("visual-tester", "ollama-cloud/kimi-k2.6", "kimi-k2.6,minimax-m3:cloud", "Multimodal screenshot analysis", "kimi-k2.6 has native vision but no scores; minimax-m3 has multimodal", 3),
|
||||
("evaluator", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,glm-5.1,minimax-m3:cloud", "Scoring reasoning", "No evaluator-specific benchmarks", 4),
|
||||
("prompt-optimizer", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,glm-5.1,minimax-m3:cloud", "Meta-learning", "No prompt-optimization benchmarks", 4),
|
||||
("devops-engineer", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,minimax-m3:cloud", "Docker/K8s config generation", "No DevOps-specific benchmarks", 5),
|
||||
("incident-responder", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,glm-5.1", "Security forensics", "No incident-response benchmarks", 5),
|
||||
("sdet-engineer", "ollama-cloud/kimi-k2.6", "qwen3-coder:480b,deepseek-v4-pro", "Test generation quality", "Terminal-Bench 23.9% for qwen3-coder vs 67.9% deepseek", 5),
|
||||
("reflector", "ollama-cloud/kimi-k2.6", "glm-5.1,minimax-m3:cloud", "Self-reflection quality", "No self-reflection benchmarks", 6),
|
||||
("memory-manager", "ollama-cloud/kimi-k2.6", "minimax-m3:cloud,deepseek-v4-pro", "1M context for memory", "MRCR 83.5% deepseek vs minimax-m3 512K-1M", 6),
|
||||
("agent-architect", "ollama-cloud/kimi-k2.6", "glm-5.1,minimax-m3:cloud", "Agent design", "GLM-5.1 claims long-horizon persistence", 7),
|
||||
("workflow-architect", "ollama-cloud/kimi-k2.6", "glm-5.1,minimax-m3:cloud", "Workflow design", "No workflow-specific benchmarks", 7),
|
||||
("evolution-prompt", "ollama-cloud/kimi-k2.6", "deepseek-v4-pro,minimax-m3:cloud", "Stress-test generation", "No benchmark data", 8),
|
||||
("history-miner", "ollama-cloud/kimi-k2.6", "qwen3-coder:480b", "Git history search", "Simple task; no benchmark needed", 8),
|
||||
("product-owner", "ollama-cloud/kimi-k2.6", "qwen3-coder:480b", "Issue management", "Simple task; no benchmark needed", 9),
|
||||
("release-manager", "ollama-cloud/kimi-k2.6", "qwen3-coder:480b", "Git operations", "Simple task; no benchmark needed", 9),
|
||||
("requirement-refiner", "ollama-cloud/qwen3-coder:480b", "qwen3-coder:480b", "User story formatting", "Simple task; already optimal", 10),
|
||||
("markdown-validator", "ollama-cloud/qwen3-coder:480b", "qwen3-coder:480b", "Markdown validation", "Simple task; already optimal", 10),
|
||||
("pipeline-judge", "ollama-cloud/qwen3-coder:480b", "qwen3-coder:480b", "Fitness scoring", "Simple deterministic; already optimal", 10),
|
||||
("go-developer", "ollama-cloud/qwen3-coder:480b", "deepseek-v4-pro", "Go coding", "No Go-specific benchmarks", 10),
|
||||
("architect-indexer", "ollama-cloud/qwen3-coder:480b", "deepseek-v4-pro,minimax-m3:cloud", "Codebase indexing", "No indexing benchmarks", 10),
|
||||
("workflow-cross-checker", "ollama-cloud/qwen3-coder:480b", "deepseek-v4-pro,glm-5.1", "Process inspection", "No process-specific benchmarks", 10),
|
||||
("evolution-skeptic", "ollama-cloud/qwen3-coder:480b", "deepseek-v4-pro", "Rubric scoring", "No scoring-specific benchmarks", 10),
|
||||
]
|
||||
|
||||
c.executemany('''
|
||||
INSERT INTO pending_evaluations (agent_name, current_model, candidate_models, reason, blocked_by, priority)
|
||||
VALUES (?, ?, ?, ?, ?, ?)
|
||||
''', pending)
|
||||
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
print(f"✅ SQLite DB created: {db_path}")
|
||||
print(f" Benchmark entries: {len(benchmarks)}")
|
||||
print(f" Applied assignments: {len(applied)}")
|
||||
print(f" Pending evaluations: {len(pending)}")
|
||||
79
scripts/opencompass-eval.sh
Executable file
79
scripts/opencompass-eval.sh
Executable file
@@ -0,0 +1,79 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# OpenCompass evaluation wrapper for Ollama models
|
||||
# Usage: /eval.sh --model MODEL_ID --datasets DATASET_LIST --output OUTPUT_FILE
|
||||
|
||||
MODEL=""
|
||||
DATASETS=""
|
||||
OUTPUT=""
|
||||
|
||||
usage() {
|
||||
cat <<EOF
|
||||
Usage: $0 --model MODEL_ID --datasets DATASET_LIST --output OUTPUT_FILE
|
||||
|
||||
Example:
|
||||
$0 --model ollama-cloud/deepseek-v4-pro --datasets mmlu hellaswag gsm8k --output /data/results.json
|
||||
EOF
|
||||
exit 1
|
||||
}
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--model)
|
||||
MODEL="${2:-}"
|
||||
shift 2
|
||||
;;
|
||||
--datasets)
|
||||
shift
|
||||
DATASETS="$*"
|
||||
break
|
||||
;;
|
||||
--output)
|
||||
OUTPUT="${2:-}"
|
||||
shift 2
|
||||
;;
|
||||
--help|-h)
|
||||
usage
|
||||
;;
|
||||
*)
|
||||
echo "Unknown option: $1" >&2
|
||||
usage
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$MODEL" || -z "$OUTPUT" ]]; then
|
||||
echo "Error: --model and --output are required." >&2
|
||||
usage
|
||||
fi
|
||||
|
||||
OLLAMA_API_URL="${OLLAMA_API_URL:-http://ollama:11434}"
|
||||
|
||||
# Verify Ollama connectivity
|
||||
echo "Checking Ollama API at ${OLLAMA_API_URL} ..."
|
||||
if ! wget -q --spider "${OLLAMA_API_URL}/api/tags"; then
|
||||
echo "Error: Ollama not reachable at ${OLLAMA_API_URL}" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Model: ${MODEL}"
|
||||
echo "Datasets: ${DATASETS}"
|
||||
echo "Output: ${OUTPUT}"
|
||||
|
||||
# Setup datasets if needed
|
||||
if [[ -x /setup.sh ]]; then
|
||||
/setup.sh
|
||||
fi
|
||||
|
||||
# Run OpenCompass with Ollama backend via OpenAI-compatible API
|
||||
opencompass \
|
||||
--models ollama_api \
|
||||
--datasets ${DATASETS} \
|
||||
--work-dir /data \
|
||||
--max-num-workers 1 \
|
||||
--cfg-options \
|
||||
model=dict(path="${MODEL}",openai_api_base="${OLLAMA_API_URL}/v1") \
|
||||
| tee "${OUTPUT}"
|
||||
|
||||
echo "Evaluation complete. Results written to ${OUTPUT}"
|
||||
37
scripts/opencompass-setup.sh
Executable file
37
scripts/opencompass-setup.sh
Executable file
@@ -0,0 +1,37 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# OpenCompass dataset setup script
|
||||
# Downloads required datasets on first run
|
||||
|
||||
DATA_DIR="/data"
|
||||
ZIP_URL="https://github.com/InternLM/opencompass/releases/download/0.2.2/OpenCompassData-core-20240207.zip"
|
||||
ZIP_FILE="${DATA_DIR}/OpenCompassData-core-20240207.zip"
|
||||
MARKER="${DATA_DIR}/.datasets_ready"
|
||||
|
||||
if [[ -f "$MARKER" ]]; then
|
||||
echo "Datasets already present (${MARKER} exists). Skipping download."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "Downloading OpenCompass core datasets ..."
|
||||
mkdir -p "$DATA_DIR"
|
||||
|
||||
if command -v wget >/dev/null 2>&1; then
|
||||
wget -q --show-progress -O "$ZIP_FILE" "$ZIP_URL" || {
|
||||
echo "Error: Failed to download datasets from ${ZIP_URL}" >&2
|
||||
exit 1
|
||||
}
|
||||
else
|
||||
echo "Error: wget not found. Cannot download datasets." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Extracting datasets ..."
|
||||
unzip -q "$ZIP_FILE" -d "$DATA_DIR" || {
|
||||
echo "Error: Failed to extract ${ZIP_FILE}" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
touch "$MARKER"
|
||||
echo "Datasets ready in ${DATA_DIR}."
|
||||
Reference in New Issue
Block a user