- Integrate apaw_agent_model_research_v3.html as standalone dashboard - Add model-benchmarks.json with 32 agents, 11 scored models, 11 recommendations - Add build-research-dashboard.ts: inject live data into template → standalone HTML - Add rebuild-template.cjs: regenerate template from v3.html source - Add sync-benchmarks-from-yaml.cjs: sync YAML → JSON round-trip - Add sync-model-research.ts: apply recommendation matrix to config files - Add model-benchmarks.schema.json and model-research.schema.json for validation - Add bidirectional-data-flow.md architecture documentation - Add log-execution.cjs pipeline hook - Update capability-index.yaml: add fallback_models, failover_strategy - Update kilo-meta.json, kilo.jsonc, KILO_SPEC.md with synced models - Update evolution.md / research.md / self-evolution.md / evolutionary-sync.md docs - Fix security-auditor.md: quote YAML color (#DC2626) - Fix orchestrator.md: remove duplicate devops-engineer key - Build research-dashboard.html (106KB standalone) + dated archive
4.6 KiB
4.6 KiB
description, mode, model, color, permission
| description | mode | model | color | permission | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Run continuous research and self-improvement cycle | workflow | ollama-cloud/glm-5 | #8B5CF6 |
|
Research Cycle Command
Runs continuous research and self-improvement cycle based on the latest findings.
Usage
/research [topic] [--auto]
/research models # research latest AI models for agent optimization
/research models --agent planner # research models for specific agent role
/research models --provider ollama-cloud # filter by provider
Parameters
topic: Optional specific research topic--auto: Automatic mode (no user input)
Execution
Step 1: Performance Monitoring
Check .kilo/logs/efficiency_score.json for low-performing agents.
Step 1.5: Model Research (when topic is "models" or agent scores are low)
IF topic === "models" OR any agent score < 7:
1. Read agent-evolution/data/model-benchmarks.json
→ Check metadata.generated staleness
2. Fetch latest model data from providers:
- Ollama Cloud: https://ollama.com/models (via webfetch)
- OpenRouter: https://openrouter.ai/models (via webfetch)
- Groq: https://console.groq.com/docs/models (via webfetch)
3. For each model, compute:
- IF score (from IFEval/IFBench benchmarks)
- Role fitness (SWE-bench for coding, GPQA for reasoning, etc.)
- Context window and cost
4. Build heatmap: score each model against each agent
Formula: role_fitness * (0.7 + 0.3 * IF/100)
5. Generate recommendations for agents where best-scored model ≠ current
6. Output to agent-evolution/data/model-research-latest.json
7. Validate against agent-evolution/data/model-research.schema.json
8. Update model-benchmarks.json with fresh data
Step 2: Gap Identification
Analyze capability-index.yaml for missing capabilities.
Step 3: Research Fetching
Fetch latest research from:
- Anthropic: https://www.anthropic.com/research
- OpenAI: https://platform.openai.com/docs/guides/agents
- Lilian Weng: https://lilianweng.github.io
Model Research Sources
- Ollama Model Library (https://ollama.com/models)
- OpenRouter Models (https://openrouter.ai/models)
- Groq Console (https://console.groq.com/docs/models)
- SWE-Bench Leaderboard (https://www.swebench.com)
- Terminal-Bench (https://marc0.dev/terminal-bench)
- LMSYS Chatbot Arena (https://chat.lmsys.org)
- Artificial Analysis (https://artificialanalysis.ai)
Step 4: Implementation
Create new agents, skills, or rules based on findings.
Step 5: Evolution Tracking
Post findings to Gitea Issue #25 (Research Milestone).
Example
/research multi-agent systems
# Output:
## Research: multi-agent systems
### Sources Fetched
- Anthropic: Building Effective Agents
- OpenAI: Agents Overview
- Lilian Weng: LLM Powered Agents
### Key Findings
- Prompt Chaining pattern for sequential tasks
- Routing for specialized agents
- Parallelization for independent tasks
### Implementations
- Created: @planner agent (CoT, ToT)
- Created: @reflector agent (Reflexion)
- Created: @memory-manager agent
### Evolution Tracked
- Issue: #25
- Commit: abc1234
Model Research Example
/research models
# Output:
## Research: model optimization
### Models Analyzed
- Ollama Cloud: 20 models
- OpenRouter Free: 3 models
- Groq Free: 5 models
### Key Findings
- DeepSeek V4-Pro Max now available (SWE-V 80.6, IF:88)
- Kimi K2.6 IF score confirmed: 91 (best for orchestration)
- Nemotron 3 Super IF:78 — weak for prompt-heavy roles
- Qwen 3.6 Plus FREE remains best IF/cost ratio (91, $0)
### Recommendations Generated
- 11 model swap recommendations
- 4 high impact, 3 medium, 4 low
- Average expected improvement: +12 points
### Files Updated
- agent-evolution/data/model-research-latest.json
- agent-evolution/data/model-benchmarks.json (refreshed)
### Evolution Tracked
- Issue: #25
- Next: /evolution to apply recommendations
Model Research Output Format
All model research output follows the schema:
agent-evolution/data/model-research.schema.json
Key fields:
models[]— model capabilities, benchmarks, IF scoresrecommendations[]— agent-specific model swap suggestionsheatmap— agent × model compatibility matrixcapability_index_patch[]— ready-to-apply YAML patchessummary— aggregate improvement metrics
This format is consumed by:
/evolutioncommand for auto-applyagent-evolution/scripts/sync-model-research.tsfor propagation- Evolution dashboard for visualization