Files

¨NW¨ 3badb259cc feat: bidirectional research dashboard + agent config fixes

- Integrate apaw_agent_model_research_v3.html as standalone dashboard
- Add model-benchmarks.json with 32 agents, 11 scored models, 11 recommendations
- Add build-research-dashboard.ts: inject live data into template → standalone HTML
- Add rebuild-template.cjs: regenerate template from v3.html source
- Add sync-benchmarks-from-yaml.cjs: sync YAML → JSON round-trip
- Add sync-model-research.ts: apply recommendation matrix to config files
- Add model-benchmarks.schema.json and model-research.schema.json for validation
- Add bidirectional-data-flow.md architecture documentation
- Add log-execution.cjs pipeline hook
- Update capability-index.yaml: add fallback_models, failover_strategy
- Update kilo-meta.json, kilo.jsonc, KILO_SPEC.md with synced models
- Update evolution.md / research.md / self-evolution.md / evolutionary-sync.md docs
- Fix security-auditor.md: quote YAML color (#DC2626)
- Fix orchestrator.md: remove duplicate devops-engineer key
- Build research-dashboard.html (106KB standalone) + dated archive

2026-04-29 21:04:22 +01:00

4.6 KiB

Raw Blame History

description, mode, model, color, permission

description

mode

model

color

permission

Run continuous research and self-improvement cycle

workflow

ollama-cloud/glm-5

#8B5CF6

read

edit

write

bash

webfetch

task

allow

capability-analyst	agent-architect
allow	allow

Research Cycle Command

Runs continuous research and self-improvement cycle based on the latest findings.

Usage

/research [topic] [--auto]
/research models           # research latest AI models for agent optimization
/research models --agent planner  # research models for specific agent role
/research models --provider ollama-cloud  # filter by provider

Parameters

topic: Optional specific research topic
--auto: Automatic mode (no user input)

Execution

Step 1: Performance Monitoring

Check .kilo/logs/efficiency_score.json for low-performing agents.

Step 1.5: Model Research (when topic is "models" or agent scores are low)

IF topic === "models" OR any agent score < 7:
  1. Read agent-evolution/data/model-benchmarks.json
     → Check metadata.generated staleness
  2. Fetch latest model data from providers:
     - Ollama Cloud: https://ollama.com/models (via webfetch)
     - OpenRouter: https://openrouter.ai/models (via webfetch)  
     - Groq: https://console.groq.com/docs/models (via webfetch)
  3. For each model, compute:
     - IF score (from IFEval/IFBench benchmarks)
     - Role fitness (SWE-bench for coding, GPQA for reasoning, etc.)
     - Context window and cost
  4. Build heatmap: score each model against each agent
     Formula: role_fitness * (0.7 + 0.3 * IF/100)
  5. Generate recommendations for agents where best-scored model ≠ current
  6. Output to agent-evolution/data/model-research-latest.json
  7. Validate against agent-evolution/data/model-research.schema.json
  8. Update model-benchmarks.json with fresh data

Step 2: Gap Identification

Analyze capability-index.yaml for missing capabilities.

Step 3: Research Fetching

Fetch latest research from:

Anthropic: https://www.anthropic.com/research
OpenAI: https://platform.openai.com/docs/guides/agents
Lilian Weng: https://lilianweng.github.io

Model Research Sources

Ollama Model Library (https://ollama.com/models)
OpenRouter Models (https://openrouter.ai/models)
Groq Console (https://console.groq.com/docs/models)
SWE-Bench Leaderboard (https://www.swebench.com)
Terminal-Bench (https://marc0.dev/terminal-bench)
LMSYS Chatbot Arena (https://chat.lmsys.org)
Artificial Analysis (https://artificialanalysis.ai)

Step 4: Implementation

Create new agents, skills, or rules based on findings.

Step 5: Evolution Tracking

Post findings to Gitea Issue #25 (Research Milestone).

Example

/research multi-agent systems

# Output:
## Research: multi-agent systems

### Sources Fetched
- Anthropic: Building Effective Agents
- OpenAI: Agents Overview
- Lilian Weng: LLM Powered Agents

### Key Findings
- Prompt Chaining pattern for sequential tasks
- Routing for specialized agents
- Parallelization for independent tasks

### Implementations
- Created: @planner agent (CoT, ToT)
- Created: @reflector agent (Reflexion)
- Created: @memory-manager agent

### Evolution Tracked
- Issue: #25
- Commit: abc1234

Model Research Example

/research models

# Output:
## Research: model optimization

### Models Analyzed
- Ollama Cloud: 20 models
- OpenRouter Free: 3 models
- Groq Free: 5 models

### Key Findings
- DeepSeek V4-Pro Max now available (SWE-V 80.6, IF:88)
- Kimi K2.6 IF score confirmed: 91 (best for orchestration)
- Nemotron 3 Super IF:78 — weak for prompt-heavy roles
- Qwen 3.6 Plus FREE remains best IF/cost ratio (91, $0)

### Recommendations Generated
- 11 model swap recommendations
- 4 high impact, 3 medium, 4 low
- Average expected improvement: +12 points

### Files Updated
- agent-evolution/data/model-research-latest.json
- agent-evolution/data/model-benchmarks.json (refreshed)

### Evolution Tracked
- Issue: #25
- Next: /evolution to apply recommendations

Model Research Output Format

All model research output follows the schema: agent-evolution/data/model-research.schema.json

Key fields:

models[] — model capabilities, benchmarks, IF scores
recommendations[] — agent-specific model swap suggestions
heatmap — agent × model compatibility matrix
capability_index_patch[] — ready-to-apply YAML patches
summary — aggregate improvement metrics

This format is consumed by:

/evolution command for auto-apply
agent-evolution/scripts/sync-model-research.ts for propagation
Evolution dashboard for visualization

4.6 KiB Raw Blame History Unescape Escape