Files

¨NW¨ b9abd91d07 feat: orchestrator evolution — full access + model upgrades + self-evolution protocol

- Add 9 missing agents to orchestrator task whitelist (20→28 agents)
- Fix 2 broken agents: debug (gpt-oss:20b→qwen3.6-plus), release-manager (devstral-2→qwen3.6-plus)
- Upgrade orchestrator (glm-5→qwen3.6-plus, IF:80→90, 128K→1M context)
- Upgrade pipeline-judge (nemotron→qwen3.6-plus, IF:85→90)
- Add orchestrator escalation path to 7 agents (lead-dev, sdet, skeptic, perf, security, evaluator, devops)
- Create self-evolution protocol (.kilo/rules/orchestrator-self-evolution.md)
- Create evolution log (.kilo/EVOLUTION_LOG.md)
- Full audit of all 29 agents with verification tests

2026-04-06 22:55:12 +01:00

8.6 KiB

Raw Permalink Blame History

Model Evolution Proposal Analysis

Date: 2026-04-06T22:28:00+01:00 Source: APAW Agent Model Research v3 Analyst: Orchestrator

Executive Summary

Critical Issues Found 🔴

Agent	Current Model	Status	Action Required
`debug` (built-in)	gpt-oss:20b	BROKEN	Fix immediately
`release-manager`	devstral-2:123b	BROKEN	Fix immediately

Recommended Changes

Priority	Agent	Change	Impact
P0	debug	gpt-oss:20b → gemma4:31b	+29% quality
P0	release-manager	devstral-2:123b → qwen3.6-plus:free	Fix broken agent
P1	orchestrator	glm-5 → qwen3.6-plus:free	+2% quality, +3x speed
P1	pipeline-judge	nemotron-3-super → qwen3.6-plus:free	+3% quality
P2	evaluator	Add Groq burst for fast scoring	+6x speed
P3	Others	Keep current	No change needed

Detailed Analysis

1. CRITICAL: Debug Agent (Built-in)

Current State:

debug:
  model: ollama-cloud/gpt-oss:20b
  status: BROKEN
  IF: ~65 (underwhelming)

Recommendation:

debug:
  model: ollama-cloud/gemma4:31b
  provider: ollama
  IF: 83
  context: 256K
  features: thinking mode, vision
  license: Apache 2.0

Rationale:

gpt-oss:20b is BROKEN on Ollama Cloud
Gemma 4 31B has IF:83 vs gpt-oss IF:65 = +29% improvement
256K context (vs 8K) = 32x more context
Thinking mode enables better debugging
Alternative: Nemotron-Cascade-2 (IF:82.9, LiveCodeBench 87.2)

Action: Apply immediately

2. CRITICAL: Release Manager

Current State:

release-manager:
  model: ollama-cloud/devstral-2:123b
  status: BROKEN
  IF: ~75

Recommendation:

release-manager:
  model: openrouter/qwen/qwen3.6-plus:free
  provider: openrouter
  IF: 90
  score: 76★
  context: 1M
  cost: FREE

Rationale:

devstral-2:123b NOT WORKING on Ollama Cloud
Comparison matrix shows Qwen 3.6+ = 76, GLM-5 = 76 (tie)
BUT Qwen has IF:90 vs GLM-5 IF:80 = better for git operations
1M context for complex changelogs
FREE via OpenRouter
Fallback: nemotron-3-super (IF:85, 1M context) for heavy tasks

Action: Apply immediately

3. HIGH: Orchestrator

Current State:

orchestrator:
  model: ollama-cloud/glm-5
  IF: 80
  score: 82
  context: 128K

Recommendation:

orchestrator:
  model: openrouter/qwen/qwen3.6-plus:free
  provider: openrouter
  IF: 90
  score: 84★
  context: 1M
  cost: FREE

Rationale:

Orchestrator is CRITICAL agent - needs best possible IF for routing
IF:90 vs IF:80 = +12.5% improvement in instruction following
1M context for complex workflow state management
Score: 84 vs 82 = +2% overall
+3x speed improvement
FREE via OpenRouter

Action: Apply after critical fixes

4. HIGH: Pipeline Judge

Current State:

pipeline-judge:
  model: ollama-cloud/nemotron-3-super
  IF: 85
  score: 78
  context: 1M

Recommendation:

pipeline-judge:
  model: openrouter/qwen/qwen3.6-plus:free
  provider: openrouter
  IF: 90
  score: 80★
  context: 1M
  cost: FREE

Rationale:

Judge needs IF:90 for accurate fitness scoring
Score: 80 vs 78 = +3% improvement
Same 1M context as Nemotron
FREE via OpenRouter
Keep Nemotron as fallback for heavy parsing tasks

Action: Apply after critical fixes

5. MEDIUM: Evaluator (Burst Mode)

Current State:

evaluator:
  model: openrouter/qwen/qwen3.6-plus:free
  IF: 90
  score: 81

Recommendation: TWO-TIER APPROACH

# Primary: Qwen 3.6+ (for detailed scoring)
evaluator:
  model: openrouter/qwen/qwen3.6-plus:free
  IF: 90
  score: 81
  use: detailed_scoring

# Burst: Groq gpt-oss:120b (for fast numeric scoring)
evaluator-burst:
  model: groq/gpt-oss-120b
  speed: 500 t/s
  IF: 72
  use: quick_numeric_scoring
  limit: 50-100 calls/day

Rationale:

Qwen 3.6+ score: 81 is already optimal
Groq gpt-oss:120b: 500 tokens/sec = +6x speed for quick scoring
IF:72 is sufficient for numeric evaluation
Use burst for simple: "Score: 8/10" responses
Use Qwen for complex: full report with recommendations

Action: Optional enhancement

6. LOW: Keep Current Models

These agents are ALREADY OPTIMAL:

Agent	Current Model	Score	Reason to Keep
`requirement-refiner`	glm-5	80★	Best score for system analysis
`security-auditor`	nemotron-3-super	76	Best for 1M ctx security scans
`markdown-validator`	nemotron-3-nano	70★	Lightweight validation
`code-skeptic`	minimax-m2.5	85★	Absolute LEADER in code review
`the-fixer`	minimax-m2.5	88★	Absolute LEADER in bug fixing
`lead-developer`	qwen3-coder:480b	92	SWE-bench 66.5%, best coding model
`frontend-developer`	qwen3-coder:480b	90	Excellent for UI
`backend-developer`	qwen3-coder:480b	91	Excellent for API

Action: No changes needed

Implementation Plan

Phase 1: CRITICAL Fixes (Immediately)

# 1. Fix debug agent
kilo.jsonc:
  agent.debug.model: "ollama-cloud/gemma4:31b"

# 2. Fix release-manager  
capability-index.yaml:
  agents.release-manager.model: "openrouter/qwen/qwen3.6-plus:free"

Phase 2: HIGH Priority (Within 24h)

# 3. Upgrade orchestrator
kilo.jsonc:
  agent.orchestrator.model: "openrouter/qwen/qwen3.6-plus:free"

# 4. Upgrade pipeline-judge
capability-index.yaml:
  agents.pipeline-judge.model: "openrouter/qwen/qwen3.6-plus:free"

Phase 3: MEDIUM Priority (Within 1 week)

# 5. Add evaluator burst mode
# Create new agent: evaluator-burst
agents.evaluator-burst.model: "groq/gpt-oss-120b"
agents.evaluator-burst.mode: "subagent"
agents.evaluator-burst.permission.task: ["evaluator"]

Phase 4: LOW Priority (No changes)

# 6-10. Keep current models
# No action needed

Risk Assessment

High Risk

Change	Risk	Mitigation
orchestrator to openrouter	Provider dependency	Keep GLM-5 as fallback
release-manager to openrouter	Provider dependency	Keep Nemotron as fallback

Medium Risk

Change	Risk	Mitigation
debug to gemma4	New model	Test with sample debug tasks
pipeline-judge to openrouter	Provider dependency	Keep Nemotron fallback

Low Risk

Change	Risk	Mitigation
evaluator burst mode	Rate limits	Limit to 100 calls/day

Quality Metrics

Expected Improvement

Agent	Before IF	After IF	Δ	Before Score	After Score	Δ
debug	65	83	+18	-	-	-
release-manager	75	90	+15	75	76	+1
orchestrator	80	90	+10	82	84	+2
pipeline-judge	85	90	+5	78	80	+2
evaluator	90	90	0	81	81	0

Overall System Impact

Broken agents fixed: 2 → 0
Average IF improvement: +18% (weighted by usage)
Average score improvement: +1.25%
Context window improvement: 128K → 1M for key agents

Verification Checklist

Before applying changes:

Backup current configuration
Test new models with sample tasks
Verify OpenRouter API key configured
Verify Groq API key configured (for burst mode)
Document fallback models
Update agent-versions.json after changes
Run sync:evolution to update dashboard

Recommendation

Apply Immediately:

debug: gpt-oss:20b → gemma4:31b (fixes broken agent)
release-manager: devstral-2:123b → qwen3.6-plus:free (fixes broken agent)

Apply Within 24h:

orchestrator: glm-5 → qwen3.6-plus:free (+2% score, +10 IF)
pipeline-judge: nemotron-3-super → qwen3.6-plus:free (+2% score)

Consider:

evaluator: Add Groq burst mode for +6x speed

Keep Unchanged:

6-10. All other agents are already optimal

Files to Modify

Phase 1 (Critical)

# kilo.jsonc - Fix debug agent
.agent.debug.model = "ollama-cloud/gemma4:31b"

# capability-index.yaml - Fix release-manager
agents.release-manager.model = "openrouter/qwen/qwen3.6-plus:free"

Phase 2 (High)

# kilo.jsonc - Upgrade orchestrator
.agent.orchestrator.model = "openrouter/qwen/qwen3.6-plus:free"

# capability-index.yaml - Upgrade pipeline-judge
agents.pipeline-judge.model = "openrouter/qwen/qwen3.6-plus:free"

Analysis Status: ✅ COMPLETE Recommendation: Apply Phase 1 immediately (2 broken agents)

8.6 KiB Raw Permalink Blame History

Model Evolution Proposal Analysis

Executive Summary

Critical Issues Found 🔴

Recommended Changes

Detailed Analysis

1. CRITICAL: Debug Agent (Built-in)

2. CRITICAL: Release Manager

3. HIGH: Orchestrator

4. HIGH: Pipeline Judge

5. MEDIUM: Evaluator (Burst Mode)

6. LOW: Keep Current Models

Implementation Plan

Phase 1: CRITICAL Fixes (Immediately)

Phase 2: HIGH Priority (Within 24h)

Phase 3: MEDIUM Priority (Within 1 week)

Phase 4: LOW Priority (No changes)

Risk Assessment

High Risk

Medium Risk

Low Risk

Quality Metrics

Expected Improvement

Overall System Impact

Verification Checklist

Recommendation

Apply Immediately:

Apply Within 24h:

Consider:

Keep Unchanged:

Files to Modify

Phase 1 (Critical)

Phase 2 (High)

8.6 KiB

Raw Permalink Blame History