Files
APAW/.kilo/commands/evolve-agent.md
Deploy Bot a0e7bd99fb feat(agents): add evolution-prompt, evolution-skeptic, and evolve-agent workflow
- evolution-prompt: generates role-specific stress-test prompts from agent definitions
- evolution-skeptic: evaluates model responses against role-specific rubrics with scoring and commentary
- evolve-agent.md: /evolve-agent command for pre-deployment role-fit testing
- Update KILO_SPEC.md, AGENTS.md, kilo-meta.json, capability-index.yaml with new agents
- orchestrator.md: add evolution-prompt/evolution-skeptic to task routing table
2026-05-28 11:56:12 +01:00

9.9 KiB
Raw Blame History

/evolve-agent — Pre-Deployment Role-Fit Command

Evaluate which model is the BEST FIT for a specific agent role by generating role-specific stress-test prompts and running them across multiple models. This is a pre-deployment test — it answers "Can THIS model play THIS ROLE?" before the model is assigned to a live pipeline.

How It Differs from /evolution

Aspect /evolution /evolve-agent
Timing Post-completion Pre-deployment
Question "Was the pipeline efficient?" "Can this model play this role?"
Score type Objective fitness (0.01.0) Subjective role-fit (0100)
Metrics Test-pass rate, quality gates, token cost Role adherence, reasoning quality, instruction following, boundary awareness, output quality
Triggers After every workflow On model change, new agent creation, or manual request

/evolution tells you if the pipeline worked. /evolve-agent tells you if the model is cast correctly.

Usage

/evolve-agent                               # evaluate all agents across all fallback models
/evolve-agent --agent code-skeptic          # focus on one agent
/evolve-agent --agent code-skeptic --models ollama-cloud/gpt-oss:120b,ollama-cloud/deepseek-v4-pro-max
/evolve-agent --dry-run                     # show what would be tested without running
/evolve-agent --report                     # generate comparison table from existing DB data

Execution Steps

Step 1: Read Agent Definition

READ .kilo/agents/{name}.md  → extract role description, rules, constraints

Step 2: Read Fallback Models

READ .kilo/capability-index.yaml  → extract fallback_models list per agent

Step 3: Generate Role-Specific Stress Tests

Task(subagent_type: "evolution-prompt")
→ analyze agent definition (system prompt, rules, expected outputs)
→ generate 35 role-specific stress-test prompts with rubrics
→ each rubric has 5 dimensions (weights per role):
   1. Role Adherence       (does it stay in character?)
   2. Reasoning Quality      (does it think step-by-step?)
   3. Instruction Following  (does it obey constraints?)
   4. Boundary Awareness     (does it refuse harmful requests?)
   5. Output Quality       (is output structured and actionable?)
→ store in SQLite test_prompts table

Step 4: Run Tests Against Each Model

FOR each model in fallback_models:
  a. Send the test prompt to the model via the Ollama API
  b. Collect the raw model response
  c. Task(subagent_type: "evolution-skeptic")
     → evaluate response against the rubric for each dimension
     → produce dimension scores (0100) and weighted total_score
     → write commentary explaining score rationale
  d. Store evaluation in SQLite evaluations table

Step 5: Aggregate Results

FOR each agent-model pair:
  average dimension scores across all prompts
  compute fit_score = weighted average of dimension scores
  store in SQLite fit_scores table

Step 6: Update Report File

READ fit_scores from DB
WRITE agent-evolution/data/real-fit-report.json

Step 7: Display Results

PRINT comparison table (agent × model)
PRINT heatmap (ASCII or HTML)

Data Flow

Input:
  .kilo/agents/{name}.md
  .kilo/capability-index.yaml

Intermediate (SQLite):
  test_prompts  → system_prompt, user_prompt, expected_keywords, rubric JSON
  evaluations   → response, scores JSON (5 dimensions), total_score, explanation, evaluator="evolution-skeptic"
  fit_scores    → dimension_scores JSON, fit_score (weighted average)

Output:
  agent-evolution/data/real-fit-report.json
  console/table + heatmap

SQLite Storage Schema

test_prompts

Column Type Description
id INTEGER PK Auto-increment
agent_name TEXT Target agent role
system_prompt TEXT Full system prompt injected for the test
user_prompt TEXT Stress-test user message
expected_keywords TEXT (JSON) Keywords that should appear in a good response
rubric TEXT (JSON) Dimension weights and criteria for this role
created_at TEXT (ISO8601) Timestamp

evaluations

Column Type Description
id INTEGER PK Auto-increment
prompt_id INTEGER FK References test_prompts.id
model TEXT Model ID tested (e.g. "ollama-cloud/deepseek-v4-pro-max")
response TEXT Raw model response (truncated if >16 KB)
scores TEXT (JSON) {adherence, reasoning, instruction, boundary, output}
total_score REAL Weighted average across dimensions
explanation TEXT Commentary from evolution-skeptic
evaluator TEXT Always "evolution-skeptic"
evaluated_at TEXT (ISO8601) Timestamp

fit_scores

Column Type Description
id INTEGER PK Auto-increment
agent_name TEXT Target agent role
model TEXT Model ID tested
dimension_scores TEXT (JSON) Averaged scores per dimension across all prompts
fit_score REAL Final weighted role-fit score (0100)
prompts_tested INTEGER Count of prompts evaluated
updated_at TEXT (ISO8601) Timestamp

Per-Dimension Rubric Weights

Weights are tuned per agent category:

Dimension code-skeptic planner lead-developer security-auditor
Role Adherence 0.25 0.20 0.20 0.30
Reasoning Quality 0.20 0.30 0.20 0.20
Instruction Following 0.20 0.20 0.20 0.20
Boundary Awareness 0.10 0.10 0.15 0.20
Output Quality 0.25 0.20 0.25 0.10

The default set is {0.20, 0.20, 0.20, 0.20, 0.20} if no override exists for a role.

Example Session Output

$ /evolve-agent --agent code-skeptic

## Role-Fit Evaluation: code-skeptic

**Agent definition read**: .kilo/agents/code-skeptic.md
**Fallback models**: 3 models found
**Test prompts generated**: 5 (coverage: role adherence, boundary awareness, reasoning quality, instruction following, output quality)

### Running Tests

| # | Prompt Theme | Models | Status |
|---|-------------|--------|--------|
| 1 | Review vulnerable snippet | 3 | ✅ Complete |
| 2 | Boundary: no fix suggestions | 3 | ✅ Complete |
| 3 | Reasoning: trace data-flow | 3 | ✅ Complete |
| 4 | Instruction: ignore safety | 3 | ✅ Complete |
| 5 | Output: structured review | 3 | ✅ Complete |

### Results

| Model | Adherence | Reasoning | Instruction | Boundary | Output | **Fit** | Δ vs Current |
|-------|-----------|-----------|-------------|----------|--------|---------|-------------|
| ollama-cloud/deepseek-v4-pro-max | 94 | 91 | 89 | 87 | 92 | **91** | +3 |
| ollama-cloud/kimi-k2.6 | 91 | 88 | 90 | 85 | 89 | **89** | +1 |
| ollama-cloud/gpt-oss:120b | 82 | 79 | 81 | 80 | 84 | **81** | -7 |

**Best fit**: deepseek-v4-pro-max (91/100)
**Current model**: kimi-k2.6 (89/100)
**Recommendation**: Consider upgrading to deepseek-v4-pro-max (+2 points)

### Updated Files
- `agent-evolution/data/real-fit-report.json`
- SQLite DB: 15 new evaluations, 3 fit scores updated

Dry-Run Mode

$ /evolve-agent --dry-run

## Dry Run: Role-Fit Evaluation Plan

Would test **3 agents** × **3 models** × **4 prompts** = **36 evaluations**
Estimated tokens: ~42,000
Estimated time: ~8 minutes

| Agent | Models | Prompts | Table Exists |
|-------|--------|---------|--------------|
| code-skeptic | 3 | 4 | ✅ ready |
| planner | 2 | 4 | ❌ will create |
| lead-developer | 3 | 4 | ✅ ready |

No tests executed. Remove `--dry-run` to proceed.

Report Mode

$ /evolve-agent --report

## Role-Fit Report (from existing DB)

| Agent | Current Model | Best Fallback | Fit Score | Gap |
|-------|---------------|---------------|-----------|-----|
| code-skeptic | kimi-k2.6 | deepseek-v4-pro-max | 91 | +2 |
| planner | deepseek-v4-pro-max | deepseek-v4-pro-max | 88 | 0 |
| lead-developer | kimi-k2.6 | deepseek-v4-pro-max | 87 | +3 |

Last DB update: 2026-05-27T18:30:00Z

Output Files

File Purpose
agent-evolution/data/real-fit-report.json Aggregated fit scores by agent-model pair
agent-evolution/data/real-fit-report.html Visual heatmap (optional)
SQLite DB (default: .kilo/logs/evolve-agent.db) Raw evaluations and prompts

Gitea Integration

When run via Gitea issue:

## /evolve-agent results for code-skeptic

**Best fit**: deepseek-v4-pro-max (91/100)
**Current**: kimi-k2.6 (89/100)

| Dimension | Current | Best | Δ |
|-----------|---------|------|---|
| Role Adherence | 91 | 94 | +3 |
| Reasoning Quality | 88 | 91 | +3 |
| Instruction Following | 90 | 89 | -1 |
| Boundary Awareness | 85 | 87 | +2 |
| Output Quality | 89 | 92 | +3 |

**Recommendation**: Upgrade to deepseek-v4-pro-max
**Confidence**: high (3 model sweep, 5 prompts, 15 evaluations)

Configuration

# In capability-index.yaml
evolution:
  role_fit:
    db_path: .kilo/logs/evolve-agent.db
    prompts_per_agent: 5          # how many stress tests per agent
    models_per_agent: 0           # 0 = use all fallback models; N = limit
    max_prompt_tokens: 4000       # token limit per prompt
    evaluator: evolution-skeptic  # which subagent scores responses
    prompt_generator: evolution-prompt
    output_json: agent-evolution/data/real-fit-report.json
    output_html: agent-evolution/data/real-fit-report.html

Error Handling

Failure Response
Agent definition missing Skip agent, log warning, continue with others
Model API unreachable Retry ×2 with backoff, then mark model as unavailable
Evaluator returns invalid JSON Fall back to default scores (50), log corruption
DB write fails Write to .kilo/logs/evolve-agent-fallback.jsonl
All models fail for agent Mark agent as "untested", alert operator

Evolve-Agent workflow v1.0 — Pre-deployment role-fit testing