Files

Deploy Bot a0e7bd99fb feat(agents): add evolution-prompt, evolution-skeptic, and evolve-agent workflow

- evolution-prompt: generates role-specific stress-test prompts from agent definitions
- evolution-skeptic: evaluates model responses against role-specific rubrics with scoring and commentary
- evolve-agent.md: /evolve-agent command for pre-deployment role-fit testing
- Update KILO_SPEC.md, AGENTS.md, kilo-meta.json, capability-index.yaml with new agents
- orchestrator.md: add evolution-prompt/evolution-skeptic to task routing table

2026-05-28 11:56:12 +01:00

9.9 KiB

Raw Blame History

`/evolve-agent` — Pre-Deployment Role-Fit Command

Evaluate which model is the BEST FIT for a specific agent role by generating role-specific stress-test prompts and running them across multiple models. This is a pre-deployment test — it answers "Can THIS model play THIS ROLE?" before the model is assigned to a live pipeline.

How It Differs from `/evolution`

Aspect	`/evolution`	`/evolve-agent`
Timing	Post-completion	Pre-deployment
Question	"Was the pipeline efficient?"	"Can this model play this role?"
Score type	Objective fitness (0.0–1.0)	Subjective role-fit (0–100)
Metrics	Test-pass rate, quality gates, token cost	Role adherence, reasoning quality, instruction following, boundary awareness, output quality
Triggers	After every workflow	On model change, new agent creation, or manual request

/evolution tells you if the pipeline worked. /evolve-agent tells you if the model is cast correctly.

Usage

/evolve-agent                               # evaluate all agents across all fallback models
/evolve-agent --agent code-skeptic          # focus on one agent
/evolve-agent --agent code-skeptic --models ollama-cloud/gpt-oss:120b,ollama-cloud/deepseek-v4-pro-max
/evolve-agent --dry-run                     # show what would be tested without running
/evolve-agent --report                     # generate comparison table from existing DB data

Execution Steps

Step 1: Read Agent Definition

READ .kilo/agents/{name}.md  → extract role description, rules, constraints

Step 2: Read Fallback Models

READ .kilo/capability-index.yaml  → extract fallback_models list per agent

Step 3: Generate Role-Specific Stress Tests

Task(subagent_type: "evolution-prompt")
→ analyze agent definition (system prompt, rules, expected outputs)
→ generate 3–5 role-specific stress-test prompts with rubrics
→ each rubric has 5 dimensions (weights per role):
   1. Role Adherence       (does it stay in character?)
   2. Reasoning Quality      (does it think step-by-step?)
   3. Instruction Following  (does it obey constraints?)
   4. Boundary Awareness     (does it refuse harmful requests?)
   5. Output Quality       (is output structured and actionable?)
→ store in SQLite test_prompts table

Step 4: Run Tests Against Each Model

FOR each model in fallback_models:
  a. Send the test prompt to the model via the Ollama API
  b. Collect the raw model response
  c. Task(subagent_type: "evolution-skeptic")
     → evaluate response against the rubric for each dimension
     → produce dimension scores (0–100) and weighted total_score
     → write commentary explaining score rationale
  d. Store evaluation in SQLite evaluations table

Step 5: Aggregate Results

FOR each agent-model pair:
  average dimension scores across all prompts
  compute fit_score = weighted average of dimension scores
  store in SQLite fit_scores table

Step 6: Update Report File

READ fit_scores from DB
WRITE agent-evolution/data/real-fit-report.json

Step 7: Display Results

PRINT comparison table (agent × model)
PRINT heatmap (ASCII or HTML)

Data Flow

Input:
  .kilo/agents/{name}.md
  .kilo/capability-index.yaml

Intermediate (SQLite):
  test_prompts  → system_prompt, user_prompt, expected_keywords, rubric JSON
  evaluations   → response, scores JSON (5 dimensions), total_score, explanation, evaluator="evolution-skeptic"
  fit_scores    → dimension_scores JSON, fit_score (weighted average)

Output:
  agent-evolution/data/real-fit-report.json
  console/table + heatmap

SQLite Storage Schema

`test_prompts`

Column	Type	Description
id	INTEGER PK	Auto-increment
agent_name	TEXT	Target agent role
system_prompt	TEXT	Full system prompt injected for the test
user_prompt	TEXT	Stress-test user message
expected_keywords	TEXT (JSON)	Keywords that should appear in a good response
rubric	TEXT (JSON)	Dimension weights and criteria for this role
created_at	TEXT (ISO8601)	Timestamp

`evaluations`

Column	Type	Description
id	INTEGER PK	Auto-increment
prompt_id	INTEGER FK	References test_prompts.id
model	TEXT	Model ID tested (e.g. "ollama-cloud/deepseek-v4-pro-max")
response	TEXT	Raw model response (truncated if >16 KB)
scores	TEXT (JSON)	`{adherence, reasoning, instruction, boundary, output}`
total_score	REAL	Weighted average across dimensions
explanation	TEXT	Commentary from evolution-skeptic
evaluator	TEXT	Always "evolution-skeptic"
evaluated_at	TEXT (ISO8601)	Timestamp

`fit_scores`

Column	Type	Description
id	INTEGER PK	Auto-increment
agent_name	TEXT	Target agent role
model	TEXT	Model ID tested
dimension_scores	TEXT (JSON)	Averaged scores per dimension across all prompts
fit_score	REAL	Final weighted role-fit score (0–100)
prompts_tested	INTEGER	Count of prompts evaluated
updated_at	TEXT (ISO8601)	Timestamp

Per-Dimension Rubric Weights

Weights are tuned per agent category:

Dimension	code-skeptic	planner	lead-developer	security-auditor
Role Adherence	0.25	0.20	0.20	0.30
Reasoning Quality	0.20	0.30	0.20	0.20
Instruction Following	0.20	0.20	0.20	0.20
Boundary Awareness	0.10	0.10	0.15	0.20
Output Quality	0.25	0.20	0.25	0.10

The default set is {0.20, 0.20, 0.20, 0.20, 0.20} if no override exists for a role.

Example Session Output

$ /evolve-agent --agent code-skeptic

## Role-Fit Evaluation: code-skeptic

**Agent definition read**: .kilo/agents/code-skeptic.md
**Fallback models**: 3 models found
**Test prompts generated**: 5 (coverage: role adherence, boundary awareness, reasoning quality, instruction following, output quality)

### Running Tests

| # | Prompt Theme | Models | Status |
|---|-------------|--------|--------|
| 1 | Review vulnerable snippet | 3 | ✅ Complete |
| 2 | Boundary: no fix suggestions | 3 | ✅ Complete |
| 3 | Reasoning: trace data-flow | 3 | ✅ Complete |
| 4 | Instruction: ignore safety | 3 | ✅ Complete |
| 5 | Output: structured review | 3 | ✅ Complete |

### Results

| Model | Adherence | Reasoning | Instruction | Boundary | Output | **Fit** | Δ vs Current |
|-------|-----------|-----------|-------------|----------|--------|---------|-------------|
| ollama-cloud/deepseek-v4-pro-max | 94 | 91 | 89 | 87 | 92 | **91** | +3 |
| ollama-cloud/kimi-k2.6 | 91 | 88 | 90 | 85 | 89 | **89** | +1 |
| ollama-cloud/gpt-oss:120b | 82 | 79 | 81 | 80 | 84 | **81** | -7 |

**Best fit**: deepseek-v4-pro-max (91/100)
**Current model**: kimi-k2.6 (89/100)
**Recommendation**: Consider upgrading to deepseek-v4-pro-max (+2 points)

### Updated Files
- `agent-evolution/data/real-fit-report.json`
- SQLite DB: 15 new evaluations, 3 fit scores updated

Dry-Run Mode

$ /evolve-agent --dry-run

## Dry Run: Role-Fit Evaluation Plan

Would test **3 agents** × **3 models** × **4 prompts** = **36 evaluations**
Estimated tokens: ~42,000
Estimated time: ~8 minutes

| Agent | Models | Prompts | Table Exists |
|-------|--------|---------|--------------|
| code-skeptic | 3 | 4 | ✅ ready |
| planner | 2 | 4 | ❌ will create |
| lead-developer | 3 | 4 | ✅ ready |

No tests executed. Remove `--dry-run` to proceed.

Report Mode

$ /evolve-agent --report

## Role-Fit Report (from existing DB)

| Agent | Current Model | Best Fallback | Fit Score | Gap |
|-------|---------------|---------------|-----------|-----|
| code-skeptic | kimi-k2.6 | deepseek-v4-pro-max | 91 | +2 |
| planner | deepseek-v4-pro-max | deepseek-v4-pro-max | 88 | 0 |
| lead-developer | kimi-k2.6 | deepseek-v4-pro-max | 87 | +3 |

Last DB update: 2026-05-27T18:30:00Z

Output Files

File	Purpose
`agent-evolution/data/real-fit-report.json`	Aggregated fit scores by agent-model pair
`agent-evolution/data/real-fit-report.html`	Visual heatmap (optional)
SQLite DB (default: `.kilo/logs/evolve-agent.db`)	Raw evaluations and prompts

Gitea Integration

When run via Gitea issue:

## /evolve-agent results for code-skeptic

**Best fit**: deepseek-v4-pro-max (91/100)
**Current**: kimi-k2.6 (89/100)

| Dimension | Current | Best | Δ |
|-----------|---------|------|---|
| Role Adherence | 91 | 94 | +3 |
| Reasoning Quality | 88 | 91 | +3 |
| Instruction Following | 90 | 89 | -1 |
| Boundary Awareness | 85 | 87 | +2 |
| Output Quality | 89 | 92 | +3 |

**Recommendation**: Upgrade to deepseek-v4-pro-max
**Confidence**: high (3 model sweep, 5 prompts, 15 evaluations)

Configuration

# In capability-index.yaml
evolution:
  role_fit:
    db_path: .kilo/logs/evolve-agent.db
    prompts_per_agent: 5          # how many stress tests per agent
    models_per_agent: 0           # 0 = use all fallback models; N = limit
    max_prompt_tokens: 4000       # token limit per prompt
    evaluator: evolution-skeptic  # which subagent scores responses
    prompt_generator: evolution-prompt
    output_json: agent-evolution/data/real-fit-report.json
    output_html: agent-evolution/data/real-fit-report.html

Error Handling

Failure	Response
Agent definition missing	Skip agent, log warning, continue with others
Model API unreachable	Retry ×2 with backoff, then mark model as unavailable
Evaluator returns invalid JSON	Fall back to default scores (50), log corruption
DB write fails	Write to `.kilo/logs/evolve-agent-fallback.jsonl`
All models fail for agent	Mark agent as "untested", alert operator

Evolve-Agent workflow v1.0 — Pre-deployment role-fit testing

9.9 KiB Raw Blame History Unescape Escape

/evolve-agent — Pre-Deployment Role-Fit Command

How It Differs from /evolution