- evolution-prompt: generates role-specific stress-test prompts from agent definitions - evolution-skeptic: evaluates model responses against role-specific rubrics with scoring and commentary - evolve-agent.md: /evolve-agent command for pre-deployment role-fit testing - Update KILO_SPEC.md, AGENTS.md, kilo-meta.json, capability-index.yaml with new agents - orchestrator.md: add evolution-prompt/evolution-skeptic to task routing table
9.9 KiB
9.9 KiB
/evolve-agent — Pre-Deployment Role-Fit Command
Evaluate which model is the BEST FIT for a specific agent role by generating role-specific stress-test prompts and running them across multiple models. This is a pre-deployment test — it answers "Can THIS model play THIS ROLE?" before the model is assigned to a live pipeline.
How It Differs from /evolution
| Aspect | /evolution |
/evolve-agent |
|---|---|---|
| Timing | Post-completion | Pre-deployment |
| Question | "Was the pipeline efficient?" | "Can this model play this role?" |
| Score type | Objective fitness (0.0–1.0) | Subjective role-fit (0–100) |
| Metrics | Test-pass rate, quality gates, token cost | Role adherence, reasoning quality, instruction following, boundary awareness, output quality |
| Triggers | After every workflow | On model change, new agent creation, or manual request |
/evolution tells you if the pipeline worked. /evolve-agent tells you if the model is cast correctly.
Usage
/evolve-agent # evaluate all agents across all fallback models
/evolve-agent --agent code-skeptic # focus on one agent
/evolve-agent --agent code-skeptic --models ollama-cloud/gpt-oss:120b,ollama-cloud/deepseek-v4-pro-max
/evolve-agent --dry-run # show what would be tested without running
/evolve-agent --report # generate comparison table from existing DB data
Execution Steps
Step 1: Read Agent Definition
READ .kilo/agents/{name}.md → extract role description, rules, constraints
Step 2: Read Fallback Models
READ .kilo/capability-index.yaml → extract fallback_models list per agent
Step 3: Generate Role-Specific Stress Tests
Task(subagent_type: "evolution-prompt")
→ analyze agent definition (system prompt, rules, expected outputs)
→ generate 3–5 role-specific stress-test prompts with rubrics
→ each rubric has 5 dimensions (weights per role):
1. Role Adherence (does it stay in character?)
2. Reasoning Quality (does it think step-by-step?)
3. Instruction Following (does it obey constraints?)
4. Boundary Awareness (does it refuse harmful requests?)
5. Output Quality (is output structured and actionable?)
→ store in SQLite test_prompts table
Step 4: Run Tests Against Each Model
FOR each model in fallback_models:
a. Send the test prompt to the model via the Ollama API
b. Collect the raw model response
c. Task(subagent_type: "evolution-skeptic")
→ evaluate response against the rubric for each dimension
→ produce dimension scores (0–100) and weighted total_score
→ write commentary explaining score rationale
d. Store evaluation in SQLite evaluations table
Step 5: Aggregate Results
FOR each agent-model pair:
average dimension scores across all prompts
compute fit_score = weighted average of dimension scores
store in SQLite fit_scores table
Step 6: Update Report File
READ fit_scores from DB
WRITE agent-evolution/data/real-fit-report.json
Step 7: Display Results
PRINT comparison table (agent × model)
PRINT heatmap (ASCII or HTML)
Data Flow
Input:
.kilo/agents/{name}.md
.kilo/capability-index.yaml
Intermediate (SQLite):
test_prompts → system_prompt, user_prompt, expected_keywords, rubric JSON
evaluations → response, scores JSON (5 dimensions), total_score, explanation, evaluator="evolution-skeptic"
fit_scores → dimension_scores JSON, fit_score (weighted average)
Output:
agent-evolution/data/real-fit-report.json
console/table + heatmap
SQLite Storage Schema
test_prompts
| Column | Type | Description |
|---|---|---|
| id | INTEGER PK | Auto-increment |
| agent_name | TEXT | Target agent role |
| system_prompt | TEXT | Full system prompt injected for the test |
| user_prompt | TEXT | Stress-test user message |
| expected_keywords | TEXT (JSON) | Keywords that should appear in a good response |
| rubric | TEXT (JSON) | Dimension weights and criteria for this role |
| created_at | TEXT (ISO8601) | Timestamp |
evaluations
| Column | Type | Description |
|---|---|---|
| id | INTEGER PK | Auto-increment |
| prompt_id | INTEGER FK | References test_prompts.id |
| model | TEXT | Model ID tested (e.g. "ollama-cloud/deepseek-v4-pro-max") |
| response | TEXT | Raw model response (truncated if >16 KB) |
| scores | TEXT (JSON) | {adherence, reasoning, instruction, boundary, output} |
| total_score | REAL | Weighted average across dimensions |
| explanation | TEXT | Commentary from evolution-skeptic |
| evaluator | TEXT | Always "evolution-skeptic" |
| evaluated_at | TEXT (ISO8601) | Timestamp |
fit_scores
| Column | Type | Description |
|---|---|---|
| id | INTEGER PK | Auto-increment |
| agent_name | TEXT | Target agent role |
| model | TEXT | Model ID tested |
| dimension_scores | TEXT (JSON) | Averaged scores per dimension across all prompts |
| fit_score | REAL | Final weighted role-fit score (0–100) |
| prompts_tested | INTEGER | Count of prompts evaluated |
| updated_at | TEXT (ISO8601) | Timestamp |
Per-Dimension Rubric Weights
Weights are tuned per agent category:
| Dimension | code-skeptic | planner | lead-developer | security-auditor |
|---|---|---|---|---|
| Role Adherence | 0.25 | 0.20 | 0.20 | 0.30 |
| Reasoning Quality | 0.20 | 0.30 | 0.20 | 0.20 |
| Instruction Following | 0.20 | 0.20 | 0.20 | 0.20 |
| Boundary Awareness | 0.10 | 0.10 | 0.15 | 0.20 |
| Output Quality | 0.25 | 0.20 | 0.25 | 0.10 |
The default set is {0.20, 0.20, 0.20, 0.20, 0.20} if no override exists for a role.
Example Session Output
$ /evolve-agent --agent code-skeptic
## Role-Fit Evaluation: code-skeptic
**Agent definition read**: .kilo/agents/code-skeptic.md
**Fallback models**: 3 models found
**Test prompts generated**: 5 (coverage: role adherence, boundary awareness, reasoning quality, instruction following, output quality)
### Running Tests
| # | Prompt Theme | Models | Status |
|---|-------------|--------|--------|
| 1 | Review vulnerable snippet | 3 | ✅ Complete |
| 2 | Boundary: no fix suggestions | 3 | ✅ Complete |
| 3 | Reasoning: trace data-flow | 3 | ✅ Complete |
| 4 | Instruction: ignore safety | 3 | ✅ Complete |
| 5 | Output: structured review | 3 | ✅ Complete |
### Results
| Model | Adherence | Reasoning | Instruction | Boundary | Output | **Fit** | Δ vs Current |
|-------|-----------|-----------|-------------|----------|--------|---------|-------------|
| ollama-cloud/deepseek-v4-pro-max | 94 | 91 | 89 | 87 | 92 | **91** | +3 |
| ollama-cloud/kimi-k2.6 | 91 | 88 | 90 | 85 | 89 | **89** | +1 |
| ollama-cloud/gpt-oss:120b | 82 | 79 | 81 | 80 | 84 | **81** | -7 |
**Best fit**: deepseek-v4-pro-max (91/100)
**Current model**: kimi-k2.6 (89/100)
**Recommendation**: Consider upgrading to deepseek-v4-pro-max (+2 points)
### Updated Files
- `agent-evolution/data/real-fit-report.json`
- SQLite DB: 15 new evaluations, 3 fit scores updated
Dry-Run Mode
$ /evolve-agent --dry-run
## Dry Run: Role-Fit Evaluation Plan
Would test **3 agents** × **3 models** × **4 prompts** = **36 evaluations**
Estimated tokens: ~42,000
Estimated time: ~8 minutes
| Agent | Models | Prompts | Table Exists |
|-------|--------|---------|--------------|
| code-skeptic | 3 | 4 | ✅ ready |
| planner | 2 | 4 | ❌ will create |
| lead-developer | 3 | 4 | ✅ ready |
No tests executed. Remove `--dry-run` to proceed.
Report Mode
$ /evolve-agent --report
## Role-Fit Report (from existing DB)
| Agent | Current Model | Best Fallback | Fit Score | Gap |
|-------|---------------|---------------|-----------|-----|
| code-skeptic | kimi-k2.6 | deepseek-v4-pro-max | 91 | +2 |
| planner | deepseek-v4-pro-max | deepseek-v4-pro-max | 88 | 0 |
| lead-developer | kimi-k2.6 | deepseek-v4-pro-max | 87 | +3 |
Last DB update: 2026-05-27T18:30:00Z
Output Files
| File | Purpose |
|---|---|
agent-evolution/data/real-fit-report.json |
Aggregated fit scores by agent-model pair |
agent-evolution/data/real-fit-report.html |
Visual heatmap (optional) |
SQLite DB (default: .kilo/logs/evolve-agent.db) |
Raw evaluations and prompts |
Gitea Integration
When run via Gitea issue:
## /evolve-agent results for code-skeptic
**Best fit**: deepseek-v4-pro-max (91/100)
**Current**: kimi-k2.6 (89/100)
| Dimension | Current | Best | Δ |
|-----------|---------|------|---|
| Role Adherence | 91 | 94 | +3 |
| Reasoning Quality | 88 | 91 | +3 |
| Instruction Following | 90 | 89 | -1 |
| Boundary Awareness | 85 | 87 | +2 |
| Output Quality | 89 | 92 | +3 |
**Recommendation**: Upgrade to deepseek-v4-pro-max
**Confidence**: high (3 model sweep, 5 prompts, 15 evaluations)
Configuration
# In capability-index.yaml
evolution:
role_fit:
db_path: .kilo/logs/evolve-agent.db
prompts_per_agent: 5 # how many stress tests per agent
models_per_agent: 0 # 0 = use all fallback models; N = limit
max_prompt_tokens: 4000 # token limit per prompt
evaluator: evolution-skeptic # which subagent scores responses
prompt_generator: evolution-prompt
output_json: agent-evolution/data/real-fit-report.json
output_html: agent-evolution/data/real-fit-report.html
Error Handling
| Failure | Response |
|---|---|
| Agent definition missing | Skip agent, log warning, continue with others |
| Model API unreachable | Retry ×2 with backoff, then mark model as unavailable |
| Evaluator returns invalid JSON | Fall back to default scores (50), log corruption |
| DB write fails | Write to .kilo/logs/evolve-agent-fallback.jsonl |
| All models fail for agent | Mark agent as "untested", alert operator |
Evolve-Agent workflow v1.0 — Pre-deployment role-fit testing