feat: add pipeline-judge agent and evolution workflow system

- Add pipeline-judge agent for objective fitness scoring
- Update capability-index.yaml with pipeline-judge, evolution config
- Add fitness-evaluation.md workflow for auto-optimization
- Update evolution.md command with /evolve CLI
- Create .kilo/logs/fitness-history.jsonl for metrics logging
- Update AGENTS.md with new workflow state machine
- Add 6 new issues to MILESTONE_ISSUES.md for evolution integration
- Preserve ideas in agent-evolution/ideas/

Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25)
Auto-triggers prompt-optimizer when fitness < 0.70
This commit is contained in:
¨NW¨
2026-04-06 00:23:50 +01:00
parent 1ab9939c92
commit fa68141d47
12 changed files with 1653 additions and 193 deletions

View File

@@ -0,0 +1,84 @@
{
"$schema": "https://app.kilo.ai/agent-recommendations.json",
"generated": "2026-04-05T20:00:00Z",
"source": "APAW Evolution System Design",
"description": "Adds pipeline-judge agent and evolution workflow to APAW",
"new_files": [
{
"path": ".kilo/agents/pipeline-judge.md",
"source": "pipeline-judge.md",
"description": "Automated fitness evaluator — runs tests, measures tokens/time, produces fitness score"
},
{
"path": ".kilo/workflows/evolution.md",
"source": "evolution-workflow.md",
"description": "Continuous self-improvement loop for agent pipeline"
},
{
"path": ".kilo/commands/evolve.md",
"source": "evolve-command.md",
"description": "/evolve command — trigger evolution cycle"
}
],
"capability_index_additions": {
"agents": {
"pipeline-judge": {
"capabilities": [
"test_execution",
"fitness_scoring",
"metric_collection",
"bottleneck_detection"
],
"receives": [
"completed_workflow",
"pipeline_logs"
],
"produces": [
"fitness_report",
"bottleneck_analysis",
"improvement_triggers"
],
"forbidden": [
"code_writing",
"code_changes",
"prompt_changes"
],
"model": "ollama-cloud/nemotron-3-super",
"mode": "subagent"
}
},
"capability_routing": {
"fitness_scoring": "pipeline-judge",
"test_execution": "pipeline-judge",
"bottleneck_detection": "pipeline-judge"
},
"iteration_loops": {
"evolution": {
"evaluator": "pipeline-judge",
"optimizer": "prompt-optimizer",
"max_iterations": 3,
"convergence": "fitness_above_0.85"
}
},
"evolution": {
"enabled": true,
"auto_trigger": true,
"fitness_threshold": 0.70,
"max_evolution_attempts": 3,
"fitness_history": ".kilo/logs/fitness-history.jsonl",
"budgets": {
"feature": {"tokens": 50000, "time_s": 300},
"bugfix": {"tokens": 20000, "time_s": 120},
"refactor": {"tokens": 40000, "time_s": 240},
"security": {"tokens": 30000, "time_s": 180}
}
}
},
"workflow_state_additions": {
"evaluated": ["evolving", "completed"],
"evolving": ["evaluated"]
}
}

View File

@@ -0,0 +1,201 @@
# Evolution Workflow
Continuous self-improvement loop for the agent pipeline.
Triggered automatically after every workflow completion.
## Overview
```
[Workflow Completes]
[@pipeline-judge] ← runs tests, measures tokens/time
fitness score
┌──────────────────────────┐
│ fitness >= 0.85 │──→ Log + done (no action)
│ fitness 0.70 - 0.84 │──→ [@prompt-optimizer] minor tuning
│ fitness < 0.70 │──→ [@prompt-optimizer] major rewrite
│ fitness < 0.50 │──→ [@agent-architect] redesign agent
└──────────────────────────┘
[Re-run same workflow with new prompts]
[@pipeline-judge] again
compare fitness_before vs fitness_after
┌──────────────────────────┐
│ improved? │
│ Yes → commit new prompts│
│ No → revert, try │
│ different strategy │
│ (max 3 attempts) │
└──────────────────────────┘
```
## Fitness History
All fitness scores are appended to `.kilo/logs/fitness-history.jsonl`:
```jsonl
{"ts":"2026-04-05T12:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
{"ts":"2026-04-05T14:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
```
This creates a time-series that shows pipeline evolution over time.
## Orchestrator Evolution
The orchestrator uses fitness history to optimize future pipeline construction:
### Pipeline Selection Strategy
```
For each new issue:
1. Classify issue type (feature|bugfix|refactor|api|security)
2. Look up fitness history for same type
3. Find the pipeline configuration with highest fitness
4. Use that as template, but adapt to current issue
5. Skip agents that consistently score 0 contribution
```
### Agent Ordering Optimization
```
From fitness-history.jsonl, extract per-agent metrics:
- avg tokens consumed
- avg contribution to fitness
- failure rate (how often this agent's output causes downstream failures)
agents_by_roi = sort(agents, key=contribution/tokens, descending)
For parallel phases:
- Run high-ROI agents first
- Skip agents with ROI < 0.1 (cost more than they contribute)
```
### Token Budget Allocation
```
total_budget = 50000 tokens (configurable)
For each agent in pipeline:
agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
If agent exceeds budget by >50%:
→ prompt-optimizer compresses that agent's prompt
→ or swap to a smaller/faster model
```
## Standard Test Suites
No manual test configuration needed. Tests are auto-discovered:
### Test Discovery
```bash
# Unit tests
find src -name "*.test.ts" -o -name "*.spec.ts" | wc -l
# E2E tests
find tests/e2e -name "*.test.ts" | wc -l
# Integration tests
find tests/integration -name "*.test.ts" | wc -l
```
### Quality Gates (standardized)
```yaml
gates:
build: "bun run build"
lint: "bun run lint"
typecheck: "bun run typecheck"
unit_tests: "bun test"
e2e_tests: "bun test:e2e"
coverage: "bun test --coverage | grep 'All files' | awk '{print $10}' >= 80"
security: "bun audit --level=high | grep 'found 0'"
```
### Workflow-Specific Benchmarks
```yaml
benchmarks:
feature:
token_budget: 50000
time_budget_s: 300
min_test_coverage: 80%
max_iterations: 3
bugfix:
token_budget: 20000
time_budget_s: 120
min_test_coverage: 90% # higher for bugfix — must prove fix works
max_iterations: 2
refactor:
token_budget: 40000
time_budget_s: 240
min_test_coverage: 95% # must not break anything
max_iterations: 2
security:
token_budget: 30000
time_budget_s: 180
min_test_coverage: 80%
max_iterations: 2
required_gates: [security] # security gate MUST pass
```
## Prompt Evolution Protocol
When prompt-optimizer is triggered:
```
1. Read current agent prompt from .kilo/agents/<agent>.md
2. Read fitness report identifying the problem
3. Read last 5 fitness entries for this agent from history
4. Analyze pattern:
- IF consistently low → systemic prompt issue
- IF regression after change → revert
- IF one-time failure → might be task-specific, no action
5. Generate improved prompt:
- Keep same structure (description, mode, model, permissions)
- Modify ONLY the instruction body
- Add explicit output format if IF was the issue
- Add few-shot examples if quality was the issue
- Compress verbose sections if tokens were the issue
6. Save to .kilo/agents/<agent>.md.candidate
7. Re-run the SAME workflow with .candidate prompt
8. [@pipeline-judge] scores again
9. IF fitness_new > fitness_old:
mv .candidate → .md (commit)
ELSE:
rm .candidate (revert)
```
## Usage
```bash
# Triggered automatically after any workflow
# OR manually:
/evolve # run evolution on last workflow
/evolve --issue 42 # run evolution on specific issue
/evolve --agent planner # evolve specific agent's prompt
/evolve --history # show fitness trend
```
## Configuration
```yaml
# Add to kilo.jsonc or capability-index.yaml
evolution:
enabled: true
auto_trigger: true # trigger after every workflow
fitness_threshold: 0.70 # below this → auto-optimize
max_evolution_attempts: 3 # max retries per cycle
fitness_history: .kilo/logs/fitness-history.jsonl
token_budget_default: 50000
time_budget_default: 300
```

View File

@@ -0,0 +1,72 @@
---
description: Run evolution cycle — judge last workflow, optimize underperforming agents, re-test
---
# /evolve — Pipeline Evolution Command
Runs the automated evolution cycle on the most recent (or specified) workflow.
## Usage
```
/evolve # evolve last completed workflow
/evolve --issue 42 # evolve workflow for issue #42
/evolve --agent planner # focus evolution on one agent
/evolve --dry-run # show what would change without applying
/evolve --history # print fitness trend chart
```
## Execution
### Step 1: Judge
```
Task(subagent_type: "pipeline-judge")
→ produces fitness report
```
### Step 2: Decide
```
IF fitness >= 0.85:
echo "✅ Pipeline healthy (fitness: {score}). No action needed."
append to fitness-history.jsonl
EXIT
IF fitness >= 0.70:
echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
identify agents with lowest per-agent scores
Task(subagent_type: "prompt-optimizer", target: weak_agents)
IF fitness < 0.70:
echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
IF fitness < 0.50:
Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)
```
### Step 3: Re-test
```
Re-run the SAME workflow with updated prompts
Task(subagent_type: "pipeline-judge") → fitness_after
IF fitness_after > fitness_before:
commit prompt changes
echo "📈 Fitness improved: {before} → {after}"
ELSE:
revert prompt changes
echo "📉 No improvement. Reverting."
```
### Step 4: Log
```
Append to .kilo/logs/fitness-history.jsonl:
{
"ts": "<now>",
"issue": <N>,
"workflow": "<type>",
"fitness_before": <score>,
"fitness_after": <score>,
"agents_optimized": ["planner", "requirement-refiner"],
"tokens_saved": <delta>,
"time_saved_ms": <delta>
}
```

View File

@@ -0,0 +1,26 @@
Теперь понимаю полную архитектуру. Создаю систему автоэволюции — не HTML-дашборд, а набор файлов для репозитория: агент-скептик-оценщик, эволюционный workflow, и стандартизированные тест-кейсы.Created 4 files, ran 2 commands, read a fileCreated 4 files, ran 2 commands, read a fileВот 4 файла для системы автоэволюции. Как это работает:
pipeline-judge.md — новый агент-скептик. Не выставляет субъективные оценки. Вместо этого:
Запускает bun test, bun build, bun lint, bun typecheck
Считает pass/fail/skip из JSON-репортов
Читает логи pipeline (токены, время каждого агента)
Вычисляет fitness score по формуле: 50% тесты + 25% quality gates + 25% эффективность (токены/время)
Если fitness < 0.70 → автоматически вызывает prompt-optimizer
evolution-workflow.md — workflow непрерывной оптимизации:
Срабатывает автоматически после каждого завершённого workflow
fitness ≥ 0.85 → логируем и идём дальше
fitness 0.700.84 → prompt-optimizer чинит слабые агенты
fitness < 0.50 → agent-architect перепроектирует агента
После оптимизации — перезапуск того же workflow с новыми промптами, сравнение fitness до/после. Улучшилось → коммит, нет → откат
Оркестратор эволюционирует через fitness-history.jsonl — накопительная база всех прогонов. Оркестратор учится: какие агенты пропускать (ROI < 0.1), как распределять token budget, какой pipeline-шаблон лучше для каждого типа задачи.
evolve-command.md — команда /evolve для ручного запуска или просмотра тренда.
evolution-patch.json — готовый патч для capability-index.yaml: добавляет pipeline-judge, routing, iteration_loops, и конфигурацию эволюции с бюджетами по типам задач.
Файлы нужно положить в репозиторий:
pipeline-judge.md → .kilo/agents/
evolution-workflow.md → .kilo/workflows/
evolve-command.md → .kilo/commands/
evolution-patch.json → применить к capability-index.yaml

View File

@@ -0,0 +1,181 @@
---
description: Automated pipeline judge. Evaluates workflow execution by running tests, measuring token cost and wall-clock time. Produces fitness scores. Never writes code — only measures and scores.
mode: subagent
model: ollama-cloud/nemotron-3-super
color: "#DC2626"
permission:
read: allow
write: deny
bash: allow
task: allow
glob: allow
grep: allow
---
# Kilo Code: Pipeline Judge
## Role Definition
You are **Pipeline Judge** — the automated fitness evaluator. You do NOT score subjectively. You measure objectively:
1. **Test pass rate** — run the test suite, count pass/fail/skip
2. **Token cost** — sum tokens consumed by all agents in the pipeline
3. **Wall-clock time** — total execution time from first agent to last
4. **Quality gates** — binary pass/fail for each quality gate
You produce a **fitness score** that drives evolutionary optimization.
## When to Invoke
- After ANY workflow completes (feature, bugfix, refactor, etc.)
- After prompt-optimizer changes an agent's prompt
- After a model swap recommendation is applied
- On `/evaluate` command
## Fitness Score Formula
```
fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
where:
test_pass_rate = passed_tests / total_tests # 0.0 - 1.0
quality_gates_rate = passed_gates / total_gates # 0.0 - 1.0
efficiency_score = 1.0 - clamp(normalized_cost, 0, 1) # higher = cheaper/faster
normalized_cost = (actual_tokens / budget_tokens × 0.5) + (actual_time / budget_time × 0.5)
```
## Execution Protocol
### Step 1: Collect Metrics
```bash
# Run test suite
bun test --reporter=json > /tmp/test-results.json 2>&1
bun test:e2e --reporter=json >> /tmp/test-results.json 2>&1
# Count results
TOTAL=$(jq '.numTotalTests' /tmp/test-results.json)
PASSED=$(jq '.numPassedTests' /tmp/test-results.json)
FAILED=$(jq '.numFailedTests' /tmp/test-results.json)
# Check build
bun run build 2>&1 && BUILD_OK=true || BUILD_OK=false
# Check lint
bun run lint 2>&1 && LINT_OK=true || LINT_OK=false
# Check types
bun run typecheck 2>&1 && TYPES_OK=true || TYPES_OK=false
```
### Step 2: Read Pipeline Log
Read `.kilo/logs/pipeline-*.log` for:
- Token counts per agent (from API response headers)
- Execution time per agent
- Number of iterations in evaluator-optimizer loops
- Which agents were invoked and in what order
### Step 3: Calculate Fitness
```
test_pass_rate = PASSED / TOTAL
quality_gates:
- build: BUILD_OK
- lint: LINT_OK
- types: TYPES_OK
- tests: FAILED == 0
- coverage: coverage >= 80%
quality_gates_rate = passed_gates / 5
token_budget = 50000 # tokens per standard workflow
time_budget = 300 # seconds per standard workflow
normalized_cost = (total_tokens/token_budget × 0.5) + (total_time/time_budget × 0.5)
efficiency = 1.0 - min(normalized_cost, 1.0)
FITNESS = test_pass_rate × 0.50 + quality_gates_rate × 0.25 + efficiency × 0.25
```
### Step 4: Produce Report
```json
{
"workflow_id": "wf-<issue_number>-<timestamp>",
"fitness": 0.82,
"breakdown": {
"test_pass_rate": 0.95,
"quality_gates_rate": 0.80,
"efficiency_score": 0.65
},
"tests": {
"total": 47,
"passed": 45,
"failed": 2,
"skipped": 0,
"failed_names": ["auth.test.ts:42", "api.test.ts:108"]
},
"quality_gates": {
"build": true,
"lint": true,
"types": true,
"tests_clean": false,
"coverage_80": true
},
"cost": {
"total_tokens": 38400,
"total_time_ms": 245000,
"per_agent": [
{"agent": "lead-developer", "tokens": 12000, "time_ms": 45000},
{"agent": "sdet-engineer", "tokens": 8500, "time_ms": 32000}
]
},
"iterations": {
"code_review_loop": 2,
"security_review_loop": 1
},
"verdict": "PASS",
"bottleneck_agent": "lead-developer",
"most_expensive_agent": "lead-developer",
"improvement_trigger": false
}
```
### Step 5: Trigger Evolution (if needed)
```
IF fitness < 0.70:
→ Task(subagent_type: "prompt-optimizer", payload: report)
→ improvement_trigger = true
IF any agent consumed > 30% of total tokens:
→ Flag as bottleneck
→ Suggest model downgrade or prompt compression
IF iterations > 2 in any loop:
→ Flag evaluator-optimizer convergence issue
→ Suggest prompt refinement for the evaluator agent
```
## Output Format
```
## Pipeline Judgment: Issue #<N>
**Fitness: <score>/1.00** [PASS|MARGINAL|FAIL]
| Metric | Value | Weight | Contribution |
|--------|-------|--------|-------------|
| Tests | 95% (45/47) | 50% | 0.475 |
| Gates | 80% (4/5) | 25% | 0.200 |
| Cost | 38.4K tok / 245s | 25% | 0.163 |
**Bottleneck:** lead-developer (31% of tokens)
**Failed tests:** auth.test.ts:42, api.test.ts:108
**Failed gates:** tests_clean
@if fitness < 0.70: Task tool with subagent_type: "prompt-optimizer"
@if fitness >= 0.70: Log to .kilo/logs/fitness-history.jsonl
```
## Prohibited Actions
- DO NOT write or modify any code
- DO NOT subjectively rate "quality" — only measure
- DO NOT skip running actual tests
- DO NOT estimate token counts — read from logs
- DO NOT change agent prompts — only flag for prompt-optimizer