feat: add pipeline-judge agent and evolution workflow system
- Add pipeline-judge agent for objective fitness scoring - Update capability-index.yaml with pipeline-judge, evolution config - Add fitness-evaluation.md workflow for auto-optimization - Update evolution.md command with /evolve CLI - Create .kilo/logs/fitness-history.jsonl for metrics logging - Update AGENTS.md with new workflow state machine - Add 6 new issues to MILESTONE_ISSUES.md for evolution integration - Preserve ideas in agent-evolution/ideas/ Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25) Auto-triggers prompt-optimizer when fitness < 0.70
This commit is contained in:
211
.kilo/agents/pipeline-judge.md
Normal file
211
.kilo/agents/pipeline-judge.md
Normal file
@@ -0,0 +1,211 @@
|
|||||||
|
---
|
||||||
|
description: Automated pipeline judge. Evaluates workflow execution by running tests, measuring token cost and wall-clock time. Produces objective fitness scores. Never writes code - only measures and scores.
|
||||||
|
mode: subagent
|
||||||
|
model: ollama-cloud/nemotron-3-super
|
||||||
|
color: "#DC2626"
|
||||||
|
permission:
|
||||||
|
read: allow
|
||||||
|
edit: deny
|
||||||
|
write: deny
|
||||||
|
bash: allow
|
||||||
|
glob: allow
|
||||||
|
grep: allow
|
||||||
|
task:
|
||||||
|
"*": deny
|
||||||
|
"prompt-optimizer": allow
|
||||||
|
---
|
||||||
|
|
||||||
|
# Kilo Code: Pipeline Judge
|
||||||
|
|
||||||
|
## Role Definition
|
||||||
|
|
||||||
|
You are **Pipeline Judge** — the automated fitness evaluator. You do NOT score subjectively. You measure objectively:
|
||||||
|
|
||||||
|
1. **Test pass rate** — run the test suite, count pass/fail/skip
|
||||||
|
2. **Token cost** — sum tokens consumed by all agents in the pipeline
|
||||||
|
3. **Wall-clock time** — total execution time from first agent to last
|
||||||
|
4. **Quality gates** — binary pass/fail for each quality gate
|
||||||
|
|
||||||
|
You produce a **fitness score** that drives evolutionary optimization.
|
||||||
|
|
||||||
|
## When to Invoke
|
||||||
|
|
||||||
|
- After ANY workflow completes (feature, bugfix, refactor, etc.)
|
||||||
|
- After prompt-optimizer changes an agent's prompt
|
||||||
|
- After a model swap recommendation is applied
|
||||||
|
- On `/evaluate` command
|
||||||
|
|
||||||
|
## Fitness Score Formula
|
||||||
|
|
||||||
|
```
|
||||||
|
fitness = (test_pass_rate x 0.50) + (quality_gates_rate x 0.25) + (efficiency_score x 0.25)
|
||||||
|
|
||||||
|
where:
|
||||||
|
test_pass_rate = passed_tests / total_tests # 0.0 - 1.0
|
||||||
|
quality_gates_rate = passed_gates / total_gates # 0.0 - 1.0
|
||||||
|
efficiency_score = 1.0 - clamp(normalized_cost, 0, 1) # higher = cheaper/faster
|
||||||
|
normalized_cost = (actual_tokens / budget_tokens x 0.5) + (actual_time / budget_time x 0.5)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Execution Protocol
|
||||||
|
|
||||||
|
### Step 1: Collect Metrics
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run test suite
|
||||||
|
bun test --reporter=json > /tmp/test-results.json 2>&1
|
||||||
|
bun test:e2e --reporter=json >> /tmp/test-results.json 2>&1
|
||||||
|
|
||||||
|
# Count results
|
||||||
|
TOTAL=$(jq '.numTotalTests' /tmp/test-results.json)
|
||||||
|
PASSED=$(jq '.numPassedTests' /tmp/test-results.json)
|
||||||
|
FAILED=$(jq '.numFailedTests' /tmp/test-results.json)
|
||||||
|
|
||||||
|
# Check build
|
||||||
|
bun run build 2>&1 && BUILD_OK=true || BUILD_OK=false
|
||||||
|
|
||||||
|
# Check lint
|
||||||
|
bun run lint 2>&1 && LINT_OK=true || LINT_OK=false
|
||||||
|
|
||||||
|
# Check types
|
||||||
|
bun run typecheck 2>&1 && TYPES_OK=true || TYPES_OK=false
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Read Pipeline Log
|
||||||
|
|
||||||
|
Read `.kilo/logs/pipeline-*.log` for:
|
||||||
|
- Token counts per agent (from API response headers)
|
||||||
|
- Execution time per agent
|
||||||
|
- Number of iterations in evaluator-optimizer loops
|
||||||
|
- Which agents were invoked and in what order
|
||||||
|
|
||||||
|
### Step 3: Calculate Fitness
|
||||||
|
|
||||||
|
```
|
||||||
|
test_pass_rate = PASSED / TOTAL
|
||||||
|
quality_gates:
|
||||||
|
- build: BUILD_OK
|
||||||
|
- lint: LINT_OK
|
||||||
|
- types: TYPES_OK
|
||||||
|
- tests: FAILED == 0
|
||||||
|
- coverage: coverage >= 80%
|
||||||
|
quality_gates_rate = passed_gates / 5
|
||||||
|
|
||||||
|
token_budget = 50000 # tokens per standard workflow
|
||||||
|
time_budget = 300 # seconds per standard workflow
|
||||||
|
normalized_cost = (total_tokens/token_budget x 0.5) + (total_time/time_budget x 0.5)
|
||||||
|
efficiency = 1.0 - min(normalized_cost, 1.0)
|
||||||
|
|
||||||
|
FITNESS = test_pass_rate x 0.50 + quality_gates_rate x 0.25 + efficiency x 0.25
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Produce Report
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"workflow_id": "wf-<issue_number>-<timestamp>",
|
||||||
|
"fitness": 0.82,
|
||||||
|
"breakdown": {
|
||||||
|
"test_pass_rate": 0.95,
|
||||||
|
"quality_gates_rate": 0.80,
|
||||||
|
"efficiency_score": 0.65
|
||||||
|
},
|
||||||
|
"tests": {
|
||||||
|
"total": 47,
|
||||||
|
"passed": 45,
|
||||||
|
"failed": 2,
|
||||||
|
"skipped": 0,
|
||||||
|
"failed_names": ["auth.test.ts:42", "api.test.ts:108"]
|
||||||
|
},
|
||||||
|
"quality_gates": {
|
||||||
|
"build": true,
|
||||||
|
"lint": true,
|
||||||
|
"types": true,
|
||||||
|
"tests_clean": false,
|
||||||
|
"coverage_80": true
|
||||||
|
},
|
||||||
|
"cost": {
|
||||||
|
"total_tokens": 38400,
|
||||||
|
"total_time_ms": 245000,
|
||||||
|
"per_agent": [
|
||||||
|
{"agent": "lead-developer", "tokens": 12000, "time_ms": 45000},
|
||||||
|
{"agent": "sdet-engineer", "tokens": 8500, "time_ms": 32000}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"iterations": {
|
||||||
|
"code_review_loop": 2,
|
||||||
|
"security_review_loop": 1
|
||||||
|
},
|
||||||
|
"verdict": "PASS",
|
||||||
|
"bottleneck_agent": "lead-developer",
|
||||||
|
"most_expensive_agent": "lead-developer",
|
||||||
|
"improvement_trigger": false
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Trigger Evolution (if needed)
|
||||||
|
|
||||||
|
```
|
||||||
|
IF fitness < 0.70:
|
||||||
|
-> Task(subagent_type: "prompt-optimizer", payload: report)
|
||||||
|
-> improvement_trigger = true
|
||||||
|
|
||||||
|
IF any agent consumed > 30% of total tokens:
|
||||||
|
-> Flag as bottleneck
|
||||||
|
-> Suggest model downgrade or prompt compression
|
||||||
|
|
||||||
|
IF iterations > 2 in any loop:
|
||||||
|
-> Flag evaluator-optimizer convergence issue
|
||||||
|
-> Suggest prompt refinement for the evaluator agent
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
```
|
||||||
|
## Pipeline Judgment: Issue #<N>
|
||||||
|
|
||||||
|
**Fitness: <score>/1.00** [PASS|MARGINAL|FAIL]
|
||||||
|
|
||||||
|
| Metric | Value | Weight | Contribution |
|
||||||
|
|--------|-------|--------|-------------|
|
||||||
|
| Tests | 95% (45/47) | 50% | 0.475 |
|
||||||
|
| Gates | 80% (4/5) | 25% | 0.200 |
|
||||||
|
| Cost | 38.4K tok / 245s | 25% | 0.163 |
|
||||||
|
|
||||||
|
**Bottleneck:** lead-developer (31% of tokens)
|
||||||
|
**Failed tests:** auth.test.ts:42, api.test.ts:108
|
||||||
|
**Failed gates:** tests_clean
|
||||||
|
|
||||||
|
@if fitness < 0.70: Task tool with subagent_type: "prompt-optimizer"
|
||||||
|
@if fitness >= 0.70: Log to .kilo/logs/fitness-history.jsonl
|
||||||
|
```
|
||||||
|
|
||||||
|
## Workflow-Specific Budgets
|
||||||
|
|
||||||
|
| Workflow | Token Budget | Time Budget (s) | Min Coverage |
|
||||||
|
|----------|-------------|-----------------|---------------|
|
||||||
|
| feature | 50000 | 300 | 80% |
|
||||||
|
| bugfix | 20000 | 120 | 90% |
|
||||||
|
| refactor | 40000 | 240 | 95% |
|
||||||
|
| security | 30000 | 180 | 80% |
|
||||||
|
|
||||||
|
## Prohibited Actions
|
||||||
|
|
||||||
|
- DO NOT write or modify any code
|
||||||
|
- DO NOT subjectively rate "quality" — only measure
|
||||||
|
- DO NOT skip running actual tests
|
||||||
|
- DO NOT estimate token counts — read from logs
|
||||||
|
- DO NOT change agent prompts — only flag for prompt-optimizer
|
||||||
|
|
||||||
|
## Gitea Commenting (MANDATORY)
|
||||||
|
|
||||||
|
**You MUST post a comment to the Gitea issue after completing your work.**
|
||||||
|
|
||||||
|
Post a comment with:
|
||||||
|
1. Fitness score with breakdown
|
||||||
|
2. Bottleneck identification
|
||||||
|
3. Improvement triggers (if any)
|
||||||
|
|
||||||
|
Use the `post_comment` function from `.kilo/skills/gitea-commenting/SKILL.md`.
|
||||||
|
|
||||||
|
**NO EXCEPTIONS** - Always comment to Gitea.
|
||||||
@@ -521,6 +521,26 @@ agents:
|
|||||||
model: ollama-cloud/nemotron-3-super
|
model: ollama-cloud/nemotron-3-super
|
||||||
mode: subagent
|
mode: subagent
|
||||||
|
|
||||||
|
pipeline-judge:
|
||||||
|
capabilities:
|
||||||
|
- test_execution
|
||||||
|
- fitness_scoring
|
||||||
|
- metric_collection
|
||||||
|
- bottleneck_detection
|
||||||
|
receives:
|
||||||
|
- completed_workflow
|
||||||
|
- pipeline_logs
|
||||||
|
produces:
|
||||||
|
- fitness_report
|
||||||
|
- bottleneck_analysis
|
||||||
|
- improvement_triggers
|
||||||
|
forbidden:
|
||||||
|
- code_writing
|
||||||
|
- code_changes
|
||||||
|
- prompt_changes
|
||||||
|
model: ollama-cloud/nemotron-3-super
|
||||||
|
mode: subagent
|
||||||
|
|
||||||
# Capability Routing Map
|
# Capability Routing Map
|
||||||
capability_routing:
|
capability_routing:
|
||||||
code_writing: lead-developer
|
code_writing: lead-developer
|
||||||
@@ -559,6 +579,10 @@ agents:
|
|||||||
memory_retrieval: memory-manager
|
memory_retrieval: memory-manager
|
||||||
chain_of_thought: planner
|
chain_of_thought: planner
|
||||||
tree_of_thoughts: planner
|
tree_of_thoughts: planner
|
||||||
|
# Fitness & Evolution
|
||||||
|
fitness_scoring: pipeline-judge
|
||||||
|
test_execution: pipeline-judge
|
||||||
|
bottleneck_detection: pipeline-judge
|
||||||
# Go Development
|
# Go Development
|
||||||
go_api_development: go-developer
|
go_api_development: go-developer
|
||||||
go_database_design: go-developer
|
go_database_design: go-developer
|
||||||
@@ -597,6 +621,13 @@ iteration_loops:
|
|||||||
max_iterations: 2
|
max_iterations: 2
|
||||||
convergence: all_perf_issues_resolved
|
convergence: all_perf_issues_resolved
|
||||||
|
|
||||||
|
# Evolution loop for continuous improvement
|
||||||
|
evolution:
|
||||||
|
evaluator: pipeline-judge
|
||||||
|
optimizer: prompt-optimizer
|
||||||
|
max_iterations: 3
|
||||||
|
convergence: fitness_above_0.85
|
||||||
|
|
||||||
# Quality Gates
|
# Quality Gates
|
||||||
quality_gates:
|
quality_gates:
|
||||||
requirements:
|
requirements:
|
||||||
@@ -647,4 +678,33 @@ workflow_states:
|
|||||||
perf_check: [security_check]
|
perf_check: [security_check]
|
||||||
security_check: [releasing]
|
security_check: [releasing]
|
||||||
releasing: [evaluated]
|
releasing: [evaluated]
|
||||||
evaluated: [completed]
|
evaluated: [evolving, completed]
|
||||||
|
evolving: [evaluated]
|
||||||
|
completed: []
|
||||||
|
|
||||||
|
# Evolution Configuration
|
||||||
|
evolution:
|
||||||
|
enabled: true
|
||||||
|
auto_trigger: true # trigger after every workflow
|
||||||
|
fitness_threshold: 0.70 # below this → auto-optimize
|
||||||
|
max_evolution_attempts: 3 # max retries per cycle
|
||||||
|
fitness_history: .kilo/logs/fitness-history.jsonl
|
||||||
|
token_budget_default: 50000
|
||||||
|
time_budget_default: 300
|
||||||
|
budgets:
|
||||||
|
feature:
|
||||||
|
tokens: 50000
|
||||||
|
time_s: 300
|
||||||
|
min_coverage: 80
|
||||||
|
bugfix:
|
||||||
|
tokens: 20000
|
||||||
|
time_s: 120
|
||||||
|
min_coverage: 90
|
||||||
|
refactor:
|
||||||
|
tokens: 40000
|
||||||
|
time_s: 240
|
||||||
|
min_coverage: 95
|
||||||
|
security:
|
||||||
|
tokens: 30000
|
||||||
|
time_s: 180
|
||||||
|
min_coverage: 80
|
||||||
|
|||||||
@@ -1,163 +1,167 @@
|
|||||||
# Agent Evolution Workflow
|
---
|
||||||
|
description: Run evolution cycle - judge last workflow, optimize underperforming agents, re-test
|
||||||
|
---
|
||||||
|
|
||||||
Tracks and records agent model improvements, capability changes, and performance metrics.
|
# /evolution — Pipeline Evolution Command
|
||||||
|
|
||||||
|
Runs the automated evolution cycle on the most recent (or specified) workflow.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
```
|
```
|
||||||
/evolution [action] [agent]
|
/evolution # evolve last completed workflow
|
||||||
|
/evolution --issue 42 # evolve workflow for issue #42
|
||||||
|
/evolution --agent planner # focus evolution on one agent
|
||||||
|
/evolution --dry-run # show what would change without applying
|
||||||
|
/evolution --history # print fitness trend chart
|
||||||
|
/evolution --fitness # run fitness evaluation (alias for /evolve)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Actions
|
## Aliases
|
||||||
|
|
||||||
| Action | Description |
|
- `/evolve` — same as `/evolution --fitness`
|
||||||
|--------|-------------|
|
- `/evolution log` — log agent model change to Gitea
|
||||||
| `log` | Log an agent improvement to Gitea and evolution data |
|
|
||||||
| `report` | Generate evolution report for agent or all agents |
|
|
||||||
| `history` | Show model change history |
|
|
||||||
| `metrics` | Display performance metrics |
|
|
||||||
| `recommend` | Get model recommendations |
|
|
||||||
|
|
||||||
### Examples
|
## Execution
|
||||||
|
|
||||||
|
### Step 1: Judge (Fitness Evaluation)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
Task(subagent_type: "pipeline-judge")
|
||||||
|
→ produces fitness report
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Decide (Threshold Routing)
|
||||||
|
|
||||||
|
```
|
||||||
|
IF fitness >= 0.85:
|
||||||
|
echo "✅ Pipeline healthy (fitness: {score}). No action needed."
|
||||||
|
append to fitness-history.jsonl
|
||||||
|
EXIT
|
||||||
|
|
||||||
|
IF fitness >= 0.70:
|
||||||
|
echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
|
||||||
|
identify agents with lowest per-agent scores
|
||||||
|
Task(subagent_type: "prompt-optimizer", target: weak_agents)
|
||||||
|
|
||||||
|
IF fitness < 0.70:
|
||||||
|
echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
|
||||||
|
Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
|
||||||
|
IF fitness < 0.50:
|
||||||
|
Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Re-test (After Optimization)
|
||||||
|
|
||||||
|
```
|
||||||
|
Re-run the SAME workflow with updated prompts
|
||||||
|
Task(subagent_type: "pipeline-judge") → fitness_after
|
||||||
|
|
||||||
|
IF fitness_after > fitness_before:
|
||||||
|
commit prompt changes
|
||||||
|
echo "📈 Fitness improved: {before} → {after}"
|
||||||
|
ELSE:
|
||||||
|
revert prompt changes
|
||||||
|
echo "📉 No improvement. Reverting."
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Log
|
||||||
|
|
||||||
|
Append to `.kilo/logs/fitness-history.jsonl`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"ts": "<now>",
|
||||||
|
"issue": <N>,
|
||||||
|
"workflow": "<type>",
|
||||||
|
"fitness_before": <score>,
|
||||||
|
"fitness_after": <score>,
|
||||||
|
"agents_optimized": ["planner", "requirement-refiner"],
|
||||||
|
"tokens_saved": <delta>,
|
||||||
|
"time_saved_ms": <delta>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Subcommands
|
||||||
|
|
||||||
|
### `log` — Log Model Change
|
||||||
|
|
||||||
|
Log an agent model improvement to Gitea and evolution data.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Log improvement
|
|
||||||
/evolution log capability-analyst "Updated to qwen3.6-plus for better IF score"
|
/evolution log capability-analyst "Updated to qwen3.6-plus for better IF score"
|
||||||
|
```
|
||||||
|
|
||||||
# Generate report
|
Steps:
|
||||||
/evolution report capability-analyst
|
1. Read current model from `.kilo/agents/{agent}.md`
|
||||||
|
2. Get previous model from `agent-evolution/data/agent-versions.json`
|
||||||
|
3. Calculate improvement (IF score, context window)
|
||||||
|
4. Write to evolution data
|
||||||
|
5. Post Gitea comment
|
||||||
|
|
||||||
# Show all changes
|
### `report` — Generate Evolution Report
|
||||||
/evolution history
|
|
||||||
|
|
||||||
# Get recommendations
|
Generate comprehensive report for agent or all agents:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/evolution report # all agents
|
||||||
|
/evolution report planner # specific agent
|
||||||
|
```
|
||||||
|
|
||||||
|
Output includes:
|
||||||
|
- Total agents
|
||||||
|
- Model changes this month
|
||||||
|
- Average quality improvement
|
||||||
|
- Recent changes table
|
||||||
|
- Performance metrics
|
||||||
|
- Model distribution
|
||||||
|
- Recommendations
|
||||||
|
|
||||||
|
### `history` — Show Fitness Trend
|
||||||
|
|
||||||
|
Print fitness trend chart:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/evolution --history
|
||||||
|
```
|
||||||
|
|
||||||
|
Output:
|
||||||
|
```
|
||||||
|
Fitness Trend (Last 30 days):
|
||||||
|
|
||||||
|
1.00 ┤
|
||||||
|
0.90 ┤ ╭─╮ ╭──╮
|
||||||
|
0.80 ┤ ╭─╯ ╰─╮ ╭─╯ ╰──╮
|
||||||
|
0.70 ┤ ╭─╯ ╰─╯ ╰──╮
|
||||||
|
0.60 ┤ │ ╰─╮
|
||||||
|
0.50 ┼─┴───────────────────────────┴──
|
||||||
|
Apr 1 Apr 8 Apr 15 Apr 22 Apr 29
|
||||||
|
|
||||||
|
Avg fitness: 0.82
|
||||||
|
Trend: ↑ improving
|
||||||
|
```
|
||||||
|
|
||||||
|
### `recommend` — Get Model Recommendations
|
||||||
|
|
||||||
|
```bash
|
||||||
/evolution recommend
|
/evolution recommend
|
||||||
```
|
```
|
||||||
|
|
||||||
## Workflow Steps
|
Shows:
|
||||||
|
- Agents with fitness < 0.70 (need optimization)
|
||||||
### Step 1: Parse Command
|
- Agents consuming > 30% of token budget (bottlenecks)
|
||||||
|
- Model upgrade recommendations
|
||||||
```bash
|
- Priority order
|
||||||
action=$1
|
|
||||||
agent=$2
|
|
||||||
message=$3
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 2: Execute Action
|
|
||||||
|
|
||||||
#### Log Action
|
|
||||||
|
|
||||||
When logging an improvement:
|
|
||||||
|
|
||||||
1. **Read current model**
|
|
||||||
```bash
|
|
||||||
# From .kilo/agents/{agent}.md
|
|
||||||
current_model=$(grep "^model:" .kilo/agents/${agent}.md | cut -d' ' -f2)
|
|
||||||
|
|
||||||
# From .kilo/capability-index.yaml
|
|
||||||
yaml_model=$(grep -A1 "${agent}:" .kilo/capability-index.yaml | grep "model:" | cut -d' ' -f2)
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Get previous model from history**
|
|
||||||
```bash
|
|
||||||
# Read from agent-evolution/data/agent-versions.json
|
|
||||||
previous_model=$(cat agent-evolution/data/agent-versions.json | ...)
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Calculate improvement**
|
|
||||||
- Look up model scores from capability-index.yaml
|
|
||||||
- Compare IF scores
|
|
||||||
- Compare context windows
|
|
||||||
|
|
||||||
4. **Write to evolution data**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"agent": "capability-analyst",
|
|
||||||
"timestamp": "2026-04-05T22:20:00Z",
|
|
||||||
"type": "model_change",
|
|
||||||
"from": "ollama-cloud/nemotron-3-super",
|
|
||||||
"to": "qwen/qwen3.6-plus:free",
|
|
||||||
"improvement": {
|
|
||||||
"quality": "+23%",
|
|
||||||
"context_window": "130K→1M",
|
|
||||||
"if_score": "85→90"
|
|
||||||
},
|
|
||||||
"rationale": "Better structured output, FREE via OpenRouter"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
5. **Post Gitea comment**
|
|
||||||
```markdown
|
|
||||||
## 🚀 Agent Evolution: {agent}
|
|
||||||
|
|
||||||
| Metric | Before | After | Change |
|
|
||||||
|--------|--------|-------|--------|
|
|
||||||
| Model | {old} | {new} | ⬆️ |
|
|
||||||
| IF Score | 85 | 90 | +5 |
|
|
||||||
| Quality | 64 | 79 | +23% |
|
|
||||||
| Context | 130K | 1M | +670K |
|
|
||||||
|
|
||||||
**Rationale**: {message}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Report Action
|
|
||||||
|
|
||||||
Generate comprehensive report:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
# Agent Evolution Report
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
- Total agents: 28
|
|
||||||
- Model changes this month: 4
|
|
||||||
- Average quality improvement: +18%
|
|
||||||
|
|
||||||
## Recent Changes
|
|
||||||
|
|
||||||
| Date | Agent | Old Model | New Model | Impact |
|
|
||||||
|------|-------|-----------|-----------|--------|
|
|
||||||
| 2026-04-05 | capability-analyst | nemotron-3-super | qwen3.6-plus | +23% |
|
|
||||||
| 2026-04-05 | requirement-refiner | nemotron-3-super | glm-5 | +33% |
|
|
||||||
| ... | ... | ... | ... | ... |
|
|
||||||
|
|
||||||
## Performance Metrics
|
|
||||||
|
|
||||||
### Agent Scores Over Time
|
|
||||||
|
|
||||||
```
|
|
||||||
capability-analyst: 64 → 79 (+23%)
|
|
||||||
requirement-refiner: 60 → 80 (+33%)
|
|
||||||
agent-architect: 67 → 82 (+22%)
|
|
||||||
evaluator: 78 → 81 (+4%)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Model Distribution
|
|
||||||
|
|
||||||
- qwen3.6-plus: 5 agents
|
|
||||||
- nemotron-3-super: 8 agents
|
|
||||||
- glm-5: 3 agents
|
|
||||||
- minimax-m2.5: 1 agent
|
|
||||||
- ...
|
|
||||||
|
|
||||||
## Recommendations
|
|
||||||
|
|
||||||
1. Consider updating history-miner to nemotron-3-super-120b
|
|
||||||
2. code-skeptic optimal with minimax-m2.5
|
|
||||||
3. ...
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 3: Update Files
|
|
||||||
|
|
||||||
After logging:
|
|
||||||
|
|
||||||
1. Update `agent-evolution/data/agent-versions.json`
|
|
||||||
2. Post comment to related Gitea issue
|
|
||||||
3. Update capability-index.yaml metrics
|
|
||||||
|
|
||||||
## Data Storage
|
## Data Storage
|
||||||
|
|
||||||
|
### fitness-history.jsonl
|
||||||
|
|
||||||
|
```jsonl
|
||||||
|
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.65},"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47,"verdict":"PASS"}
|
||||||
|
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"breakdown":{"test_pass_rate":1.00,"quality_gates_rate":0.80,"efficiency_score":0.88},"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47,"verdict":"PASS"}
|
||||||
|
```
|
||||||
|
|
||||||
### agent-versions.json
|
### agent-versions.json
|
||||||
|
|
||||||
```json
|
```json
|
||||||
@@ -186,22 +190,6 @@ After logging:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Gitea Issue Comments
|
|
||||||
|
|
||||||
Each evolution log posts a formatted comment:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
## 🚀 Agent Evolution Log
|
|
||||||
|
|
||||||
### {agent}
|
|
||||||
- **Model**: {old} → {new}
|
|
||||||
- **Quality**: {old_score} → {new_score} ({change}%)
|
|
||||||
- **Context**: {old_ctx} → {new_ctx}
|
|
||||||
- **Rationale**: {reason}
|
|
||||||
|
|
||||||
_This change was tracked by /evolution workflow._
|
|
||||||
```
|
|
||||||
|
|
||||||
## Integration Points
|
## Integration Points
|
||||||
|
|
||||||
- **After `/pipeline`**: Evaluator scores logged
|
- **After `/pipeline`**: Evaluator scores logged
|
||||||
@@ -209,29 +197,52 @@ _This change was tracked by /evolution workflow._
|
|||||||
- **Weekly**: Performance report generated
|
- **Weekly**: Performance report generated
|
||||||
- **On request**: Recommendations provided
|
- **On request**: Recommendations provided
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# In capability-index.yaml
|
||||||
|
evolution:
|
||||||
|
enabled: true
|
||||||
|
auto_trigger: true # trigger after every workflow
|
||||||
|
fitness_threshold: 0.70 # below this → auto-optimize
|
||||||
|
max_evolution_attempts: 3 # max retries per cycle
|
||||||
|
fitness_history: .kilo/logs/fitness-history.jsonl
|
||||||
|
token_budget_default: 50000
|
||||||
|
time_budget_default: 300
|
||||||
|
```
|
||||||
|
|
||||||
## Metrics Tracked
|
## Metrics Tracked
|
||||||
|
|
||||||
| Metric | Source | Purpose |
|
| Metric | Source | Purpose |
|
||||||
|--------|--------|---------|
|
|--------|--------|---------|
|
||||||
| IF Score | KILO_SPEC.md | Instruction Following |
|
| Fitness Score | pipeline-judge | Overall pipeline health |
|
||||||
| Quality Score | Research | Overall performance |
|
| Test Pass Rate | bun test | Code quality |
|
||||||
| Context Window | Model spec | Max tokens |
|
| Quality Gates | build/lint/typecheck | Standards compliance |
|
||||||
| Provider | Config | API endpoint |
|
| Token Cost | pipeline logs | Resource efficiency |
|
||||||
| Cost | Pricing | Resource planning |
|
| Wall-Clock Time | pipeline logs | Speed |
|
||||||
| SWE-bench | Research | Code benchmark |
|
| Agent ROI | history analysis | Cost/benefit |
|
||||||
| RULER | Research | Long-context benchmark |
|
|
||||||
|
|
||||||
## Example Session
|
## Example Session
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ /evolution log capability-analyst "Updated to qwen3.6-plus for FREE tier and better IF"
|
$ /evolution
|
||||||
|
|
||||||
✅ Logged evolution for capability-analyst
|
## Pipeline Judgment: Issue #42
|
||||||
📊 Quality improvement: +23%
|
|
||||||
📄 Posted comment to Issue #27
|
**Fitness: 0.82/1.00** [PASS]
|
||||||
📝 Updated agent-versions.json
|
|
||||||
|
| Metric | Value | Weight | Contribution |
|
||||||
|
|--------|-------|--------|-------------|
|
||||||
|
| Tests | 95% (45/47) | 50% | 0.475 |
|
||||||
|
| Gates | 80% (4/5) | 25% | 0.200 |
|
||||||
|
| Cost | 38.4K tok / 245s | 25% | 0.163 |
|
||||||
|
|
||||||
|
**Bottleneck:** lead-developer (31% of tokens)
|
||||||
|
**Verdict:** PASS - within acceptable range
|
||||||
|
|
||||||
|
✅ Logged to .kilo/logs/fitness-history.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
_Evolution workflow v1.0 - Track agent improvements_
|
*Evolution workflow v2.0 - Objective fitness scoring with pipeline-judge*
|
||||||
1
.kilo/logs/fitness-history.jsonl
Normal file
1
.kilo/logs/fitness-history.jsonl
Normal file
@@ -0,0 +1 @@
|
|||||||
|
{"ts":"2026-04-04T02:30:00Z","issue":5,"workflow":"feature","fitness":0.85,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.78},"tokens":38400,"time_ms":245000,"tests_passed":9,"tests_total":10,"agents":["requirement-refiner","history-miner","system-analyst","sdet-engineer","lead-developer"],"verdict":"PASS"}
|
||||||
259
.kilo/workflows/fitness-evaluation.md
Normal file
259
.kilo/workflows/fitness-evaluation.md
Normal file
@@ -0,0 +1,259 @@
|
|||||||
|
# Fitness Evaluation Workflow
|
||||||
|
|
||||||
|
Post-workflow fitness evaluation and automatic optimization loop.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This workflow runs after every completed workflow to:
|
||||||
|
1. Evaluate fitness objectively via `pipeline-judge`
|
||||||
|
2. Trigger optimization if fitness < threshold
|
||||||
|
3. Re-run and compare before/after
|
||||||
|
4. Log results to fitness-history.jsonl
|
||||||
|
|
||||||
|
## Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
[Workflow Completes]
|
||||||
|
↓
|
||||||
|
[@pipeline-judge] ← runs tests, measures tokens/time
|
||||||
|
↓
|
||||||
|
fitness score
|
||||||
|
↓
|
||||||
|
┌──────────────────────────────────┐
|
||||||
|
│ fitness >= 0.85 │──→ Log + done (no action)
|
||||||
|
│ fitness 0.70 - 0.84 │──→ [@prompt-optimizer] minor tuning
|
||||||
|
│ fitness < 0.70 │──→ [@prompt-optimizer] major rewrite
|
||||||
|
│ fitness < 0.50 │──→ [@agent-architect] redesign agent
|
||||||
|
└──────────────────────────────────┘
|
||||||
|
↓
|
||||||
|
[Re-run same workflow with new prompts]
|
||||||
|
↓
|
||||||
|
[@pipeline-judge] again
|
||||||
|
↓
|
||||||
|
compare fitness_before vs fitness_after
|
||||||
|
↓
|
||||||
|
┌──────────────────────────────────┐
|
||||||
|
│ improved? │
|
||||||
|
│ Yes → commit new prompts │
|
||||||
|
│ No → revert, try │
|
||||||
|
│ different strategy │
|
||||||
|
│ (max 3 attempts) │
|
||||||
|
└──────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Fitness Score Formula
|
||||||
|
|
||||||
|
```
|
||||||
|
fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
|
||||||
|
|
||||||
|
where:
|
||||||
|
test_pass_rate = passed_tests / total_tests
|
||||||
|
quality_gates_rate = passed_gates / total_gates
|
||||||
|
efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)
|
||||||
|
normalized_cost = (actual_tokens / budget_tokens × 0.5) + (actual_time / budget_time × 0.5)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quality Gates
|
||||||
|
|
||||||
|
Each gate is binary (pass/fail):
|
||||||
|
|
||||||
|
| Gate | Command | Weight |
|
||||||
|
|------|---------|--------|
|
||||||
|
| build | `bun run build` | 1/5 |
|
||||||
|
| lint | `bun run lint` | 1/5 |
|
||||||
|
| types | `bun run typecheck` | 1/5 |
|
||||||
|
| tests | `bun test` | 1/5 |
|
||||||
|
| coverage | `bun test --coverage >= 80%` | 1/5 |
|
||||||
|
|
||||||
|
## Budget Defaults
|
||||||
|
|
||||||
|
| Workflow | Token Budget | Time Budget (s) | Min Coverage |
|
||||||
|
|----------|-------------|-----------------|---------------|
|
||||||
|
| feature | 50000 | 300 | 80% |
|
||||||
|
| bugfix | 20000 | 120 | 90% |
|
||||||
|
| refactor | 40000 | 240 | 95% |
|
||||||
|
| security | 30000 | 180 | 80% |
|
||||||
|
|
||||||
|
## Workflow-Specific Benchmarks
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
benchmarks:
|
||||||
|
feature:
|
||||||
|
token_budget: 50000
|
||||||
|
time_budget_s: 300
|
||||||
|
min_test_coverage: 80%
|
||||||
|
max_iterations: 3
|
||||||
|
|
||||||
|
bugfix:
|
||||||
|
token_budget: 20000
|
||||||
|
time_budget_s: 120
|
||||||
|
min_test_coverage: 90% # higher for bugfix - must prove fix works
|
||||||
|
max_iterations: 2
|
||||||
|
|
||||||
|
refactor:
|
||||||
|
token_budget: 40000
|
||||||
|
time_budget_s: 240
|
||||||
|
min_test_coverage: 95% # must not break anything
|
||||||
|
max_iterations: 2
|
||||||
|
|
||||||
|
security:
|
||||||
|
token_budget: 30000
|
||||||
|
time_budget_s: 180
|
||||||
|
min_test_coverage: 80%
|
||||||
|
max_iterations: 2
|
||||||
|
required_gates: [security] # security gate MUST pass
|
||||||
|
```
|
||||||
|
|
||||||
|
## Execution Steps
|
||||||
|
|
||||||
|
### Step 1: Collect Metrics
|
||||||
|
|
||||||
|
Agent: `pipeline-judge`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run test suite
|
||||||
|
bun test --reporter=json > /tmp/test-results.json 2>&1
|
||||||
|
|
||||||
|
# Count results
|
||||||
|
TOTAL=$(jq '.numTotalTests' /tmp/test-results.json)
|
||||||
|
PASSED=$(jq '.numPassedTests' /tmp/test-results.json)
|
||||||
|
FAILED=$(jq '.numFailedTests' /tmp/test-results.json)
|
||||||
|
|
||||||
|
# Check quality gates
|
||||||
|
bun run build 2>&1 && BUILD_OK=true || BUILD_OK=false
|
||||||
|
bun run lint 2>&1 && LINT_OK=true || LINT_OK=false
|
||||||
|
bun run typecheck 2>&1 && TYPES_OK=true || TYPES_OK=false
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Read Pipeline Log
|
||||||
|
|
||||||
|
Read `.kilo/logs/pipeline-*.log` for:
|
||||||
|
- Token counts per agent
|
||||||
|
- Execution time per agent
|
||||||
|
- Number of iterations in evaluator-optimizer loops
|
||||||
|
- Which agents were invoked
|
||||||
|
|
||||||
|
### Step 3: Calculate Fitness
|
||||||
|
|
||||||
|
```
|
||||||
|
test_pass_rate = PASSED / TOTAL
|
||||||
|
quality_gates_rate = (BUILD_OK + LINT_OK + TYPES_OK + TESTS_CLEAN + COVERAGE_OK) / 5
|
||||||
|
efficiency = 1.0 - min((tokens/50000 + time/300) / 2, 1.0)
|
||||||
|
|
||||||
|
FITNESS = test_pass_rate × 0.50 + quality_gates_rate × 0.25 + efficiency × 0.25
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Decide Action
|
||||||
|
|
||||||
|
| Fitness | Action |
|
||||||
|
|---------|--------|
|
||||||
|
| >= 0.85 | Log to fitness-history.jsonl, done |
|
||||||
|
| 0.70-0.84 | Call `prompt-optimizer` for minor tuning |
|
||||||
|
| 0.50-0.69 | Call `prompt-optimizer` for major rewrite |
|
||||||
|
| < 0.50 | Call `agent-architect` to redesign agent |
|
||||||
|
|
||||||
|
### Step 5: Re-test After Optimization
|
||||||
|
|
||||||
|
If optimization was triggered:
|
||||||
|
1. Re-run the same workflow with new prompts
|
||||||
|
2. Call `pipeline-judge` again
|
||||||
|
3. Compare fitness_before vs fitness_after
|
||||||
|
4. If improved: commit prompts
|
||||||
|
5. If not improved: revert
|
||||||
|
|
||||||
|
### Step 6: Log Results
|
||||||
|
|
||||||
|
Append to `.kilo/logs/fitness-history.jsonl`:
|
||||||
|
|
||||||
|
```jsonl
|
||||||
|
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Automatic (post-pipeline)
|
||||||
|
|
||||||
|
The workflow triggers automatically after any workflow completes.
|
||||||
|
|
||||||
|
### Manual
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/evolve # evolve last completed workflow
|
||||||
|
/evolve --issue 42 # evolve workflow for issue #42
|
||||||
|
/evolve --agent planner # focus evolution on one agent
|
||||||
|
/evolve --dry-run # show what would change without applying
|
||||||
|
/evolve --history # print fitness trend chart
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration Points
|
||||||
|
|
||||||
|
- **After `/pipeline`**: pipeline-judge scores the workflow
|
||||||
|
- **After prompt update**: evolution loop retries
|
||||||
|
- **Weekly**: Performance trend analysis
|
||||||
|
- **On request**: Recommendation generation
|
||||||
|
|
||||||
|
## Orchestrator Learning
|
||||||
|
|
||||||
|
The orchestrator uses fitness history to optimize future pipeline construction:
|
||||||
|
|
||||||
|
### Pipeline Selection Strategy
|
||||||
|
|
||||||
|
```
|
||||||
|
For each new issue:
|
||||||
|
1. Classify issue type (feature|bugfix|refactor|api|security)
|
||||||
|
2. Look up fitness history for same type
|
||||||
|
3. Find pipeline configuration with highest fitness
|
||||||
|
4. Use that as template, but adapt to current issue
|
||||||
|
5. Skip agents that consistently score 0 contribution
|
||||||
|
```
|
||||||
|
|
||||||
|
### Agent Ordering Optimization
|
||||||
|
|
||||||
|
```
|
||||||
|
From fitness-history.jsonl, extract per-agent metrics:
|
||||||
|
- avg tokens consumed
|
||||||
|
- avg contribution to fitness
|
||||||
|
- failure rate (how often this agent's output causes downstream failures)
|
||||||
|
|
||||||
|
agents_by_roi = sort(agents, key=contribution/tokens, descending)
|
||||||
|
|
||||||
|
For parallel phases:
|
||||||
|
- Run high-ROI agents first
|
||||||
|
- Skip agents with ROI < 0.1 (cost more than they contribute)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Token Budget Allocation
|
||||||
|
|
||||||
|
```
|
||||||
|
total_budget = 50000 tokens (configurable)
|
||||||
|
|
||||||
|
For each agent in pipeline:
|
||||||
|
agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
|
||||||
|
|
||||||
|
If agent exceeds budget by >50%:
|
||||||
|
→ prompt-optimizer compresses that agent's prompt
|
||||||
|
→ or swap to a smaller/faster model
|
||||||
|
```
|
||||||
|
|
||||||
|
## Prompt Evolution Protocol
|
||||||
|
|
||||||
|
When prompt-optimizer is triggered:
|
||||||
|
|
||||||
|
1. Read current agent prompt from `.kilo/agents/<agent>.md`
|
||||||
|
2. Read fitness report identifying the problem
|
||||||
|
3. Read last 5 fitness entries for this agent from history
|
||||||
|
4. Analyze pattern:
|
||||||
|
- IF consistently low → systemic prompt issue
|
||||||
|
- IF regression after change → revert
|
||||||
|
- IF one-time failure → might be task-specific, no action
|
||||||
|
5. Generate improved prompt:
|
||||||
|
- Keep same structure (description, mode, model, permissions)
|
||||||
|
- Modify ONLY the instruction body
|
||||||
|
- Add explicit output format IF was the issue
|
||||||
|
- Add few-shot examples IF quality was the issue
|
||||||
|
- Compress verbose sections IF tokens were the issue
|
||||||
|
6. Save to `.kilo/agents/<agent>.md.candidate`
|
||||||
|
7. Re-run workflow with .candidate prompt
|
||||||
|
8. `@pipeline-judge` scores again
|
||||||
|
9. IF fitness_new > fitness_old: mv .candidate → .md (commit)
|
||||||
|
ELSE: rm .candidate (revert)
|
||||||
71
AGENTS.md
71
AGENTS.md
@@ -17,12 +17,15 @@ Agent: Runs full pipeline for issue #42 with Gitea logging
|
|||||||
|---------|-------------|-------|
|
|---------|-------------|-------|
|
||||||
| `/pipeline <issue>` | Run full agent pipeline for issue | `/pipeline 42` |
|
| `/pipeline <issue>` | Run full agent pipeline for issue | `/pipeline 42` |
|
||||||
| `/status <issue>` | Check pipeline status for issue | `/status 42` |
|
| `/status <issue>` | Check pipeline status for issue | `/status 42` |
|
||||||
|
| `/evolve` | Run evolution cycle with fitness scoring | `/evolve --issue 42` |
|
||||||
| `/evaluate <issue>` | Generate performance report | `/evaluate 42` |
|
| `/evaluate <issue>` | Generate performance report | `/evaluate 42` |
|
||||||
| `/plan` | Creates detailed task plans | `/plan feature X` |
|
| `/plan` | Creates detailed task plans | `/plan feature X` |
|
||||||
| `/ask` | Answers codebase questions | `/ask how does auth work` |
|
| `/ask` | Answers codebase questions | `/ask how does auth work` |
|
||||||
| `/debug` | Analyzes and fixes bugs | `/debug error in login` |
|
| `/debug` | Analyzes and fixes bugs | `/debug error in login` |
|
||||||
| `/code` | Quick code generation | `/code add validation` |
|
| `/code` | Quick code generation | `/code add validation` |
|
||||||
| `/research [topic]` | Run research and self-improvement | `/research multi-agent` |
|
| `/research [topic]` | Run research and self-improvement | `/research multi-agent` |
|
||||||
|
| `/evolution log` | Log agent model change | `/evolution log planner "reason"` |
|
||||||
|
| `/evolution report` | Generate evolution report | `/evolution report` |
|
||||||
|
|
||||||
## Pipeline Agents (Subagents)
|
## Pipeline Agents (Subagents)
|
||||||
|
|
||||||
@@ -62,7 +65,8 @@ These agents are invoked automatically by `/pipeline` or manually via `@mention`
|
|||||||
|-------|------|--------------|
|
|-------|------|--------------|
|
||||||
| `@release-manager` | Git operations | Status: releasing |
|
| `@release-manager` | Git operations | Status: releasing |
|
||||||
| `@evaluator` | Scores effectiveness | Status: evaluated |
|
| `@evaluator` | Scores effectiveness | Status: evaluated |
|
||||||
| `@prompt-optimizer` | Improves prompts | When score < 7 |
|
| `@pipeline-judge` | Objective fitness scoring | After workflow completes |
|
||||||
|
| `@prompt-optimizer` | Improves prompts | When fitness < 0.70 |
|
||||||
| `@capability-analyst` | Analyzes task coverage | When starting new task |
|
| `@capability-analyst` | Analyzes task coverage | When starting new task |
|
||||||
| `@agent-architect` | Creates new agents | When gaps identified |
|
| `@agent-architect` | Creates new agents | When gaps identified |
|
||||||
| `@workflow-architect` | Creates workflows | New workflow needed |
|
| `@workflow-architect` | Creates workflows | New workflow needed |
|
||||||
@@ -94,9 +98,27 @@ These agents are invoked automatically by `/pipeline` or manually via `@mention`
|
|||||||
[releasing]
|
[releasing]
|
||||||
↓ @release-manager
|
↓ @release-manager
|
||||||
[evaluated]
|
[evaluated]
|
||||||
↓ @evaluator
|
↓ @evaluator (subjective score 1-10)
|
||||||
├── [score ≥ 7] → [completed]
|
├── [score ≥ 7] → [@pipeline-judge] → fitness scoring
|
||||||
└── [score < 7] → @prompt-optimizer → [completed]
|
└── [score < 7] → @prompt-optimizer → [@evaluated]
|
||||||
|
↓
|
||||||
|
[@pipeline-judge] ← runs tests, measures tokens/time
|
||||||
|
↓
|
||||||
|
fitness score
|
||||||
|
↓
|
||||||
|
┌──────────────────────────────────────┐
|
||||||
|
│ fitness >= 0.85 │──→ [completed]
|
||||||
|
│ fitness 0.70-0.84 │──→ @prompt-optimizer → [evolving]
|
||||||
|
│ fitness < 0.70 │──→ @prompt-optimizer (major) → [evolving]
|
||||||
|
│ fitness < 0.50 │──→ @agent-architect → redesign
|
||||||
|
└──────────────────────────────────────┘
|
||||||
|
↓
|
||||||
|
[evolving] → re-run workflow → [@pipeline-judge]
|
||||||
|
↓
|
||||||
|
compare fitness_before vs fitness_after
|
||||||
|
↓
|
||||||
|
[improved?] → commit prompts → [completed]
|
||||||
|
└─ [not improved?] → revert → try different strategy
|
||||||
```
|
```
|
||||||
|
|
||||||
## Capability Analysis Flow
|
## Capability Analysis Flow
|
||||||
@@ -167,6 +189,14 @@ Scores saved to `.kilo/logs/efficiency_score.json`:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Fitness Tracking
|
||||||
|
|
||||||
|
Fitness scores saved to `.kilo/logs/fitness-history.jsonl`:
|
||||||
|
```jsonl
|
||||||
|
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
|
||||||
|
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
|
||||||
|
```
|
||||||
|
|
||||||
## Manual Agent Invocation
|
## Manual Agent Invocation
|
||||||
|
|
||||||
```typescript
|
```typescript
|
||||||
@@ -192,11 +222,34 @@ GITEA_TOKEN=your-token-here
|
|||||||
## Self-Improvement Cycle
|
## Self-Improvement Cycle
|
||||||
|
|
||||||
1. **Pipeline runs** for each issue
|
1. **Pipeline runs** for each issue
|
||||||
2. **Evaluator scores** each agent (1-10)
|
2. **Evaluator scores** each agent (1-10) - subjective
|
||||||
3. **Low scores (<7)** trigger prompt-optimizer
|
3. **Pipeline Judge measures** fitness objectively (0.0-1.0)
|
||||||
4. **Prompt optimizer** analyzes failures and improves prompts
|
4. **Low fitness (<0.70)** triggers prompt-optimizer
|
||||||
5. **New prompts** saved to `.kilo/agents/`
|
5. **Prompt optimizer** analyzes failures and improves prompts
|
||||||
6. **Next run** uses improved prompts
|
6. **Re-run workflow** with improved prompts
|
||||||
|
7. **Compare fitness** before/after - commit if improved
|
||||||
|
8. **Log results** to `.kilo/logs/fitness-history.jsonl`
|
||||||
|
|
||||||
|
### Evaluator vs Pipeline Judge
|
||||||
|
|
||||||
|
| Aspect | Evaluator | Pipeline Judge |
|
||||||
|
|--------|-----------|----------------|
|
||||||
|
| Type | Subjective | Objective |
|
||||||
|
| Score | 1-10 (opinion) | 0.0-1.0 (metrics) |
|
||||||
|
| Metrics | Observations | Tests, tokens, time |
|
||||||
|
| Trigger | After workflow | After evaluator |
|
||||||
|
| Action | Logs to Gitea | Triggers optimization |
|
||||||
|
|
||||||
|
### Fitness Score Components
|
||||||
|
|
||||||
|
```
|
||||||
|
fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
|
||||||
|
|
||||||
|
where:
|
||||||
|
test_pass_rate = passed_tests / total_tests
|
||||||
|
quality_gates_rate = passed_gates / total_gates (build, lint, types, tests, coverage)
|
||||||
|
efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)
|
||||||
|
```
|
||||||
|
|
||||||
## Architecture Files
|
## Architecture Files
|
||||||
|
|
||||||
|
|||||||
@@ -151,25 +151,314 @@ docker-compose -f docker-compose.evolution.yml up -d
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Статус напраления
|
## NEW: Pipeline Fitness & Auto-Evolution Issues
|
||||||
|
|
||||||
**Текущий статус:** `PAUSED` - приостановлено до следующего спринта
|
### Issue 6: Pipeline Judge Agent — Объективная оценка fitness
|
||||||
|
|
||||||
**Причина паузы:**
|
**Title:** Создать pipeline-judge агента для объективной оценки workflow
|
||||||
Базовая инфраструктура создана:
|
**Labels:** `agent`, `fitness`, `high-priority`
|
||||||
- ✅ Структура директорий `agent-evolution/`
|
**Milestone:** Agent Evolution Dashboard
|
||||||
- ✅ Данные интегрированы в HTML
|
|
||||||
- ✅ Скрипты синхронизации созданы
|
|
||||||
- ✅ Docker контейнер настроен
|
|
||||||
- ✅ Документация написана
|
|
||||||
|
|
||||||
**Что осталось:**
|
**Описание:**
|
||||||
- 🔄 Issue #2: Интеграция с Gitea API (требует backend)
|
Создать агента `pipeline-judge`, который объективно оценивает качество выполненного workflow на основе метрик, а не субъективных оценок.
|
||||||
- 🔄 Issue #3: Полная синхронизация (требует тестирования)
|
|
||||||
- 🔄 Issue #4: Расширенная документация
|
|
||||||
|
|
||||||
**Резюме работы:**
|
**Отличие от evaluator:**
|
||||||
Создана полноценная инфраструктура для отслеживания эволюции агентной системы. Дашборд работает автономно без сервера, включает данные о 28 агентах, 8 моделях, рекомендациях по оптимизации. Подготовлен foundation для будущей интеграции с Gitea.
|
- `evaluator` — субъективные оценки 1-10 на основе наблюдений
|
||||||
|
- `pipeline-judge` — объективные метрики: тесты, токены, время, quality gates
|
||||||
|
|
||||||
|
**Файлы:**
|
||||||
|
- `.kilo/agents/pipeline-judge.md` — ✅ создан
|
||||||
|
|
||||||
|
**Fitness Formula:**
|
||||||
|
```
|
||||||
|
fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Метрики:**
|
||||||
|
- Test pass rate: passed/total тестов
|
||||||
|
- Quality gates: build, lint, typecheck, tests_clean, coverage
|
||||||
|
- Efficiency: токены и время относительно бюджетов
|
||||||
|
|
||||||
|
**Критерии приёмки:**
|
||||||
|
- [x] Агент создан в `.kilo/agents/pipeline-judge.md`
|
||||||
|
- [ ] Добавлен в `capability-index.yaml`
|
||||||
|
- [ ] Интегрирован в workflow после завершения пайплайна
|
||||||
|
- [ ] Логирует результаты в `.kilo/logs/fitness-history.jsonl`
|
||||||
|
- [ ] Триггерит `prompt-optimizer` при fitness < 0.70
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue 7: Fitness History Logging — накопление метрик
|
||||||
|
|
||||||
|
**Title:** Создать систему логирования fitness-метрик
|
||||||
|
**Labels:** `logging`, `metrics`, `high-priority`
|
||||||
|
**Milestone:** Agent Evolution Dashboard
|
||||||
|
|
||||||
|
**Описание:**
|
||||||
|
Создать систему накопления fitness-метрик для отслеживания эволюции пайплайна во времени.
|
||||||
|
|
||||||
|
**Формат лога (`.kilo/logs/fitness-history.jsonl`):**
|
||||||
|
```jsonl
|
||||||
|
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
|
||||||
|
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Действия:**
|
||||||
|
1. ✅ Создать директорию `.kilo/logs/` если не существует
|
||||||
|
2. 🔄 Создать `.kilo/logs/fitness-history.jsonl`
|
||||||
|
3. 🔄 Обновить `pipeline-judge.md` для записи в лог
|
||||||
|
4. 🔄 Создать скрипт `agent-evolution/scripts/sync-fitness-history.ts`
|
||||||
|
|
||||||
|
**Критерии приёмки:**
|
||||||
|
- [ ] Файл `.kilo/logs/fitness-history.jsonl` создан
|
||||||
|
- [ ] pipeline-judge пишет в лог после каждого workflow
|
||||||
|
- [ ] Скрипт синхронизации интегрирован в `sync:evolution`
|
||||||
|
- [ ] Дашборд отображает фитнесс-тренды
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue 8: Evolution Workflow — автоматическое самоулучшение
|
||||||
|
|
||||||
|
**Title:** Реализовать эволюционный workflow для автоматической оптимизации
|
||||||
|
**Labels:** `workflow`, `automation`, `high-priority`
|
||||||
|
**Milestone:** Agent Evolution Dashboard
|
||||||
|
|
||||||
|
**Описание:**
|
||||||
|
Реализовать непрерывный цикл самоулучшения пайплайна на основе фитнесс-метрик.
|
||||||
|
|
||||||
|
**Workflow:**
|
||||||
|
```
|
||||||
|
[Workflow Completes]
|
||||||
|
↓
|
||||||
|
[pipeline-judge] → fitness score
|
||||||
|
↓
|
||||||
|
┌───────────────────────────┐
|
||||||
|
│ fitness >= 0.85 │──→ Log + done
|
||||||
|
│ fitness 0.70-0.84 │──→ [prompt-optimizer] minor tuning
|
||||||
|
│ fitness < 0.70 │──→ [prompt-optimizer] major rewrite
|
||||||
|
│ fitness < 0.50 │──→ [agent-architect] redesign
|
||||||
|
└───────────────────────────┘
|
||||||
|
↓
|
||||||
|
[Re-run workflow with new prompts]
|
||||||
|
↓
|
||||||
|
[pipeline-judge] again
|
||||||
|
↓
|
||||||
|
[Compare before/after]
|
||||||
|
↓
|
||||||
|
[Commit or revert]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Файлы:**
|
||||||
|
- `.kilo/workflows/fitness-evaluation.md` — документация workflow
|
||||||
|
- Обновить `capability-index.yaml` — добавить `iteration_loops.evolution`
|
||||||
|
|
||||||
|
**Конфигурация:**
|
||||||
|
```yaml
|
||||||
|
evolution:
|
||||||
|
enabled: true
|
||||||
|
auto_trigger: true
|
||||||
|
fitness_threshold: 0.70
|
||||||
|
max_evolution_attempts: 3
|
||||||
|
fitness_history: .kilo/logs/fitness-history.jsonl
|
||||||
|
budgets:
|
||||||
|
feature: {tokens: 50000, time_s: 300}
|
||||||
|
bugfix: {tokens: 20000, time_s: 120}
|
||||||
|
refactor: {tokens: 40000, time_s: 240}
|
||||||
|
security: {tokens: 30000, time_s: 180}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Критерии приёмки:**
|
||||||
|
- [ ] Workflow определён в `.kilo/workflows/`
|
||||||
|
- [ ] Интегрирован в основной pipeline
|
||||||
|
- [ ] Автоматически триггерит prompt-optimizer
|
||||||
|
- [ ] Сравнивает before/after fitness
|
||||||
|
- [ ] Коммитит только улучшения
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue 9: /evolve Command — ручной запуск эволюции
|
||||||
|
|
||||||
|
**Title:** Обновить команду /evolve для работы с fitness
|
||||||
|
**Labels:** `command`, `cli`, `medium-priority`
|
||||||
|
**Milestone:** Agent Evolution Dashboard
|
||||||
|
|
||||||
|
**Описание:**
|
||||||
|
Расширить существующую команду `/evolution` (логирование моделей) до полноценной `/evolve` команды с анализом fitness.
|
||||||
|
|
||||||
|
**Текущий `/evolution`:**
|
||||||
|
- Логирует изменения моделей
|
||||||
|
- Генерирует отчёты
|
||||||
|
|
||||||
|
**Новый `/evolve`:**
|
||||||
|
```bash
|
||||||
|
/evolve # evolve last completed workflow
|
||||||
|
/evolve --issue 42 # evolve workflow for issue #42
|
||||||
|
/evolve --agent planner # focus evolution on one agent
|
||||||
|
/evolve --dry-run # show what would change without applying
|
||||||
|
/evolve --history # print fitness trend chart
|
||||||
|
```
|
||||||
|
|
||||||
|
**Execution:**
|
||||||
|
1. Judge: `Task(subagent_type: "pipeline-judge")` → fitness report
|
||||||
|
2. Decide: threshold-based routing
|
||||||
|
3. Re-test: тот же workflow с обновлёнными промптами
|
||||||
|
4. Log: append to fitness-history.jsonl
|
||||||
|
|
||||||
|
**Файлы:**
|
||||||
|
- Обновить `.kilo/commands/evolution.md` — добавить fitness логику
|
||||||
|
- Создать алиас `/evolve` → `/evolution --fitness`
|
||||||
|
|
||||||
|
**Критерии приёмки:**
|
||||||
|
- [ ] Команда `/evolve` работает с fitness
|
||||||
|
- [ ] Опции `--issue`, `--agent`, `--dry-run`, `--history`
|
||||||
|
- [ ] Интегрирована с `pipeline-judge`
|
||||||
|
- [ ] Отображает тренд fitness
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue 10: Update Capability Index — интеграция pipeline-judge
|
||||||
|
|
||||||
|
**Title:** Добавить pipeline-judge и evolution конфигурацию в capability-index.yaml
|
||||||
|
**Labels:** `config`, `integration`, `high-priority`
|
||||||
|
**Milestone:** Agent Evolution Dashboard
|
||||||
|
|
||||||
|
**Описание:**
|
||||||
|
Обновить `capability-index.yaml` для поддержки нового эволюционного workflow.
|
||||||
|
|
||||||
|
**Добавить:**
|
||||||
|
```yaml
|
||||||
|
agents:
|
||||||
|
pipeline-judge:
|
||||||
|
capabilities:
|
||||||
|
- test_execution
|
||||||
|
- fitness_scoring
|
||||||
|
- metric_collection
|
||||||
|
- bottleneck_detection
|
||||||
|
receives:
|
||||||
|
- completed_workflow
|
||||||
|
- pipeline_logs
|
||||||
|
produces:
|
||||||
|
- fitness_report
|
||||||
|
- bottleneck_analysis
|
||||||
|
- improvement_triggers
|
||||||
|
forbidden:
|
||||||
|
- code_writing
|
||||||
|
- code_changes
|
||||||
|
- prompt_changes
|
||||||
|
model: ollama-cloud/nemotron-3-super
|
||||||
|
mode: subagent
|
||||||
|
|
||||||
|
capability_routing:
|
||||||
|
fitness_scoring: pipeline-judge
|
||||||
|
test_execution: pipeline-judge
|
||||||
|
bottleneck_detection: pipeline-judge
|
||||||
|
|
||||||
|
iteration_loops:
|
||||||
|
evolution:
|
||||||
|
evaluator: pipeline-judge
|
||||||
|
optimizer: prompt-optimizer
|
||||||
|
max_iterations: 3
|
||||||
|
convergence: fitness_above_0.85
|
||||||
|
|
||||||
|
workflow_states:
|
||||||
|
evaluated: [evolving, completed]
|
||||||
|
evolving: [evaluated]
|
||||||
|
|
||||||
|
evolution:
|
||||||
|
enabled: true
|
||||||
|
auto_trigger: true
|
||||||
|
fitness_threshold: 0.70
|
||||||
|
max_evolution_attempts: 3
|
||||||
|
fitness_history: .kilo/logs/fitness-history.jsonl
|
||||||
|
budgets:
|
||||||
|
feature: {tokens: 50000, time_s: 300}
|
||||||
|
bugfix: {tokens: 20000, time_s: 120}
|
||||||
|
refactor: {tokens: 40000, time_s: 240}
|
||||||
|
security: {tokens: 30000, time_s: 180}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Критерии приёмки:**
|
||||||
|
- [ ] pipeline-judge добавлен в секцию agents
|
||||||
|
- [ ] capability_routing обновлён
|
||||||
|
- [ ] iteration_loops.evolution добавлен
|
||||||
|
- [ ] workflow_states обновлены
|
||||||
|
- [ ] Секция evolution конфигурирована
|
||||||
|
- [ ] YAML валиден
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue 11: Dashboard Evolution Tab — визуализация fitness
|
||||||
|
|
||||||
|
**Title:** Добавить вкладку Fitness Evolution в дашборд
|
||||||
|
**Labels:** `dashboard`, `visualization`, `medium-priority`
|
||||||
|
**Milestone:** Agent Evolution Dashboard
|
||||||
|
|
||||||
|
**Описание:**
|
||||||
|
Расширить дашборд для отображения фитнесс-метрик и трендов эволюции.
|
||||||
|
|
||||||
|
**Новая вкладка "Evolution":**
|
||||||
|
- **Fitness Trend Chart** — график fitness по времени
|
||||||
|
- **Workflow Comparison** — сравнение fitness разных workflow типов
|
||||||
|
- **Agent Bottlenecks** — агенты с наибольшим потреблением токенов
|
||||||
|
- **Optimization History** — история оптимизаций промптов
|
||||||
|
|
||||||
|
**Data Source:**
|
||||||
|
- `.kilo/logs/fitness-history.jsonl`
|
||||||
|
- `.kilo/logs/efficiency_score.json`
|
||||||
|
|
||||||
|
**UI Components:**
|
||||||
|
```javascript
|
||||||
|
// Fitness Trend Chart
|
||||||
|
// X-axis: timestamp
|
||||||
|
// Y-axis: fitness score (0.0 - 1.0)
|
||||||
|
// Series: issues by type (feature, bugfix, refactor)
|
||||||
|
|
||||||
|
// Agent Heatmap
|
||||||
|
// Rows: agents
|
||||||
|
// Cols: metrics (tokens, time, contribution)
|
||||||
|
// Color: intensity
|
||||||
|
```
|
||||||
|
|
||||||
|
**Критерии приёмки:**
|
||||||
|
- [ ] Вкладка "Evolution" добавлена в дашборд
|
||||||
|
- [ ] График fitness-trend работает
|
||||||
|
- [ ] Agent bottlenecks отображаются
|
||||||
|
- [ ] Данные загружаются из fitness-history.jsonl
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Статус направления
|
||||||
|
|
||||||
|
**Текущий статус:** `ACTIVE` — новые ишьюсы для интеграции fitness-системы
|
||||||
|
|
||||||
|
**Приоритеты на спринт:**
|
||||||
|
| Priority | Issue | Effort | Impact |
|
||||||
|
|----------|-------|--------|--------|
|
||||||
|
| **P0** | #6 Pipeline Judge Agent | Low | High |
|
||||||
|
| **P0** | #7 Fitness History Logging | Low | High |
|
||||||
|
| **P0** | #10 Capability Index Update | Low | High |
|
||||||
|
| **P1** | #8 Evolution Workflow | Medium | High |
|
||||||
|
| **P1** | #9 /evolve Command | Medium | Medium |
|
||||||
|
| **P2** | #11 Dashboard Evolution Tab | Medium | Medium |
|
||||||
|
|
||||||
|
**Зависимости:**
|
||||||
|
```
|
||||||
|
#6 (pipeline-judge) ──► #7 (fitness-history) ──► #11 (dashboard)
|
||||||
|
│
|
||||||
|
└──► #10 (capability-index)
|
||||||
|
│
|
||||||
|
┌───────────────┘
|
||||||
|
▼
|
||||||
|
#8 (evolution-workflow) ──► #9 (evolve-command)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Рекомендуемый порядок выполнения:**
|
||||||
|
1. Issue #6: Создать `pipeline-judge.md` ✅ DONE
|
||||||
|
2. Issue #10: Обновить `capability-index.yaml`
|
||||||
|
3. Issue #7: Создать `fitness-history.jsonl` и интегрировать логирование
|
||||||
|
4. Issue #8: Создать workflow `fitness-evaluation.md`
|
||||||
|
5. Issue #9: Обновить команду `/evolution`
|
||||||
|
6. Issue #11: Добавить вкладку в дашборд
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -180,3 +469,15 @@ docker-compose -f docker-compose.evolution.yml up -d
|
|||||||
- Build Script: `agent-evolution/scripts/build-standalone.cjs`
|
- Build Script: `agent-evolution/scripts/build-standalone.cjs`
|
||||||
- Docker: `docker-compose -f docker-compose.evolution.yml up -d`
|
- Docker: `docker-compose -f docker-compose.evolution.yml up -d`
|
||||||
- NPM: `bun run sync:evolution`
|
- NPM: `bun run sync:evolution`
|
||||||
|
- **NEW** Pipeline Judge: `.kilo/agents/pipeline-judge.md`
|
||||||
|
- **NEW** Fitness Log: `.kilo/logs/fitness-history.jsonl`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Changelog
|
||||||
|
|
||||||
|
### 2026-04-06
|
||||||
|
- ✅ Created `pipeline-judge.md` agent
|
||||||
|
- ✅ Updated MILESTONE_ISSUES.md with 6 new issues (#6-#11)
|
||||||
|
- ✅ Added dependency graph and priority matrix
|
||||||
|
- ✅ Changed status from PAUSED to ACTIVE
|
||||||
84
agent-evolution/ideas/evolution-patch.json
Normal file
84
agent-evolution/ideas/evolution-patch.json
Normal file
@@ -0,0 +1,84 @@
|
|||||||
|
{
|
||||||
|
"$schema": "https://app.kilo.ai/agent-recommendations.json",
|
||||||
|
"generated": "2026-04-05T20:00:00Z",
|
||||||
|
"source": "APAW Evolution System Design",
|
||||||
|
"description": "Adds pipeline-judge agent and evolution workflow to APAW",
|
||||||
|
|
||||||
|
"new_files": [
|
||||||
|
{
|
||||||
|
"path": ".kilo/agents/pipeline-judge.md",
|
||||||
|
"source": "pipeline-judge.md",
|
||||||
|
"description": "Automated fitness evaluator — runs tests, measures tokens/time, produces fitness score"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": ".kilo/workflows/evolution.md",
|
||||||
|
"source": "evolution-workflow.md",
|
||||||
|
"description": "Continuous self-improvement loop for agent pipeline"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": ".kilo/commands/evolve.md",
|
||||||
|
"source": "evolve-command.md",
|
||||||
|
"description": "/evolve command — trigger evolution cycle"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
|
||||||
|
"capability_index_additions": {
|
||||||
|
"agents": {
|
||||||
|
"pipeline-judge": {
|
||||||
|
"capabilities": [
|
||||||
|
"test_execution",
|
||||||
|
"fitness_scoring",
|
||||||
|
"metric_collection",
|
||||||
|
"bottleneck_detection"
|
||||||
|
],
|
||||||
|
"receives": [
|
||||||
|
"completed_workflow",
|
||||||
|
"pipeline_logs"
|
||||||
|
],
|
||||||
|
"produces": [
|
||||||
|
"fitness_report",
|
||||||
|
"bottleneck_analysis",
|
||||||
|
"improvement_triggers"
|
||||||
|
],
|
||||||
|
"forbidden": [
|
||||||
|
"code_writing",
|
||||||
|
"code_changes",
|
||||||
|
"prompt_changes"
|
||||||
|
],
|
||||||
|
"model": "ollama-cloud/nemotron-3-super",
|
||||||
|
"mode": "subagent"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"capability_routing": {
|
||||||
|
"fitness_scoring": "pipeline-judge",
|
||||||
|
"test_execution": "pipeline-judge",
|
||||||
|
"bottleneck_detection": "pipeline-judge"
|
||||||
|
},
|
||||||
|
"iteration_loops": {
|
||||||
|
"evolution": {
|
||||||
|
"evaluator": "pipeline-judge",
|
||||||
|
"optimizer": "prompt-optimizer",
|
||||||
|
"max_iterations": 3,
|
||||||
|
"convergence": "fitness_above_0.85"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"evolution": {
|
||||||
|
"enabled": true,
|
||||||
|
"auto_trigger": true,
|
||||||
|
"fitness_threshold": 0.70,
|
||||||
|
"max_evolution_attempts": 3,
|
||||||
|
"fitness_history": ".kilo/logs/fitness-history.jsonl",
|
||||||
|
"budgets": {
|
||||||
|
"feature": {"tokens": 50000, "time_s": 300},
|
||||||
|
"bugfix": {"tokens": 20000, "time_s": 120},
|
||||||
|
"refactor": {"tokens": 40000, "time_s": 240},
|
||||||
|
"security": {"tokens": 30000, "time_s": 180}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
|
"workflow_state_additions": {
|
||||||
|
"evaluated": ["evolving", "completed"],
|
||||||
|
"evolving": ["evaluated"]
|
||||||
|
}
|
||||||
|
}
|
||||||
201
agent-evolution/ideas/evolution-workflow.md
Normal file
201
agent-evolution/ideas/evolution-workflow.md
Normal file
@@ -0,0 +1,201 @@
|
|||||||
|
# Evolution Workflow
|
||||||
|
|
||||||
|
Continuous self-improvement loop for the agent pipeline.
|
||||||
|
Triggered automatically after every workflow completion.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
[Workflow Completes]
|
||||||
|
↓
|
||||||
|
[@pipeline-judge] ← runs tests, measures tokens/time
|
||||||
|
↓
|
||||||
|
fitness score
|
||||||
|
↓
|
||||||
|
┌──────────────────────────┐
|
||||||
|
│ fitness >= 0.85 │──→ Log + done (no action)
|
||||||
|
│ fitness 0.70 - 0.84 │──→ [@prompt-optimizer] minor tuning
|
||||||
|
│ fitness < 0.70 │──→ [@prompt-optimizer] major rewrite
|
||||||
|
│ fitness < 0.50 │──→ [@agent-architect] redesign agent
|
||||||
|
└──────────────────────────┘
|
||||||
|
↓
|
||||||
|
[Re-run same workflow with new prompts]
|
||||||
|
↓
|
||||||
|
[@pipeline-judge] again
|
||||||
|
↓
|
||||||
|
compare fitness_before vs fitness_after
|
||||||
|
↓
|
||||||
|
┌──────────────────────────┐
|
||||||
|
│ improved? │
|
||||||
|
│ Yes → commit new prompts│
|
||||||
|
│ No → revert, try │
|
||||||
|
│ different strategy │
|
||||||
|
│ (max 3 attempts) │
|
||||||
|
└──────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Fitness History
|
||||||
|
|
||||||
|
All fitness scores are appended to `.kilo/logs/fitness-history.jsonl`:
|
||||||
|
|
||||||
|
```jsonl
|
||||||
|
{"ts":"2026-04-05T12:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
|
||||||
|
{"ts":"2026-04-05T14:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
|
||||||
|
```
|
||||||
|
|
||||||
|
This creates a time-series that shows pipeline evolution over time.
|
||||||
|
|
||||||
|
## Orchestrator Evolution
|
||||||
|
|
||||||
|
The orchestrator uses fitness history to optimize future pipeline construction:
|
||||||
|
|
||||||
|
### Pipeline Selection Strategy
|
||||||
|
```
|
||||||
|
For each new issue:
|
||||||
|
1. Classify issue type (feature|bugfix|refactor|api|security)
|
||||||
|
2. Look up fitness history for same type
|
||||||
|
3. Find the pipeline configuration with highest fitness
|
||||||
|
4. Use that as template, but adapt to current issue
|
||||||
|
5. Skip agents that consistently score 0 contribution
|
||||||
|
```
|
||||||
|
|
||||||
|
### Agent Ordering Optimization
|
||||||
|
```
|
||||||
|
From fitness-history.jsonl, extract per-agent metrics:
|
||||||
|
- avg tokens consumed
|
||||||
|
- avg contribution to fitness
|
||||||
|
- failure rate (how often this agent's output causes downstream failures)
|
||||||
|
|
||||||
|
agents_by_roi = sort(agents, key=contribution/tokens, descending)
|
||||||
|
|
||||||
|
For parallel phases:
|
||||||
|
- Run high-ROI agents first
|
||||||
|
- Skip agents with ROI < 0.1 (cost more than they contribute)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Token Budget Allocation
|
||||||
|
```
|
||||||
|
total_budget = 50000 tokens (configurable)
|
||||||
|
|
||||||
|
For each agent in pipeline:
|
||||||
|
agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
|
||||||
|
|
||||||
|
If agent exceeds budget by >50%:
|
||||||
|
→ prompt-optimizer compresses that agent's prompt
|
||||||
|
→ or swap to a smaller/faster model
|
||||||
|
```
|
||||||
|
|
||||||
|
## Standard Test Suites
|
||||||
|
|
||||||
|
No manual test configuration needed. Tests are auto-discovered:
|
||||||
|
|
||||||
|
### Test Discovery
|
||||||
|
```bash
|
||||||
|
# Unit tests
|
||||||
|
find src -name "*.test.ts" -o -name "*.spec.ts" | wc -l
|
||||||
|
|
||||||
|
# E2E tests
|
||||||
|
find tests/e2e -name "*.test.ts" | wc -l
|
||||||
|
|
||||||
|
# Integration tests
|
||||||
|
find tests/integration -name "*.test.ts" | wc -l
|
||||||
|
```
|
||||||
|
|
||||||
|
### Quality Gates (standardized)
|
||||||
|
```yaml
|
||||||
|
gates:
|
||||||
|
build: "bun run build"
|
||||||
|
lint: "bun run lint"
|
||||||
|
typecheck: "bun run typecheck"
|
||||||
|
unit_tests: "bun test"
|
||||||
|
e2e_tests: "bun test:e2e"
|
||||||
|
coverage: "bun test --coverage | grep 'All files' | awk '{print $10}' >= 80"
|
||||||
|
security: "bun audit --level=high | grep 'found 0'"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Workflow-Specific Benchmarks
|
||||||
|
```yaml
|
||||||
|
benchmarks:
|
||||||
|
feature:
|
||||||
|
token_budget: 50000
|
||||||
|
time_budget_s: 300
|
||||||
|
min_test_coverage: 80%
|
||||||
|
max_iterations: 3
|
||||||
|
|
||||||
|
bugfix:
|
||||||
|
token_budget: 20000
|
||||||
|
time_budget_s: 120
|
||||||
|
min_test_coverage: 90% # higher for bugfix — must prove fix works
|
||||||
|
max_iterations: 2
|
||||||
|
|
||||||
|
refactor:
|
||||||
|
token_budget: 40000
|
||||||
|
time_budget_s: 240
|
||||||
|
min_test_coverage: 95% # must not break anything
|
||||||
|
max_iterations: 2
|
||||||
|
|
||||||
|
security:
|
||||||
|
token_budget: 30000
|
||||||
|
time_budget_s: 180
|
||||||
|
min_test_coverage: 80%
|
||||||
|
max_iterations: 2
|
||||||
|
required_gates: [security] # security gate MUST pass
|
||||||
|
```
|
||||||
|
|
||||||
|
## Prompt Evolution Protocol
|
||||||
|
|
||||||
|
When prompt-optimizer is triggered:
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Read current agent prompt from .kilo/agents/<agent>.md
|
||||||
|
2. Read fitness report identifying the problem
|
||||||
|
3. Read last 5 fitness entries for this agent from history
|
||||||
|
|
||||||
|
4. Analyze pattern:
|
||||||
|
- IF consistently low → systemic prompt issue
|
||||||
|
- IF regression after change → revert
|
||||||
|
- IF one-time failure → might be task-specific, no action
|
||||||
|
|
||||||
|
5. Generate improved prompt:
|
||||||
|
- Keep same structure (description, mode, model, permissions)
|
||||||
|
- Modify ONLY the instruction body
|
||||||
|
- Add explicit output format if IF was the issue
|
||||||
|
- Add few-shot examples if quality was the issue
|
||||||
|
- Compress verbose sections if tokens were the issue
|
||||||
|
|
||||||
|
6. Save to .kilo/agents/<agent>.md.candidate
|
||||||
|
|
||||||
|
7. Re-run the SAME workflow with .candidate prompt
|
||||||
|
|
||||||
|
8. [@pipeline-judge] scores again
|
||||||
|
|
||||||
|
9. IF fitness_new > fitness_old:
|
||||||
|
mv .candidate → .md (commit)
|
||||||
|
ELSE:
|
||||||
|
rm .candidate (revert)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Triggered automatically after any workflow
|
||||||
|
# OR manually:
|
||||||
|
/evolve # run evolution on last workflow
|
||||||
|
/evolve --issue 42 # run evolution on specific issue
|
||||||
|
/evolve --agent planner # evolve specific agent's prompt
|
||||||
|
/evolve --history # show fitness trend
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Add to kilo.jsonc or capability-index.yaml
|
||||||
|
evolution:
|
||||||
|
enabled: true
|
||||||
|
auto_trigger: true # trigger after every workflow
|
||||||
|
fitness_threshold: 0.70 # below this → auto-optimize
|
||||||
|
max_evolution_attempts: 3 # max retries per cycle
|
||||||
|
fitness_history: .kilo/logs/fitness-history.jsonl
|
||||||
|
token_budget_default: 50000
|
||||||
|
time_budget_default: 300
|
||||||
|
```
|
||||||
72
agent-evolution/ideas/evolve-command.md
Normal file
72
agent-evolution/ideas/evolve-command.md
Normal file
@@ -0,0 +1,72 @@
|
|||||||
|
---
|
||||||
|
description: Run evolution cycle — judge last workflow, optimize underperforming agents, re-test
|
||||||
|
---
|
||||||
|
|
||||||
|
# /evolve — Pipeline Evolution Command
|
||||||
|
|
||||||
|
Runs the automated evolution cycle on the most recent (or specified) workflow.
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```
|
||||||
|
/evolve # evolve last completed workflow
|
||||||
|
/evolve --issue 42 # evolve workflow for issue #42
|
||||||
|
/evolve --agent planner # focus evolution on one agent
|
||||||
|
/evolve --dry-run # show what would change without applying
|
||||||
|
/evolve --history # print fitness trend chart
|
||||||
|
```
|
||||||
|
|
||||||
|
## Execution
|
||||||
|
|
||||||
|
### Step 1: Judge
|
||||||
|
```
|
||||||
|
Task(subagent_type: "pipeline-judge")
|
||||||
|
→ produces fitness report
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Decide
|
||||||
|
```
|
||||||
|
IF fitness >= 0.85:
|
||||||
|
echo "✅ Pipeline healthy (fitness: {score}). No action needed."
|
||||||
|
append to fitness-history.jsonl
|
||||||
|
EXIT
|
||||||
|
|
||||||
|
IF fitness >= 0.70:
|
||||||
|
echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
|
||||||
|
identify agents with lowest per-agent scores
|
||||||
|
Task(subagent_type: "prompt-optimizer", target: weak_agents)
|
||||||
|
|
||||||
|
IF fitness < 0.70:
|
||||||
|
echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
|
||||||
|
Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
|
||||||
|
IF fitness < 0.50:
|
||||||
|
Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Re-test
|
||||||
|
```
|
||||||
|
Re-run the SAME workflow with updated prompts
|
||||||
|
Task(subagent_type: "pipeline-judge") → fitness_after
|
||||||
|
|
||||||
|
IF fitness_after > fitness_before:
|
||||||
|
commit prompt changes
|
||||||
|
echo "📈 Fitness improved: {before} → {after}"
|
||||||
|
ELSE:
|
||||||
|
revert prompt changes
|
||||||
|
echo "📉 No improvement. Reverting."
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Log
|
||||||
|
```
|
||||||
|
Append to .kilo/logs/fitness-history.jsonl:
|
||||||
|
{
|
||||||
|
"ts": "<now>",
|
||||||
|
"issue": <N>,
|
||||||
|
"workflow": "<type>",
|
||||||
|
"fitness_before": <score>,
|
||||||
|
"fitness_after": <score>,
|
||||||
|
"agents_optimized": ["planner", "requirement-refiner"],
|
||||||
|
"tokens_saved": <delta>,
|
||||||
|
"time_saved_ms": <delta>
|
||||||
|
}
|
||||||
|
```
|
||||||
26
agent-evolution/ideas/initial.txt
Normal file
26
agent-evolution/ideas/initial.txt
Normal file
@@ -0,0 +1,26 @@
|
|||||||
|
Теперь понимаю полную архитектуру. Создаю систему автоэволюции — не HTML-дашборд, а набор файлов для репозитория: агент-скептик-оценщик, эволюционный workflow, и стандартизированные тест-кейсы.Created 4 files, ran 2 commands, read a fileCreated 4 files, ran 2 commands, read a fileВот 4 файла для системы автоэволюции. Как это работает:
|
||||||
|
pipeline-judge.md — новый агент-скептик. Не выставляет субъективные оценки. Вместо этого:
|
||||||
|
|
||||||
|
Запускает bun test, bun build, bun lint, bun typecheck
|
||||||
|
Считает pass/fail/skip из JSON-репортов
|
||||||
|
Читает логи pipeline (токены, время каждого агента)
|
||||||
|
Вычисляет fitness score по формуле: 50% тесты + 25% quality gates + 25% эффективность (токены/время)
|
||||||
|
Если fitness < 0.70 → автоматически вызывает prompt-optimizer
|
||||||
|
|
||||||
|
evolution-workflow.md — workflow непрерывной оптимизации:
|
||||||
|
|
||||||
|
Срабатывает автоматически после каждого завершённого workflow
|
||||||
|
fitness ≥ 0.85 → логируем и идём дальше
|
||||||
|
fitness 0.70–0.84 → prompt-optimizer чинит слабые агенты
|
||||||
|
fitness < 0.50 → agent-architect перепроектирует агента
|
||||||
|
После оптимизации — перезапуск того же workflow с новыми промптами, сравнение fitness до/после. Улучшилось → коммит, нет → откат
|
||||||
|
|
||||||
|
Оркестратор эволюционирует через fitness-history.jsonl — накопительная база всех прогонов. Оркестратор учится: какие агенты пропускать (ROI < 0.1), как распределять token budget, какой pipeline-шаблон лучше для каждого типа задачи.
|
||||||
|
evolve-command.md — команда /evolve для ручного запуска или просмотра тренда.
|
||||||
|
evolution-patch.json — готовый патч для capability-index.yaml: добавляет pipeline-judge, routing, iteration_loops, и конфигурацию эволюции с бюджетами по типам задач.
|
||||||
|
Файлы нужно положить в репозиторий:
|
||||||
|
|
||||||
|
pipeline-judge.md → .kilo/agents/
|
||||||
|
evolution-workflow.md → .kilo/workflows/
|
||||||
|
evolve-command.md → .kilo/commands/
|
||||||
|
evolution-patch.json → применить к capability-index.yaml
|
||||||
181
agent-evolution/ideas/pipeline-judge.md
Normal file
181
agent-evolution/ideas/pipeline-judge.md
Normal file
@@ -0,0 +1,181 @@
|
|||||||
|
---
|
||||||
|
description: Automated pipeline judge. Evaluates workflow execution by running tests, measuring token cost and wall-clock time. Produces fitness scores. Never writes code — only measures and scores.
|
||||||
|
mode: subagent
|
||||||
|
model: ollama-cloud/nemotron-3-super
|
||||||
|
color: "#DC2626"
|
||||||
|
permission:
|
||||||
|
read: allow
|
||||||
|
write: deny
|
||||||
|
bash: allow
|
||||||
|
task: allow
|
||||||
|
glob: allow
|
||||||
|
grep: allow
|
||||||
|
---
|
||||||
|
|
||||||
|
# Kilo Code: Pipeline Judge
|
||||||
|
|
||||||
|
## Role Definition
|
||||||
|
|
||||||
|
You are **Pipeline Judge** — the automated fitness evaluator. You do NOT score subjectively. You measure objectively:
|
||||||
|
|
||||||
|
1. **Test pass rate** — run the test suite, count pass/fail/skip
|
||||||
|
2. **Token cost** — sum tokens consumed by all agents in the pipeline
|
||||||
|
3. **Wall-clock time** — total execution time from first agent to last
|
||||||
|
4. **Quality gates** — binary pass/fail for each quality gate
|
||||||
|
|
||||||
|
You produce a **fitness score** that drives evolutionary optimization.
|
||||||
|
|
||||||
|
## When to Invoke
|
||||||
|
|
||||||
|
- After ANY workflow completes (feature, bugfix, refactor, etc.)
|
||||||
|
- After prompt-optimizer changes an agent's prompt
|
||||||
|
- After a model swap recommendation is applied
|
||||||
|
- On `/evaluate` command
|
||||||
|
|
||||||
|
## Fitness Score Formula
|
||||||
|
|
||||||
|
```
|
||||||
|
fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
|
||||||
|
|
||||||
|
where:
|
||||||
|
test_pass_rate = passed_tests / total_tests # 0.0 - 1.0
|
||||||
|
quality_gates_rate = passed_gates / total_gates # 0.0 - 1.0
|
||||||
|
efficiency_score = 1.0 - clamp(normalized_cost, 0, 1) # higher = cheaper/faster
|
||||||
|
normalized_cost = (actual_tokens / budget_tokens × 0.5) + (actual_time / budget_time × 0.5)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Execution Protocol
|
||||||
|
|
||||||
|
### Step 1: Collect Metrics
|
||||||
|
```bash
|
||||||
|
# Run test suite
|
||||||
|
bun test --reporter=json > /tmp/test-results.json 2>&1
|
||||||
|
bun test:e2e --reporter=json >> /tmp/test-results.json 2>&1
|
||||||
|
|
||||||
|
# Count results
|
||||||
|
TOTAL=$(jq '.numTotalTests' /tmp/test-results.json)
|
||||||
|
PASSED=$(jq '.numPassedTests' /tmp/test-results.json)
|
||||||
|
FAILED=$(jq '.numFailedTests' /tmp/test-results.json)
|
||||||
|
|
||||||
|
# Check build
|
||||||
|
bun run build 2>&1 && BUILD_OK=true || BUILD_OK=false
|
||||||
|
|
||||||
|
# Check lint
|
||||||
|
bun run lint 2>&1 && LINT_OK=true || LINT_OK=false
|
||||||
|
|
||||||
|
# Check types
|
||||||
|
bun run typecheck 2>&1 && TYPES_OK=true || TYPES_OK=false
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Read Pipeline Log
|
||||||
|
Read `.kilo/logs/pipeline-*.log` for:
|
||||||
|
- Token counts per agent (from API response headers)
|
||||||
|
- Execution time per agent
|
||||||
|
- Number of iterations in evaluator-optimizer loops
|
||||||
|
- Which agents were invoked and in what order
|
||||||
|
|
||||||
|
### Step 3: Calculate Fitness
|
||||||
|
```
|
||||||
|
test_pass_rate = PASSED / TOTAL
|
||||||
|
quality_gates:
|
||||||
|
- build: BUILD_OK
|
||||||
|
- lint: LINT_OK
|
||||||
|
- types: TYPES_OK
|
||||||
|
- tests: FAILED == 0
|
||||||
|
- coverage: coverage >= 80%
|
||||||
|
quality_gates_rate = passed_gates / 5
|
||||||
|
|
||||||
|
token_budget = 50000 # tokens per standard workflow
|
||||||
|
time_budget = 300 # seconds per standard workflow
|
||||||
|
normalized_cost = (total_tokens/token_budget × 0.5) + (total_time/time_budget × 0.5)
|
||||||
|
efficiency = 1.0 - min(normalized_cost, 1.0)
|
||||||
|
|
||||||
|
FITNESS = test_pass_rate × 0.50 + quality_gates_rate × 0.25 + efficiency × 0.25
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Produce Report
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"workflow_id": "wf-<issue_number>-<timestamp>",
|
||||||
|
"fitness": 0.82,
|
||||||
|
"breakdown": {
|
||||||
|
"test_pass_rate": 0.95,
|
||||||
|
"quality_gates_rate": 0.80,
|
||||||
|
"efficiency_score": 0.65
|
||||||
|
},
|
||||||
|
"tests": {
|
||||||
|
"total": 47,
|
||||||
|
"passed": 45,
|
||||||
|
"failed": 2,
|
||||||
|
"skipped": 0,
|
||||||
|
"failed_names": ["auth.test.ts:42", "api.test.ts:108"]
|
||||||
|
},
|
||||||
|
"quality_gates": {
|
||||||
|
"build": true,
|
||||||
|
"lint": true,
|
||||||
|
"types": true,
|
||||||
|
"tests_clean": false,
|
||||||
|
"coverage_80": true
|
||||||
|
},
|
||||||
|
"cost": {
|
||||||
|
"total_tokens": 38400,
|
||||||
|
"total_time_ms": 245000,
|
||||||
|
"per_agent": [
|
||||||
|
{"agent": "lead-developer", "tokens": 12000, "time_ms": 45000},
|
||||||
|
{"agent": "sdet-engineer", "tokens": 8500, "time_ms": 32000}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"iterations": {
|
||||||
|
"code_review_loop": 2,
|
||||||
|
"security_review_loop": 1
|
||||||
|
},
|
||||||
|
"verdict": "PASS",
|
||||||
|
"bottleneck_agent": "lead-developer",
|
||||||
|
"most_expensive_agent": "lead-developer",
|
||||||
|
"improvement_trigger": false
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Trigger Evolution (if needed)
|
||||||
|
```
|
||||||
|
IF fitness < 0.70:
|
||||||
|
→ Task(subagent_type: "prompt-optimizer", payload: report)
|
||||||
|
→ improvement_trigger = true
|
||||||
|
|
||||||
|
IF any agent consumed > 30% of total tokens:
|
||||||
|
→ Flag as bottleneck
|
||||||
|
→ Suggest model downgrade or prompt compression
|
||||||
|
|
||||||
|
IF iterations > 2 in any loop:
|
||||||
|
→ Flag evaluator-optimizer convergence issue
|
||||||
|
→ Suggest prompt refinement for the evaluator agent
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
```
|
||||||
|
## Pipeline Judgment: Issue #<N>
|
||||||
|
|
||||||
|
**Fitness: <score>/1.00** [PASS|MARGINAL|FAIL]
|
||||||
|
|
||||||
|
| Metric | Value | Weight | Contribution |
|
||||||
|
|--------|-------|--------|-------------|
|
||||||
|
| Tests | 95% (45/47) | 50% | 0.475 |
|
||||||
|
| Gates | 80% (4/5) | 25% | 0.200 |
|
||||||
|
| Cost | 38.4K tok / 245s | 25% | 0.163 |
|
||||||
|
|
||||||
|
**Bottleneck:** lead-developer (31% of tokens)
|
||||||
|
**Failed tests:** auth.test.ts:42, api.test.ts:108
|
||||||
|
**Failed gates:** tests_clean
|
||||||
|
|
||||||
|
@if fitness < 0.70: Task tool with subagent_type: "prompt-optimizer"
|
||||||
|
@if fitness >= 0.70: Log to .kilo/logs/fitness-history.jsonl
|
||||||
|
```
|
||||||
|
|
||||||
|
## Prohibited Actions
|
||||||
|
|
||||||
|
- DO NOT write or modify any code
|
||||||
|
- DO NOT subjectively rate "quality" — only measure
|
||||||
|
- DO NOT skip running actual tests
|
||||||
|
- DO NOT estimate token counts — read from logs
|
||||||
|
- DO NOT change agent prompts — only flag for prompt-optimizer
|
||||||
Reference in New Issue
Block a user