feat: add pipeline-judge agent and evolution workflow system

- Add pipeline-judge agent for objective fitness scoring - Update capability-index.yaml with pipeline-judge, evolution config - Add fitness-evaluation.md workflow for auto-optimization - Update evolution.md command with /evolve CLI - Create .kilo/logs/fitness-history.jsonl for metrics logging - Update AGENTS.md with new workflow state machine - Add 6 new issues to MILESTONE_ISSUES.md for evolution integration - Preserve ideas in agent-evolution/ideas/ Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25) Auto-triggers prompt-optimizer when fitness < 0.70
2026-04-06 00:23:50 +01:00
parent 1ab9939c92
commit fa68141d47
12 changed files with 1653 additions and 193 deletions
--- a/.kilo/commands/evolution.md
+++ b/.kilo/commands/evolution.md
@@ -1,163 +1,167 @@
-# Agent Evolution Workflow
+---
+description: Run evolution cycle - judge last workflow, optimize underperforming agents, re-test
+---

-Tracks and records agent model improvements, capability changes, and performance metrics.
+# /evolution — Pipeline Evolution Command
+
+Runs the automated evolution cycle on the most recent (or specified) workflow.

 ## Usage

 ```
-/evolution [action] [agent]
+/evolution                     # evolve last completed workflow
+/evolution --issue 42          # evolve workflow for issue #42
+/evolution --agent planner     # focus evolution on one agent
+/evolution --dry-run           # show what would change without applying
+/evolution --history           # print fitness trend chart
+/evolution --fitness           # run fitness evaluation (alias for /evolve)
 ```

-### Actions
+## Aliases

-| Action | Description |
-|--------|-------------|
-| `log` | Log an agent improvement to Gitea and evolution data |
-| `report` | Generate evolution report for agent or all agents |
-| `history` | Show model change history |
-| `metrics` | Display performance metrics |
-| `recommend` | Get model recommendations |
+- `/evolve` — same as `/evolution --fitness`
+- `/evolution log` — log agent model change to Gitea

-### Examples
+## Execution
+
+### Step 1: Judge (Fitness Evaluation)
+
+```bash
+Task(subagent_type: "pipeline-judge")
+→ produces fitness report
+```
+
+### Step 2: Decide (Threshold Routing)
+
+```
+IF fitness >= 0.85:
+  echo "✅ Pipeline healthy (fitness: {score}). No action needed."
+  append to fitness-history.jsonl
+  EXIT
+
+IF fitness >= 0.70:
+  echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
+  identify agents with lowest per-agent scores
+  Task(subagent_type: "prompt-optimizer", target: weak_agents)
+
+IF fitness < 0.70:
+  echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
+  Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
+  IF fitness < 0.50:
+    Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)
+```
+
+### Step 3: Re-test (After Optimization)
+
+```
+Re-run the SAME workflow with updated prompts
+Task(subagent_type: "pipeline-judge") → fitness_after
+
+IF fitness_after > fitness_before:
+  commit prompt changes
+  echo "📈 Fitness improved: {before} → {after}"
+ELSE:
+  revert prompt changes
+  echo "📉 No improvement. Reverting."
+```
+
+### Step 4: Log
+
+Append to `.kilo/logs/fitness-history.jsonl`:
+
+```json
+{
+  "ts": "<now>",
+  "issue": <N>,
+  "workflow": "<type>",
+  "fitness_before": <score>,
+  "fitness_after": <score>,
+  "agents_optimized": ["planner", "requirement-refiner"],
+  "tokens_saved": <delta>,
+  "time_saved_ms": <delta>
+}
+```
+
+## Subcommands
+
+### `log` — Log Model Change
+
+Log an agent model improvement to Gitea and evolution data.

 ```bash
-# Log improvement
 /evolution log capability-analyst "Updated to qwen3.6-plus for better IF score"
+```

-# Generate report
-/evolution report capability-analyst
+Steps:
+1. Read current model from `.kilo/agents/{agent}.md`
+2. Get previous model from `agent-evolution/data/agent-versions.json`
+3. Calculate improvement (IF score, context window)
+4. Write to evolution data
+5. Post Gitea comment

-# Show all changes
-/evolution history
+### `report` — Generate Evolution Report

-# Get recommendations
+Generate comprehensive report for agent or all agents:
+
+```bash
+/evolution report           # all agents
+/evolution report planner   # specific agent
+```
+
+Output includes:
+- Total agents
+- Model changes this month
+- Average quality improvement
+- Recent changes table
+- Performance metrics
+- Model distribution
+- Recommendations
+
+### `history` — Show Fitness Trend
+
+Print fitness trend chart:
+
+```bash
+/evolution --history
+```
+
+Output:
+```
+Fitness Trend (Last 30 days):
+
+1.00 ┤
+0.90 ┤     ╭─╮     ╭──╮
+0.80 ┤   ╭─╯ ╰─╮ ╭─╯  ╰──╮
+0.70 ┤ ╭─╯     ╰─╯        ╰──╮
+0.60 ┤ │                         ╰─╮
+0.50 ┼─┴───────────────────────────┴──
+     Apr 1  Apr 8  Apr 15  Apr 22  Apr 29
+
+Avg fitness: 0.82
+Trend: ↑ improving
+```
+
+### `recommend` — Get Model Recommendations
+
+```bash
 /evolution recommend
 ```

-## Workflow Steps
-
-### Step 1: Parse Command
-
-```bash
-action=$1
-agent=$2
-message=$3
-```
-
-### Step 2: Execute Action
-
-#### Log Action
-
-When logging an improvement:
-
-1. **Read current model**
-   ```bash
-   # From .kilo/agents/{agent}.md
-   current_model=$(grep "^model:" .kilo/agents/${agent}.md | cut -d' ' -f2)
-   
-   # From .kilo/capability-index.yaml
-   yaml_model=$(grep -A1 "${agent}:" .kilo/capability-index.yaml | grep "model:" | cut -d' ' -f2)
-   ```
-
-2. **Get previous model from history**
-   ```bash
-   # Read from agent-evolution/data/agent-versions.json
-   previous_model=$(cat agent-evolution/data/agent-versions.json | ...)
-   ```
-
-3. **Calculate improvement**
-   - Look up model scores from capability-index.yaml
-   - Compare IF scores
-   - Compare context windows
-
-4. **Write to evolution data**
-   ```json
-   {
-     "agent": "capability-analyst",
-     "timestamp": "2026-04-05T22:20:00Z",
-     "type": "model_change",
-     "from": "ollama-cloud/nemotron-3-super",
-     "to": "qwen/qwen3.6-plus:free",
-     "improvement": {
-       "quality": "+23%",
-       "context_window": "130K→1M",
-       "if_score": "85→90"
-     },
-     "rationale": "Better structured output, FREE via OpenRouter"
-   }
-   ```
-
-5. **Post Gitea comment**
-   ```markdown
-   ## 🚀 Agent Evolution: {agent}
-
-   | Metric | Before | After | Change |
-   |--------|--------|-------|--------|
-   | Model | {old} | {new} | ⬆️ |
-   | IF Score | 85 | 90 | +5 |
-   | Quality | 64 | 79 | +23% |
-   | Context | 130K | 1M | +670K |
-
-   **Rationale**: {message}
-   ```
-
-#### Report Action
-
-Generate comprehensive report:
-
-```markdown
-# Agent Evolution Report
-
-## Overview
-
- Total agents: 28
- Model changes this month: 4
- Average quality improvement: +18%
-
-## Recent Changes
-
-| Date | Agent | Old Model | New Model | Impact |
-|------|-------|-----------|-----------|--------|
-| 2026-04-05 | capability-analyst | nemotron-3-super | qwen3.6-plus | +23% |
-| 2026-04-05 | requirement-refiner | nemotron-3-super | glm-5 | +33% |
-| ... | ... | ... | ... | ... |
-
-## Performance Metrics
-
-### Agent Scores Over Time
-
-```
-capability-analyst: 64 → 79 (+23%)
-requirement-refiner: 60 → 80 (+33%)
-agent-architect: 67 → 82 (+22%)
-evaluator: 78 → 81 (+4%)
-```
-
-### Model Distribution
-
- qwen3.6-plus: 5 agents
- nemotron-3-super: 8 agents
- glm-5: 3 agents
- minimax-m2.5: 1 agent
- ...
-
-## Recommendations
-
-1. Consider updating history-miner to nemotron-3-super-120b
-2. code-skeptic optimal with minimax-m2.5
-3. ...
-```
-
-### Step 3: Update Files
-
-After logging:
-
-1. Update `agent-evolution/data/agent-versions.json`
-2. Post comment to related Gitea issue
-3. Update capability-index.yaml metrics
+Shows:
+- Agents with fitness < 0.70 (need optimization)
+- Agents consuming > 30% of token budget (bottlenecks)
+- Model upgrade recommendations
+- Priority order

 ## Data Storage

+### fitness-history.jsonl
+
+```jsonl
+{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.65},"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47,"verdict":"PASS"}
+{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"breakdown":{"test_pass_rate":1.00,"quality_gates_rate":0.80,"efficiency_score":0.88},"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47,"verdict":"PASS"}
+```
+
 ### agent-versions.json

 ```json
@@ -186,22 +190,6 @@ After logging:
 }
 ```

-### Gitea Issue Comments
-
-Each evolution log posts a formatted comment:
-
-```markdown
-## 🚀 Agent Evolution Log
-
-### {agent}
- **Model**: {old} → {new}
- **Quality**: {old_score} → {new_score} ({change}%)
- **Context**: {old_ctx} → {new_ctx}
- **Rationale**: {reason}
-
-_This change was tracked by /evolution workflow._
-```
-
 ## Integration Points

 - **After `/pipeline`**: Evaluator scores logged
@@ -209,29 +197,52 @@ _This change was tracked by /evolution workflow._
 - **Weekly**: Performance report generated
 - **On request**: Recommendations provided

+## Configuration
+
+```yaml
+# In capability-index.yaml
+evolution:
+  enabled: true
+  auto_trigger: true           # trigger after every workflow
+  fitness_threshold: 0.70      # below this → auto-optimize
+  max_evolution_attempts: 3    # max retries per cycle
+  fitness_history: .kilo/logs/fitness-history.jsonl
+  token_budget_default: 50000
+  time_budget_default: 300
+```
+
 ## Metrics Tracked

 | Metric | Source | Purpose |
 |--------|--------|---------|
-| IF Score | KILO_SPEC.md | Instruction Following |
-| Quality Score | Research | Overall performance |
-| Context Window | Model spec | Max tokens |
-| Provider | Config | API endpoint |
-| Cost | Pricing | Resource planning |
-| SWE-bench | Research | Code benchmark |
-| RULER | Research | Long-context benchmark |
+| Fitness Score | pipeline-judge | Overall pipeline health |
+| Test Pass Rate | bun test | Code quality |
+| Quality Gates | build/lint/typecheck | Standards compliance |
+| Token Cost | pipeline logs | Resource efficiency |
+| Wall-Clock Time | pipeline logs | Speed |
+| Agent ROI | history analysis | Cost/benefit |

 ## Example Session

 ```bash
-$ /evolution log capability-analyst "Updated to qwen3.6-plus for FREE tier and better IF"
+$ /evolution

-✅ Logged evolution for capability-analyst
-📊 Quality improvement: +23%
-📄 Posted comment to Issue #27
-📝 Updated agent-versions.json
+## Pipeline Judgment: Issue #42
+
+**Fitness: 0.82/1.00** [PASS]
+
+| Metric | Value | Weight | Contribution |
+|--------|-------|--------|-------------|
+| Tests  | 95% (45/47) | 50% | 0.475 |
+| Gates  | 80% (4/5) | 25% | 0.200 |
+| Cost   | 38.4K tok / 245s | 25% | 0.163 |
+
+**Bottleneck:** lead-developer (31% of tokens)
+**Verdict:** PASS - within acceptable range
+
+✅ Logged to .kilo/logs/fitness-history.jsonl
 ```

 ---

-_Evolution workflow v1.0 - Track agent improvements_
+*Evolution workflow v2.0 - Objective fitness scoring with pipeline-judge*