feat: add pipeline-judge agent and evolution workflow system

- Add pipeline-judge agent for objective fitness scoring - Update capability-index.yaml with pipeline-judge, evolution config - Add fitness-evaluation.md workflow for auto-optimization - Update evolution.md command with /evolve CLI - Create .kilo/logs/fitness-history.jsonl for metrics logging - Update AGENTS.md with new workflow state machine - Add 6 new issues to MILESTONE_ISSUES.md for evolution integration - Preserve ideas in agent-evolution/ideas/ Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25) Auto-triggers prompt-optimizer when fitness < 0.70
2026-04-06 00:23:50 +01:00
parent 1ab9939c92
commit fa68141d47
12 changed files with 1653 additions and 193 deletions
--- a/.kilo/agents/pipeline-judge.md
+++ b/.kilo/agents/pipeline-judge.md
@@ -0,0 +1,211 @@
 ---
 description: Automated pipeline judge. Evaluates workflow execution by running tests, measuring token cost and wall-clock time. Produces objective fitness scores. Never writes code - only measures and scores.
 mode: subagent
 model: ollama-cloud/nemotron-3-super
 color: "#DC2626"
 permission:
  read: allow
  edit: deny
  write: deny
  bash: allow
  glob: allow
  grep: allow
  task:
    "*": deny
    "prompt-optimizer": allow
 ---
 # Kilo Code: Pipeline Judge
 ## Role Definition
 You are **Pipeline Judge** — the automated fitness evaluator. You do NOT score subjectively. You measure objectively:
 1. **Test pass rate** — run the test suite, count pass/fail/skip
 2. **Token cost** — sum tokens consumed by all agents in the pipeline
 3. **Wall-clock time** — total execution time from first agent to last
 4. **Quality gates** — binary pass/fail for each quality gate
 You produce a **fitness score** that drives evolutionary optimization.
 ## When to Invoke
 - After ANY workflow completes (feature, bugfix, refactor, etc.)
 - After prompt-optimizer changes an agent's prompt
 - After a model swap recommendation is applied
 - On `/evaluate` command
 ## Fitness Score Formula
 ```
 fitness = (test_pass_rate x 0.50) + (quality_gates_rate x 0.25) + (efficiency_score x 0.25)
 where:
  test_pass_rate = passed_tests / total_tests                    # 0.0 - 1.0
  quality_gates_rate = passed_gates / total_gates                # 0.0 - 1.0  
  efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)         # higher = cheaper/faster
  normalized_cost = (actual_tokens / budget_tokens x 0.5) + (actual_time / budget_time x 0.5)
 ```
 ## Execution Protocol
 ### Step 1: Collect Metrics
 ```bash
 # Run test suite
 bun test --reporter=json > /tmp/test-results.json 2>&1
 bun test:e2e --reporter=json >> /tmp/test-results.json 2>&1
 # Count results
 TOTAL=$(jq '.numTotalTests' /tmp/test-results.json)
 PASSED=$(jq '.numPassedTests' /tmp/test-results.json)
 FAILED=$(jq '.numFailedTests' /tmp/test-results.json)
 # Check build
 bun run build 2>&1 && BUILD_OK=true || BUILD_OK=false
 # Check lint
 bun run lint 2>&1 && LINT_OK=true || LINT_OK=false
 # Check types
 bun run typecheck 2>&1 && TYPES_OK=true || TYPES_OK=false
 ```
 ### Step 2: Read Pipeline Log
 Read `.kilo/logs/pipeline-*.log` for:
 - Token counts per agent (from API response headers)
 - Execution time per agent
 - Number of iterations in evaluator-optimizer loops
 - Which agents were invoked and in what order
 ### Step 3: Calculate Fitness
 ```
 test_pass_rate = PASSED / TOTAL
 quality_gates:
  - build: BUILD_OK
  - lint: LINT_OK  
  - types: TYPES_OK
  - tests: FAILED == 0
  - coverage: coverage >= 80%
 quality_gates_rate = passed_gates / 5
 token_budget = 50000  # tokens per standard workflow
 time_budget = 300     # seconds per standard workflow
 normalized_cost = (total_tokens/token_budget x 0.5) + (total_time/time_budget x 0.5)
 efficiency = 1.0 - min(normalized_cost, 1.0)
 FITNESS = test_pass_rate x 0.50 + quality_gates_rate x 0.25 + efficiency x 0.25
 ```
 ### Step 4: Produce Report
 ```json
 {
  "workflow_id": "wf-<issue_number>-<timestamp>",
  "fitness": 0.82,
  "breakdown": {
    "test_pass_rate": 0.95,
    "quality_gates_rate": 0.80,
    "efficiency_score": 0.65
  },
  "tests": {
    "total": 47,
    "passed": 45,
    "failed": 2,
    "skipped": 0,
    "failed_names": ["auth.test.ts:42", "api.test.ts:108"]
  },
  "quality_gates": {
    "build": true,
    "lint": true,
    "types": true,
    "tests_clean": false,
    "coverage_80": true
  },
  "cost": {
    "total_tokens": 38400,
    "total_time_ms": 245000,
    "per_agent": [
      {"agent": "lead-developer", "tokens": 12000, "time_ms": 45000},
      {"agent": "sdet-engineer", "tokens": 8500, "time_ms": 32000}
    ]
  },
  "iterations": {
    "code_review_loop": 2,
    "security_review_loop": 1
  },
  "verdict": "PASS",
  "bottleneck_agent": "lead-developer",
  "most_expensive_agent": "lead-developer",
  "improvement_trigger": false
 }
 ```
 ### Step 5: Trigger Evolution (if needed)
 ```
 IF fitness < 0.70:
  -> Task(subagent_type: "prompt-optimizer", payload: report)
  -> improvement_trigger = true
 IF any agent consumed > 30% of total tokens:
  -> Flag as bottleneck
  -> Suggest model downgrade or prompt compression
 IF iterations > 2 in any loop:
  -> Flag evaluator-optimizer convergence issue
  -> Suggest prompt refinement for the evaluator agent
 ```
 ## Output Format
 ```
 ## Pipeline Judgment: Issue #<N>
 **Fitness: <score>/1.00** [PASS|MARGINAL|FAIL]
 | Metric | Value | Weight | Contribution |
 |--------|-------|--------|-------------|
 | Tests  | 95% (45/47) | 50% | 0.475 |
 | Gates  | 80% (4/5) | 25% | 0.200 |
 | Cost   | 38.4K tok / 245s | 25% | 0.163 |
 **Bottleneck:** lead-developer (31% of tokens)
 **Failed tests:** auth.test.ts:42, api.test.ts:108
 **Failed gates:** tests_clean
@if fitness < 0.70: Task tool with subagent_type: "prompt-optimizer"
@if fitness >= 0.70: Log to .kilo/logs/fitness-history.jsonl
 ```
 ## Workflow-Specific Budgets
 | Workflow | Token Budget | Time Budget (s) | Min Coverage |
 |----------|-------------|-----------------|---------------|
 | feature | 50000 | 300 | 80% |
 | bugfix | 20000 | 120 | 90% |
 | refactor | 40000 | 240 | 95% |
 | security | 30000 | 180 | 80% |
 ## Prohibited Actions
 - DO NOT write or modify any code
 - DO NOT subjectively rate "quality" — only measure
 - DO NOT skip running actual tests
 - DO NOT estimate token counts — read from logs
 - DO NOT change agent prompts — only flag for prompt-optimizer
 ## Gitea Commenting (MANDATORY)
 **You MUST post a comment to the Gitea issue after completing your work.**
 Post a comment with:
 1. Fitness score with breakdown
 2. Bottleneck identification
 3. Improvement triggers (if any)
 Use the `post_comment` function from `.kilo/skills/gitea-commenting/SKILL.md`.
 **NO EXCEPTIONS** - Always comment to Gitea.
--- a/.kilo/capability-index.yaml
+++ b/.kilo/capability-index.yaml
@@ -521,6 +521,26 @@ agents:
    model: ollama-cloud/nemotron-3-super
    mode: subagent
  pipeline-judge:
    capabilities:
      - test_execution
      - fitness_scoring
      - metric_collection
      - bottleneck_detection
    receives:
      - completed_workflow
      - pipeline_logs
    produces:
      - fitness_report
      - bottleneck_analysis
      - improvement_triggers
    forbidden:
      - code_writing
      - code_changes
      - prompt_changes
    model: ollama-cloud/nemotron-3-super
    mode: subagent
  # Capability Routing Map
  capability_routing:
    code_writing: lead-developer
@@ -559,6 +579,10 @@ agents:
    memory_retrieval: memory-manager
    chain_of_thought: planner
    tree_of_thoughts: planner
    # Fitness & Evolution
    fitness_scoring: pipeline-judge
    test_execution: pipeline-judge
    bottleneck_detection: pipeline-judge
  # Go Development
  go_api_development: go-developer
  go_database_design: go-developer
@@ -597,6 +621,13 @@ iteration_loops:
    max_iterations: 2
    convergence: all_perf_issues_resolved
  # Evolution loop for continuous improvement
  evolution:
    evaluator: pipeline-judge
    optimizer: prompt-optimizer
    max_iterations: 3
    convergence: fitness_above_0.85
 # Quality Gates
 quality_gates:
  requirements:
@@ -647,4 +678,33 @@ workflow_states:
  perf_check: [security_check]
  security_check: [releasing]
  releasing: [evaluated]
-  evaluated: [completed]
+  evaluated: [evolving, completed]
  evolving: [evaluated]
  completed: []
 # Evolution Configuration
 evolution:
  enabled: true
  auto_trigger: true           # trigger after every workflow
  fitness_threshold: 0.70      # below this → auto-optimize
  max_evolution_attempts: 3    # max retries per cycle
  fitness_history: .kilo/logs/fitness-history.jsonl
  token_budget_default: 50000
  time_budget_default: 300
  budgets:
    feature:
      tokens: 50000
      time_s: 300
      min_coverage: 80
    bugfix:
      tokens: 20000
      time_s: 120
      min_coverage: 90
    refactor:
      tokens: 40000
      time_s: 240
      min_coverage: 95
    security:
      tokens: 30000
      time_s: 180
      min_coverage: 80
--- a/.kilo/commands/evolution.md
+++ b/.kilo/commands/evolution.md
@@ -1,163 +1,167 @@
-# Agent Evolution Workflow
+---
 description: Run evolution cycle - judge last workflow, optimize underperforming agents, re-test
 ---
-Tracks and records agent model improvements, capability changes, and performance metrics.
+# /evolution — Pipeline Evolution Command
 Runs the automated evolution cycle on the most recent (or specified) workflow.
 ## Usage
 ```
-/evolution [action] [agent]
+/evolution                     # evolve last completed workflow
 /evolution --issue 42          # evolve workflow for issue #42
 /evolution --agent planner     # focus evolution on one agent
 /evolution --dry-run           # show what would change without applying
 /evolution --history           # print fitness trend chart
 /evolution --fitness           # run fitness evaluation (alias for /evolve)
 ```
-### Actions
+## Aliases
-| Action | Description |
+- `/evolve` — same as `/evolution --fitness`
-|--------|-------------|
+- `/evolution log` — log agent model change to Gitea
 | `log` | Log an agent improvement to Gitea and evolution data |
 | `report` | Generate evolution report for agent or all agents |
 | `history` | Show model change history |
 | `metrics` | Display performance metrics |
 | `recommend` | Get model recommendations |
-### Examples
+## Execution
 ### Step 1: Judge (Fitness Evaluation)
 ```bash
 Task(subagent_type: "pipeline-judge")
 → produces fitness report
 ```
 ### Step 2: Decide (Threshold Routing)
 ```
 IF fitness >= 0.85:
  echo "✅ Pipeline healthy (fitness: {score}). No action needed."
  append to fitness-history.jsonl
  EXIT
 IF fitness >= 0.70:
  echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
  identify agents with lowest per-agent scores
  Task(subagent_type: "prompt-optimizer", target: weak_agents)
 IF fitness < 0.70:
  echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
  Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
  IF fitness < 0.50:
    Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)
 ```
 ### Step 3: Re-test (After Optimization)
 ```
 Re-run the SAME workflow with updated prompts
 Task(subagent_type: "pipeline-judge") → fitness_after
 IF fitness_after > fitness_before:
  commit prompt changes
  echo "📈 Fitness improved: {before} → {after}"
 ELSE:
  revert prompt changes
  echo "📉 No improvement. Reverting."
 ```
 ### Step 4: Log
 Append to `.kilo/logs/fitness-history.jsonl`:
 ```json
 {
  "ts": "<now>",
  "issue": <N>,
  "workflow": "<type>",
  "fitness_before": <score>,
  "fitness_after": <score>,
  "agents_optimized": ["planner", "requirement-refiner"],
  "tokens_saved": <delta>,
  "time_saved_ms": <delta>
 }
 ```
 ## Subcommands
 ### `log` — Log Model Change
 Log an agent model improvement to Gitea and evolution data.
 ```bash
 # Log improvement
 /evolution log capability-analyst "Updated to qwen3.6-plus for better IF score"
 ```
-# Generate report
+Steps:
-/evolution report capability-analyst
+1. Read current model from `.kilo/agents/{agent}.md`
 2. Get previous model from `agent-evolution/data/agent-versions.json`
 3. Calculate improvement (IF score, context window)
 4. Write to evolution data
 5. Post Gitea comment
-# Show all changes
+### `report` — Generate Evolution Report
 /evolution history
-# Get recommendations
+Generate comprehensive report for agent or all agents:
 ```bash
 /evolution report           # all agents
 /evolution report planner   # specific agent
 ```
 Output includes:
 - Total agents
 - Model changes this month
 - Average quality improvement
 - Recent changes table
 - Performance metrics
 - Model distribution
 - Recommendations
 ### `history` — Show Fitness Trend
 Print fitness trend chart:
 ```bash
 /evolution --history
 ```
 Output:
 ```
 Fitness Trend (Last 30 days):
 1.00 ┤
 0.90 ┤     ╭─╮     ╭──╮
 0.80 ┤   ╭─╯ ╰─╮ ╭─╯  ╰──╮
 0.70 ┤ ╭─╯     ╰─╯        ╰──╮
 0.60 ┤ │                         ╰─╮
 0.50 ┼─┴───────────────────────────┴──
     Apr 1  Apr 8  Apr 15  Apr 22  Apr 29
 Avg fitness: 0.82
 Trend: ↑ improving
 ```
 ### `recommend` — Get Model Recommendations
 ```bash
 /evolution recommend
 ```
-## Workflow Steps
+Shows:
-
+- Agents with fitness < 0.70 (need optimization)
-### Step 1: Parse Command
+- Agents consuming > 30% of token budget (bottlenecks)
-
+- Model upgrade recommendations
-```bash
+- Priority order
 action=$1
 agent=$2
 message=$3
 ```
 ### Step 2: Execute Action
 #### Log Action
 When logging an improvement:
 1. **Read current model**
   ```bash
   # From .kilo/agents/{agent}.md
   current_model=$(grep "^model:" .kilo/agents/${agent}.md | cut -d' ' -f2)
   # From .kilo/capability-index.yaml
   yaml_model=$(grep -A1 "${agent}:" .kilo/capability-index.yaml | grep "model:" | cut -d' ' -f2)
   ```
 2. **Get previous model from history**
   ```bash
   # Read from agent-evolution/data/agent-versions.json
   previous_model=$(cat agent-evolution/data/agent-versions.json | ...)
   ```
 3. **Calculate improvement**
   - Look up model scores from capability-index.yaml
   - Compare IF scores
   - Compare context windows
 4. **Write to evolution data**
   ```json
   {
     "agent": "capability-analyst",
     "timestamp": "2026-04-05T22:20:00Z",
     "type": "model_change",
     "from": "ollama-cloud/nemotron-3-super",
     "to": "qwen/qwen3.6-plus:free",
     "improvement": {
       "quality": "+23%",
       "context_window": "130K→1M",
       "if_score": "85→90"
     },
     "rationale": "Better structured output, FREE via OpenRouter"
   }
   ```
 5. **Post Gitea comment**
   ```markdown
   ## 🚀 Agent Evolution: {agent}
   | Metric | Before | After | Change |
   |--------|--------|-------|--------|
   | Model | {old} | {new} | ⬆️ |
   | IF Score | 85 | 90 | +5 |
   | Quality | 64 | 79 | +23% |
   | Context | 130K | 1M | +670K |
   **Rationale**: {message}
   ```
 #### Report Action
 Generate comprehensive report:
 ```markdown
 # Agent Evolution Report
 ## Overview
 - Total agents: 28
 - Model changes this month: 4
 - Average quality improvement: +18%
 ## Recent Changes
 | Date | Agent | Old Model | New Model | Impact |
 |------|-------|-----------|-----------|--------|
 | 2026-04-05 | capability-analyst | nemotron-3-super | qwen3.6-plus | +23% |
 | 2026-04-05 | requirement-refiner | nemotron-3-super | glm-5 | +33% |
 | ... | ... | ... | ... | ... |
 ## Performance Metrics
 ### Agent Scores Over Time
 ```
 capability-analyst: 64 → 79 (+23%)
 requirement-refiner: 60 → 80 (+33%)
 agent-architect: 67 → 82 (+22%)
 evaluator: 78 → 81 (+4%)
 ```
 ### Model Distribution
 - qwen3.6-plus: 5 agents
 - nemotron-3-super: 8 agents
 - glm-5: 3 agents
 - minimax-m2.5: 1 agent
 - ...
 ## Recommendations
 1. Consider updating history-miner to nemotron-3-super-120b
 2. code-skeptic optimal with minimax-m2.5
 3. ...
 ```
 ### Step 3: Update Files
 After logging:
 1. Update `agent-evolution/data/agent-versions.json`
 2. Post comment to related Gitea issue
 3. Update capability-index.yaml metrics
 ## Data Storage
 ### fitness-history.jsonl
 ```jsonl
 {"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.65},"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47,"verdict":"PASS"}
 {"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"breakdown":{"test_pass_rate":1.00,"quality_gates_rate":0.80,"efficiency_score":0.88},"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47,"verdict":"PASS"}
 ```
 ### agent-versions.json
 ```json
@@ -186,22 +190,6 @@ After logging:
 }
 ```
 ### Gitea Issue Comments
 Each evolution log posts a formatted comment:
 ```markdown
 ## 🚀 Agent Evolution Log
 ### {agent}
 - **Model**: {old} → {new}
 - **Quality**: {old_score} → {new_score} ({change}%)
 - **Context**: {old_ctx} → {new_ctx}
 - **Rationale**: {reason}
 _This change was tracked by /evolution workflow._
 ```
 ## Integration Points
 - **After `/pipeline`**: Evaluator scores logged
@@ -209,29 +197,52 @@ _This change was tracked by /evolution workflow._
 - **Weekly**: Performance report generated
 - **On request**: Recommendations provided
 ## Configuration
 ```yaml
 # In capability-index.yaml
 evolution:
  enabled: true
  auto_trigger: true           # trigger after every workflow
  fitness_threshold: 0.70      # below this → auto-optimize
  max_evolution_attempts: 3    # max retries per cycle
  fitness_history: .kilo/logs/fitness-history.jsonl
  token_budget_default: 50000
  time_budget_default: 300
 ```
 ## Metrics Tracked
 | Metric | Source | Purpose |
 |--------|--------|---------|
-| IF Score | KILO_SPEC.md | Instruction Following |
+| Fitness Score | pipeline-judge | Overall pipeline health |
-| Quality Score | Research | Overall performance |
+| Test Pass Rate | bun test | Code quality |
-| Context Window | Model spec | Max tokens |
+| Quality Gates | build/lint/typecheck | Standards compliance |
-| Provider | Config | API endpoint |
+| Token Cost | pipeline logs | Resource efficiency |
-| Cost | Pricing | Resource planning |
+| Wall-Clock Time | pipeline logs | Speed |
-| SWE-bench | Research | Code benchmark |
+| Agent ROI | history analysis | Cost/benefit |
 | RULER | Research | Long-context benchmark |
 ## Example Session
 ```bash
-$ /evolution log capability-analyst "Updated to qwen3.6-plus for FREE tier and better IF"
+$ /evolution
-✅ Logged evolution for capability-analyst
+## Pipeline Judgment: Issue #42
-📊 Quality improvement: +23%
+
-📄 Posted comment to Issue #27
+**Fitness: 0.82/1.00** [PASS]
-📝 Updated agent-versions.json
+
 | Metric | Value | Weight | Contribution |
 |--------|-------|--------|-------------|
 | Tests  | 95% (45/47) | 50% | 0.475 |
 | Gates  | 80% (4/5) | 25% | 0.200 |
 | Cost   | 38.4K tok / 245s | 25% | 0.163 |
 **Bottleneck:** lead-developer (31% of tokens)
 **Verdict:** PASS - within acceptable range
 ✅ Logged to .kilo/logs/fitness-history.jsonl
 ```
 ---
-_Evolution workflow v1.0 - Track agent improvements_
+*Evolution workflow v2.0 - Objective fitness scoring with pipeline-judge*
--- a/.kilo/logs/fitness-history.jsonl
+++ b/.kilo/logs/fitness-history.jsonl
@@ -0,0 +1 @@
 {"ts":"2026-04-04T02:30:00Z","issue":5,"workflow":"feature","fitness":0.85,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.78},"tokens":38400,"time_ms":245000,"tests_passed":9,"tests_total":10,"agents":["requirement-refiner","history-miner","system-analyst","sdet-engineer","lead-developer"],"verdict":"PASS"}
--- a/.kilo/workflows/fitness-evaluation.md
+++ b/.kilo/workflows/fitness-evaluation.md
@@ -0,0 +1,259 @@
 # Fitness Evaluation Workflow
 Post-workflow fitness evaluation and automatic optimization loop.
 ## Overview
 This workflow runs after every completed workflow to:
 1. Evaluate fitness objectively via `pipeline-judge`
 2. Trigger optimization if fitness < threshold
 3. Re-run and compare before/after
 4. Log results to fitness-history.jsonl
 ## Flow
 ```
 [Workflow Completes]
        ↓
 [@pipeline-judge] ← runs tests, measures tokens/time
        ↓
    fitness score
        ↓
 ┌──────────────────────────────────┐
 │ fitness >= 0.85                  │──→ Log + done (no action)
 │ fitness 0.70 - 0.84              │──→ [@prompt-optimizer] minor tuning
 │ fitness < 0.70                   │──→ [@prompt-optimizer] major rewrite
 │ fitness < 0.50                   │──→ [@agent-architect] redesign agent
 └──────────────────────────────────┘
        ↓
 [Re-run same workflow with new prompts]
        ↓
 [@pipeline-judge] again
        ↓
    compare fitness_before vs fitness_after
        ↓
 ┌──────────────────────────────────┐
 │ improved?                        │
 │  Yes → commit new prompts        │
 │  No  → revert, try               │
 │        different strategy        │
 │        (max 3 attempts)           │
 └──────────────────────────────────┘
 ```
 ## Fitness Score Formula
 ```
 fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
 where:
  test_pass_rate = passed_tests / total_tests
  quality_gates_rate = passed_gates / total_gates
  efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)
  normalized_cost = (actual_tokens / budget_tokens × 0.5) + (actual_time / budget_time × 0.5)
 ```
 ## Quality Gates
 Each gate is binary (pass/fail):
 | Gate | Command | Weight |
 |------|---------|--------|
 | build | `bun run build` | 1/5 |
 | lint | `bun run lint` | 1/5 |
 | types | `bun run typecheck` | 1/5 |
 | tests | `bun test` | 1/5 |
 | coverage | `bun test --coverage >= 80%` | 1/5 |
 ## Budget Defaults
 | Workflow | Token Budget | Time Budget (s) | Min Coverage |
 |----------|-------------|-----------------|---------------|
 | feature | 50000 | 300 | 80% |
 | bugfix | 20000 | 120 | 90% |
 | refactor | 40000 | 240 | 95% |
 | security | 30000 | 180 | 80% |
 ## Workflow-Specific Benchmarks
 ```yaml
 benchmarks:
  feature:
    token_budget: 50000
    time_budget_s: 300
    min_test_coverage: 80%
    max_iterations: 3
  bugfix:
    token_budget: 20000
    time_budget_s: 120
    min_test_coverage: 90%  # higher for bugfix - must prove fix works
    max_iterations: 2
  refactor:
    token_budget: 40000
    time_budget_s: 240
    min_test_coverage: 95%  # must not break anything
    max_iterations: 2
  security:
    token_budget: 30000
    time_budget_s: 180
    min_test_coverage: 80%
    max_iterations: 2
    required_gates: [security]  # security gate MUST pass
 ```
 ## Execution Steps
 ### Step 1: Collect Metrics
 Agent: `pipeline-judge`
 ```bash
 # Run test suite
 bun test --reporter=json > /tmp/test-results.json 2>&1
 # Count results
 TOTAL=$(jq '.numTotalTests' /tmp/test-results.json)
 PASSED=$(jq '.numPassedTests' /tmp/test-results.json)
 FAILED=$(jq '.numFailedTests' /tmp/test-results.json)
 # Check quality gates
 bun run build 2>&1 && BUILD_OK=true || BUILD_OK=false
 bun run lint 2>&1 && LINT_OK=true || LINT_OK=false
 bun run typecheck 2>&1 && TYPES_OK=true || TYPES_OK=false
 ```
 ### Step 2: Read Pipeline Log
 Read `.kilo/logs/pipeline-*.log` for:
 - Token counts per agent
 - Execution time per agent
 - Number of iterations in evaluator-optimizer loops
 - Which agents were invoked
 ### Step 3: Calculate Fitness
 ```
 test_pass_rate = PASSED / TOTAL
 quality_gates_rate = (BUILD_OK + LINT_OK + TYPES_OK + TESTS_CLEAN + COVERAGE_OK) / 5
 efficiency = 1.0 - min((tokens/50000 + time/300) / 2, 1.0)
 FITNESS = test_pass_rate × 0.50 + quality_gates_rate × 0.25 + efficiency × 0.25
 ```
 ### Step 4: Decide Action
 | Fitness | Action |
 |---------|--------|
 | >= 0.85 | Log to fitness-history.jsonl, done |
 | 0.70-0.84 | Call `prompt-optimizer` for minor tuning |
 | 0.50-0.69 | Call `prompt-optimizer` for major rewrite |
 | < 0.50 | Call `agent-architect` to redesign agent |
 ### Step 5: Re-test After Optimization
 If optimization was triggered:
 1. Re-run the same workflow with new prompts
 2. Call `pipeline-judge` again
 3. Compare fitness_before vs fitness_after
 4. If improved: commit prompts
 5. If not improved: revert
 ### Step 6: Log Results
 Append to `.kilo/logs/fitness-history.jsonl`:
 ```jsonl
 {"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
 ```
 ## Usage
 ### Automatic (post-pipeline)
 The workflow triggers automatically after any workflow completes.
 ### Manual
 ```bash
 /evolve                     # evolve last completed workflow
 /evolve --issue 42          # evolve workflow for issue #42
 /evolve --agent planner     # focus evolution on one agent
 /evolve --dry-run           # show what would change without applying
 /evolve --history           # print fitness trend chart
 ```
 ## Integration Points
 - **After `/pipeline`**: pipeline-judge scores the workflow
 - **After prompt update**: evolution loop retries
 - **Weekly**: Performance trend analysis
 - **On request**: Recommendation generation
 ## Orchestrator Learning
 The orchestrator uses fitness history to optimize future pipeline construction:
 ### Pipeline Selection Strategy
 ```
 For each new issue:
  1. Classify issue type (feature|bugfix|refactor|api|security)
  2. Look up fitness history for same type
  3. Find pipeline configuration with highest fitness
  4. Use that as template, but adapt to current issue
  5. Skip agents that consistently score 0 contribution
 ```
 ### Agent Ordering Optimization
 ```
 From fitness-history.jsonl, extract per-agent metrics:
  - avg tokens consumed
  - avg contribution to fitness
  - failure rate (how often this agent's output causes downstream failures)
 agents_by_roi = sort(agents, key=contribution/tokens, descending)
 For parallel phases:
  - Run high-ROI agents first
  - Skip agents with ROI < 0.1 (cost more than they contribute)
 ```
 ### Token Budget Allocation
 ```
 total_budget = 50000 tokens (configurable)
 For each agent in pipeline:
  agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
  If agent exceeds budget by >50%:
    → prompt-optimizer compresses that agent's prompt
    → or swap to a smaller/faster model
 ```
 ## Prompt Evolution Protocol
 When prompt-optimizer is triggered:
 1. Read current agent prompt from `.kilo/agents/<agent>.md`
 2. Read fitness report identifying the problem
 3. Read last 5 fitness entries for this agent from history
 4. Analyze pattern:
   - IF consistently low → systemic prompt issue
   - IF regression after change → revert
   - IF one-time failure → might be task-specific, no action
 5. Generate improved prompt:
   - Keep same structure (description, mode, model, permissions)
   - Modify ONLY the instruction body
   - Add explicit output format IF was the issue
   - Add few-shot examples IF quality was the issue
   - Compress verbose sections IF tokens were the issue
 6. Save to `.kilo/agents/<agent>.md.candidate`
 7. Re-run workflow with .candidate prompt
 8. `@pipeline-judge` scores again
 9. IF fitness_new > fitness_old: mv .candidate → .md (commit)
   ELSE: rm .candidate (revert)
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -17,12 +17,15 @@ Agent: Runs full pipeline for issue #42 with Gitea logging
 |---------|-------------|-------|
 | `/pipeline <issue>` | Run full agent pipeline for issue | `/pipeline 42` |
 | `/status <issue>` | Check pipeline status for issue | `/status 42` |
 | `/evolve` | Run evolution cycle with fitness scoring | `/evolve --issue 42` |
 | `/evaluate <issue>` | Generate performance report | `/evaluate 42` |
 | `/plan` | Creates detailed task plans | `/plan feature X` |
 | `/ask` | Answers codebase questions | `/ask how does auth work` |
 | `/debug` | Analyzes and fixes bugs | `/debug error in login` |
 | `/code` | Quick code generation | `/code add validation` |
 | `/research [topic]` | Run research and self-improvement | `/research multi-agent` |
 | `/evolution log` | Log agent model change | `/evolution log planner "reason"` |
 | `/evolution report` | Generate evolution report | `/evolution report` |
 ## Pipeline Agents (Subagents)
@@ -62,7 +65,8 @@ These agents are invoked automatically by `/pipeline` or manually via `@mention`
 |-------|------|--------------|
 | `@release-manager` | Git operations | Status: releasing |
 | `@evaluator` | Scores effectiveness | Status: evaluated |
-| `@prompt-optimizer` | Improves prompts | When score < 7 |
+| `@pipeline-judge` | Objective fitness scoring | After workflow completes |
 | `@prompt-optimizer` | Improves prompts | When fitness < 0.70 |
 | `@capability-analyst` | Analyzes task coverage | When starting new task |
 | `@agent-architect` | Creates new agents | When gaps identified |
 | `@workflow-architect` | Creates workflows | New workflow needed |
@@ -94,9 +98,27 @@ These agents are invoked automatically by `/pipeline` or manually via `@mention`
 [releasing] 
  ↓ @release-manager
 [evaluated] 
-  ↓ @evaluator
+  ↓ @evaluator (subjective score 1-10)
-  ├── [score ≥ 7] → [completed]
+  ├── [score ≥ 7] → [@pipeline-judge] → fitness scoring
-  └── [score < 7] → @prompt-optimizer → [completed]
+  └── [score < 7] → @prompt-optimizer → [@evaluated]
        ↓
    [@pipeline-judge] ← runs tests, measures tokens/time
        ↓
    fitness score
        ↓
 ┌──────────────────────────────────────┐
 │ fitness >= 0.85                      │──→ [completed]
 │ fitness 0.70-0.84                    │──→ @prompt-optimizer → [evolving]
 │ fitness < 0.70                      │──→ @prompt-optimizer (major) → [evolving]
 │ fitness < 0.50                      │──→ @agent-architect → redesign
 └──────────────────────────────────────┘
        ↓
 [evolving] → re-run workflow → [@pipeline-judge]
        ↓
    compare fitness_before vs fitness_after
        ↓
    [improved?] → commit prompts → [completed]
              └─ [not improved?] → revert → try different strategy
 ```
 ## Capability Analysis Flow
@@ -167,6 +189,14 @@ Scores saved to `.kilo/logs/efficiency_score.json`:
 }
 ```
 ### Fitness Tracking
 Fitness scores saved to `.kilo/logs/fitness-history.jsonl`:
 ```jsonl
 {"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
 {"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
 ```
 ## Manual Agent Invocation
 ```typescript
@@ -192,11 +222,34 @@ GITEA_TOKEN=your-token-here
 ## Self-Improvement Cycle
 1. **Pipeline runs** for each issue
-2. **Evaluator scores** each agent (1-10)
+2. **Evaluator scores** each agent (1-10) - subjective
-3. **Low scores (<7)** trigger prompt-optimizer
+3. **Pipeline Judge measures** fitness objectively (0.0-1.0)
-4. **Prompt optimizer** analyzes failures and improves prompts
+4. **Low fitness (<0.70)** triggers prompt-optimizer
-5. **New prompts** saved to `.kilo/agents/`
+5. **Prompt optimizer** analyzes failures and improves prompts
-6. **Next run** uses improved prompts
+6. **Re-run workflow** with improved prompts
 7. **Compare fitness** before/after - commit if improved
 8. **Log results** to `.kilo/logs/fitness-history.jsonl`
 ### Evaluator vs Pipeline Judge
 | Aspect | Evaluator | Pipeline Judge |
 |--------|-----------|----------------|
 | Type | Subjective | Objective |
 | Score | 1-10 (opinion) | 0.0-1.0 (metrics) |
 | Metrics | Observations | Tests, tokens, time |
 | Trigger | After workflow | After evaluator |
 | Action | Logs to Gitea | Triggers optimization |
 ### Fitness Score Components
 ```
 fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
 where:
  test_pass_rate = passed_tests / total_tests
  quality_gates_rate = passed_gates / total_gates (build, lint, types, tests, coverage)
  efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)
 ```
 ## Architecture Files
--- a/agent-evolution/MILESTONE_ISSUES.md
+++ b/agent-evolution/MILESTONE_ISSUES.md
@@ -151,25 +151,314 @@ docker-compose -f docker-compose.evolution.yml up -d
 ---
-## Статус напраления
+## NEW: Pipeline Fitness & Auto-Evolution Issues
-**Текущий статус:** `PAUSED` - приостановлено до следующего спринта
+### Issue 6: Pipeline Judge Agent — Объективная оценка fitness
-**Причина паузы:**
+**Title:** Создать pipeline-judge агента для объективной оценки workflow
-Базовая инфраструктура создана:
+**Labels:** `agent`, `fitness`, `high-priority`
- ✅ Структура директорий `agent-evolution/`
+**Milestone:** Agent Evolution Dashboard
 - ✅ Данные интегрированы в HTML
 - ✅ Скрипты синхронизации созданы
 - ✅ Docker контейнер настроен
 - ✅ Документация написана
-**Что осталось:**
+**Описание:**
- 🔄 Issue #2: Интеграция с Gitea API (требует backend)
+Создать агента `pipeline-judge`, который объективно оценивает качество выполненного workflow на основе метрик, а не субъективных оценок.
 - 🔄 Issue #3: Полная синхронизация (требует тестирования)
 - 🔄 Issue #4: Расширенная документация
-**Резюме работы:**
+**Отличие от evaluator:**
-Создана полноценная инфраструктура для отслеживания эволюции агентной системы. Дашборд работает автономно без сервера, включает данные о 28 агентах, 8 моделях, рекомендациях по оптимизации. Подготовлен foundation для будущей интеграции с Gitea.
+- `evaluator` — субъективные оценки 1-10 на основе наблюдений
 - `pipeline-judge` — объективные метрики: тесты, токены, время, quality gates
 **Файлы:**
 - `.kilo/agents/pipeline-judge.md` — ✅ создан
 **Fitness Formula:**
 ```
 fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
 ```
 **Метрики:**
 - Test pass rate: passed/total тестов
 - Quality gates: build, lint, typecheck, tests_clean, coverage
 - Efficiency: токены и время относительно бюджетов
 **Критерии приёмки:**
 - [x] Агент создан в `.kilo/agents/pipeline-judge.md`
 - [ ] Добавлен в `capability-index.yaml`
 - [ ] Интегрирован в workflow после завершения пайплайна
 - [ ] Логирует результаты в `.kilo/logs/fitness-history.jsonl`
 - [ ] Триггерит `prompt-optimizer` при fitness < 0.70
 ---
 ### Issue 7: Fitness History Logging — накопление метрик
 **Title:** Создать систему логирования fitness-метрик
 **Labels:** `logging`, `metrics`, `high-priority`
 **Milestone:** Agent Evolution Dashboard
 **Описание:**
 Создать систему накопления fitness-метрик для отслеживания эволюции пайплайна во времени.
 **Формат лога (`.kilo/logs/fitness-history.jsonl`):**
 ```jsonl
 {"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
 {"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
 ```
 **Действия:**
 1. ✅ Создать директорию `.kilo/logs/` если не существует
 2. 🔄 Создать `.kilo/logs/fitness-history.jsonl`
 3. 🔄 Обновить `pipeline-judge.md` для записи в лог
 4. 🔄 Создать скрипт `agent-evolution/scripts/sync-fitness-history.ts`
 **Критерии приёмки:**
 - [ ] Файл `.kilo/logs/fitness-history.jsonl` создан
 - [ ] pipeline-judge пишет в лог после каждого workflow
 - [ ] Скрипт синхронизации интегрирован в `sync:evolution`
 - [ ] Дашборд отображает фитнесс-тренды
 ---
 ### Issue 8: Evolution Workflow — автоматическое самоулучшение
 **Title:** Реализовать эволюционный workflow для автоматической оптимизации
 **Labels:** `workflow`, `automation`, `high-priority`
 **Milestone:** Agent Evolution Dashboard
 **Описание:**
 Реализовать непрерывный цикл самоулучшения пайплайна на основе фитнесс-метрик.
 **Workflow:**
 ```
 [Workflow Completes]
       ↓
 [pipeline-judge] → fitness score
       ↓
 ┌───────────────────────────┐
 │ fitness >= 0.85           │──→ Log + done
 │ fitness 0.70-0.84         │──→ [prompt-optimizer] minor tuning
 │ fitness < 0.70            │──→ [prompt-optimizer] major rewrite
 │ fitness < 0.50            │──→ [agent-architect] redesign
 └───────────────────────────┘
       ↓
 [Re-run workflow with new prompts]
       ↓
 [pipeline-judge] again
       ↓
 [Compare before/after]
       ↓
 [Commit or revert]
 ```
 **Файлы:**
 - `.kilo/workflows/fitness-evaluation.md` — документация workflow
 - Обновить `capability-index.yaml` — добавить `iteration_loops.evolution`
 **Конфигурация:**
 ```yaml
 evolution:
  enabled: true
  auto_trigger: true
  fitness_threshold: 0.70
  max_evolution_attempts: 3
  fitness_history: .kilo/logs/fitness-history.jsonl
  budgets:
    feature: {tokens: 50000, time_s: 300}
    bugfix: {tokens: 20000, time_s: 120}
    refactor: {tokens: 40000, time_s: 240}
    security: {tokens: 30000, time_s: 180}
 ```
 **Критерии приёмки:**
 - [ ] Workflow определён в `.kilo/workflows/`
 - [ ] Интегрирован в основной pipeline
 - [ ] Автоматически триггерит prompt-optimizer
 - [ ] Сравнивает before/after fitness
 - [ ] Коммитит только улучшения
 ---
 ### Issue 9: /evolve Command — ручной запуск эволюции
 **Title:** Обновить команду /evolve для работы с fitness
 **Labels:** `command`, `cli`, `medium-priority`
 **Milestone:** Agent Evolution Dashboard
 **Описание:**
 Расширить существующую команду `/evolution` (логирование моделей) до полноценной `/evolve` команды с анализом fitness.
 **Текущий `/evolution`:**
 - Логирует изменения моделей
 - Генерирует отчёты
 **Новый `/evolve`:**
 ```bash
 /evolve                     # evolve last completed workflow
 /evolve --issue 42          # evolve workflow for issue #42
 /evolve --agent planner     # focus evolution on one agent
 /evolve --dry-run           # show what would change without applying
 /evolve --history           # print fitness trend chart
 ```
 **Execution:**
 1. Judge: `Task(subagent_type: "pipeline-judge")` → fitness report
 2. Decide: threshold-based routing
 3. Re-test: тот же workflow с обновлёнными промптами
 4. Log: append to fitness-history.jsonl
 **Файлы:**
 - Обновить `.kilo/commands/evolution.md` — добавить fitness логику
 - Создать алиас `/evolve` → `/evolution --fitness`
 **Критерии приёмки:**
 - [ ] Команда `/evolve` работает с fitness
 - [ ] Опции `--issue`, `--agent`, `--dry-run`, `--history`
 - [ ] Интегрирована с `pipeline-judge`
 - [ ] Отображает тренд fitness
 ---
 ### Issue 10: Update Capability Index — интеграция pipeline-judge
 **Title:** Добавить pipeline-judge и evolution конфигурацию в capability-index.yaml
 **Labels:** `config`, `integration`, `high-priority`
 **Milestone:** Agent Evolution Dashboard
 **Описание:**
 Обновить `capability-index.yaml` для поддержки нового эволюционного workflow.
 **Добавить:**
 ```yaml
 agents:
  pipeline-judge:
    capabilities:
      - test_execution
      - fitness_scoring
      - metric_collection
      - bottleneck_detection
    receives:
      - completed_workflow
      - pipeline_logs
    produces:
      - fitness_report
      - bottleneck_analysis
      - improvement_triggers
    forbidden:
      - code_writing
      - code_changes
      - prompt_changes
    model: ollama-cloud/nemotron-3-super
    mode: subagent
 capability_routing:
  fitness_scoring: pipeline-judge
  test_execution: pipeline-judge
  bottleneck_detection: pipeline-judge
 iteration_loops:
  evolution:
    evaluator: pipeline-judge
    optimizer: prompt-optimizer
    max_iterations: 3
    convergence: fitness_above_0.85
 workflow_states:
  evaluated: [evolving, completed]
  evolving: [evaluated]
 evolution:
  enabled: true
  auto_trigger: true
  fitness_threshold: 0.70
  max_evolution_attempts: 3
  fitness_history: .kilo/logs/fitness-history.jsonl
  budgets:
    feature: {tokens: 50000, time_s: 300}
    bugfix: {tokens: 20000, time_s: 120}
    refactor: {tokens: 40000, time_s: 240}
    security: {tokens: 30000, time_s: 180}
 ```
 **Критерии приёмки:**
 - [ ] pipeline-judge добавлен в секцию agents
 - [ ] capability_routing обновлён
 - [ ] iteration_loops.evolution добавлен
 - [ ] workflow_states обновлены
 - [ ] Секция evolution конфигурирована
 - [ ] YAML валиден
 ---
 ### Issue 11: Dashboard Evolution Tab — визуализация fitness
 **Title:** Добавить вкладку Fitness Evolution в дашборд
 **Labels:** `dashboard`, `visualization`, `medium-priority`
 **Milestone:** Agent Evolution Dashboard
 **Описание:**
 Расширить дашборд для отображения фитнесс-метрик и трендов эволюции.
 **Новая вкладка "Evolution":**
 - **Fitness Trend Chart** — график fitness по времени
 - **Workflow Comparison** — сравнение fitness разных workflow типов
 - **Agent Bottlenecks** — агенты с наибольшим потреблением токенов
 - **Optimization History** — история оптимизаций промптов
 **Data Source:**
 - `.kilo/logs/fitness-history.jsonl`
 - `.kilo/logs/efficiency_score.json`
 **UI Components:**
 ```javascript
 // Fitness Trend Chart
 // X-axis: timestamp
 // Y-axis: fitness score (0.0 - 1.0)
 // Series: issues by type (feature, bugfix, refactor)
 // Agent Heatmap
 // Rows: agents
 // Cols: metrics (tokens, time, contribution)
 // Color: intensity
 ```
 **Критерии приёмки:**
 - [ ] Вкладка "Evolution" добавлена в дашборд
 - [ ] График fitness-trend работает
 - [ ] Agent bottlenecks отображаются
 - [ ] Данные загружаются из fitness-history.jsonl
 ---
 ## Статус направления
 **Текущий статус:** `ACTIVE` — новые ишьюсы для интеграции fitness-системы
 **Приоритеты на спринт:**
 | Priority | Issue | Effort | Impact |
 |----------|-------|--------|--------|
 | **P0** | #6 Pipeline Judge Agent | Low | High |
 | **P0** | #7 Fitness History Logging | Low | High |
 | **P0** | #10 Capability Index Update | Low | High |
 | **P1** | #8 Evolution Workflow | Medium | High |
 | **P1** | #9 /evolve Command | Medium | Medium |
 | **P2** | #11 Dashboard Evolution Tab | Medium | Medium |
 **Зависимости:**
 ```
 #6 (pipeline-judge) ──► #7 (fitness-history) ──► #11 (dashboard)
        │
        └──► #10 (capability-index)
                        │
        ┌───────────────┘
        ▼
 #8 (evolution-workflow) ──► #9 (evolve-command)
 ```
 **Рекомендуемый порядок выполнения:**
 1. Issue #6: Создать `pipeline-judge.md` ✅ DONE
 2. Issue #10: Обновить `capability-index.yaml`
 3. Issue #7: Создать `fitness-history.jsonl` и интегрировать логирование
 4. Issue #8: Создать workflow `fitness-evaluation.md`
 5. Issue #9: Обновить команду `/evolution`
 6. Issue #11: Добавить вкладку в дашборд
 ---
@@ -180,3 +469,15 @@ docker-compose -f docker-compose.evolution.yml up -d
 - Build Script: `agent-evolution/scripts/build-standalone.cjs`
 - Docker: `docker-compose -f docker-compose.evolution.yml up -d`
 - NPM: `bun run sync:evolution`
 - **NEW** Pipeline Judge: `.kilo/agents/pipeline-judge.md`
 - **NEW** Fitness Log: `.kilo/logs/fitness-history.jsonl`
 ---
 ## Changelog
 ### 2026-04-06
 - ✅ Created `pipeline-judge.md` agent
 - ✅ Updated MILESTONE_ISSUES.md with 6 new issues (#6-#11)
 - ✅ Added dependency graph and priority matrix
 - ✅ Changed status from PAUSED to ACTIVE
--- a/agent-evolution/ideas/evolution-patch.json
+++ b/agent-evolution/ideas/evolution-patch.json
@@ -0,0 +1,84 @@
 {
  "$schema": "https://app.kilo.ai/agent-recommendations.json",
  "generated": "2026-04-05T20:00:00Z",
  "source": "APAW Evolution System Design",
  "description": "Adds pipeline-judge agent and evolution workflow to APAW",
  "new_files": [
    {
      "path": ".kilo/agents/pipeline-judge.md",
      "source": "pipeline-judge.md",
      "description": "Automated fitness evaluator — runs tests, measures tokens/time, produces fitness score"
    },
    {
      "path": ".kilo/workflows/evolution.md",
      "source": "evolution-workflow.md", 
      "description": "Continuous self-improvement loop for agent pipeline"
    },
    {
      "path": ".kilo/commands/evolve.md",
      "source": "evolve-command.md",
      "description": "/evolve command — trigger evolution cycle"
    }
  ],
  "capability_index_additions": {
    "agents": {
      "pipeline-judge": {
        "capabilities": [
          "test_execution",
          "fitness_scoring",
          "metric_collection",
          "bottleneck_detection"
        ],
        "receives": [
          "completed_workflow",
          "pipeline_logs"
        ],
        "produces": [
          "fitness_report",
          "bottleneck_analysis",
          "improvement_triggers"
        ],
        "forbidden": [
          "code_writing",
          "code_changes",
          "prompt_changes"
        ],
        "model": "ollama-cloud/nemotron-3-super",
        "mode": "subagent"
      }
    },
    "capability_routing": {
      "fitness_scoring": "pipeline-judge",
      "test_execution": "pipeline-judge",
      "bottleneck_detection": "pipeline-judge"
    },
    "iteration_loops": {
      "evolution": {
        "evaluator": "pipeline-judge",
        "optimizer": "prompt-optimizer",
        "max_iterations": 3,
        "convergence": "fitness_above_0.85"
      }
    },
    "evolution": {
      "enabled": true,
      "auto_trigger": true,
      "fitness_threshold": 0.70,
      "max_evolution_attempts": 3,
      "fitness_history": ".kilo/logs/fitness-history.jsonl",
      "budgets": {
        "feature": {"tokens": 50000, "time_s": 300},
        "bugfix": {"tokens": 20000, "time_s": 120},
        "refactor": {"tokens": 40000, "time_s": 240},
        "security": {"tokens": 30000, "time_s": 180}
      }
    }
  },
  "workflow_state_additions": {
    "evaluated": ["evolving", "completed"],
    "evolving": ["evaluated"]
  }
 }
--- a/agent-evolution/ideas/evolution-workflow.md
+++ b/agent-evolution/ideas/evolution-workflow.md
@@ -0,0 +1,201 @@
 # Evolution Workflow
 Continuous self-improvement loop for the agent pipeline.
 Triggered automatically after every workflow completion.
 ## Overview
 ```
 [Workflow Completes]
       ↓
 [@pipeline-judge] ← runs tests, measures tokens/time
       ↓
   fitness score
       ↓
 ┌──────────────────────────┐
 │ fitness >= 0.85          │──→ Log + done (no action)
 │ fitness 0.70 - 0.84      │──→ [@prompt-optimizer] minor tuning
 │ fitness < 0.70           │──→ [@prompt-optimizer] major rewrite
 │ fitness < 0.50           │──→ [@agent-architect] redesign agent
 └──────────────────────────┘
       ↓
   [Re-run same workflow with new prompts]
       ↓
   [@pipeline-judge] again
       ↓
   compare fitness_before vs fitness_after
       ↓
 ┌──────────────────────────┐
 │ improved?                │
 │  Yes → commit new prompts│
 │  No  → revert, try       │
 │        different strategy │
 │        (max 3 attempts)   │
 └──────────────────────────┘
 ```
 ## Fitness History
 All fitness scores are appended to `.kilo/logs/fitness-history.jsonl`:
 ```jsonl
 {"ts":"2026-04-05T12:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
 {"ts":"2026-04-05T14:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
 ```
 This creates a time-series that shows pipeline evolution over time.
 ## Orchestrator Evolution
 The orchestrator uses fitness history to optimize future pipeline construction:
 ### Pipeline Selection Strategy
 ```
 For each new issue:
  1. Classify issue type (feature|bugfix|refactor|api|security)
  2. Look up fitness history for same type
  3. Find the pipeline configuration with highest fitness
  4. Use that as template, but adapt to current issue
  5. Skip agents that consistently score 0 contribution
 ```
 ### Agent Ordering Optimization
 ```
 From fitness-history.jsonl, extract per-agent metrics:
  - avg tokens consumed
  - avg contribution to fitness
  - failure rate (how often this agent's output causes downstream failures)
 agents_by_roi = sort(agents, key=contribution/tokens, descending)
 For parallel phases:
  - Run high-ROI agents first
  - Skip agents with ROI < 0.1 (cost more than they contribute)
 ```
 ### Token Budget Allocation
 ```
 total_budget = 50000 tokens (configurable)
 For each agent in pipeline:
  agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
  If agent exceeds budget by >50%:
    → prompt-optimizer compresses that agent's prompt
    → or swap to a smaller/faster model
 ```
 ## Standard Test Suites
 No manual test configuration needed. Tests are auto-discovered:
 ### Test Discovery
 ```bash
 # Unit tests
 find src -name "*.test.ts" -o -name "*.spec.ts" | wc -l
 # E2E tests  
 find tests/e2e -name "*.test.ts" | wc -l
 # Integration tests
 find tests/integration -name "*.test.ts" | wc -l
 ```
 ### Quality Gates (standardized)
 ```yaml
 gates:
  build:      "bun run build"
  lint:       "bun run lint"
  typecheck:  "bun run typecheck"  
  unit_tests: "bun test"
  e2e_tests:  "bun test:e2e"
  coverage:   "bun test --coverage | grep 'All files' | awk '{print $10}' >= 80"
  security:   "bun audit --level=high | grep 'found 0'"
 ```
 ### Workflow-Specific Benchmarks
 ```yaml
 benchmarks:
  feature:
    token_budget: 50000
    time_budget_s: 300
    min_test_coverage: 80%
    max_iterations: 3
  bugfix:
    token_budget: 20000
    time_budget_s: 120
    min_test_coverage: 90%  # higher for bugfix — must prove fix works
    max_iterations: 2
  refactor:
    token_budget: 40000
    time_budget_s: 240
    min_test_coverage: 95%  # must not break anything
    max_iterations: 2
  security:
    token_budget: 30000
    time_budget_s: 180
    min_test_coverage: 80%
    max_iterations: 2
    required_gates: [security]  # security gate MUST pass
 ```
 ## Prompt Evolution Protocol
 When prompt-optimizer is triggered:
 ```
 1. Read current agent prompt from .kilo/agents/<agent>.md
 2. Read fitness report identifying the problem
 3. Read last 5 fitness entries for this agent from history
 4. Analyze pattern:
   - IF consistently low → systemic prompt issue
   - IF regression after change → revert
   - IF one-time failure → might be task-specific, no action
 5. Generate improved prompt:
   - Keep same structure (description, mode, model, permissions)
   - Modify ONLY the instruction body
   - Add explicit output format if IF was the issue
   - Add few-shot examples if quality was the issue
   - Compress verbose sections if tokens were the issue
 6. Save to .kilo/agents/<agent>.md.candidate
 7. Re-run the SAME workflow with .candidate prompt
 8. [@pipeline-judge] scores again
 9. IF fitness_new > fitness_old:
     mv .candidate → .md (commit)
   ELSE:
     rm .candidate (revert)
 ```
 ## Usage
 ```bash
 # Triggered automatically after any workflow
 # OR manually:
 /evolve                    # run evolution on last workflow
 /evolve --issue 42         # run evolution on specific issue
 /evolve --agent planner    # evolve specific agent's prompt
 /evolve --history          # show fitness trend
 ```
 ## Configuration
 ```yaml
 # Add to kilo.jsonc or capability-index.yaml
 evolution:
  enabled: true
  auto_trigger: true           # trigger after every workflow
  fitness_threshold: 0.70      # below this → auto-optimize
  max_evolution_attempts: 3    # max retries per cycle
  fitness_history: .kilo/logs/fitness-history.jsonl
  token_budget_default: 50000
  time_budget_default: 300
 ```
--- a/agent-evolution/ideas/evolve-command.md
+++ b/agent-evolution/ideas/evolve-command.md
@@ -0,0 +1,72 @@
 ---
 description: Run evolution cycle — judge last workflow, optimize underperforming agents, re-test
 ---
 # /evolve — Pipeline Evolution Command
 Runs the automated evolution cycle on the most recent (or specified) workflow.
 ## Usage
 ```
 /evolve                     # evolve last completed workflow
 /evolve --issue 42          # evolve workflow for issue #42
 /evolve --agent planner     # focus evolution on one agent
 /evolve --dry-run           # show what would change without applying
 /evolve --history           # print fitness trend chart
 ```
 ## Execution
 ### Step 1: Judge
 ```
 Task(subagent_type: "pipeline-judge")
 → produces fitness report
 ```
 ### Step 2: Decide
 ```
 IF fitness >= 0.85:
  echo "✅ Pipeline healthy (fitness: {score}). No action needed."
  append to fitness-history.jsonl
  EXIT
 IF fitness >= 0.70:
  echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
  identify agents with lowest per-agent scores
  Task(subagent_type: "prompt-optimizer", target: weak_agents)
 IF fitness < 0.70:
  echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
  Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
  IF fitness < 0.50:
    Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)
 ```
 ### Step 3: Re-test
 ```
 Re-run the SAME workflow with updated prompts
 Task(subagent_type: "pipeline-judge") → fitness_after
 IF fitness_after > fitness_before:
  commit prompt changes
  echo "📈 Fitness improved: {before} → {after}"
 ELSE:
  revert prompt changes
  echo "📉 No improvement. Reverting."
 ```
 ### Step 4: Log
 ```
 Append to .kilo/logs/fitness-history.jsonl:
 {
  "ts": "<now>",
  "issue": <N>,
  "workflow": "<type>",
  "fitness_before": <score>,
  "fitness_after": <score>,
  "agents_optimized": ["planner", "requirement-refiner"],
  "tokens_saved": <delta>,
  "time_saved_ms": <delta>
 }
 ```
--- a/agent-evolution/ideas/initial.txt
+++ b/agent-evolution/ideas/initial.txt
@@ -0,0 +1,26 @@
 Теперь понимаю полную архитектуру. Создаю систему автоэволюции — не HTML-дашборд, а набор файлов для репозитория: агент-скептик-оценщик, эволюционный workflow, и стандартизированные тест-кейсы.Created 4 files, ran 2 commands, read a fileCreated 4 files, ran 2 commands, read a fileВот 4 файла для системы автоэволюции. Как это работает:
 pipeline-judge.md — новый агент-скептик. Не выставляет субъективные оценки. Вместо этого:
 Запускает bun test, bun build, bun lint, bun typecheck
 Считает pass/fail/skip из JSON-репортов
 Читает логи pipeline (токены, время каждого агента)
 Вычисляет fitness score по формуле: 50% тесты + 25% quality gates + 25% эффективность (токены/время)
 Если fitness < 0.70 → автоматически вызывает prompt-optimizer
 evolution-workflow.md — workflow непрерывной оптимизации:
 Срабатывает автоматически после каждого завершённого workflow
 fitness ≥ 0.85 → логируем и идём дальше
 fitness 0.70–0.84 → prompt-optimizer чинит слабые агенты
 fitness < 0.50 → agent-architect перепроектирует агента
 После оптимизации — перезапуск того же workflow с новыми промптами, сравнение fitness до/после. Улучшилось → коммит, нет → откат
 Оркестратор эволюционирует через fitness-history.jsonl — накопительная база всех прогонов. Оркестратор учится: какие агенты пропускать (ROI < 0.1), как распределять token budget, какой pipeline-шаблон лучше для каждого типа задачи.
 evolve-command.md — команда /evolve для ручного запуска или просмотра тренда.
 evolution-patch.json — готовый патч для capability-index.yaml: добавляет pipeline-judge, routing, iteration_loops, и конфигурацию эволюции с бюджетами по типам задач.
 Файлы нужно положить в репозиторий:
 pipeline-judge.md → .kilo/agents/
 evolution-workflow.md → .kilo/workflows/
 evolve-command.md → .kilo/commands/
 evolution-patch.json → применить к capability-index.yaml
--- a/agent-evolution/ideas/pipeline-judge.md
+++ b/agent-evolution/ideas/pipeline-judge.md
@@ -0,0 +1,181 @@
 ---
 description: Automated pipeline judge. Evaluates workflow execution by running tests, measuring token cost and wall-clock time. Produces fitness scores. Never writes code — only measures and scores.
 mode: subagent
 model: ollama-cloud/nemotron-3-super
 color: "#DC2626"
 permission:
  read: allow
  write: deny
  bash: allow
  task: allow
  glob: allow
  grep: allow
 ---
 # Kilo Code: Pipeline Judge
 ## Role Definition
 You are **Pipeline Judge** — the automated fitness evaluator. You do NOT score subjectively. You measure objectively:
 1. **Test pass rate** — run the test suite, count pass/fail/skip
 2. **Token cost** — sum tokens consumed by all agents in the pipeline
 3. **Wall-clock time** — total execution time from first agent to last
 4. **Quality gates** — binary pass/fail for each quality gate
 You produce a **fitness score** that drives evolutionary optimization.
 ## When to Invoke
 - After ANY workflow completes (feature, bugfix, refactor, etc.)
 - After prompt-optimizer changes an agent's prompt
 - After a model swap recommendation is applied
 - On `/evaluate` command
 ## Fitness Score Formula
 ```
 fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
 where:
  test_pass_rate = passed_tests / total_tests                    # 0.0 - 1.0
  quality_gates_rate = passed_gates / total_gates                # 0.0 - 1.0  
  efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)         # higher = cheaper/faster
  normalized_cost = (actual_tokens / budget_tokens × 0.5) + (actual_time / budget_time × 0.5)
 ```
 ## Execution Protocol
 ### Step 1: Collect Metrics
 ```bash
 # Run test suite
 bun test --reporter=json > /tmp/test-results.json 2>&1
 bun test:e2e --reporter=json >> /tmp/test-results.json 2>&1
 # Count results
 TOTAL=$(jq '.numTotalTests' /tmp/test-results.json)
 PASSED=$(jq '.numPassedTests' /tmp/test-results.json)
 FAILED=$(jq '.numFailedTests' /tmp/test-results.json)
 # Check build
 bun run build 2>&1 && BUILD_OK=true || BUILD_OK=false
 # Check lint
 bun run lint 2>&1 && LINT_OK=true || LINT_OK=false
 # Check types
 bun run typecheck 2>&1 && TYPES_OK=true || TYPES_OK=false
 ```
 ### Step 2: Read Pipeline Log
 Read `.kilo/logs/pipeline-*.log` for:
 - Token counts per agent (from API response headers)
 - Execution time per agent
 - Number of iterations in evaluator-optimizer loops
 - Which agents were invoked and in what order
 ### Step 3: Calculate Fitness
 ```
 test_pass_rate = PASSED / TOTAL
 quality_gates:
  - build: BUILD_OK
  - lint: LINT_OK  
  - types: TYPES_OK
  - tests: FAILED == 0
  - coverage: coverage >= 80%
 quality_gates_rate = passed_gates / 5
 token_budget = 50000  # tokens per standard workflow
 time_budget = 300     # seconds per standard workflow
 normalized_cost = (total_tokens/token_budget × 0.5) + (total_time/time_budget × 0.5)
 efficiency = 1.0 - min(normalized_cost, 1.0)
 FITNESS = test_pass_rate × 0.50 + quality_gates_rate × 0.25 + efficiency × 0.25
 ```
 ### Step 4: Produce Report
 ```json
 {
  "workflow_id": "wf-<issue_number>-<timestamp>",
  "fitness": 0.82,
  "breakdown": {
    "test_pass_rate": 0.95,
    "quality_gates_rate": 0.80,
    "efficiency_score": 0.65
  },
  "tests": {
    "total": 47,
    "passed": 45,
    "failed": 2,
    "skipped": 0,
    "failed_names": ["auth.test.ts:42", "api.test.ts:108"]
  },
  "quality_gates": {
    "build": true,
    "lint": true,
    "types": true,
    "tests_clean": false,
    "coverage_80": true
  },
  "cost": {
    "total_tokens": 38400,
    "total_time_ms": 245000,
    "per_agent": [
      {"agent": "lead-developer", "tokens": 12000, "time_ms": 45000},
      {"agent": "sdet-engineer", "tokens": 8500, "time_ms": 32000}
    ]
  },
  "iterations": {
    "code_review_loop": 2,
    "security_review_loop": 1
  },
  "verdict": "PASS",
  "bottleneck_agent": "lead-developer",
  "most_expensive_agent": "lead-developer",
  "improvement_trigger": false
 }
 ```
 ### Step 5: Trigger Evolution (if needed)
 ```
 IF fitness < 0.70:
  → Task(subagent_type: "prompt-optimizer", payload: report)
  → improvement_trigger = true
 IF any agent consumed > 30% of total tokens:
  → Flag as bottleneck
  → Suggest model downgrade or prompt compression
 IF iterations > 2 in any loop:
  → Flag evaluator-optimizer convergence issue
  → Suggest prompt refinement for the evaluator agent
 ```
 ## Output Format
 ```
 ## Pipeline Judgment: Issue #<N>
 **Fitness: <score>/1.00** [PASS|MARGINAL|FAIL]
 | Metric | Value | Weight | Contribution |
 |--------|-------|--------|-------------|
 | Tests  | 95% (45/47) | 50% | 0.475 |
 | Gates  | 80% (4/5) | 25% | 0.200 |
 | Cost   | 38.4K tok / 245s | 25% | 0.163 |
 **Bottleneck:** lead-developer (31% of tokens)
 **Failed tests:** auth.test.ts:42, api.test.ts:108
 **Failed gates:** tests_clean
@if fitness < 0.70: Task tool with subagent_type: "prompt-optimizer"
@if fitness >= 0.70: Log to .kilo/logs/fitness-history.jsonl
 ```
 ## Prohibited Actions
 - DO NOT write or modify any code
 - DO NOT subjectively rate "quality" — only measure
 - DO NOT skip running actual tests
 - DO NOT estimate token counts — read from logs
 - DO NOT change agent prompts — only flag for prompt-optimizer
		`@@ -0,0 +1 @@`
							`{"ts":"2026-04-04T02:30:00Z","issue":5,"workflow":"feature","fitness":0.85,"breakdown":{"test_pass_rate":0.95,"quality_gates_rate":0.80,"efficiency_score":0.78},"tokens":38400,"time_ms":245000,"tests_passed":9,"tests_total":10,"agents":["requirement-refiner","history-miner","system-analyst","sdet-engineer","lead-developer"],"verdict":"PASS"}`