feat: add pipeline-judge agent and evolution workflow system

- Add pipeline-judge agent for objective fitness scoring - Update capability-index.yaml with pipeline-judge, evolution config - Add fitness-evaluation.md workflow for auto-optimization - Update evolution.md command with /evolve CLI - Create .kilo/logs/fitness-history.jsonl for metrics logging - Update AGENTS.md with new workflow state machine - Add 6 new issues to MILESTONE_ISSUES.md for evolution integration - Preserve ideas in agent-evolution/ideas/ Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25) Auto-triggers prompt-optimizer when fitness < 0.70
2026-04-06 00:23:50 +01:00
parent 1ab9939c92
commit fa68141d47
12 changed files with 1653 additions and 193 deletions
--- a/agent-evolution/MILESTONE_ISSUES.md
+++ b/agent-evolution/MILESTONE_ISSUES.md
@@ -151,25 +151,314 @@ docker-compose -f docker-compose.evolution.yml up -d

 ---

-## Статус напраления
+## NEW: Pipeline Fitness & Auto-Evolution Issues

-**Текущий статус:** `PAUSED` - приостановлено до следующего спринта
+### Issue 6: Pipeline Judge Agent — Объективная оценка fitness

-**Причина паузы:**
-Базовая инфраструктура создана:
- ✅ Структура директорий `agent-evolution/`
- ✅ Данные интегрированы в HTML
- ✅ Скрипты синхронизации созданы
- ✅ Docker контейнер настроен
- ✅ Документация написана
+**Title:** Создать pipeline-judge агента для объективной оценки workflow
+**Labels:** `agent`, `fitness`, `high-priority`
+**Milestone:** Agent Evolution Dashboard

-**Что осталось:**
- 🔄 Issue #2: Интеграция с Gitea API (требует backend)
- 🔄 Issue #3: Полная синхронизация (требует тестирования)
- 🔄 Issue #4: Расширенная документация
+**Описание:**
+Создать агента `pipeline-judge`, который объективно оценивает качество выполненного workflow на основе метрик, а не субъективных оценок.

-**Резюме работы:**
-Создана полноценная инфраструктура для отслеживания эволюции агентной системы. Дашборд работает автономно без сервера, включает данные о 28 агентах, 8 моделях, рекомендациях по оптимизации. Подготовлен foundation для будущей интеграции с Gitea.
+**Отличие от evaluator:**
+- `evaluator` — субъективные оценки 1-10 на основе наблюдений
+- `pipeline-judge` — объективные метрики: тесты, токены, время, quality gates
+
+**Файлы:**
+- `.kilo/agents/pipeline-judge.md` — ✅ создан
+
+**Fitness Formula:**
+```
+fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
+```
+
+**Метрики:**
+- Test pass rate: passed/total тестов
+- Quality gates: build, lint, typecheck, tests_clean, coverage
+- Efficiency: токены и время относительно бюджетов
+
+**Критерии приёмки:**
+- [x] Агент создан в `.kilo/agents/pipeline-judge.md`
+- [ ] Добавлен в `capability-index.yaml`
+- [ ] Интегрирован в workflow после завершения пайплайна
+- [ ] Логирует результаты в `.kilo/logs/fitness-history.jsonl`
+- [ ] Триггерит `prompt-optimizer` при fitness < 0.70
+
+---
+
+### Issue 7: Fitness History Logging — накопление метрик
+
+**Title:** Создать систему логирования fitness-метрик
+**Labels:** `logging`, `metrics`, `high-priority`
+**Milestone:** Agent Evolution Dashboard
+
+**Описание:**
+Создать систему накопления fitness-метрик для отслеживания эволюции пайплайна во времени.
+
+**Формат лога (`.kilo/logs/fitness-history.jsonl`):**
+```jsonl
+{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
+{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
+```
+
+**Действия:**
+1. ✅ Создать директорию `.kilo/logs/` если не существует
+2. 🔄 Создать `.kilo/logs/fitness-history.jsonl`
+3. 🔄 Обновить `pipeline-judge.md` для записи в лог
+4. 🔄 Создать скрипт `agent-evolution/scripts/sync-fitness-history.ts`
+
+**Критерии приёмки:**
+- [ ] Файл `.kilo/logs/fitness-history.jsonl` создан
+- [ ] pipeline-judge пишет в лог после каждого workflow
+- [ ] Скрипт синхронизации интегрирован в `sync:evolution`
+- [ ] Дашборд отображает фитнесс-тренды
+
+---
+
+### Issue 8: Evolution Workflow — автоматическое самоулучшение
+
+**Title:** Реализовать эволюционный workflow для автоматической оптимизации
+**Labels:** `workflow`, `automation`, `high-priority`
+**Milestone:** Agent Evolution Dashboard
+
+**Описание:**
+Реализовать непрерывный цикл самоулучшения пайплайна на основе фитнесс-метрик.
+
+**Workflow:**
+```
+[Workflow Completes]
+       ↓
+[pipeline-judge] → fitness score
+       ↓
+┌───────────────────────────┐
+│ fitness >= 0.85           │──→ Log + done
+│ fitness 0.70-0.84         │──→ [prompt-optimizer] minor tuning
+│ fitness < 0.70            │──→ [prompt-optimizer] major rewrite
+│ fitness < 0.50            │──→ [agent-architect] redesign
+└───────────────────────────┘
+       ↓
+[Re-run workflow with new prompts]
+       ↓
+[pipeline-judge] again
+       ↓
+[Compare before/after]
+       ↓
+[Commit or revert]
+```
+
+**Файлы:**
+- `.kilo/workflows/fitness-evaluation.md` — документация workflow
+- Обновить `capability-index.yaml` — добавить `iteration_loops.evolution`
+
+**Конфигурация:**
+```yaml
+evolution:
+  enabled: true
+  auto_trigger: true
+  fitness_threshold: 0.70
+  max_evolution_attempts: 3
+  fitness_history: .kilo/logs/fitness-history.jsonl
+  budgets:
+    feature: {tokens: 50000, time_s: 300}
+    bugfix: {tokens: 20000, time_s: 120}
+    refactor: {tokens: 40000, time_s: 240}
+    security: {tokens: 30000, time_s: 180}
+```
+
+**Критерии приёмки:**
+- [ ] Workflow определён в `.kilo/workflows/`
+- [ ] Интегрирован в основной pipeline
+- [ ] Автоматически триггерит prompt-optimizer
+- [ ] Сравнивает before/after fitness
+- [ ] Коммитит только улучшения
+
+---
+
+### Issue 9: /evolve Command — ручной запуск эволюции
+
+**Title:** Обновить команду /evolve для работы с fitness
+**Labels:** `command`, `cli`, `medium-priority`
+**Milestone:** Agent Evolution Dashboard
+
+**Описание:**
+Расширить существующую команду `/evolution` (логирование моделей) до полноценной `/evolve` команды с анализом fitness.
+
+**Текущий `/evolution`:**
+- Логирует изменения моделей
+- Генерирует отчёты
+
+**Новый `/evolve`:**
+```bash
+/evolve                     # evolve last completed workflow
+/evolve --issue 42          # evolve workflow for issue #42
+/evolve --agent planner     # focus evolution on one agent
+/evolve --dry-run           # show what would change without applying
+/evolve --history           # print fitness trend chart
+```
+
+**Execution:**
+1. Judge: `Task(subagent_type: "pipeline-judge")` → fitness report
+2. Decide: threshold-based routing
+3. Re-test: тот же workflow с обновлёнными промптами
+4. Log: append to fitness-history.jsonl
+
+**Файлы:**
+- Обновить `.kilo/commands/evolution.md` — добавить fitness логику
+- Создать алиас `/evolve` → `/evolution --fitness`
+
+**Критерии приёмки:**
+- [ ] Команда `/evolve` работает с fitness
+- [ ] Опции `--issue`, `--agent`, `--dry-run`, `--history`
+- [ ] Интегрирована с `pipeline-judge`
+- [ ] Отображает тренд fitness
+
+---
+
+### Issue 10: Update Capability Index — интеграция pipeline-judge
+
+**Title:** Добавить pipeline-judge и evolution конфигурацию в capability-index.yaml
+**Labels:** `config`, `integration`, `high-priority`
+**Milestone:** Agent Evolution Dashboard
+
+**Описание:**
+Обновить `capability-index.yaml` для поддержки нового эволюционного workflow.
+
+**Добавить:**
+```yaml
+agents:
+  pipeline-judge:
+    capabilities:
+      - test_execution
+      - fitness_scoring
+      - metric_collection
+      - bottleneck_detection
+    receives:
+      - completed_workflow
+      - pipeline_logs
+    produces:
+      - fitness_report
+      - bottleneck_analysis
+      - improvement_triggers
+    forbidden:
+      - code_writing
+      - code_changes
+      - prompt_changes
+    model: ollama-cloud/nemotron-3-super
+    mode: subagent
+
+capability_routing:
+  fitness_scoring: pipeline-judge
+  test_execution: pipeline-judge
+  bottleneck_detection: pipeline-judge
+
+iteration_loops:
+  evolution:
+    evaluator: pipeline-judge
+    optimizer: prompt-optimizer
+    max_iterations: 3
+    convergence: fitness_above_0.85
+
+workflow_states:
+  evaluated: [evolving, completed]
+  evolving: [evaluated]
+
+evolution:
+  enabled: true
+  auto_trigger: true
+  fitness_threshold: 0.70
+  max_evolution_attempts: 3
+  fitness_history: .kilo/logs/fitness-history.jsonl
+  budgets:
+    feature: {tokens: 50000, time_s: 300}
+    bugfix: {tokens: 20000, time_s: 120}
+    refactor: {tokens: 40000, time_s: 240}
+    security: {tokens: 30000, time_s: 180}
+```
+
+**Критерии приёмки:**
+- [ ] pipeline-judge добавлен в секцию agents
+- [ ] capability_routing обновлён
+- [ ] iteration_loops.evolution добавлен
+- [ ] workflow_states обновлены
+- [ ] Секция evolution конфигурирована
+- [ ] YAML валиден
+
+---
+
+### Issue 11: Dashboard Evolution Tab — визуализация fitness
+
+**Title:** Добавить вкладку Fitness Evolution в дашборд
+**Labels:** `dashboard`, `visualization`, `medium-priority`
+**Milestone:** Agent Evolution Dashboard
+
+**Описание:**
+Расширить дашборд для отображения фитнесс-метрик и трендов эволюции.
+
+**Новая вкладка "Evolution":**
+- **Fitness Trend Chart** — график fitness по времени
+- **Workflow Comparison** — сравнение fitness разных workflow типов
+- **Agent Bottlenecks** — агенты с наибольшим потреблением токенов
+- **Optimization History** — история оптимизаций промптов
+
+**Data Source:**
+- `.kilo/logs/fitness-history.jsonl`
+- `.kilo/logs/efficiency_score.json`
+
+**UI Components:**
+```javascript
+// Fitness Trend Chart
+// X-axis: timestamp
+// Y-axis: fitness score (0.0 - 1.0)
+// Series: issues by type (feature, bugfix, refactor)
+
+// Agent Heatmap
+// Rows: agents
+// Cols: metrics (tokens, time, contribution)
+// Color: intensity
+```
+
+**Критерии приёмки:**
+- [ ] Вкладка "Evolution" добавлена в дашборд
+- [ ] График fitness-trend работает
+- [ ] Agent bottlenecks отображаются
+- [ ] Данные загружаются из fitness-history.jsonl
+
+---
+
+## Статус направления
+
+**Текущий статус:** `ACTIVE` — новые ишьюсы для интеграции fitness-системы
+
+**Приоритеты на спринт:**
+| Priority | Issue | Effort | Impact |
+|----------|-------|--------|--------|
+| **P0** | #6 Pipeline Judge Agent | Low | High |
+| **P0** | #7 Fitness History Logging | Low | High |
+| **P0** | #10 Capability Index Update | Low | High |
+| **P1** | #8 Evolution Workflow | Medium | High |
+| **P1** | #9 /evolve Command | Medium | Medium |
+| **P2** | #11 Dashboard Evolution Tab | Medium | Medium |
+
+**Зависимости:**
+```
+#6 (pipeline-judge) ──► #7 (fitness-history) ──► #11 (dashboard)
+        │
+        └──► #10 (capability-index)
+                        │
+        ┌───────────────┘
+        ▼
+#8 (evolution-workflow) ──► #9 (evolve-command)
+```
+
+**Рекомендуемый порядок выполнения:**
+1. Issue #6: Создать `pipeline-judge.md` ✅ DONE
+2. Issue #10: Обновить `capability-index.yaml`
+3. Issue #7: Создать `fitness-history.jsonl` и интегрировать логирование
+4. Issue #8: Создать workflow `fitness-evaluation.md`
+5. Issue #9: Обновить команду `/evolution`
+6. Issue #11: Добавить вкладку в дашборд

 ---

@@ -180,3 +469,15 @@ docker-compose -f docker-compose.evolution.yml up -d
 - Build Script: `agent-evolution/scripts/build-standalone.cjs`
 - Docker: `docker-compose -f docker-compose.evolution.yml up -d`
 - NPM: `bun run sync:evolution`
+- **NEW** Pipeline Judge: `.kilo/agents/pipeline-judge.md`
+- **NEW** Fitness Log: `.kilo/logs/fitness-history.jsonl`
+
+---
+
+## Changelog
+
+### 2026-04-06
+- ✅ Created `pipeline-judge.md` agent
+- ✅ Updated MILESTONE_ISSUES.md with 6 new issues (#6-#11)
+- ✅ Added dependency graph and priority matrix
+- ✅ Changed status from PAUSED to ACTIVE
--- a/agent-evolution/ideas/evolution-patch.json
+++ b/agent-evolution/ideas/evolution-patch.json
@@ -0,0 +1,84 @@
+{
+  "$schema": "https://app.kilo.ai/agent-recommendations.json",
+  "generated": "2026-04-05T20:00:00Z",
+  "source": "APAW Evolution System Design",
+  "description": "Adds pipeline-judge agent and evolution workflow to APAW",
+  
+  "new_files": [
+    {
+      "path": ".kilo/agents/pipeline-judge.md",
+      "source": "pipeline-judge.md",
+      "description": "Automated fitness evaluator — runs tests, measures tokens/time, produces fitness score"
+    },
+    {
+      "path": ".kilo/workflows/evolution.md",
+      "source": "evolution-workflow.md", 
+      "description": "Continuous self-improvement loop for agent pipeline"
+    },
+    {
+      "path": ".kilo/commands/evolve.md",
+      "source": "evolve-command.md",
+      "description": "/evolve command — trigger evolution cycle"
+    }
+  ],
+
+  "capability_index_additions": {
+    "agents": {
+      "pipeline-judge": {
+        "capabilities": [
+          "test_execution",
+          "fitness_scoring",
+          "metric_collection",
+          "bottleneck_detection"
+        ],
+        "receives": [
+          "completed_workflow",
+          "pipeline_logs"
+        ],
+        "produces": [
+          "fitness_report",
+          "bottleneck_analysis",
+          "improvement_triggers"
+        ],
+        "forbidden": [
+          "code_writing",
+          "code_changes",
+          "prompt_changes"
+        ],
+        "model": "ollama-cloud/nemotron-3-super",
+        "mode": "subagent"
+      }
+    },
+    "capability_routing": {
+      "fitness_scoring": "pipeline-judge",
+      "test_execution": "pipeline-judge",
+      "bottleneck_detection": "pipeline-judge"
+    },
+    "iteration_loops": {
+      "evolution": {
+        "evaluator": "pipeline-judge",
+        "optimizer": "prompt-optimizer",
+        "max_iterations": 3,
+        "convergence": "fitness_above_0.85"
+      }
+    },
+    "evolution": {
+      "enabled": true,
+      "auto_trigger": true,
+      "fitness_threshold": 0.70,
+      "max_evolution_attempts": 3,
+      "fitness_history": ".kilo/logs/fitness-history.jsonl",
+      "budgets": {
+        "feature": {"tokens": 50000, "time_s": 300},
+        "bugfix": {"tokens": 20000, "time_s": 120},
+        "refactor": {"tokens": 40000, "time_s": 240},
+        "security": {"tokens": 30000, "time_s": 180}
+      }
+    }
+  },
+
+  "workflow_state_additions": {
+    "evaluated": ["evolving", "completed"],
+    "evolving": ["evaluated"]
+  }
+}
--- a/agent-evolution/ideas/evolution-workflow.md
+++ b/agent-evolution/ideas/evolution-workflow.md
@@ -0,0 +1,201 @@
+# Evolution Workflow
+
+Continuous self-improvement loop for the agent pipeline.
+Triggered automatically after every workflow completion.
+
+## Overview
+
+```
+[Workflow Completes]
+       ↓
+[@pipeline-judge] ← runs tests, measures tokens/time
+       ↓
+   fitness score
+       ↓
+┌──────────────────────────┐
+│ fitness >= 0.85          │──→ Log + done (no action)
+│ fitness 0.70 - 0.84      │──→ [@prompt-optimizer] minor tuning
+│ fitness < 0.70           │──→ [@prompt-optimizer] major rewrite
+│ fitness < 0.50           │──→ [@agent-architect] redesign agent
+└──────────────────────────┘
+       ↓
+   [Re-run same workflow with new prompts]
+       ↓
+   [@pipeline-judge] again
+       ↓
+   compare fitness_before vs fitness_after
+       ↓
+┌──────────────────────────┐
+│ improved?                │
+│  Yes → commit new prompts│
+│  No  → revert, try       │
+│        different strategy │
+│        (max 3 attempts)   │
+└──────────────────────────┘
+```
+
+## Fitness History
+
+All fitness scores are appended to `.kilo/logs/fitness-history.jsonl`:
+
+```jsonl
+{"ts":"2026-04-05T12:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
+{"ts":"2026-04-05T14:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
+```
+
+This creates a time-series that shows pipeline evolution over time.
+
+## Orchestrator Evolution
+
+The orchestrator uses fitness history to optimize future pipeline construction:
+
+### Pipeline Selection Strategy
+```
+For each new issue:
+  1. Classify issue type (feature|bugfix|refactor|api|security)
+  2. Look up fitness history for same type
+  3. Find the pipeline configuration with highest fitness
+  4. Use that as template, but adapt to current issue
+  5. Skip agents that consistently score 0 contribution
+```
+
+### Agent Ordering Optimization
+```
+From fitness-history.jsonl, extract per-agent metrics:
+  - avg tokens consumed
+  - avg contribution to fitness
+  - failure rate (how often this agent's output causes downstream failures)
+
+agents_by_roi = sort(agents, key=contribution/tokens, descending)
+
+For parallel phases:
+  - Run high-ROI agents first
+  - Skip agents with ROI < 0.1 (cost more than they contribute)
+```
+
+### Token Budget Allocation
+```
+total_budget = 50000 tokens (configurable)
+
+For each agent in pipeline:
+  agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
+  
+  If agent exceeds budget by >50%:
+    → prompt-optimizer compresses that agent's prompt
+    → or swap to a smaller/faster model
+```
+
+## Standard Test Suites
+
+No manual test configuration needed. Tests are auto-discovered:
+
+### Test Discovery
+```bash
+# Unit tests
+find src -name "*.test.ts" -o -name "*.spec.ts" | wc -l
+
+# E2E tests  
+find tests/e2e -name "*.test.ts" | wc -l
+
+# Integration tests
+find tests/integration -name "*.test.ts" | wc -l
+```
+
+### Quality Gates (standardized)
+```yaml
+gates:
+  build:      "bun run build"
+  lint:       "bun run lint"
+  typecheck:  "bun run typecheck"  
+  unit_tests: "bun test"
+  e2e_tests:  "bun test:e2e"
+  coverage:   "bun test --coverage | grep 'All files' | awk '{print $10}' >= 80"
+  security:   "bun audit --level=high | grep 'found 0'"
+```
+
+### Workflow-Specific Benchmarks
+```yaml
+benchmarks:
+  feature:
+    token_budget: 50000
+    time_budget_s: 300
+    min_test_coverage: 80%
+    max_iterations: 3
+    
+  bugfix:
+    token_budget: 20000
+    time_budget_s: 120
+    min_test_coverage: 90%  # higher for bugfix — must prove fix works
+    max_iterations: 2
+    
+  refactor:
+    token_budget: 40000
+    time_budget_s: 240
+    min_test_coverage: 95%  # must not break anything
+    max_iterations: 2
+    
+  security:
+    token_budget: 30000
+    time_budget_s: 180
+    min_test_coverage: 80%
+    max_iterations: 2
+    required_gates: [security]  # security gate MUST pass
+```
+
+## Prompt Evolution Protocol
+
+When prompt-optimizer is triggered:
+
+```
+1. Read current agent prompt from .kilo/agents/<agent>.md
+2. Read fitness report identifying the problem
+3. Read last 5 fitness entries for this agent from history
+
+4. Analyze pattern:
+   - IF consistently low → systemic prompt issue
+   - IF regression after change → revert
+   - IF one-time failure → might be task-specific, no action
+
+5. Generate improved prompt:
+   - Keep same structure (description, mode, model, permissions)
+   - Modify ONLY the instruction body
+   - Add explicit output format if IF was the issue
+   - Add few-shot examples if quality was the issue
+   - Compress verbose sections if tokens were the issue
+
+6. Save to .kilo/agents/<agent>.md.candidate
+
+7. Re-run the SAME workflow with .candidate prompt
+
+8. [@pipeline-judge] scores again
+
+9. IF fitness_new > fitness_old:
+     mv .candidate → .md (commit)
+   ELSE:
+     rm .candidate (revert)
+```
+
+## Usage
+
+```bash
+# Triggered automatically after any workflow
+# OR manually:
+/evolve                    # run evolution on last workflow
+/evolve --issue 42         # run evolution on specific issue
+/evolve --agent planner    # evolve specific agent's prompt
+/evolve --history          # show fitness trend
+```
+
+## Configuration
+
+```yaml
+# Add to kilo.jsonc or capability-index.yaml
+evolution:
+  enabled: true
+  auto_trigger: true           # trigger after every workflow
+  fitness_threshold: 0.70      # below this → auto-optimize
+  max_evolution_attempts: 3    # max retries per cycle
+  fitness_history: .kilo/logs/fitness-history.jsonl
+  token_budget_default: 50000
+  time_budget_default: 300
+```
--- a/agent-evolution/ideas/evolve-command.md
+++ b/agent-evolution/ideas/evolve-command.md
@@ -0,0 +1,72 @@
+---
+description: Run evolution cycle — judge last workflow, optimize underperforming agents, re-test
+---
+
+# /evolve — Pipeline Evolution Command
+
+Runs the automated evolution cycle on the most recent (or specified) workflow.
+
+## Usage
+
+```
+/evolve                     # evolve last completed workflow
+/evolve --issue 42          # evolve workflow for issue #42
+/evolve --agent planner     # focus evolution on one agent
+/evolve --dry-run           # show what would change without applying
+/evolve --history           # print fitness trend chart
+```
+
+## Execution
+
+### Step 1: Judge
+```
+Task(subagent_type: "pipeline-judge")
+→ produces fitness report
+```
+
+### Step 2: Decide
+```
+IF fitness >= 0.85:
+  echo "✅ Pipeline healthy (fitness: {score}). No action needed."
+  append to fitness-history.jsonl
+  EXIT
+
+IF fitness >= 0.70:
+  echo "⚠ Pipeline marginal (fitness: {score}). Optimizing weak agents..."
+  identify agents with lowest per-agent scores
+  Task(subagent_type: "prompt-optimizer", target: weak_agents)
+
+IF fitness < 0.70:
+  echo "🔴 Pipeline underperforming (fitness: {score}). Major optimization..."
+  Task(subagent_type: "prompt-optimizer", target: all_flagged_agents)
+  IF fitness < 0.50:
+    Task(subagent_type: "agent-architect", action: "redesign", target: worst_agent)
+```
+
+### Step 3: Re-test
+```
+Re-run the SAME workflow with updated prompts
+Task(subagent_type: "pipeline-judge") → fitness_after
+
+IF fitness_after > fitness_before:
+  commit prompt changes
+  echo "📈 Fitness improved: {before} → {after}"
+ELSE:
+  revert prompt changes
+  echo "📉 No improvement. Reverting."
+```
+
+### Step 4: Log
+```
+Append to .kilo/logs/fitness-history.jsonl:
+{
+  "ts": "<now>",
+  "issue": <N>,
+  "workflow": "<type>",
+  "fitness_before": <score>,
+  "fitness_after": <score>,
+  "agents_optimized": ["planner", "requirement-refiner"],
+  "tokens_saved": <delta>,
+  "time_saved_ms": <delta>
+}
+```
--- a/agent-evolution/ideas/initial.txt
+++ b/agent-evolution/ideas/initial.txt
@@ -0,0 +1,26 @@
+Теперь понимаю полную архитектуру. Создаю систему автоэволюции — не HTML-дашборд, а набор файлов для репозитория: агент-скептик-оценщик, эволюционный workflow, и стандартизированные тест-кейсы.Created 4 files, ran 2 commands, read a fileCreated 4 files, ran 2 commands, read a fileВот 4 файла для системы автоэволюции. Как это работает:
+pipeline-judge.md — новый агент-скептик. Не выставляет субъективные оценки. Вместо этого:
+
+Запускает bun test, bun build, bun lint, bun typecheck
+Считает pass/fail/skip из JSON-репортов
+Читает логи pipeline (токены, время каждого агента)
+Вычисляет fitness score по формуле: 50% тесты + 25% quality gates + 25% эффективность (токены/время)
+Если fitness < 0.70 → автоматически вызывает prompt-optimizer
+
+evolution-workflow.md — workflow непрерывной оптимизации:
+
+Срабатывает автоматически после каждого завершённого workflow
+fitness ≥ 0.85 → логируем и идём дальше
+fitness 0.70–0.84 → prompt-optimizer чинит слабые агенты
+fitness < 0.50 → agent-architect перепроектирует агента
+После оптимизации — перезапуск того же workflow с новыми промптами, сравнение fitness до/после. Улучшилось → коммит, нет → откат
+
+Оркестратор эволюционирует через fitness-history.jsonl — накопительная база всех прогонов. Оркестратор учится: какие агенты пропускать (ROI < 0.1), как распределять token budget, какой pipeline-шаблон лучше для каждого типа задачи.
+evolve-command.md — команда /evolve для ручного запуска или просмотра тренда.
+evolution-patch.json — готовый патч для capability-index.yaml: добавляет pipeline-judge, routing, iteration_loops, и конфигурацию эволюции с бюджетами по типам задач.
+Файлы нужно положить в репозиторий:
+
+pipeline-judge.md → .kilo/agents/
+evolution-workflow.md → .kilo/workflows/
+evolve-command.md → .kilo/commands/
+evolution-patch.json → применить к capability-index.yaml
--- a/agent-evolution/ideas/pipeline-judge.md
+++ b/agent-evolution/ideas/pipeline-judge.md
@@ -0,0 +1,181 @@
+---
+description: Automated pipeline judge. Evaluates workflow execution by running tests, measuring token cost and wall-clock time. Produces fitness scores. Never writes code — only measures and scores.
+mode: subagent
+model: ollama-cloud/nemotron-3-super
+color: "#DC2626"
+permission:
+  read: allow
+  write: deny
+  bash: allow
+  task: allow
+  glob: allow
+  grep: allow
+---
+
+# Kilo Code: Pipeline Judge
+
+## Role Definition
+
+You are **Pipeline Judge** — the automated fitness evaluator. You do NOT score subjectively. You measure objectively:
+
+1. **Test pass rate** — run the test suite, count pass/fail/skip
+2. **Token cost** — sum tokens consumed by all agents in the pipeline
+3. **Wall-clock time** — total execution time from first agent to last
+4. **Quality gates** — binary pass/fail for each quality gate
+
+You produce a **fitness score** that drives evolutionary optimization.
+
+## When to Invoke
+
+- After ANY workflow completes (feature, bugfix, refactor, etc.)
+- After prompt-optimizer changes an agent's prompt
+- After a model swap recommendation is applied
+- On `/evaluate` command
+
+## Fitness Score Formula
+
+```
+fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
+
+where:
+  test_pass_rate = passed_tests / total_tests                    # 0.0 - 1.0
+  quality_gates_rate = passed_gates / total_gates                # 0.0 - 1.0  
+  efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)         # higher = cheaper/faster
+  normalized_cost = (actual_tokens / budget_tokens × 0.5) + (actual_time / budget_time × 0.5)
+```
+
+## Execution Protocol
+
+### Step 1: Collect Metrics
+```bash
+# Run test suite
+bun test --reporter=json > /tmp/test-results.json 2>&1
+bun test:e2e --reporter=json >> /tmp/test-results.json 2>&1
+
+# Count results
+TOTAL=$(jq '.numTotalTests' /tmp/test-results.json)
+PASSED=$(jq '.numPassedTests' /tmp/test-results.json)
+FAILED=$(jq '.numFailedTests' /tmp/test-results.json)
+
+# Check build
+bun run build 2>&1 && BUILD_OK=true || BUILD_OK=false
+
+# Check lint
+bun run lint 2>&1 && LINT_OK=true || LINT_OK=false
+
+# Check types
+bun run typecheck 2>&1 && TYPES_OK=true || TYPES_OK=false
+```
+
+### Step 2: Read Pipeline Log
+Read `.kilo/logs/pipeline-*.log` for:
+- Token counts per agent (from API response headers)
+- Execution time per agent
+- Number of iterations in evaluator-optimizer loops
+- Which agents were invoked and in what order
+
+### Step 3: Calculate Fitness
+```
+test_pass_rate = PASSED / TOTAL
+quality_gates:
+  - build: BUILD_OK
+  - lint: LINT_OK  
+  - types: TYPES_OK
+  - tests: FAILED == 0
+  - coverage: coverage >= 80%
+quality_gates_rate = passed_gates / 5
+
+token_budget = 50000  # tokens per standard workflow
+time_budget = 300     # seconds per standard workflow
+normalized_cost = (total_tokens/token_budget × 0.5) + (total_time/time_budget × 0.5)
+efficiency = 1.0 - min(normalized_cost, 1.0)
+
+FITNESS = test_pass_rate × 0.50 + quality_gates_rate × 0.25 + efficiency × 0.25
+```
+
+### Step 4: Produce Report
+```json
+{
+  "workflow_id": "wf-<issue_number>-<timestamp>",
+  "fitness": 0.82,
+  "breakdown": {
+    "test_pass_rate": 0.95,
+    "quality_gates_rate": 0.80,
+    "efficiency_score": 0.65
+  },
+  "tests": {
+    "total": 47,
+    "passed": 45,
+    "failed": 2,
+    "skipped": 0,
+    "failed_names": ["auth.test.ts:42", "api.test.ts:108"]
+  },
+  "quality_gates": {
+    "build": true,
+    "lint": true,
+    "types": true,
+    "tests_clean": false,
+    "coverage_80": true
+  },
+  "cost": {
+    "total_tokens": 38400,
+    "total_time_ms": 245000,
+    "per_agent": [
+      {"agent": "lead-developer", "tokens": 12000, "time_ms": 45000},
+      {"agent": "sdet-engineer", "tokens": 8500, "time_ms": 32000}
+    ]
+  },
+  "iterations": {
+    "code_review_loop": 2,
+    "security_review_loop": 1
+  },
+  "verdict": "PASS",
+  "bottleneck_agent": "lead-developer",
+  "most_expensive_agent": "lead-developer",
+  "improvement_trigger": false
+}
+```
+
+### Step 5: Trigger Evolution (if needed)
+```
+IF fitness < 0.70:
+  → Task(subagent_type: "prompt-optimizer", payload: report)
+  → improvement_trigger = true
+
+IF any agent consumed > 30% of total tokens:
+  → Flag as bottleneck
+  → Suggest model downgrade or prompt compression
+
+IF iterations > 2 in any loop:
+  → Flag evaluator-optimizer convergence issue
+  → Suggest prompt refinement for the evaluator agent
+```
+
+## Output Format
+
+```
+## Pipeline Judgment: Issue #<N>
+
+**Fitness: <score>/1.00** [PASS|MARGINAL|FAIL]
+
+| Metric | Value | Weight | Contribution |
+|--------|-------|--------|-------------|
+| Tests  | 95% (45/47) | 50% | 0.475 |
+| Gates  | 80% (4/5) | 25% | 0.200 |
+| Cost   | 38.4K tok / 245s | 25% | 0.163 |
+
+**Bottleneck:** lead-developer (31% of tokens)
+**Failed tests:** auth.test.ts:42, api.test.ts:108
+**Failed gates:** tests_clean
+
+@if fitness < 0.70: Task tool with subagent_type: "prompt-optimizer"
+@if fitness >= 0.70: Log to .kilo/logs/fitness-history.jsonl
+```
+
+## Prohibited Actions
+
+- DO NOT write or modify any code
+- DO NOT subjectively rate "quality" — only measure
+- DO NOT skip running actual tests
+- DO NOT estimate token counts — read from logs
+- DO NOT change agent prompts — only flag for prompt-optimizer