feat: add pipeline-judge agent and evolution workflow system

- Add pipeline-judge agent for objective fitness scoring
- Update capability-index.yaml with pipeline-judge, evolution config
- Add fitness-evaluation.md workflow for auto-optimization
- Update evolution.md command with /evolve CLI
- Create .kilo/logs/fitness-history.jsonl for metrics logging
- Update AGENTS.md with new workflow state machine
- Add 6 new issues to MILESTONE_ISSUES.md for evolution integration
- Preserve ideas in agent-evolution/ideas/

Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25)
Auto-triggers prompt-optimizer when fitness < 0.70
This commit is contained in:
¨NW¨
2026-04-06 00:23:50 +01:00
parent 1ab9939c92
commit fa68141d47
12 changed files with 1653 additions and 193 deletions

View File

@@ -17,12 +17,15 @@ Agent: Runs full pipeline for issue #42 with Gitea logging
|---------|-------------|-------|
| `/pipeline <issue>` | Run full agent pipeline for issue | `/pipeline 42` |
| `/status <issue>` | Check pipeline status for issue | `/status 42` |
| `/evolve` | Run evolution cycle with fitness scoring | `/evolve --issue 42` |
| `/evaluate <issue>` | Generate performance report | `/evaluate 42` |
| `/plan` | Creates detailed task plans | `/plan feature X` |
| `/ask` | Answers codebase questions | `/ask how does auth work` |
| `/debug` | Analyzes and fixes bugs | `/debug error in login` |
| `/code` | Quick code generation | `/code add validation` |
| `/research [topic]` | Run research and self-improvement | `/research multi-agent` |
| `/evolution log` | Log agent model change | `/evolution log planner "reason"` |
| `/evolution report` | Generate evolution report | `/evolution report` |
## Pipeline Agents (Subagents)
@@ -62,7 +65,8 @@ These agents are invoked automatically by `/pipeline` or manually via `@mention`
|-------|------|--------------|
| `@release-manager` | Git operations | Status: releasing |
| `@evaluator` | Scores effectiveness | Status: evaluated |
| `@prompt-optimizer` | Improves prompts | When score < 7 |
| `@pipeline-judge` | Objective fitness scoring | After workflow completes |
| `@prompt-optimizer` | Improves prompts | When fitness < 0.70 |
| `@capability-analyst` | Analyzes task coverage | When starting new task |
| `@agent-architect` | Creates new agents | When gaps identified |
| `@workflow-architect` | Creates workflows | New workflow needed |
@@ -94,9 +98,27 @@ These agents are invoked automatically by `/pipeline` or manually via `@mention`
[releasing]
↓ @release-manager
[evaluated]
↓ @evaluator
├── [score ≥ 7] → [completed]
└── [score < 7] → @prompt-optimizer → [completed]
↓ @evaluator (subjective score 1-10)
├── [score ≥ 7] → [@pipeline-judge] → fitness scoring
└── [score < 7] → @prompt-optimizer → [@evaluated]
[@pipeline-judge] ← runs tests, measures tokens/time
fitness score
┌──────────────────────────────────────┐
│ fitness >= 0.85 │──→ [completed]
│ fitness 0.70-0.84 │──→ @prompt-optimizer → [evolving]
│ fitness < 0.70 │──→ @prompt-optimizer (major) → [evolving]
│ fitness < 0.50 │──→ @agent-architect → redesign
└──────────────────────────────────────┘
[evolving] → re-run workflow → [@pipeline-judge]
compare fitness_before vs fitness_after
[improved?] → commit prompts → [completed]
└─ [not improved?] → revert → try different strategy
```
## Capability Analysis Flow
@@ -167,6 +189,14 @@ Scores saved to `.kilo/logs/efficiency_score.json`:
}
```
### Fitness Tracking
Fitness scores saved to `.kilo/logs/fitness-history.jsonl`:
```jsonl
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
```
## Manual Agent Invocation
```typescript
@@ -192,11 +222,34 @@ GITEA_TOKEN=your-token-here
## Self-Improvement Cycle
1. **Pipeline runs** for each issue
2. **Evaluator scores** each agent (1-10)
3. **Low scores (<7)** trigger prompt-optimizer
4. **Prompt optimizer** analyzes failures and improves prompts
5. **New prompts** saved to `.kilo/agents/`
6. **Next run** uses improved prompts
2. **Evaluator scores** each agent (1-10) - subjective
3. **Pipeline Judge measures** fitness objectively (0.0-1.0)
4. **Low fitness (<0.70)** triggers prompt-optimizer
5. **Prompt optimizer** analyzes failures and improves prompts
6. **Re-run workflow** with improved prompts
7. **Compare fitness** before/after - commit if improved
8. **Log results** to `.kilo/logs/fitness-history.jsonl`
### Evaluator vs Pipeline Judge
| Aspect | Evaluator | Pipeline Judge |
|--------|-----------|----------------|
| Type | Subjective | Objective |
| Score | 1-10 (opinion) | 0.0-1.0 (metrics) |
| Metrics | Observations | Tests, tokens, time |
| Trigger | After workflow | After evaluator |
| Action | Logs to Gitea | Triggers optimization |
### Fitness Score Components
```
fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
where:
test_pass_rate = passed_tests / total_tests
quality_gates_rate = passed_gates / total_gates (build, lint, types, tests, coverage)
efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)
```
## Architecture Files