feat: add pipeline-judge agent and evolution workflow system
- Add pipeline-judge agent for objective fitness scoring - Update capability-index.yaml with pipeline-judge, evolution config - Add fitness-evaluation.md workflow for auto-optimization - Update evolution.md command with /evolve CLI - Create .kilo/logs/fitness-history.jsonl for metrics logging - Update AGENTS.md with new workflow state machine - Add 6 new issues to MILESTONE_ISSUES.md for evolution integration - Preserve ideas in agent-evolution/ideas/ Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25) Auto-triggers prompt-optimizer when fitness < 0.70
This commit is contained in:
71
AGENTS.md
71
AGENTS.md
@@ -17,12 +17,15 @@ Agent: Runs full pipeline for issue #42 with Gitea logging
|
||||
|---------|-------------|-------|
|
||||
| `/pipeline <issue>` | Run full agent pipeline for issue | `/pipeline 42` |
|
||||
| `/status <issue>` | Check pipeline status for issue | `/status 42` |
|
||||
| `/evolve` | Run evolution cycle with fitness scoring | `/evolve --issue 42` |
|
||||
| `/evaluate <issue>` | Generate performance report | `/evaluate 42` |
|
||||
| `/plan` | Creates detailed task plans | `/plan feature X` |
|
||||
| `/ask` | Answers codebase questions | `/ask how does auth work` |
|
||||
| `/debug` | Analyzes and fixes bugs | `/debug error in login` |
|
||||
| `/code` | Quick code generation | `/code add validation` |
|
||||
| `/research [topic]` | Run research and self-improvement | `/research multi-agent` |
|
||||
| `/evolution log` | Log agent model change | `/evolution log planner "reason"` |
|
||||
| `/evolution report` | Generate evolution report | `/evolution report` |
|
||||
|
||||
## Pipeline Agents (Subagents)
|
||||
|
||||
@@ -62,7 +65,8 @@ These agents are invoked automatically by `/pipeline` or manually via `@mention`
|
||||
|-------|------|--------------|
|
||||
| `@release-manager` | Git operations | Status: releasing |
|
||||
| `@evaluator` | Scores effectiveness | Status: evaluated |
|
||||
| `@prompt-optimizer` | Improves prompts | When score < 7 |
|
||||
| `@pipeline-judge` | Objective fitness scoring | After workflow completes |
|
||||
| `@prompt-optimizer` | Improves prompts | When fitness < 0.70 |
|
||||
| `@capability-analyst` | Analyzes task coverage | When starting new task |
|
||||
| `@agent-architect` | Creates new agents | When gaps identified |
|
||||
| `@workflow-architect` | Creates workflows | New workflow needed |
|
||||
@@ -94,9 +98,27 @@ These agents are invoked automatically by `/pipeline` or manually via `@mention`
|
||||
[releasing]
|
||||
↓ @release-manager
|
||||
[evaluated]
|
||||
↓ @evaluator
|
||||
├── [score ≥ 7] → [completed]
|
||||
└── [score < 7] → @prompt-optimizer → [completed]
|
||||
↓ @evaluator (subjective score 1-10)
|
||||
├── [score ≥ 7] → [@pipeline-judge] → fitness scoring
|
||||
└── [score < 7] → @prompt-optimizer → [@evaluated]
|
||||
↓
|
||||
[@pipeline-judge] ← runs tests, measures tokens/time
|
||||
↓
|
||||
fitness score
|
||||
↓
|
||||
┌──────────────────────────────────────┐
|
||||
│ fitness >= 0.85 │──→ [completed]
|
||||
│ fitness 0.70-0.84 │──→ @prompt-optimizer → [evolving]
|
||||
│ fitness < 0.70 │──→ @prompt-optimizer (major) → [evolving]
|
||||
│ fitness < 0.50 │──→ @agent-architect → redesign
|
||||
└──────────────────────────────────────┘
|
||||
↓
|
||||
[evolving] → re-run workflow → [@pipeline-judge]
|
||||
↓
|
||||
compare fitness_before vs fitness_after
|
||||
↓
|
||||
[improved?] → commit prompts → [completed]
|
||||
└─ [not improved?] → revert → try different strategy
|
||||
```
|
||||
|
||||
## Capability Analysis Flow
|
||||
@@ -167,6 +189,14 @@ Scores saved to `.kilo/logs/efficiency_score.json`:
|
||||
}
|
||||
```
|
||||
|
||||
### Fitness Tracking
|
||||
|
||||
Fitness scores saved to `.kilo/logs/fitness-history.jsonl`:
|
||||
```jsonl
|
||||
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
|
||||
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
|
||||
```
|
||||
|
||||
## Manual Agent Invocation
|
||||
|
||||
```typescript
|
||||
@@ -192,11 +222,34 @@ GITEA_TOKEN=your-token-here
|
||||
## Self-Improvement Cycle
|
||||
|
||||
1. **Pipeline runs** for each issue
|
||||
2. **Evaluator scores** each agent (1-10)
|
||||
3. **Low scores (<7)** trigger prompt-optimizer
|
||||
4. **Prompt optimizer** analyzes failures and improves prompts
|
||||
5. **New prompts** saved to `.kilo/agents/`
|
||||
6. **Next run** uses improved prompts
|
||||
2. **Evaluator scores** each agent (1-10) - subjective
|
||||
3. **Pipeline Judge measures** fitness objectively (0.0-1.0)
|
||||
4. **Low fitness (<0.70)** triggers prompt-optimizer
|
||||
5. **Prompt optimizer** analyzes failures and improves prompts
|
||||
6. **Re-run workflow** with improved prompts
|
||||
7. **Compare fitness** before/after - commit if improved
|
||||
8. **Log results** to `.kilo/logs/fitness-history.jsonl`
|
||||
|
||||
### Evaluator vs Pipeline Judge
|
||||
|
||||
| Aspect | Evaluator | Pipeline Judge |
|
||||
|--------|-----------|----------------|
|
||||
| Type | Subjective | Objective |
|
||||
| Score | 1-10 (opinion) | 0.0-1.0 (metrics) |
|
||||
| Metrics | Observations | Tests, tokens, time |
|
||||
| Trigger | After workflow | After evaluator |
|
||||
| Action | Logs to Gitea | Triggers optimization |
|
||||
|
||||
### Fitness Score Components
|
||||
|
||||
```
|
||||
fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
|
||||
|
||||
where:
|
||||
test_pass_rate = passed_tests / total_tests
|
||||
quality_gates_rate = passed_gates / total_gates (build, lint, types, tests, coverage)
|
||||
efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)
|
||||
```
|
||||
|
||||
## Architecture Files
|
||||
|
||||
|
||||
Reference in New Issue
Block a user