Files

¨NW¨ fa68141d47 feat: add pipeline-judge agent and evolution workflow system

- Add pipeline-judge agent for objective fitness scoring
- Update capability-index.yaml with pipeline-judge, evolution config
- Add fitness-evaluation.md workflow for auto-optimization
- Update evolution.md command with /evolve CLI
- Create .kilo/logs/fitness-history.jsonl for metrics logging
- Update AGENTS.md with new workflow state machine
- Add 6 new issues to MILESTONE_ISSUES.md for evolution integration
- Preserve ideas in agent-evolution/ideas/

Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25)
Auto-triggers prompt-optimizer when fitness < 0.70

2026-04-06 00:23:50 +01:00

5.7 KiB

Raw Blame History

Evolution Workflow

Continuous self-improvement loop for the agent pipeline. Triggered automatically after every workflow completion.

Overview

[Workflow Completes]
       ↓
[@pipeline-judge] ← runs tests, measures tokens/time
       ↓
   fitness score
       ↓
┌──────────────────────────┐
│ fitness >= 0.85          │──→ Log + done (no action)
│ fitness 0.70 - 0.84      │──→ [@prompt-optimizer] minor tuning
│ fitness < 0.70           │──→ [@prompt-optimizer] major rewrite
│ fitness < 0.50           │──→ [@agent-architect] redesign agent
└──────────────────────────┘
       ↓
   [Re-run same workflow with new prompts]
       ↓
   [@pipeline-judge] again
       ↓
   compare fitness_before vs fitness_after
       ↓
┌──────────────────────────┐
│ improved?                │
│  Yes → commit new prompts│
│  No  → revert, try       │
│        different strategy │
│        (max 3 attempts)   │
└──────────────────────────┘

Fitness History

All fitness scores are appended to .kilo/logs/fitness-history.jsonl:

{"ts":"2026-04-05T12:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
{"ts":"2026-04-05T14:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}

This creates a time-series that shows pipeline evolution over time.

Orchestrator Evolution

The orchestrator uses fitness history to optimize future pipeline construction:

Pipeline Selection Strategy

For each new issue:
  1. Classify issue type (feature|bugfix|refactor|api|security)
  2. Look up fitness history for same type
  3. Find the pipeline configuration with highest fitness
  4. Use that as template, but adapt to current issue
  5. Skip agents that consistently score 0 contribution

Agent Ordering Optimization

From fitness-history.jsonl, extract per-agent metrics:
  - avg tokens consumed
  - avg contribution to fitness
  - failure rate (how often this agent's output causes downstream failures)

agents_by_roi = sort(agents, key=contribution/tokens, descending)

For parallel phases:
  - Run high-ROI agents first
  - Skip agents with ROI < 0.1 (cost more than they contribute)

Token Budget Allocation

total_budget = 50000 tokens (configurable)

For each agent in pipeline:
  agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
  
  If agent exceeds budget by >50%:
    → prompt-optimizer compresses that agent's prompt
    → or swap to a smaller/faster model

Standard Test Suites

No manual test configuration needed. Tests are auto-discovered:

Test Discovery

# Unit tests
find src -name "*.test.ts" -o -name "*.spec.ts" | wc -l

# E2E tests  
find tests/e2e -name "*.test.ts" | wc -l

# Integration tests
find tests/integration -name "*.test.ts" | wc -l

Quality Gates (standardized)

gates:
  build:      "bun run build"
  lint:       "bun run lint"
  typecheck:  "bun run typecheck"  
  unit_tests: "bun test"
  e2e_tests:  "bun test:e2e"
  coverage:   "bun test --coverage | grep 'All files' | awk '{print $10}' >= 80"
  security:   "bun audit --level=high | grep 'found 0'"

Workflow-Specific Benchmarks

benchmarks:
  feature:
    token_budget: 50000
    time_budget_s: 300
    min_test_coverage: 80%
    max_iterations: 3
    
  bugfix:
    token_budget: 20000
    time_budget_s: 120
    min_test_coverage: 90%  # higher for bugfix — must prove fix works
    max_iterations: 2
    
  refactor:
    token_budget: 40000
    time_budget_s: 240
    min_test_coverage: 95%  # must not break anything
    max_iterations: 2
    
  security:
    token_budget: 30000
    time_budget_s: 180
    min_test_coverage: 80%
    max_iterations: 2
    required_gates: [security]  # security gate MUST pass

Prompt Evolution Protocol

When prompt-optimizer is triggered:

1. Read current agent prompt from .kilo/agents/<agent>.md
2. Read fitness report identifying the problem
3. Read last 5 fitness entries for this agent from history

4. Analyze pattern:
   - IF consistently low → systemic prompt issue
   - IF regression after change → revert
   - IF one-time failure → might be task-specific, no action

5. Generate improved prompt:
   - Keep same structure (description, mode, model, permissions)
   - Modify ONLY the instruction body
   - Add explicit output format if IF was the issue
   - Add few-shot examples if quality was the issue
   - Compress verbose sections if tokens were the issue

6. Save to .kilo/agents/<agent>.md.candidate

7. Re-run the SAME workflow with .candidate prompt

8. [@pipeline-judge] scores again

9. IF fitness_new > fitness_old:
     mv .candidate → .md (commit)
   ELSE:
     rm .candidate (revert)

Usage

# Triggered automatically after any workflow
# OR manually:
/evolve                    # run evolution on last workflow
/evolve --issue 42         # run evolution on specific issue
/evolve --agent planner    # evolve specific agent's prompt
/evolve --history          # show fitness trend

Configuration

# Add to kilo.jsonc or capability-index.yaml
evolution:
  enabled: true
  auto_trigger: true           # trigger after every workflow
  fitness_threshold: 0.70      # below this → auto-optimize
  max_evolution_attempts: 3    # max retries per cycle
  fitness_history: .kilo/logs/fitness-history.jsonl
  token_budget_default: 50000
  time_budget_default: 300

5.7 KiB Raw Blame History Unescape Escape