Files
APAW/.kilo/workflows/fitness-evaluation.md
¨NW¨ fa68141d47 feat: add pipeline-judge agent and evolution workflow system
- Add pipeline-judge agent for objective fitness scoring
- Update capability-index.yaml with pipeline-judge, evolution config
- Add fitness-evaluation.md workflow for auto-optimization
- Update evolution.md command with /evolve CLI
- Create .kilo/logs/fitness-history.jsonl for metrics logging
- Update AGENTS.md with new workflow state machine
- Add 6 new issues to MILESTONE_ISSUES.md for evolution integration
- Preserve ideas in agent-evolution/ideas/

Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25)
Auto-triggers prompt-optimizer when fitness < 0.70
2026-04-06 00:23:50 +01:00

7.6 KiB
Raw Blame History

Fitness Evaluation Workflow

Post-workflow fitness evaluation and automatic optimization loop.

Overview

This workflow runs after every completed workflow to:

  1. Evaluate fitness objectively via pipeline-judge
  2. Trigger optimization if fitness < threshold
  3. Re-run and compare before/after
  4. Log results to fitness-history.jsonl

Flow

[Workflow Completes]
        ↓
[@pipeline-judge] ← runs tests, measures tokens/time
        ↓
    fitness score
        ↓
┌──────────────────────────────────┐
│ fitness >= 0.85                  │──→ Log + done (no action)
│ fitness 0.70 - 0.84              │──→ [@prompt-optimizer] minor tuning
│ fitness < 0.70                   │──→ [@prompt-optimizer] major rewrite
│ fitness < 0.50                   │──→ [@agent-architect] redesign agent
└──────────────────────────────────┘
        ↓
[Re-run same workflow with new prompts]
        ↓
[@pipeline-judge] again
        ↓
    compare fitness_before vs fitness_after
        ↓
┌──────────────────────────────────┐
│ improved?                        │
│  Yes → commit new prompts        │
│  No  → revert, try               │
│        different strategy        │
│        (max 3 attempts)           │
└──────────────────────────────────┘

Fitness Score Formula

fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)

where:
  test_pass_rate = passed_tests / total_tests
  quality_gates_rate = passed_gates / total_gates
  efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)
  normalized_cost = (actual_tokens / budget_tokens × 0.5) + (actual_time / budget_time × 0.5)

Quality Gates

Each gate is binary (pass/fail):

Gate Command Weight
build bun run build 1/5
lint bun run lint 1/5
types bun run typecheck 1/5
tests bun test 1/5
coverage bun test --coverage >= 80% 1/5

Budget Defaults

Workflow Token Budget Time Budget (s) Min Coverage
feature 50000 300 80%
bugfix 20000 120 90%
refactor 40000 240 95%
security 30000 180 80%

Workflow-Specific Benchmarks

benchmarks:
  feature:
    token_budget: 50000
    time_budget_s: 300
    min_test_coverage: 80%
    max_iterations: 3
    
  bugfix:
    token_budget: 20000
    time_budget_s: 120
    min_test_coverage: 90%  # higher for bugfix - must prove fix works
    max_iterations: 2
    
  refactor:
    token_budget: 40000
    time_budget_s: 240
    min_test_coverage: 95%  # must not break anything
    max_iterations: 2
    
  security:
    token_budget: 30000
    time_budget_s: 180
    min_test_coverage: 80%
    max_iterations: 2
    required_gates: [security]  # security gate MUST pass

Execution Steps

Step 1: Collect Metrics

Agent: pipeline-judge

# Run test suite
bun test --reporter=json > /tmp/test-results.json 2>&1

# Count results
TOTAL=$(jq '.numTotalTests' /tmp/test-results.json)
PASSED=$(jq '.numPassedTests' /tmp/test-results.json)
FAILED=$(jq '.numFailedTests' /tmp/test-results.json)

# Check quality gates
bun run build 2>&1 && BUILD_OK=true || BUILD_OK=false
bun run lint 2>&1 && LINT_OK=true || LINT_OK=false
bun run typecheck 2>&1 && TYPES_OK=true || TYPES_OK=false

Step 2: Read Pipeline Log

Read .kilo/logs/pipeline-*.log for:

  • Token counts per agent
  • Execution time per agent
  • Number of iterations in evaluator-optimizer loops
  • Which agents were invoked

Step 3: Calculate Fitness

test_pass_rate = PASSED / TOTAL
quality_gates_rate = (BUILD_OK + LINT_OK + TYPES_OK + TESTS_CLEAN + COVERAGE_OK) / 5
efficiency = 1.0 - min((tokens/50000 + time/300) / 2, 1.0)

FITNESS = test_pass_rate × 0.50 + quality_gates_rate × 0.25 + efficiency × 0.25

Step 4: Decide Action

Fitness Action
>= 0.85 Log to fitness-history.jsonl, done
0.70-0.84 Call prompt-optimizer for minor tuning
0.50-0.69 Call prompt-optimizer for major rewrite
< 0.50 Call agent-architect to redesign agent

Step 5: Re-test After Optimization

If optimization was triggered:

  1. Re-run the same workflow with new prompts
  2. Call pipeline-judge again
  3. Compare fitness_before vs fitness_after
  4. If improved: commit prompts
  5. If not improved: revert

Step 6: Log Results

Append to .kilo/logs/fitness-history.jsonl:

{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}

Usage

Automatic (post-pipeline)

The workflow triggers automatically after any workflow completes.

Manual

/evolve                     # evolve last completed workflow
/evolve --issue 42          # evolve workflow for issue #42
/evolve --agent planner     # focus evolution on one agent
/evolve --dry-run           # show what would change without applying
/evolve --history           # print fitness trend chart

Integration Points

  • After /pipeline: pipeline-judge scores the workflow
  • After prompt update: evolution loop retries
  • Weekly: Performance trend analysis
  • On request: Recommendation generation

Orchestrator Learning

The orchestrator uses fitness history to optimize future pipeline construction:

Pipeline Selection Strategy

For each new issue:
  1. Classify issue type (feature|bugfix|refactor|api|security)
  2. Look up fitness history for same type
  3. Find pipeline configuration with highest fitness
  4. Use that as template, but adapt to current issue
  5. Skip agents that consistently score 0 contribution

Agent Ordering Optimization

From fitness-history.jsonl, extract per-agent metrics:
  - avg tokens consumed
  - avg contribution to fitness
  - failure rate (how often this agent's output causes downstream failures)

agents_by_roi = sort(agents, key=contribution/tokens, descending)

For parallel phases:
  - Run high-ROI agents first
  - Skip agents with ROI < 0.1 (cost more than they contribute)

Token Budget Allocation

total_budget = 50000 tokens (configurable)

For each agent in pipeline:
  agent_budget = total_budget × (agent_avg_contribution / sum_all_contributions)
  
  If agent exceeds budget by >50%:
    → prompt-optimizer compresses that agent's prompt
    → or swap to a smaller/faster model

Prompt Evolution Protocol

When prompt-optimizer is triggered:

  1. Read current agent prompt from .kilo/agents/<agent>.md
  2. Read fitness report identifying the problem
  3. Read last 5 fitness entries for this agent from history
  4. Analyze pattern:
    • IF consistently low → systemic prompt issue
    • IF regression after change → revert
    • IF one-time failure → might be task-specific, no action
  5. Generate improved prompt:
    • Keep same structure (description, mode, model, permissions)
    • Modify ONLY the instruction body
    • Add explicit output format IF was the issue
    • Add few-shot examples IF quality was the issue
    • Compress verbose sections IF tokens were the issue
  6. Save to .kilo/agents/<agent>.md.candidate
  7. Re-run workflow with .candidate prompt
  8. @pipeline-judge scores again
  9. IF fitness_new > fitness_old: mv .candidate → .md (commit) ELSE: rm .candidate (revert)