Files
APAW/AGENTS.md
¨NW¨ fa68141d47 feat: add pipeline-judge agent and evolution workflow system
- Add pipeline-judge agent for objective fitness scoring
- Update capability-index.yaml with pipeline-judge, evolution config
- Add fitness-evaluation.md workflow for auto-optimization
- Update evolution.md command with /evolve CLI
- Create .kilo/logs/fitness-history.jsonl for metrics logging
- Update AGENTS.md with new workflow state machine
- Add 6 new issues to MILESTONE_ISSUES.md for evolution integration
- Preserve ideas in agent-evolution/ideas/

Pipeline Judge computes fitness = (test_rate*0.5) + (gates*0.25) + (efficiency*0.25)
Auto-triggers prompt-optimizer when fitness < 0.70
2026-04-06 00:23:50 +01:00

11 KiB
Raw Blame History

Kilo Code Agents Reference

This file configures AI agent behavior for the APAW project - a self-improving code pipeline with Gitea logging.

Pipeline Workflow

The main workflow is /pipeline - use it to process issues through all agents automatically.

User: /pipeline 42
Agent: Runs full pipeline for issue #42 with Gitea logging

Commands (Slash Commands)

Command Description Usage
/pipeline <issue> Run full agent pipeline for issue /pipeline 42
/status <issue> Check pipeline status for issue /status 42
/evolve Run evolution cycle with fitness scoring /evolve --issue 42
/evaluate <issue> Generate performance report /evaluate 42
/plan Creates detailed task plans /plan feature X
/ask Answers codebase questions /ask how does auth work
/debug Analyzes and fixes bugs /debug error in login
/code Quick code generation /code add validation
/research [topic] Run research and self-improvement /research multi-agent
/evolution log Log agent model change /evolution log planner "reason"
/evolution report Generate evolution report /evolution report

Pipeline Agents (Subagents)

These agents are invoked automatically by /pipeline or manually via @mention:

Core Development

Agent Role When Invoked
@requirement-refiner Converts ideas to User Stories Issue status: new
@history-miner Finds duplicates in git Status: planned
@system-analyst Designs specifications Status: researching
@sdet-engineer Writes tests (TDD) Status: designed
@lead-developer Implements code Status: testing (tests fail)
@frontend-developer UI implementation When UI work needed
@backend-developer Node.js/Express/APIs When backend needed
@flutter-developer Flutter mobile apps When mobile development
@go-developer Go backend services When Go backend needed

Quality Assurance

Agent Role When Invoked
@code-skeptic Adversarial review Status: implementing
@the-fixer Fixes issues When review fails
@performance-engineer Performance review After code-skeptic
@security-auditor Security audit After performance
@visual-tester Visual regression When UI changes

Cognitive Enhancement (New)

Agent Role When Invoked
@planner Task decomposition (CoT/ToT) Complex tasks
@reflector Self-reflection (Reflexion) After each agent
@memory-manager Memory systems Context management

Meta & Process

Agent Role When Invoked
@release-manager Git operations Status: releasing
@evaluator Scores effectiveness Status: evaluated
@pipeline-judge Objective fitness scoring After workflow completes
@prompt-optimizer Improves prompts When fitness < 0.70
@capability-analyst Analyzes task coverage When starting new task
@agent-architect Creates new agents When gaps identified
@workflow-architect Creates workflows New workflow needed
@markdown-validator Validates Markdown Before issue creation

Workflow State Machine

[new] 
  ↓ @requirement-refiner
[planned] 
  ↓ @capability-analyst → (gaps?) → @agent-architect → create new agents
  ↓ @history-miner
[researching] 
  ↓ @system-analyst
[designed] 
  ↓ @sdet-engineer (writes failing tests)
[testing] 
  ↓ @lead-developer (makes tests pass)
[implementing] 
  ↓ @code-skeptic (review)
[reviewing] ──[fail]──→ [fixing] ──→ [reviewing]
  ↓ @review-watcher → (auto-validate) → create fix tasks
  ↓ [pass]
[perf-check] 
  ↓ @performance-engineer
[security-check] 
  ↓ @security-auditor
[releasing] 
  ↓ @release-manager
[evaluated] 
  ↓ @evaluator (subjective score 1-10)
  ├── [score ≥ 7] → [@pipeline-judge] → fitness scoring
  └── [score < 7] → @prompt-optimizer → [@evaluated]
        ↓
    [@pipeline-judge] ← runs tests, measures tokens/time
        ↓
    fitness score
        ↓
┌──────────────────────────────────────┐
│ fitness >= 0.85                      │──→ [completed]
│ fitness 0.70-0.84                    │──→ @prompt-optimizer → [evolving]
│ fitness < 0.70                      │──→ @prompt-optimizer (major) → [evolving]
│ fitness < 0.50                      │──→ @agent-architect → redesign
└──────────────────────────────────────┘
        ↓
[evolving] → re-run workflow → [@pipeline-judge]
        ↓
    compare fitness_before vs fitness_after
        ↓
    [improved?] → commit prompts → [completed]
              └─ [not improved?] → revert → try different strategy

Capability Analysis Flow

When starting a complex task:

[User Request]
      ↓
[@capability-analyst] ← Analyzes requirements vs existing capabilities
      ↓
[Gap Analysis] ← Identifies missing agents, workflows, skills
      ↓
[Recommendations] → Create new or enhance existing?
      ↓
[Decision]
  ├── [Create New] → [@agent-architect] → Create component → Review
  └── [Enhance] → [@lead-developer] → Modify existing
      ↓
[Integration] ← Verify new component works with system
      ↓
[Complete] ← Task can now be handled

Gitea Integration

Status Labels

Pipeline uses Gitea labels to track progress:

  • status: newstatus: plannedstatus: researching → ...
  • Agents add/remove labels automatically

Performance Logging

Each agent logs to Gitea issue comments:

## ✅ lead-developer completed

**Score**: 8/10
**Duration**: 1.2h
**Files**: src/auth.ts, src/user.ts

### Notes
- Clean implementation
- Follows existing patterns
- Tests passing

Efficiency Tracking

Scores saved to .kilo/logs/efficiency_score.json:

{
  "version": "1.0",
  "history": [
    {
      "issue": 42,
      "date": "2024-01-02T10:00:00Z",
      "agents": {
        "lead-developer": 8,
        "code-skeptic": 7,
        "the-fixer": 9
      },
      "iterations": 2,
      "duration_hours": 1.5
    }
  ]
}

Fitness Tracking

Fitness scores saved to .kilo/logs/fitness-history.jsonl:

{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}

Manual Agent Invocation

// Use Task tool to invoke subagent
Task tool with:
  subagent_type: "lead-developer"
  prompt: "Implement authentication for issue #42"

Or via @mention:

@lead-developer implement authentication flow

Environment Variables

Required for Gitea integration:

GITEA_API_URL=https://git.softuniq.eu/api/v1
GITEA_TOKEN=your-token-here

Self-Improvement Cycle

  1. Pipeline runs for each issue
  2. Evaluator scores each agent (1-10) - subjective
  3. Pipeline Judge measures fitness objectively (0.0-1.0)
  4. Low fitness (<0.70) triggers prompt-optimizer
  5. Prompt optimizer analyzes failures and improves prompts
  6. Re-run workflow with improved prompts
  7. Compare fitness before/after - commit if improved
  8. Log results to .kilo/logs/fitness-history.jsonl

Evaluator vs Pipeline Judge

Aspect Evaluator Pipeline Judge
Type Subjective Objective
Score 1-10 (opinion) 0.0-1.0 (metrics)
Metrics Observations Tests, tokens, time
Trigger After workflow After evaluator
Action Logs to Gitea Triggers optimization

Fitness Score Components

fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)

where:
  test_pass_rate = passed_tests / total_tests
  quality_gates_rate = passed_gates / total_gates (build, lint, types, tests, coverage)
  efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)

Architecture Files

File Purpose
AGENTS.md This file - main config
.kilo/agents/*.md Agent definitions with prompts
.kilo/commands/*.md Workflow commands
.kilo/rules/*.md Custom rules loaded globally
.kilo/skills/ Skill modules
src/kilocode/ TypeScript API for programmatic use

Using the TypeScript API

import { 
  PipelineRunner, 
  GiteaClient, 
  decideRouting 
} from './src/kilocode/index.js'

const runner = await createPipelineRunner({
  giteaToken: process.env.GITEA_TOKEN
})

await runner.run({ issueNumber: 42 })

Agent Evolution Dashboard

Track agent model changes, performance, and recommendations in real-time.

Access

# Sync agent data
bun run sync:evolution

# Open dashboard
bun run evolution:dashboard
bun run evolution:open
# or visit http://localhost:3001

Dashboard Tabs

Tab Description
Overview Stats, recent changes, pending recommendations
All Agents Filterable agent cards with history
Timeline Full evolution history
Recommendations Priority-based model suggestions
Model Matrix Agent × Model mapping with fit scores

Data Sources

Source What it tracks
.kilo/agents/*.md Model, description, capabilities
.kilo/kilo.jsonc Model assignments
.kilo/capability-index.yaml Capability routing
Git History Model and prompt changes
Gitea Comments Performance scores

Evolution Data Structure

{
  "agents": {
    "lead-developer": {
      "current": { "model": "qwen3-coder:480b", "fit_score": 92 },
      "history": [{ "type": "model_change", "from": "deepseek", "to": "qwen3" }],
      "performance_log": [{ "issue": 42, "score": 8, "success": true }]
    }
  }
}

Recommendations Priority

Priority When Example
Critical Fit score < 70 Immediate model change required
High Model unavailable Switch to fallback
Medium Better model available Consider upgrade
Low Optimization possible Optional improvement

Code Style

  • Use TypeScript for new files
  • Follow existing patterns
  • Write tests before code (TDD)
  • Keep functions under 50 lines
  • Use early returns
  • No comments unless explicitly requested