Files
APAW/AGENTS.md
NW 3a8aa6b416 docs: update visual testing agent docs, remove test artifacts from git, add pipeline documentation
- Remove baseline screenshots from git tracking (test artifacts, not code)
- Add tests/visual/baseline/ to .gitignore
- Rewrite .kilo/agents/visual-tester.md: Docker-first pipeline, bbox extraction, console/network error detection
- Rewrite .kilo/commands/web-test.md: accurate commands, output format, agent flow
- Update .kilo/capability-index.yaml: add bbox_extraction, console_error_detection, button_overflow_detection to visual-tester
- Update AGENTS.md: add /web-test and /e2e-test commands, update visual-tester description
2026-04-16 22:48:46 +01:00

12 KiB
Raw Blame History

Kilo Code Agents Reference

This file configures AI agent behavior for the APAW project - a self-improving code pipeline with Gitea logging.

Pipeline Workflow

The main workflow is /pipeline - use it to process issues through all agents automatically.

User: /pipeline 42
Agent: Runs full pipeline for issue #42 with Gitea logging

Commands (Slash Commands)

Command Description Usage
/pipeline <issue> Run full agent pipeline for issue /pipeline 42
/status <issue> Check pipeline status for issue /status 42
/evolve Run evolution cycle with fitness scoring /evolve --issue 42
/evaluate <issue> Generate performance report /evaluate 42
/plan Creates detailed task plans /plan feature X
/ask Answers codebase questions /ask how does auth work
/debug Analyzes and fixes bugs /debug error in login
/code Quick code generation /code add validation
/research [topic] Run research and self-improvement /research multi-agent
/evolution log Log agent model change /evolution log planner "reason"
/evolution report Generate evolution report /evolution report
/web-test <url> Visual regression testing in Docker /web-test https://bbox.wtf
/e2e-test <url> E2E browser automation tests /e2e-test https://my-app.com

Pipeline Agents (Subagents)

These agents are invoked automatically by /pipeline or manually via @mention:

Core Development

Agent Role Model Variant Can Call
@requirement-refiner Converts ideas to User Stories glm-5.1 thinking history-miner, system-analyst
@history-miner Finds duplicates in git nemotron-3-super (read-only)
@system-analyst Designs specifications glm-5.1 thinking sdet-engineer, orchestrator
@sdet-engineer Writes tests (TDD) qwen3-coder:480b thinking lead-developer, orchestrator
@lead-developer Implements code qwen3-coder:480b thinking code-skeptic, orchestrator
@frontend-developer UI implementation qwen3-coder:480b code-skeptic, orchestrator
@backend-developer Node.js/Express/APIs qwen3-coder:480b code-skeptic, orchestrator
@go-developer Go backend services qwen3-coder:480b code-skeptic, orchestrator
@flutter-developer Flutter mobile apps qwen3-coder:480b code-skeptic, orchestrator

Quality Assurance

Agent Role Model Variant Can Call
@code-skeptic Adversarial review minimax-m2.5 the-fixer, performance-engineer, orchestrator
@the-fixer Fixes issues minimax-m2.5 code-skeptic, orchestrator
@performance-engineer Performance review nemotron-3-super the-fixer, security-auditor, orchestrator
@security-auditor Security audit nemotron-3-super the-fixer, release-manager, orchestrator
@visual-tester Visual regression + bbox extraction + console/network errors qwen3-coder:480b the-fixer, orchestrator
@browser-automation E2E testing qwen3-coder:480b orchestrator

DevOps & Infrastructure

Agent Role Model Variant Can Call
@devops-engineer Docker/K8s/CI-CD nemotron-3-super code-skeptic, security-auditor, orchestrator
@release-manager Git operations, releases glm-5.1 evaluator

Meta & Process

Agent Role Model Variant Can Call
@evaluator Scores effectiveness glm-5.1 thinking prompt-optimizer, product-owner, orchestrator
@pipeline-judge Objective fitness scoring glm-5.1 prompt-optimizer
@prompt-optimizer Improves prompts glm-5.1 instant (edits files)
@product-owner Manages issues/tracking glm-5.1 (read-only)

Analysis & Design

Agent Role Model Variant Can Call
@capability-analyst Analyzes task coverage glm-5.1 agent-architect, orchestrator
@agent-architect Creates new agents glm-5.1 thinking capability-analyst, requirement-refiner, system-analyst
@workflow-architect Creates workflows glm-5.1 thinking (edits files)
@markdown-validator Validates Markdown nemotron-3-nano:30b orchestrator

Cognitive Enhancement

Agent Role Model Variant Can Call
@planner Task decomposition nemotron-3-super (read-only)
@reflector Self-reflection nemotron-3-super (read-only)
@memory-manager Memory systems nemotron-3-super (read-only)

Workflow State Machine

[new] 
  ↓ @requirement-refiner
[planned] 
  ↓ @capability-analyst → (gaps?) → @agent-architect → create new agents
  ↓ @history-miner
[researching] 
  ↓ @system-analyst
[designed] 
  ↓ @sdet-engineer (writes failing tests)
[testing] 
  ↓ @lead-developer (makes tests pass)
[implementing] 
  ↓ @code-skeptic (review)
[reviewing] ──[fail]──→ [fixing] ──→ [reviewing]
  ↓ @review-watcher → (auto-validate) → create fix tasks
  ↓ [pass]
[perf-check] 
  ↓ @performance-engineer
[security-check] 
  ↓ @security-auditor
[releasing] 
  ↓ @release-manager
[evaluated] 
  ↓ @evaluator (subjective score 1-10)
  ├── [score ≥ 7] → [@pipeline-judge] → fitness scoring
  └── [score < 7] → @prompt-optimizer → [@evaluated]
        ↓
    [@pipeline-judge] ← runs tests, measures tokens/time
        ↓
    fitness score
        ↓
┌──────────────────────────────────────┐
│ fitness >= 0.85                      │──→ [completed]
│ fitness 0.70-0.84                    │──→ @prompt-optimizer → [evolving]
│ fitness < 0.70                      │──→ @prompt-optimizer (major) → [evolving]
│ fitness < 0.50                      │──→ @agent-architect → redesign
└──────────────────────────────────────┘
        ↓
[evolving] → re-run workflow → [@pipeline-judge]
        ↓
    compare fitness_before vs fitness_after
        ↓
    [improved?] → commit prompts → [completed]
              └─ [not improved?] → revert → try different strategy

Capability Analysis Flow

When starting a complex task:

[User Request]
      ↓
[@capability-analyst] ← Analyzes requirements vs existing capabilities
      ↓
[Gap Analysis] ← Identifies missing agents, workflows, skills
      ↓
[Recommendations] → Create new or enhance existing?
      ↓
[Decision]
  ├── [Create New] → [@agent-architect] → Create component → Review
  └── [Enhance] → [@lead-developer] → Modify existing
      ↓
[Integration] ← Verify new component works with system
      ↓
[Complete] ← Task can now be handled

Gitea Integration

Status Labels

Pipeline uses Gitea labels to track progress:

  • status: newstatus: plannedstatus: researching → ...
  • Agents add/remove labels automatically

Performance Logging

Each agent logs to Gitea issue comments:

## ✅ lead-developer completed

**Score**: 8/10
**Duration**: 1.2h
**Files**: src/auth.ts, src/user.ts

### Notes
- Clean implementation
- Follows existing patterns
- Tests passing

Efficiency Tracking

Scores saved to .kilo/logs/efficiency_score.json:

{
  "version": "1.0",
  "history": [
    {
      "issue": 42,
      "date": "2024-01-02T10:00:00Z",
      "agents": {
        "lead-developer": 8,
        "code-skeptic": 7,
        "the-fixer": 9
      },
      "iterations": 2,
      "duration_hours": 1.5
    }
  ]
}

Fitness Tracking

Fitness scores saved to .kilo/logs/fitness-history.jsonl:

{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}

Manual Agent Invocation

// Use Task tool to invoke subagent
Task tool with:
  subagent_type: "lead-developer"
  prompt: "Implement authentication for issue #42"

Or via @mention:

@lead-developer implement authentication flow

Environment Variables

Required for Gitea integration:

GITEA_API_URL=https://git.softuniq.eu/api/v1
GITEA_TOKEN=your-token-here

Self-Improvement Cycle

  1. Pipeline runs for each issue
  2. Evaluator scores each agent (1-10) - subjective
  3. Pipeline Judge measures fitness objectively (0.0-1.0)
  4. Low fitness (<0.70) triggers prompt-optimizer
  5. Prompt optimizer analyzes failures and improves prompts
  6. Re-run workflow with improved prompts
  7. Compare fitness before/after - commit if improved
  8. Log results to .kilo/logs/fitness-history.jsonl

Evaluator vs Pipeline Judge

Aspect Evaluator Pipeline Judge
Type Subjective Objective
Score 1-10 (opinion) 0.0-1.0 (metrics)
Metrics Observations Tests, tokens, time
Trigger After workflow After evaluator
Action Logs to Gitea Triggers optimization

Fitness Score Components

fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)

where:
  test_pass_rate = passed_tests / total_tests
  quality_gates_rate = passed_gates / total_gates (build, lint, types, tests, coverage)
  efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)

Architecture Files

File Purpose
AGENTS.md This file - main config
.kilo/agents/*.md Agent definitions with prompts
.kilo/commands/*.md Workflow commands
.kilo/rules/*.md Custom rules loaded globally
.kilo/skills/ Skill modules
src/kilocode/ TypeScript API for programmatic use

Using the TypeScript API

import { 
  PipelineRunner, 
  GiteaClient, 
  decideRouting 
} from './src/kilocode/index.js'

const runner = await createPipelineRunner({
  giteaToken: process.env.GITEA_TOKEN
})

await runner.run({ issueNumber: 42 })

Agent Evolution Dashboard

Track agent model changes, performance, and recommendations in real-time.

Access

# Sync agent data
bun run sync:evolution

# Open dashboard
bun run evolution:dashboard
bun run evolution:open
# or visit http://localhost:3001

Dashboard Tabs

Tab Description
Overview Stats, recent changes, pending recommendations
All Agents Filterable agent cards with history
Timeline Full evolution history
Recommendations Priority-based model suggestions
Model Matrix Agent × Model mapping with fit scores

Data Sources

Source What it tracks
.kilo/agents/*.md Model, description, capabilities
.kilo/kilo.jsonc Model assignments
.kilo/capability-index.yaml Capability routing
Git History Model and prompt changes
Gitea Comments Performance scores

Evolution Data Structure

{
  "agents": {
    "lead-developer": {
      "current": { "model": "qwen3-coder:480b", "fit_score": 92 },
      "history": [{ "type": "model_change", "from": "deepseek", "to": "qwen3" }],
      "performance_log": [{ "issue": 42, "score": 8, "success": true }]
    }
  }
}

Recommendations Priority

Priority When Example
Critical Fit score < 70 Immediate model change required
High Model unavailable Switch to fallback
Medium Better model available Consider upgrade
Low Optimization possible Optional improvement

Code Style

  • Use TypeScript for new files
  • Follow existing patterns
  • Write tests before code (TDD)
  • Keep functions under 50 lines
  • Use early returns
  • No comments unless explicitly requested