- Remove baseline screenshots from git tracking (test artifacts, not code) - Add tests/visual/baseline/ to .gitignore - Rewrite .kilo/agents/visual-tester.md: Docker-first pipeline, bbox extraction, console/network error detection - Rewrite .kilo/commands/web-test.md: accurate commands, output format, agent flow - Update .kilo/capability-index.yaml: add bbox_extraction, console_error_detection, button_overflow_detection to visual-tester - Update AGENTS.md: add /web-test and /e2e-test commands, update visual-tester description
360 lines
12 KiB
Markdown
360 lines
12 KiB
Markdown
# Kilo Code Agents Reference
|
||
|
||
This file configures AI agent behavior for the APAW project - a self-improving code pipeline with Gitea logging.
|
||
|
||
## Pipeline Workflow
|
||
|
||
The main workflow is `/pipeline` - use it to process issues through all agents automatically.
|
||
|
||
```
|
||
User: /pipeline 42
|
||
Agent: Runs full pipeline for issue #42 with Gitea logging
|
||
```
|
||
|
||
## Commands (Slash Commands)
|
||
|
||
| Command | Description | Usage |
|
||
|---------|-------------|-------|
|
||
| `/pipeline <issue>` | Run full agent pipeline for issue | `/pipeline 42` |
|
||
| `/status <issue>` | Check pipeline status for issue | `/status 42` |
|
||
| `/evolve` | Run evolution cycle with fitness scoring | `/evolve --issue 42` |
|
||
| `/evaluate <issue>` | Generate performance report | `/evaluate 42` |
|
||
| `/plan` | Creates detailed task plans | `/plan feature X` |
|
||
| `/ask` | Answers codebase questions | `/ask how does auth work` |
|
||
| `/debug` | Analyzes and fixes bugs | `/debug error in login` |
|
||
| `/code` | Quick code generation | `/code add validation` |
|
||
| `/research [topic]` | Run research and self-improvement | `/research multi-agent` |
|
||
| `/evolution log` | Log agent model change | `/evolution log planner "reason"` |
|
||
| `/evolution report` | Generate evolution report | `/evolution report` |
|
||
| `/web-test <url>` | Visual regression testing in Docker | `/web-test https://bbox.wtf` |
|
||
| `/e2e-test <url>` | E2E browser automation tests | `/e2e-test https://my-app.com` |
|
||
|
||
## Pipeline Agents (Subagents)
|
||
|
||
These agents are invoked automatically by `/pipeline` or manually via `@mention`:
|
||
|
||
### Core Development
|
||
| Agent | Role | Model | Variant | Can Call |
|
||
|-------|------|-------|---------|----------|
|
||
| `@requirement-refiner` | Converts ideas to User Stories | glm-5.1 | thinking | history-miner, system-analyst |
|
||
| `@history-miner` | Finds duplicates in git | nemotron-3-super | — | *(read-only)* |
|
||
| `@system-analyst` | Designs specifications | glm-5.1 | thinking | sdet-engineer, orchestrator |
|
||
| `@sdet-engineer` | Writes tests (TDD) | qwen3-coder:480b | thinking | lead-developer, orchestrator |
|
||
| `@lead-developer` | Implements code | qwen3-coder:480b | thinking | code-skeptic, orchestrator |
|
||
| `@frontend-developer` | UI implementation | qwen3-coder:480b | — | code-skeptic, orchestrator |
|
||
| `@backend-developer` | Node.js/Express/APIs | qwen3-coder:480b | — | code-skeptic, orchestrator |
|
||
| `@go-developer` | Go backend services | qwen3-coder:480b | — | code-skeptic, orchestrator |
|
||
| `@flutter-developer` | Flutter mobile apps | qwen3-coder:480b | — | code-skeptic, orchestrator |
|
||
|
||
### Quality Assurance
|
||
| Agent | Role | Model | Variant | Can Call |
|
||
|-------|------|-------|---------|----------|
|
||
| `@code-skeptic` | Adversarial review | minimax-m2.5 | — | the-fixer, performance-engineer, orchestrator |
|
||
| `@the-fixer` | Fixes issues | minimax-m2.5 | — | code-skeptic, orchestrator |
|
||
| `@performance-engineer` | Performance review | nemotron-3-super | — | the-fixer, security-auditor, orchestrator |
|
||
| `@security-auditor` | Security audit | nemotron-3-super | — | the-fixer, release-manager, orchestrator |
|
||
| `@visual-tester` | Visual regression + bbox extraction + console/network errors | qwen3-coder:480b | — | the-fixer, orchestrator |
|
||
| `@browser-automation` | E2E testing | qwen3-coder:480b | — | orchestrator |
|
||
|
||
### DevOps & Infrastructure
|
||
| Agent | Role | Model | Variant | Can Call |
|
||
|-------|------|-------|---------|----------|
|
||
| `@devops-engineer` | Docker/K8s/CI-CD | nemotron-3-super | — | code-skeptic, security-auditor, orchestrator |
|
||
| `@release-manager` | Git operations, releases | glm-5.1 | — | evaluator |
|
||
|
||
### Meta & Process
|
||
| Agent | Role | Model | Variant | Can Call |
|
||
|-------|------|-------|---------|----------|
|
||
| `@evaluator` | Scores effectiveness | glm-5.1 | thinking | prompt-optimizer, product-owner, orchestrator |
|
||
| `@pipeline-judge` | Objective fitness scoring | glm-5.1 | — | prompt-optimizer |
|
||
| `@prompt-optimizer` | Improves prompts | glm-5.1 | instant | *(edits files)* |
|
||
| `@product-owner` | Manages issues/tracking | glm-5.1 | — | *(read-only)* |
|
||
|
||
### Analysis & Design
|
||
| Agent | Role | Model | Variant | Can Call |
|
||
|-------|------|-------|---------|----------|
|
||
| `@capability-analyst` | Analyzes task coverage | glm-5.1 | — | agent-architect, orchestrator |
|
||
| `@agent-architect` | Creates new agents | glm-5.1 | thinking | capability-analyst, requirement-refiner, system-analyst |
|
||
| `@workflow-architect` | Creates workflows | glm-5.1 | thinking | *(edits files)* |
|
||
| `@markdown-validator` | Validates Markdown | nemotron-3-nano:30b | — | orchestrator |
|
||
|
||
### Cognitive Enhancement
|
||
| Agent | Role | Model | Variant | Can Call |
|
||
|-------|------|-------|---------|----------|
|
||
| `@planner` | Task decomposition | nemotron-3-super | — | *(read-only)* |
|
||
| `@reflector` | Self-reflection | nemotron-3-super | — | *(read-only)* |
|
||
| `@memory-manager` | Memory systems | nemotron-3-super | — | *(read-only)* |
|
||
|
||
## Workflow State Machine
|
||
|
||
```
|
||
[new]
|
||
↓ @requirement-refiner
|
||
[planned]
|
||
↓ @capability-analyst → (gaps?) → @agent-architect → create new agents
|
||
↓ @history-miner
|
||
[researching]
|
||
↓ @system-analyst
|
||
[designed]
|
||
↓ @sdet-engineer (writes failing tests)
|
||
[testing]
|
||
↓ @lead-developer (makes tests pass)
|
||
[implementing]
|
||
↓ @code-skeptic (review)
|
||
[reviewing] ──[fail]──→ [fixing] ──→ [reviewing]
|
||
↓ @review-watcher → (auto-validate) → create fix tasks
|
||
↓ [pass]
|
||
[perf-check]
|
||
↓ @performance-engineer
|
||
[security-check]
|
||
↓ @security-auditor
|
||
[releasing]
|
||
↓ @release-manager
|
||
[evaluated]
|
||
↓ @evaluator (subjective score 1-10)
|
||
├── [score ≥ 7] → [@pipeline-judge] → fitness scoring
|
||
└── [score < 7] → @prompt-optimizer → [@evaluated]
|
||
↓
|
||
[@pipeline-judge] ← runs tests, measures tokens/time
|
||
↓
|
||
fitness score
|
||
↓
|
||
┌──────────────────────────────────────┐
|
||
│ fitness >= 0.85 │──→ [completed]
|
||
│ fitness 0.70-0.84 │──→ @prompt-optimizer → [evolving]
|
||
│ fitness < 0.70 │──→ @prompt-optimizer (major) → [evolving]
|
||
│ fitness < 0.50 │──→ @agent-architect → redesign
|
||
└──────────────────────────────────────┘
|
||
↓
|
||
[evolving] → re-run workflow → [@pipeline-judge]
|
||
↓
|
||
compare fitness_before vs fitness_after
|
||
↓
|
||
[improved?] → commit prompts → [completed]
|
||
└─ [not improved?] → revert → try different strategy
|
||
```
|
||
|
||
## Capability Analysis Flow
|
||
|
||
When starting a complex task:
|
||
|
||
```
|
||
[User Request]
|
||
↓
|
||
[@capability-analyst] ← Analyzes requirements vs existing capabilities
|
||
↓
|
||
[Gap Analysis] ← Identifies missing agents, workflows, skills
|
||
↓
|
||
[Recommendations] → Create new or enhance existing?
|
||
↓
|
||
[Decision]
|
||
├── [Create New] → [@agent-architect] → Create component → Review
|
||
└── [Enhance] → [@lead-developer] → Modify existing
|
||
↓
|
||
[Integration] ← Verify new component works with system
|
||
↓
|
||
[Complete] ← Task can now be handled
|
||
```
|
||
|
||
## Gitea Integration
|
||
|
||
### Status Labels
|
||
|
||
Pipeline uses Gitea labels to track progress:
|
||
- `status: new` → `status: planned` → `status: researching` → ...
|
||
- Agents add/remove labels automatically
|
||
|
||
### Performance Logging
|
||
|
||
Each agent logs to Gitea issue comments:
|
||
```markdown
|
||
## ✅ lead-developer completed
|
||
|
||
**Score**: 8/10
|
||
**Duration**: 1.2h
|
||
**Files**: src/auth.ts, src/user.ts
|
||
|
||
### Notes
|
||
- Clean implementation
|
||
- Follows existing patterns
|
||
- Tests passing
|
||
```
|
||
|
||
### Efficiency Tracking
|
||
|
||
Scores saved to `.kilo/logs/efficiency_score.json`:
|
||
```json
|
||
{
|
||
"version": "1.0",
|
||
"history": [
|
||
{
|
||
"issue": 42,
|
||
"date": "2024-01-02T10:00:00Z",
|
||
"agents": {
|
||
"lead-developer": 8,
|
||
"code-skeptic": 7,
|
||
"the-fixer": 9
|
||
},
|
||
"iterations": 2,
|
||
"duration_hours": 1.5
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
### Fitness Tracking
|
||
|
||
Fitness scores saved to `.kilo/logs/fitness-history.jsonl`:
|
||
```jsonl
|
||
{"ts":"2026-04-06T00:00:00Z","issue":42,"workflow":"feature","fitness":0.82,"tokens":38400,"time_ms":245000,"tests_passed":45,"tests_total":47}
|
||
{"ts":"2026-04-06T01:30:00Z","issue":43,"workflow":"bugfix","fitness":0.91,"tokens":12000,"time_ms":85000,"tests_passed":47,"tests_total":47}
|
||
```
|
||
|
||
## Manual Agent Invocation
|
||
|
||
```typescript
|
||
// Use Task tool to invoke subagent
|
||
Task tool with:
|
||
subagent_type: "lead-developer"
|
||
prompt: "Implement authentication for issue #42"
|
||
```
|
||
|
||
Or via `@mention`:
|
||
```
|
||
@lead-developer implement authentication flow
|
||
```
|
||
|
||
## Environment Variables
|
||
|
||
Required for Gitea integration:
|
||
```bash
|
||
GITEA_API_URL=https://git.softuniq.eu/api/v1
|
||
GITEA_TOKEN=your-token-here
|
||
```
|
||
|
||
## Self-Improvement Cycle
|
||
|
||
1. **Pipeline runs** for each issue
|
||
2. **Evaluator scores** each agent (1-10) - subjective
|
||
3. **Pipeline Judge measures** fitness objectively (0.0-1.0)
|
||
4. **Low fitness (<0.70)** triggers prompt-optimizer
|
||
5. **Prompt optimizer** analyzes failures and improves prompts
|
||
6. **Re-run workflow** with improved prompts
|
||
7. **Compare fitness** before/after - commit if improved
|
||
8. **Log results** to `.kilo/logs/fitness-history.jsonl`
|
||
|
||
### Evaluator vs Pipeline Judge
|
||
|
||
| Aspect | Evaluator | Pipeline Judge |
|
||
|--------|-----------|----------------|
|
||
| Type | Subjective | Objective |
|
||
| Score | 1-10 (opinion) | 0.0-1.0 (metrics) |
|
||
| Metrics | Observations | Tests, tokens, time |
|
||
| Trigger | After workflow | After evaluator |
|
||
| Action | Logs to Gitea | Triggers optimization |
|
||
|
||
### Fitness Score Components
|
||
|
||
```
|
||
fitness = (test_pass_rate × 0.50) + (quality_gates_rate × 0.25) + (efficiency_score × 0.25)
|
||
|
||
where:
|
||
test_pass_rate = passed_tests / total_tests
|
||
quality_gates_rate = passed_gates / total_gates (build, lint, types, tests, coverage)
|
||
efficiency_score = 1.0 - clamp(normalized_cost, 0, 1)
|
||
```
|
||
|
||
## Architecture Files
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `AGENTS.md` | This file - main config |
|
||
| `.kilo/agents/*.md` | Agent definitions with prompts |
|
||
| `.kilo/commands/*.md` | Workflow commands |
|
||
| `.kilo/rules/*.md` | Custom rules loaded globally |
|
||
| `.kilo/skills/` | Skill modules |
|
||
| `src/kilocode/` | TypeScript API for programmatic use |
|
||
|
||
## Using the TypeScript API
|
||
|
||
```typescript
|
||
import {
|
||
PipelineRunner,
|
||
GiteaClient,
|
||
decideRouting
|
||
} from './src/kilocode/index.js'
|
||
|
||
const runner = await createPipelineRunner({
|
||
giteaToken: process.env.GITEA_TOKEN
|
||
})
|
||
|
||
await runner.run({ issueNumber: 42 })
|
||
```
|
||
|
||
## Agent Evolution Dashboard
|
||
|
||
Track agent model changes, performance, and recommendations in real-time.
|
||
|
||
### Access
|
||
|
||
```bash
|
||
# Sync agent data
|
||
bun run sync:evolution
|
||
|
||
# Open dashboard
|
||
bun run evolution:dashboard
|
||
bun run evolution:open
|
||
# or visit http://localhost:3001
|
||
```
|
||
|
||
### Dashboard Tabs
|
||
|
||
| Tab | Description |
|
||
|-----|-------------|
|
||
| **Overview** | Stats, recent changes, pending recommendations |
|
||
| **All Agents** | Filterable agent cards with history |
|
||
| **Timeline** | Full evolution history |
|
||
| **Recommendations** | Priority-based model suggestions |
|
||
| **Model Matrix** | Agent × Model mapping with fit scores |
|
||
|
||
### Data Sources
|
||
|
||
| Source | What it tracks |
|
||
|--------|----------------|
|
||
| `.kilo/agents/*.md` | Model, description, capabilities |
|
||
| `.kilo/kilo.jsonc` | Model assignments |
|
||
| `.kilo/capability-index.yaml` | Capability routing |
|
||
| Git History | Model and prompt changes |
|
||
| Gitea Comments | Performance scores |
|
||
|
||
### Evolution Data Structure
|
||
|
||
```json
|
||
{
|
||
"agents": {
|
||
"lead-developer": {
|
||
"current": { "model": "qwen3-coder:480b", "fit_score": 92 },
|
||
"history": [{ "type": "model_change", "from": "deepseek", "to": "qwen3" }],
|
||
"performance_log": [{ "issue": 42, "score": 8, "success": true }]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### Recommendations Priority
|
||
|
||
| Priority | When | Example |
|
||
|----------|------|---------|
|
||
| **Critical** | Fit score < 70 | Immediate model change required |
|
||
| **High** | Model unavailable | Switch to fallback |
|
||
| **Medium** | Better model available | Consider upgrade |
|
||
| **Low** | Optimization possible | Optional improvement |
|
||
|
||
## Code Style
|
||
|
||
- Use TypeScript for new files
|
||
- Follow existing patterns
|
||
- Write tests before code (TDD)
|
||
- Keep functions under 50 lines
|
||
- Use early returns
|
||
- No comments unless explicitly requested |