Design real-fit evaluation pipeline #122

New Issue

NW · 2026-05-27T17:28:05Z

NW commented

2026-05-27 17:28:05 +00:00

Problem

The current fit_score in .kilo/kilo-meta.json is just the LLM model's IF (Intelligence Frontier) score — a generic benchmark. It does not measure how well a model performs for a specific agent role (e.g., lead-developer vs code-skeptic).

This means:

Agents may be routed to models that are theoretically "smart" but poor at the actual task.
We have no empirical per-role evaluation data.
The evolution dashboard shows static numbers, not real-world fitness.

Proposed Architecture

.kilo/agents/*.md frontmatter
        |
        v
┌──────────────────────────────┐
│  1. Prompt Generator          │  ← Extract task types from agent descriptions
│     → Generate test prompts   │     per agent role
└──────────────┬───────────────┘
               |
               v
┌──────────────────────────────┐
│  2. Batch Runner (Ollama)    │  ← Execute each prompt on N models
│     → Multi-model × agent    │
└──────────────┬───────────────┘
               |
               v
┌──────────────────────────────┐
│  3. Rubric Evaluator         │  ← Score outputs per role-specific rubric
│     → Structured scoring      │     (correctness, clarity, safety, etc.)
└──────────────┬───────────────┘
               |
               v
┌──────────────────────────────┐
│  4. Judge-Model Evaluator    │  ← Cross-model preference ranking
│     → Consensus score         │
└──────────────┬───────────────┘
               |
               v
┌──────────────────────────────┐
│  5. Persistence               │  ← real-fit-results.json
│     → Append results per run  │
└──────────────┬───────────────┘
               |
               v
┌──────────────────────────────┐
│  6. Dashboard                 │  ← Analysis tab with heatmap
│     → Cell drill-down         │
│     → "Why this score?"      │
└──────────────────────────────┘

Acceptance Criteria

Generate test prompts from .kilo/agents/*.md frontmatter
Build batch runner (Ollama API × agents × models)
Build rubric-based evaluator per agent role
Build judge-model evaluator (cross-model scoring)
Persist results in real-fit-results.json
Dashboard Analysis tab with cell drill-down
Click heatmap cell → show "why this score" explanation

Estimated Effort

~20 hours

Priority

High — needed before next agent evolution cycle to stop routing agents to models that score well on generic IF but fail on real tasks.

## Problem The current `fit_score` in `.kilo/kilo-meta.json` is just the LLM model's **IF (Intelligence Frontier) score** — a generic benchmark. It does **not** measure how well a model performs for a **specific agent role** (e.g., `lead-developer` vs `code-skeptic`). This means: - Agents may be routed to models that are theoretically "smart" but poor at the actual task. - We have no empirical per-role evaluation data. - The evolution dashboard shows static numbers, not real-world fitness. ## Proposed Architecture ``` .kilo/agents/*.md frontmatter | v ┌──────────────────────────────┐ │ 1. Prompt Generator │ ← Extract task types from agent descriptions │ → Generate test prompts │ per agent role └──────────────┬───────────────┘ | v ┌──────────────────────────────┐ │ 2. Batch Runner (Ollama) │ ← Execute each prompt on N models │ → Multi-model × agent │ └──────────────┬───────────────┘ | v ┌──────────────────────────────┐ │ 3. Rubric Evaluator │ ← Score outputs per role-specific rubric │ → Structured scoring │ (correctness, clarity, safety, etc.) └──────────────┬───────────────┘ | v ┌──────────────────────────────┐ │ 4. Judge-Model Evaluator │ ← Cross-model preference ranking │ → Consensus score │ └──────────────┬───────────────┘ | v ┌──────────────────────────────┐ │ 5. Persistence │ ← real-fit-results.json │ → Append results per run │ └──────────────┬───────────────┘ | v ┌──────────────────────────────┐ │ 6. Dashboard │ ← Analysis tab with heatmap │ → Cell drill-down │ │ → "Why this score?" │ └──────────────────────────────┘ ``` ## Acceptance Criteria - [ ] Generate test prompts from `.kilo/agents/*.md` frontmatter - [ ] Build batch runner (Ollama API × agents × models) - [ ] Build rubric-based evaluator per agent role - [ ] Build judge-model evaluator (cross-model scoring) - [ ] Persist results in `real-fit-results.json` - [ ] Dashboard Analysis tab with cell drill-down - [ ] Click heatmap cell → show "why this score" explanation ## Estimated Effort ~20 hours ## Priority High — needed before next agent evolution cycle to stop routing agents to models that score well on generic IF but fail on real tasks.

NW added this to the Real-Fit Analysis Engine milestone 2026-05-27 17:28:05 +00:00

NW added the phase::researching priority::high type::feature labels 2026-05-27 17:28:12 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: UniqueSoft/APAW#122