Files

Deploy Bot 4071551476 feat(scripts): add real-fit evaluation engine and supporting test scripts

- real-fit-engine.py: refactored to support --from-report, improved Ollama v1/chat/completions compatibility, agent name normalization
- run-focused-eval.py: run evaluations for specific agent/model pairs from CLI
- test_ollama_minimal.py/test_real_api.py: Ollama API connectivity tests
- real-fit-architecture.md: architecture overview document
- tests/scripts/: E2E landing test, analytics capture, evolution heatmap verification
- Remove real-fit-recalc.py (superseded by --from-report flag)

2026-05-28 11:57:46 +01:00

2.5 KiB

Raw Blame History

Real-Fit Analysis System Architecture

Problem

Current fit_score is just model_benchmarks.if_score — generic benchmark, NOT evaluated per-role. workflow-cross-checker gets 92 simply because kimi-k2.6 has IF=91, not because anyone tested if kimi is actually good at cross-checking workflows.

Solution: End-to-End Real Evaluation Pipeline

Phase 1: Test Prompt Generation

For each agent, extract role description + capabilities from .kilo/agents/{name}.md frontmatter + body rules. Generate 3 representative tasks that exercise agent's actual responsibilities.

Phase 2: Multi-Model Execution

Run each task through N top models (kimi, deepseek, glm, qwen, etc.) via Ollama API. Collect responses + latency + token count.

Phase 3: Role-Aware Evaluation

Judge each response against role-specific criteria:

code-skeptic: Did it find the bug? Depth of analysis? Actionable fixes?
workflow-cross-checker: Did it ask uncomfortable questions? Covered all gates?
lead-developer: Working code? Tests pass? Clean structure?

Using rubric-based scoring + model-as-judge (one model evaluates another).

Phase 4: Aggregation & Storage

Store per-agent-per-model scores with:

Overall fit_score (0-100)
Dimension scores: accuracy, completeness, relevance, role-adherence
Explanation text: "Model X scored 87 because it correctly identified the race condition but missed the SQL injection (see response #3)"
Raw responses for drill-down

Phase 5: Dashboard Integration

Heatmap cell = real fit_score per agent per model
Click cell → Analysis tab shows: score breakdown + explanation + raw response snippets
"Why this score?" panel

Data Schema

{
  "agent": "workflow-cross-checker",
  "model": "ollama-cloud/kimi-k2.6",
  "fit_score": 87,
  "dimensions": {
    "accuracy": 90,
    "completeness": 85,
    "role_adherence": 92,
    "actionability": 80
  },
  "explanation": "Strong at asking uncomfortable questions (gate protocol covered). Weak at suggesting concrete recovery actions.",
  "tests": [
    {
      "task_id": "wf-check-001",
      "prompt": "...",
      "response": "...",
      "scores": {"accuracy": 90, "completeness": 85},
      "judge_notes": "..."
    }
  ],
  "timestamp": "2026-05-27T18:00:00Z"
}

Next Steps

Build prompt generator (read .kilo/agents/*.md → extract role → generate tasks)
Build batch runner (call Ollama API for each agent×model×task)
Build evaluator (rubric scoring + judge model)
Build storage (JSON DB with drill-down)
Build dashboard tab (Analysis with cell drill-down)

2.5 KiB Raw Blame History Unescape Escape