- real-fit-engine.py: refactored to support --from-report, improved Ollama v1/chat/completions compatibility, agent name normalization - run-focused-eval.py: run evaluations for specific agent/model pairs from CLI - test_ollama_minimal.py/test_real_api.py: Ollama API connectivity tests - real-fit-architecture.md: architecture overview document - tests/scripts/: E2E landing test, analytics capture, evolution heatmap verification - Remove real-fit-recalc.py (superseded by --from-report flag)
2.5 KiB
2.5 KiB
Real-Fit Analysis System Architecture
Problem
Current fit_score is just model_benchmarks.if_score — generic benchmark, NOT evaluated per-role. workflow-cross-checker gets 92 simply because kimi-k2.6 has IF=91, not because anyone tested if kimi is actually good at cross-checking workflows.
Solution: End-to-End Real Evaluation Pipeline
Phase 1: Test Prompt Generation
For each agent, extract role description + capabilities from .kilo/agents/{name}.md frontmatter + body rules.
Generate 3 representative tasks that exercise agent's actual responsibilities.
Phase 2: Multi-Model Execution
Run each task through N top models (kimi, deepseek, glm, qwen, etc.) via Ollama API. Collect responses + latency + token count.
Phase 3: Role-Aware Evaluation
Judge each response against role-specific criteria:
code-skeptic: Did it find the bug? Depth of analysis? Actionable fixes?workflow-cross-checker: Did it ask uncomfortable questions? Covered all gates?lead-developer: Working code? Tests pass? Clean structure?
Using rubric-based scoring + model-as-judge (one model evaluates another).
Phase 4: Aggregation & Storage
Store per-agent-per-model scores with:
- Overall fit_score (0-100)
- Dimension scores: accuracy, completeness, relevance, role-adherence
- Explanation text: "Model X scored 87 because it correctly identified the race condition but missed the SQL injection (see response #3)"
- Raw responses for drill-down
Phase 5: Dashboard Integration
- Heatmap cell = real fit_score per agent per model
- Click cell → Analysis tab shows: score breakdown + explanation + raw response snippets
- "Why this score?" panel
Data Schema
{
"agent": "workflow-cross-checker",
"model": "ollama-cloud/kimi-k2.6",
"fit_score": 87,
"dimensions": {
"accuracy": 90,
"completeness": 85,
"role_adherence": 92,
"actionability": 80
},
"explanation": "Strong at asking uncomfortable questions (gate protocol covered). Weak at suggesting concrete recovery actions.",
"tests": [
{
"task_id": "wf-check-001",
"prompt": "...",
"response": "...",
"scores": {"accuracy": 90, "completeness": 85},
"judge_notes": "..."
}
],
"timestamp": "2026-05-27T18:00:00Z"
}
Next Steps
- Build prompt generator (read .kilo/agents/*.md → extract role → generate tasks)
- Build batch runner (call Ollama API for each agent×model×task)
- Build evaluator (rubric scoring + judge model)
- Build storage (JSON DB with drill-down)
- Build dashboard tab (Analysis with cell drill-down)