feat(scripts): add real-fit evaluation engine and supporting test scripts

- real-fit-engine.py: refactored to support --from-report, improved Ollama v1/chat/completions compatibility, agent name normalization - run-focused-eval.py: run evaluations for specific agent/model pairs from CLI - test_ollama_minimal.py/test_real_api.py: Ollama API connectivity tests - real-fit-architecture.md: architecture overview document - tests/scripts/: E2E landing test, analytics capture, evolution heatmap verification - Remove real-fit-recalc.py (superseded by --from-report flag)
2026-05-28 11:57:46 +01:00
parent a0e7bd99fb
commit 4071551476
10 changed files with 1219 additions and 312 deletions
--- a/agent-evolution/data/real-fit.html
+++ b/agent-evolution/data/real-fit.html
@@ -0,0 +1,93 @@
+<!DOCTYPE html>
+<html lang="ru">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>Real-Fit Matrix — Agent × Model Performance</title>
+<style>
+:root{--bg:#0a0f1a;--bg2:#0f1525;--bg3:#141c2e;--bdr:#1e2d45;--txt:#e8f1ff;--txt2:#8ba3c0;--cyan:#00d4ff;--green:#00ff94;--red:#ff4757;--orange:#ff9f43;--purple:#a855f7;}
+*{margin:0;padding:0;box-sizing:border-box}
+body{font-family:system-ui,-apple-system,sans-serif;background:var(--bg);color:var(--txt);min-height:100vh;padding:24px}
+h1{font-size:1.6rem;background:linear-gradient(90deg,var(--cyan),var(--green));-webkit-background-clip:text;-webkit-text-fill-color:transparent;margin-bottom:8px}
+.sub{color:var(--txt2);font-size:.85rem;margin-bottom:20px}
+table{width:100%;border-collapse:collapse;font-size:.82rem}
+th,td{padding:8px 10px;border:1px solid var(--bdr);text-align:center}
+th{background:var(--bg2);color:var(--txt2);font-size:.72rem;text-transform:uppercase;letter-spacing:.5px;position:sticky;top:0}
+td:first-child{text-align:left;font-weight:700;white-space:nowrap}
+td.score{font-weight:700;font-family:monospace}
+.hm-cur{box-shadow:inset 0 0 0 2px var(--cyan)}
+.high{background:rgba(0,255,148,.18);color:var(--green)}
+.good{background:rgba(0,212,255,.14);color:var(--cyan)}
+.med{background:rgba(168,85,247,.15);color:var(--purple)}
+.low{background:rgba(255,71,87,.1);color:var(--red)}
+.na{background:transparent;color:var(--txt2);font-size:.9rem}
+.legend{display:flex;gap:12px;flex-wrap:wrap;margin-top:16px;font-size:.78rem;color:var(--txt2)}
+.legend span{display:flex;align-items:center;gap:4px}
+.dot{width:14px;height:14px;border-radius:3px}
+.meta{font-size:.72rem;color:var(--txt2);margin-top:12px}
+a{color:var(--cyan);text-decoration:none}
+</style>
+</head>
+<body>
+<h1>Real-Fit Matrix</h1>
+<div class="sub">Real agent × model evaluation scores via live Ollama API (28 calls, 4 models, 7 agents)</div>
+
+<div id="matrix"></div>
+<div class="legend">
+  <span><span class="dot high"></span> 90+ Excellent</span>
+  <span><span class="dot good"></span> 75–89 Good</span>
+  <span><span class="dot med"></span> 50–74 Average</span>
+  <span><span class="dot low"></span> &lt;50 Weak</span>
+  <span style="margin-left:auto">● = assigned model</span>
+</div>
+<div class="meta">Data source: <a href="data/real-fit-report.json" target="_blank">real-fit-report.json</a> | Updated: <span id="updated"></span></div>
+
+<script>
+async function load() {
+  const res = await fetch('data/real-fit-report.json');
+  const data = await res.json();
+  document.getElementById('updated').textContent = new Date(data.generated).toLocaleString('ru-RU');
+
+  // Extract focused agents (those with >0 evaluations on >1 model)
+  const agents = Object.values(data.agents).filter(a => {
+    const evs = Object.values(a.evaluations);
+    return evs.length > 0 && evs.some(s => s > 0);
+  });
+
+  // Get all models from any agent
+  const models = new Set();
+  agents.forEach(a => Object.keys(a.evaluations).forEach(m => models.add(m)));
+  const modelList = Array.from(models).sort();
+
+  // Build table
+  let html = '<table><thead><tr><th>Agent</th>';
+  modelList.forEach(m => html += `<th>${m}</th>`);
+  html += '<th>Best</th><th>Score</th></tr></thead><tbody>';
+
+  agents.forEach(a => {
+    html += `<tr><td>${a.name}</td>`;
+    modelList.forEach(m => {
+      const score = a.evaluations[m];
+      const isCur = a.info && a.info[2] && a.info[2].includes(m);
+      let cls = 'na';
+      let text = '—';
+      if (score !== undefined && score > 0) {
+        if (score >= 90) cls = 'score high';
+        else if (score >= 75) cls = 'score good';
+        else if (score >= 50) cls = 'score med';
+        else cls = 'score low';
+        text = Math.round(score);
+      }
+      const curCls = isCur ? ' hm-cur' : '';
+      html += `<td class="${cls}${curCls}">${text}${isCur ? ' ●' : ''}</td>`;
+    });
+    html += `<td>${a.best_model}</td><td style="font-weight:700">${Math.round(a.best_score)}</td></tr>`;
+  });
+
+  html += '</tbody></table>';
+  document.getElementById('matrix').innerHTML = html;
+}
+load().catch(e => document.getElementById('matrix').innerHTML = 'Error: ' + e);
+</script>
+</body>
+</html>
--- a/agent-evolution/docs/real-fit-architecture.md
+++ b/agent-evolution/docs/real-fit-architecture.md
@@ -0,0 +1,68 @@
+# Real-Fit Analysis System Architecture
+
+## Problem
+Current `fit_score` is just `model_benchmarks.if_score` — generic benchmark, NOT evaluated per-role. `workflow-cross-checker` gets 92 simply because `kimi-k2.6` has IF=91, not because anyone tested if kimi is actually good at cross-checking workflows.
+
+## Solution: End-to-End Real Evaluation Pipeline
+
+### Phase 1: Test Prompt Generation
+For each agent, extract role description + capabilities from `.kilo/agents/{name}.md` frontmatter + body rules.
+Generate 3 representative tasks that exercise agent's actual responsibilities.
+
+### Phase 2: Multi-Model Execution
+Run each task through N top models (kimi, deepseek, glm, qwen, etc.) via Ollama API.
+Collect responses + latency + token count.
+
+### Phase 3: Role-Aware Evaluation
+Judge each response against role-specific criteria:
+- `code-skeptic`: Did it find the bug? Depth of analysis? Actionable fixes?
+- `workflow-cross-checker`: Did it ask uncomfortable questions? Covered all gates?
+- `lead-developer`: Working code? Tests pass? Clean structure?
+
+Using rubric-based scoring + model-as-judge (one model evaluates another).
+
+### Phase 4: Aggregation & Storage
+Store per-agent-per-model scores with:
+- Overall fit_score (0-100)
+- Dimension scores: accuracy, completeness, relevance, role-adherence
+- Explanation text: "Model X scored 87 because it correctly identified the race condition but missed the SQL injection (see response #3)"
+- Raw responses for drill-down
+
+### Phase 5: Dashboard Integration
+- Heatmap cell = real fit_score per agent per model
+- Click cell → Analysis tab shows: score breakdown + explanation + raw response snippets
+- "Why this score?" panel
+
+## Data Schema
+
+```json
+{
+  "agent": "workflow-cross-checker",
+  "model": "ollama-cloud/kimi-k2.6",
+  "fit_score": 87,
+  "dimensions": {
+    "accuracy": 90,
+    "completeness": 85,
+    "role_adherence": 92,
+    "actionability": 80
+  },
+  "explanation": "Strong at asking uncomfortable questions (gate protocol covered). Weak at suggesting concrete recovery actions.",
+  "tests": [
+    {
+      "task_id": "wf-check-001",
+      "prompt": "...",
+      "response": "...",
+      "scores": {"accuracy": 90, "completeness": 85},
+      "judge_notes": "..."
+    }
+  ],
+  "timestamp": "2026-05-27T18:00:00Z"
+}
+```
+
+## Next Steps
+1. Build prompt generator (read .kilo/agents/*.md → extract role → generate tasks)
+2. Build batch runner (call Ollama API for each agent×model×task)
+3. Build evaluator (rubric scoring + judge model)
+4. Build storage (JSON DB with drill-down)
+5. Build dashboard tab (Analysis with cell drill-down)