Real-Fit Analysis Engine

New Issue

End-to-end evaluation pipeline that measures LLM fitness per agent role, not just generic benchmark scores.

Unlike current fit_score (which is a static model IF score), real-fit scores should reflect:

  • How well a model performs on actual agent-specific tasks
  • Prompt generation from .kilo/agents/*.md frontmatter
  • Multi-model execution via Ollama API
  • Rubric-based evaluation per agent role
  • Cross-model judge scoring
  • Storage in real-fit-results.json
  • Dashboard with cell drill-down

Deliverable: A complete evaluation pipeline integrated into the agent evolution workflow.

No due date
0% Completed