Real-Fit Analysis Engine
End-to-end evaluation pipeline that measures LLM fitness per agent role, not just generic benchmark scores.
Unlike current fit_score (which is a static model IF score), real-fit scores should reflect:
- How well a model performs on actual agent-specific tasks
- Prompt generation from .kilo/agents/*.md frontmatter
- Multi-model execution via Ollama API
- Rubric-based evaluation per agent role
- Cross-model judge scoring
- Storage in real-fit-results.json
- Dashboard with cell drill-down
Deliverable: A complete evaluation pipeline integrated into the agent evolution workflow.
No due date
0% Completed