A/B Benchmark: qwen3.5-122b vs glm-5.1 for evaluator #116

New Issue

NW · 2026-05-25T14:08:23Z

NW commented

2026-05-25 14:08:23 +00:00

Context

Moved evaluator from glm-5.1 (IF=90) to qwen3.5-122b (IF=92, 12.4M pulls).

Task

Run 10 evaluation cycles on a completed pipeline issue.

Expected

~4% score improvement, lower instruction drift.

Refs: agent-evolution/data/model-research-2026-05-24.md

## Context Moved evaluator from glm-5.1 (IF=90) to qwen3.5-122b (IF=92, 12.4M pulls). ## Task Run 10 evaluation cycles on a completed pipeline issue. ## Expected ~4% score improvement, lower instruction drift. Refs: agent-evolution/data/model-research-2026-05-24.md

NW added this to the [Evolution] APAW Model Optimization May 2026 milestone 2026-05-25 14:08:23 +00:00

Sign in to join this conversation.

No Label

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: UniqueSoft/APAW#116