feat(retry): LLM retry-on-failure for orchestrator — never returns empty response

Problem: when LLM returned empty content or network error, the orchestrator
immediately stopped with (no response) — visible to user as blank reply.

Solution — 4-layer retry system:

## Go Gateway (gateway/internal/orchestrator/orchestrator.go)
- Extracted shared runLoop() used by Chat(), ChatWithEvents(), ChatWithEventsAndRetry()
- Added RetryPolicy struct: MaxLLMRetries (default 3), InitialDelay (2s),
  MaxDelay (30s), RetryOnEmpty (true)
- callLLMWithRetry(): wraps every LLM call with exponential back-off:
  * retries on HTTP/network error
  * retries on empty choices array
  * retries when content=="" AND finish_reason!="tool_calls" (soft empty)
  * strips tools on attempt > 1 (avoids repeated tool-format errors)
  * logs each attempt; total attempts = MaxLLMRetries + 1 (default: 4)
- Added ChatWithEventsAndRetry() with onRetry callback for client visibility
- SetRetryPolicy() for runtime override

## Config (gateway/config/config.go)
- New fields: MaxLLMRetries (GATEWAY_MAX_LLM_RETRIES, default 3)
             RetryDelaySecs (GATEWAY_RETRY_DELAY_SECS, default 2)

## main.go — wires retry policy from config into orchestrator

## docker-compose.yml
- GATEWAY_REQUEST_TIMEOUT_SECS: 120 → 300 (accommodates up to 4 retries)
- GATEWAY_MAX_LLM_RETRIES=3, GATEWAY_RETRY_DELAY_SECS=2 env vars

## API (handlers.go)
- StartChatSession goroutine now uses ChatWithEventsAndRetry
- onRetry callback emits "thinking" DB event with content "⟳ Retry N: reason"
  so the client sees retry progress in the console panel

## Frontend (client/src/lib/chatStore.ts + client/src/pages/Chat.tsx)
- ConsoleEntry gains content?: string and new type "retry"
- thinking events with content starting "⟳ Retry" → type=retry (amber)
- Chat ConsolePanel renders retry events in amber with RefreshCw icon
  and shows the retry reason string underneath
This commit is contained in:
bboxwtf
2026-03-21 20:01:26 +00:00
parent e228e7a655
commit 12b8332b2f
7 changed files with 314 additions and 199 deletions

View File

@@ -109,8 +109,12 @@ services:
DEFAULT_MODEL: "${DEFAULT_MODEL:-qwen2.5:7b}"
DATABASE_URL: "${MYSQL_USER:-goclaw}:${MYSQL_PASSWORD:-goClawPass123}@tcp(db:3306)/${MYSQL_DATABASE:-goclaw}?parseTime=true"
PROJECT_ROOT: "/app"
GATEWAY_REQUEST_TIMEOUT_SECS: "120"
# Request timeout — must be > (MaxLLMRetries * RetryDelay * 2 + actual LLM time)
GATEWAY_REQUEST_TIMEOUT_SECS: "300"
GATEWAY_MAX_TOOL_ITERATIONS: "10"
# LLM retry policy: retry up to N times on empty response or network error
GATEWAY_MAX_LLM_RETRIES: "${GATEWAY_MAX_LLM_RETRIES:-3}"
GATEWAY_RETRY_DELAY_SECS: "${GATEWAY_RETRY_DELAY_SECS:-2}"
LOG_LEVEL: "info"
depends_on:
db: