Problem: when LLM returned empty content or network error, the orchestrator
immediately stopped with (no response) — visible to user as blank reply.
Solution — 4-layer retry system:
## Go Gateway (gateway/internal/orchestrator/orchestrator.go)
- Extracted shared runLoop() used by Chat(), ChatWithEvents(), ChatWithEventsAndRetry()
- Added RetryPolicy struct: MaxLLMRetries (default 3), InitialDelay (2s),
MaxDelay (30s), RetryOnEmpty (true)
- callLLMWithRetry(): wraps every LLM call with exponential back-off:
* retries on HTTP/network error
* retries on empty choices array
* retries when content=="" AND finish_reason!="tool_calls" (soft empty)
* strips tools on attempt > 1 (avoids repeated tool-format errors)
* logs each attempt; total attempts = MaxLLMRetries + 1 (default: 4)
- Added ChatWithEventsAndRetry() with onRetry callback for client visibility
- SetRetryPolicy() for runtime override
## Config (gateway/config/config.go)
- New fields: MaxLLMRetries (GATEWAY_MAX_LLM_RETRIES, default 3)
RetryDelaySecs (GATEWAY_RETRY_DELAY_SECS, default 2)
## main.go — wires retry policy from config into orchestrator
## docker-compose.yml
- GATEWAY_REQUEST_TIMEOUT_SECS: 120 → 300 (accommodates up to 4 retries)
- GATEWAY_MAX_LLM_RETRIES=3, GATEWAY_RETRY_DELAY_SECS=2 env vars
## API (handlers.go)
- StartChatSession goroutine now uses ChatWithEventsAndRetry
- onRetry callback emits "thinking" DB event with content "⟳ Retry N: reason"
so the client sees retry progress in the console panel
## Frontend (client/src/lib/chatStore.ts + client/src/pages/Chat.tsx)
- ConsoleEntry gains content?: string and new type "retry"
- thinking events with content starting "⟳ Retry" → type=retry (amber)
- Chat ConsolePanel renders retry events in amber with RefreshCw icon
and shows the retry reason string underneath
- db.go: added SaveMetric(MetricInput) and SaveHistory(HistoryInput) methods
that write directly to MySQL; non-fatal (log-only on error)
- handlers.go (OrchestratorStream): after each SSE stream finishes, an async
goroutine saves agentMetrics (agentId, requestId, tokens, processingTimeMs,
model, toolsCalled, status) and agentHistory (userMessage, agentResponse);
both error and success paths covered; orchAgentID resolved from DB
- routers.ts (agents.chat): saveMetric() called for both success and error paths
in the Node.js direct-chat fallback (was only saving agentHistory before)
- Verified: agentMetrics row ID=2 shows processingTimeMs=2133, totalTokens=143,
model=minimax-m2.7, Cyrillic text stored correctly as UTF-8
- Chat.tsx: rewritten to use global chatStore singleton — SSE connection survives
page navigation; added StopCircle cancel button; scrolls only when near bottom
- chatStore.ts: new module-level singleton (EventTarget pattern) that holds all
conversation/console state; TextDecoder with stream:true for correct UTF-8
- handlers.go (ProvidersReload): now accepts decrypted key in request body from
Node.js so Go gateway can actually use the API key without sharing crypto logic
- providers.ts (activateProvider): sends decrypted key to gateway via
notifyGatewayReload(); seedDefaultProvider also calls notifyGatewayReload()
- seed.ts: on startup, after seeding, pushes active provider to gateway with
retry loop (5 retries × 3 s) to wait for gateway readiness
- index.ts (SSE proxy): TextDecoder('utf-8', {stream:true}) already correct;
confirmed Cyrillic text arrives ungarbled (e.g. 'Привет!' not '??????????')
Реализовано:
- gateway/internal/docker/client.go: Docker API клиент через unix socket (/var/run/docker.sock)
- IsSwarmActive(), GetSwarmInfo(), ListNodes(), ListContainers(), GetContainerStats()
- CalcCPUPercent() для расчёта CPU%
- gateway/internal/api/handlers.go: новые endpoints
- GET /api/nodes: список Swarm нод или standalone Docker хост
- GET /api/nodes/stats: live CPU/RAM статистика контейнеров
- POST /api/tools/execute: выполнение инструментов
- gateway/cmd/gateway/main.go: зарегистрированы новые маршруты
- server/gateway-proxy.ts: добавлены getGatewayNodes() и getGatewayNodeStats()
- server/routers.ts: добавлен nodes router (nodes.list, nodes.stats)
- client/src/pages/Nodes.tsx: полностью переписан на реальные данные
- Auto-refresh: 10s для нод, 15s для статистики контейнеров
- Swarm mode: показывает все ноды кластера
- Standalone mode: показывает локальный Docker хост + контейнеры
- CPU/RAM gauges из реальных docker stats
- Error state при недоступном Gateway
- Loading skeleton
- server/nodes.test.ts: 14 новых vitest тестов
- Все 51 тест пройдены