Commit Graph

12 Commits

Author SHA1 Message Date
bboxwtf
12b8332b2f feat(retry): LLM retry-on-failure for orchestrator — never returns empty response
Problem: when LLM returned empty content or network error, the orchestrator
immediately stopped with (no response) — visible to user as blank reply.

Solution — 4-layer retry system:

## Go Gateway (gateway/internal/orchestrator/orchestrator.go)
- Extracted shared runLoop() used by Chat(), ChatWithEvents(), ChatWithEventsAndRetry()
- Added RetryPolicy struct: MaxLLMRetries (default 3), InitialDelay (2s),
  MaxDelay (30s), RetryOnEmpty (true)
- callLLMWithRetry(): wraps every LLM call with exponential back-off:
  * retries on HTTP/network error
  * retries on empty choices array
  * retries when content=="" AND finish_reason!="tool_calls" (soft empty)
  * strips tools on attempt > 1 (avoids repeated tool-format errors)
  * logs each attempt; total attempts = MaxLLMRetries + 1 (default: 4)
- Added ChatWithEventsAndRetry() with onRetry callback for client visibility
- SetRetryPolicy() for runtime override

## Config (gateway/config/config.go)
- New fields: MaxLLMRetries (GATEWAY_MAX_LLM_RETRIES, default 3)
             RetryDelaySecs (GATEWAY_RETRY_DELAY_SECS, default 2)

## main.go — wires retry policy from config into orchestrator

## docker-compose.yml
- GATEWAY_REQUEST_TIMEOUT_SECS: 120 → 300 (accommodates up to 4 retries)
- GATEWAY_MAX_LLM_RETRIES=3, GATEWAY_RETRY_DELAY_SECS=2 env vars

## API (handlers.go)
- StartChatSession goroutine now uses ChatWithEventsAndRetry
- onRetry callback emits "thinking" DB event with content "⟳ Retry N: reason"
  so the client sees retry progress in the console panel

## Frontend (client/src/lib/chatStore.ts + client/src/pages/Chat.tsx)
- ConsoleEntry gains content?: string and new type "retry"
- thinking events with content starting "⟳ Retry" → type=retry (amber)
- Chat ConsolePanel renders retry events in amber with RefreshCw icon
  and shows the retry reason string underneath
2026-03-21 20:01:26 +00:00
bboxwtf
c57d694236 feat(phase21): real Docker Swarm management — live nodes, services, tasks, host shell, agent deployment
## What's implemented

### Go Gateway — New /api/swarm/* endpoints (handlers.go + docker/client.go + db.go)
- GET  /api/swarm/info          — swarm state, manager address, join tokens
- GET  /api/swarm/nodes         — live node list (hostname, IP, CPU, RAM, role, labels)
- POST /api/swarm/nodes/{id}/label        — add/update node label
- POST /api/swarm/nodes/{id}/availability — set node availability (active|pause|drain)
- GET  /api/swarm/services       — all swarm services with replica counts
- POST /api/swarm/services/create — deploy a new agent as a swarm service
- GET  /api/swarm/services/{id}/tasks  — tasks per service (which node runs which replica)
- POST /api/swarm/services/{id}/scale  — scale replicas
- GET  /api/swarm/join-token    — worker/manager join command with token + manager addr
- POST /api/swarm/shell         — execute commands on the HOST via nsenter PID 1

### Docker client (client.go)
- ListServices, GetService, ScaleService, ListServiceTasks, CreateAgentService
- AddNodeLabel, UpdateNodeAvailability (patch node spec via Docker API)
- ExecOnHost (nsenter -t 1 → falls back to container scope)

### DB persistence (db.go)
- UpsertSwarmNodes — stores live node state to swarmNodes table
- UpsertSwarmTokens / GetSwarmTokens — persist join tokens
- Startup goroutine in main.go syncs tokens to DB on gateway start

### Node.js tRPC wrappers (routers.ts + gateway-proxy.ts)
- nodes.swarmInfo, nodes.list, nodes.services, nodes.serviceTasks
- nodes.scaleService, nodes.joinToken, nodes.execShell
- nodes.addNodeLabel, nodes.setAvailability, nodes.deployAgentService

### Frontend — Nodes.tsx (complete rewrite)
- Real swarm overview cards (nodes, managers, services, running tasks)
- Join token cards with copy button for worker & manager tokens
- Node cards with inline availability selector (active/pause/drain) + add-label form
- Services table with Scale dialog + Tasks drawer (replica → node mapping)
- Deploy Agent dialog (image, replicas, env vars, published port)
- Host Shell tab with command history and quick-command buttons

### docker-compose.yml
- gateway now runs with privileged: true + pid: host
  → nsenter can access the host PID namespace for real host-level shell execution

## Verified end-to-end
- GET /api/swarm/info returns manager addr + join tokens ✓
- GET /api/swarm/nodes returns node wsm (2 cores, 3.9 GB) ✓
- POST /api/swarm/services/create → deployed goclaw-test-agent (2 replicas) ✓
- GET /api/swarm/services/{id}/tasks returns task list with nodeId ✓
- POST /api/swarm/services/{id}/scale → scale to 0 ✓
- POST /api/swarm/shell {command:'docker node ls'} → real host output ✓
- tRPC chain: browser → control-center → gateway → docker.sock ✓
2026-03-21 17:23:32 +00:00
bboxwtf
471ca42835 feat(phase20): persistent background chat sessions — DB-backed polling architecture
ARCHITECTURE:
- Replace SSE stream (breaks on page reload) with DB-backed background sessions
- Go Gateway runs orchestrator in detached goroutine using context.Background()
  (survives HTTP disconnect, page reload, and laptop sleep/shutdown)
- Every SSE event (thinking/tool_call/delta/done/error) is persisted to chatEvents table
- Session lifecycle stored in chatSessions table (running→done/error)
- Frontend polls GET /api/orchestrator/getEvents every 1.5 s until status=done

DB CHANGES:
- Migration 0005_chat_sessions.sql: chatSessions + chatEvents tables
- schema.ts: TypeScript types for chatSessions and chatEvents
- db.go: ChatSessionRow and ChatEventRow structs with proper json tags (camelCase)
- db.go: CreateSession, AppendEvent, MarkSessionDone, GetSession, GetEvents, GetRecentSessions

GO GATEWAY:
- handlers.go: StartChatSession — creates DB session, launches goroutine, returns {sessionId} immediately
- handlers.go: GetChatSession, GetChatEvents, ListChatSessions handlers
- main.go: routes POST /api/chat/session, GET /api/chat/session/{id}, GET /api/chat/session/{id}/events, GET /api/chat/sessions
- JSON tags added to ChatSessionRow/ChatEventRow so Go returns camelCase to frontend

NODE.JS SERVER:
- gateway-proxy.ts: startChatSession, getChatSession, getChatEvents, listChatSessions functions
- routers.ts: orchestrator.startSession, .getSession, .getEvents, .listSessions tRPC procedures

FRONTEND:
- chatStore.ts: completely rewritten — uses background sessions + localStorage-based polling resume
  * send() calls orchestrator.startSession via tRPC (returns immediately)
  * Stores sessionId in localStorage (goclaw-pending-sessions)
  * Polls getEvents every 1.5 s, applies events to UI incrementally
  * On page reload: _resumePendingSessions() checks pending sessions and resumes polling
  * cancel() stops all active polls
- chatStore.ts: conversations persisted to localStorage (v3 key, survives page reload)
- Chat.tsx: updated status texts to 'Фоновая обработка…', 'Обработка в фоне…'

VERIFIED:
- POST /api/chat/session → {sessionId, status:'running'} in <100ms
- Poll events → thinking, delta('Привет!'), done after ~2s
- chatSessions table has rows with status=done, model, totalTokens
- Cyrillic stored correctly in UTF-8
- JSON fields are camelCase: id, sessionId, seq, eventType, content, toolName...
2026-03-21 16:50:44 +00:00
bboxwtf
73bfa99c67 feat(metrics): persist orchestrator call stats to agentMetrics + agentHistory
- db.go: added SaveMetric(MetricInput) and SaveHistory(HistoryInput) methods
  that write directly to MySQL; non-fatal (log-only on error)
- handlers.go (OrchestratorStream): after each SSE stream finishes, an async
  goroutine saves agentMetrics (agentId, requestId, tokens, processingTimeMs,
  model, toolsCalled, status) and agentHistory (userMessage, agentResponse);
  both error and success paths covered; orchAgentID resolved from DB
- routers.ts (agents.chat): saveMetric() called for both success and error paths
  in the Node.js direct-chat fallback (was only saving agentHistory before)
- Verified: agentMetrics row ID=2 shows processingTimeMs=2133, totalTokens=143,
  model=minimax-m2.7, Cyrillic text stored correctly as UTF-8
2026-03-21 16:17:15 +00:00
bboxwtf
1b6b8bc2cb feat(phase19): background chat store, UTF-8 SSE fix, DB-backed provider push to gateway
- Chat.tsx: rewritten to use global chatStore singleton — SSE connection survives
  page navigation; added StopCircle cancel button; scrolls only when near bottom
- chatStore.ts: new module-level singleton (EventTarget pattern) that holds all
  conversation/console state; TextDecoder with stream:true for correct UTF-8
- handlers.go (ProvidersReload): now accepts decrypted key in request body from
  Node.js so Go gateway can actually use the API key without sharing crypto logic
- providers.ts (activateProvider): sends decrypted key to gateway via
  notifyGatewayReload(); seedDefaultProvider also calls notifyGatewayReload()
- seed.ts: on startup, after seeding, pushes active provider to gateway with
  retry loop (5 retries × 3 s) to wait for gateway readiness
- index.ts (SSE proxy): TextDecoder('utf-8', {stream:true}) already correct;
  confirmed Cyrillic text arrives ungarbled (e.g. 'Привет!' not '??????????')
2026-03-21 04:12:45 +00:00
bboxwtf
1ad62cf215 feat(phase18): DB-backed LLM providers, SSE streaming chat, left panel + console
Changes:
- drizzle/schema.ts: added llmProviders table (AES-256-GCM encrypted API keys)
- drizzle/0004_llm_providers.sql: migration for llmProviders
- server/providers.ts: full CRUD + AES-256-GCM encrypt/decrypt + seedDefaultProvider
- server/routers.ts: replaced hardcoded config.providers with DB-backed providers router;
  added providers.list/create/update/delete/activate tRPC endpoints
- server/seed.ts: calls seedDefaultProvider() on startup to seed from env if table empty
- server/_core/index.ts: added POST /api/orchestrator/stream SSE proxy route to Go Gateway
- gateway/internal/llm/client.go: added ChatStream (SSE) + UpdateCredentials
- gateway/internal/orchestrator/orchestrator.go: added ChatWithEvents (tool-call callbacks)
- gateway/internal/api/handlers.go: added OrchestratorStream (SSE) + ProvidersReload endpoints
- gateway/internal/db/db.go: added GetActiveProvider from llmProviders table
- gateway/cmd/gateway/main.go: registered /api/orchestrator/stream + /api/providers/reload routes
- client/src/pages/Chat.tsx: full rebuild — 3-panel layout (left: conversation list,
  centre: messages with SSE streaming + markdown, right: live tool-call console)
- client/src/pages/Settings.tsx: full rebuild — DB-backed provider CRUD (add/edit/activate/delete),
  no hardcoded keys, key shown masked from DB hint
2026-03-21 03:25:43 +00:00
bboxwtf
f08513d9a5 fix(phase16): model validation & agent editor improvements
- AgentDetailModal: load real models from API with loading indicator;
  fallback to current agent model when API unavailable; show count badge
- AgentCreateModal: remove broken provider-filter on models list;
  add loading indicator and disabled state during fetch; show count badge
- gateway/orchestrator: add resolveModel() — validates desired model
  against LLM API before use; auto-fallback to first available model
  to prevent 401/404 errors (fixes glm-5 unauthorized in chat)
- gateway/orchestrator: add ModelWarning field to ChatResult struct
- gateway-proxy.ts: add modelWarning field to GatewayChatResult
- Chat.tsx: display modelWarning as amber badge next to model name
- todo.md: add Phase 16 section with bug fixes and tech debt notes
2026-03-21 02:10:17 +00:00
NW
b1a3a994bc fix: upgrade Docker API version from v1.41 to v1.44 2026-03-20 20:19:54 -04:00
Manus
0dcae37a78 Checkpoint: Phase 12: Real-time Docker Swarm monitoring for /nodes page
Реализовано:
- gateway/internal/docker/client.go: Docker API клиент через unix socket (/var/run/docker.sock)
  - IsSwarmActive(), GetSwarmInfo(), ListNodes(), ListContainers(), GetContainerStats()
  - CalcCPUPercent() для расчёта CPU%
- gateway/internal/api/handlers.go: новые endpoints
  - GET /api/nodes: список Swarm нод или standalone Docker хост
  - GET /api/nodes/stats: live CPU/RAM статистика контейнеров
  - POST /api/tools/execute: выполнение инструментов
- gateway/cmd/gateway/main.go: зарегистрированы новые маршруты
- server/gateway-proxy.ts: добавлены getGatewayNodes() и getGatewayNodeStats()
- server/routers.ts: добавлен nodes router (nodes.list, nodes.stats)
- client/src/pages/Nodes.tsx: полностью переписан на реальные данные
  - Auto-refresh: 10s для нод, 15s для статистики контейнеров
  - Swarm mode: показывает все ноды кластера
  - Standalone mode: показывает локальный Docker хост + контейнеры
  - CPU/RAM gauges из реальных docker stats
  - Error state при недоступном Gateway
  - Loading skeleton
- server/nodes.test.ts: 14 новых vitest тестов
- Все 51 тест пройдены
2026-03-20 20:12:57 -04:00
Manus
2f87e18e85 Checkpoint: Phase 11 complete: Frontend connected to Go Gateway. All orchestrator/ollama tRPC calls go through gateway-proxy.ts with Node.js fallback. 37 vitest tests pass. End-to-end verified: chat, tool calling, health via Go Gateway. 2026-03-20 19:38:27 -04:00
Manus
dda99d7056 Checkpoint: Phase 10: LLM Provider Configuration — Ollama Cloud как дефолт, локальный Ollama закомментирован (GPU only), docker-compose/stack обновлены, .env.example с 4 провайдерами 2026-03-20 19:16:44 -04:00
Manus
02742f836c Checkpoint: Phase 9: Go Gateway — полный перенос оркестратора и tool executor на Go. Добавлены gateway/ (Go), docker/ (docker-compose + stack + Dockerfiles), server/gateway-proxy.ts 2026-03-20 18:43:49 -04:00