Commit Graph

9 Commits

Author SHA1 Message Date
bboxwtf
153399f41e feat(phase-A): agent-worker container — autonomous agent HTTP server
PHASE A COMPLETE: каждый агент теперь может жить в отдельном Docker Swarm контейнере как автономная единица.

- HTTP-сервер агента: GET /health, GET /info, POST /chat, POST /task, GET /tasks, GET /tasks/{id}, GET /memory
- Загружает конфиг из shared DB по AGENT_ID env var (model, systemPrompt, allowedTools)
- 4 горутины-воркера для параллельной обработки задач
- In-memory task queue (buffered channel, depth=100) + ring buffer последних 50 задач
- Callback URL: POST результата при завершении async задачи
- Sliding window памяти: загружает последние 20 сообщений из DB при каждом запросе
- Изолированные инструменты: агент получает только allowedTools из своей конфигурации
- Агент сам вызывает LLM напрямую через LLM_BASE_URL (не через Gateway)
- Graceful shutdown с таймаутом 15s

- 20 unit-тестов: все PASS
- Покрытие: инициализация, task queue, /health, /info, /task, /tasks, /memory, инструменты, lifecycle

- Multi-stage Go build: golang:1.23-alpine → alpine:3.21
- EXPOSE 8001, HEALTHCHECK на /health каждые 15s
- Агенты деплоятся динамически Swarm (не статический сервис в stack)

- Новые поля в таблице agents: serviceName, servicePort, containerImage, containerStatus
- SQL migration: drizzle/migrations/0006_agent_container_fields.sql

- AgentConfig + AgentRow: новые поля serviceName, servicePort, containerImage, containerStatus
- UpdateContainerStatus() — обновление статуса при деплое/остановке
- GetAgentHistory() — sliding window памяти агента из DB
- SaveHistory() — сохранение диалога агента в DB

- delegate_to_agent: реальный HTTP POST к контейнеру агента через overlay DNS
  - sync: POST /chat (ждёт ответ)
  - async: POST /task (возвращает task_id)
  - fallback: если агент не запущен — информативное сообщение
- SetDatabase() — инжекция DB для резолва адресов живых агентов

- Orchestrator инжектирует DB в Executor через SetDatabase() при инициализации
2026-03-31 23:11:02 +00:00
bboxwtf
a8a8ea1ee2 feat(swarm): autonomous agent containers, Swarm Manager with auto-stop, /nodes UI overhaul
## 1. Fix /nodes Swarm Status Display
- Add SwarmStatusBanner component: clear green/red/loading state
- Shows nodeId, managerAddr, isManager badge
- Error state explains what to check (docker.sock mount)
- Header now shows 'swarm unreachable — check gateway' vs 'active'
- swarmOk now checks nodeId presence, not just data existence

## 2. Autonomous Agent Container
- New docker/Dockerfile.agent — builds Go agent binary from gateway/cmd/agent/
- New gateway/cmd/agent/main.go — standalone HTTP microservice:
  * GET /health — liveness probe with idle time info
  * POST /task — receives task, forwards to Gateway orchestrator
  * GET /info  — agent metadata (id, hostname, gateway url)
  * Idle watchdog: calls /api/swarm/agents/{name}/stop after IdleTimeoutMinutes
  * Connects to Swarm overlay network (goclaw-net) → reaches DB/Gateway by DNS
  * Env: AGENT_ID, GATEWAY_URL, DATABASE_URL, IDLE_TIMEOUT_MINUTES

## 3. Swarm Manager Agent (auto-stop after 15min idle)
- New gateway/internal/api/swarm_manager.go:
  * SwarmManager goroutine checks every 60s
  * Scales idle GoClaw agent services to 0 replicas after 15 min
  * Tracks lastActivity from task UpdatedAt timestamps
- New REST endpoints in gateway:
  * GET  /api/swarm/agents           — list agents with idleMinutes
  * POST /api/swarm/agents/{name}/start — scale up agent
  * POST /api/swarm/agents/{name}/stop  — scale to 0
  * DELETE /api/swarm/services/{id}     — remove service permanently
- SwarmManager started as background goroutine in main.go with context cancel

## 4. Docker Client Enhancements
- Added NetworkAttachment type and Networks field to ServiceSpec
- CreateAgentServiceFull(opts) — supports overlay networks, custom labels
- CreateAgentService() delegates to CreateAgentServiceFull for backward compat
- RemoveService(id) — DELETE /v1.44/services/{id}
- GetServiceLastActivity(id) — finds latest task UpdatedAt for idle detection

## 5. tRPC & Gateway Proxy
- New functions: removeSwarmService, listSwarmAgents, startSwarmAgent, stopSwarmAgent
- SwarmAgentInfo type with idleMinutes, lastActivity, desiredReplicas
- createAgentService now accepts networks[] parameter
- New tRPC endpoints: nodes.removeService, nodes.listAgents, nodes.startAgent, nodes.stopAgent

## 6. Nodes.tsx UI Overhaul
- SwarmStatusBanner component at top — no more silent 'connecting…'
- New 'Agents' tab with AgentManagerRow: idle time, auto-stop warning, start/stop/remove buttons
- IdleColor coding: green < 5m, yellow 5-10m, red 10m+ with countdown to auto-stop
- ServiceRow: added Remove button with confirmation dialog
- RemoveConfirmDialog component
- DeployAgentDialog: added overlay networks field, default env includes GATEWAY_URL
- All queries refetch after agent start/stop/remove
2026-03-21 20:37:21 +00:00
bboxwtf
12b8332b2f feat(retry): LLM retry-on-failure for orchestrator — never returns empty response
Problem: when LLM returned empty content or network error, the orchestrator
immediately stopped with (no response) — visible to user as blank reply.

Solution — 4-layer retry system:

## Go Gateway (gateway/internal/orchestrator/orchestrator.go)
- Extracted shared runLoop() used by Chat(), ChatWithEvents(), ChatWithEventsAndRetry()
- Added RetryPolicy struct: MaxLLMRetries (default 3), InitialDelay (2s),
  MaxDelay (30s), RetryOnEmpty (true)
- callLLMWithRetry(): wraps every LLM call with exponential back-off:
  * retries on HTTP/network error
  * retries on empty choices array
  * retries when content=="" AND finish_reason!="tool_calls" (soft empty)
  * strips tools on attempt > 1 (avoids repeated tool-format errors)
  * logs each attempt; total attempts = MaxLLMRetries + 1 (default: 4)
- Added ChatWithEventsAndRetry() with onRetry callback for client visibility
- SetRetryPolicy() for runtime override

## Config (gateway/config/config.go)
- New fields: MaxLLMRetries (GATEWAY_MAX_LLM_RETRIES, default 3)
             RetryDelaySecs (GATEWAY_RETRY_DELAY_SECS, default 2)

## main.go — wires retry policy from config into orchestrator

## docker-compose.yml
- GATEWAY_REQUEST_TIMEOUT_SECS: 120 → 300 (accommodates up to 4 retries)
- GATEWAY_MAX_LLM_RETRIES=3, GATEWAY_RETRY_DELAY_SECS=2 env vars

## API (handlers.go)
- StartChatSession goroutine now uses ChatWithEventsAndRetry
- onRetry callback emits "thinking" DB event with content "⟳ Retry N: reason"
  so the client sees retry progress in the console panel

## Frontend (client/src/lib/chatStore.ts + client/src/pages/Chat.tsx)
- ConsoleEntry gains content?: string and new type "retry"
- thinking events with content starting "⟳ Retry" → type=retry (amber)
- Chat ConsolePanel renders retry events in amber with RefreshCw icon
  and shows the retry reason string underneath
2026-03-21 20:01:26 +00:00
bboxwtf
c57d694236 feat(phase21): real Docker Swarm management — live nodes, services, tasks, host shell, agent deployment
## What's implemented

### Go Gateway — New /api/swarm/* endpoints (handlers.go + docker/client.go + db.go)
- GET  /api/swarm/info          — swarm state, manager address, join tokens
- GET  /api/swarm/nodes         — live node list (hostname, IP, CPU, RAM, role, labels)
- POST /api/swarm/nodes/{id}/label        — add/update node label
- POST /api/swarm/nodes/{id}/availability — set node availability (active|pause|drain)
- GET  /api/swarm/services       — all swarm services with replica counts
- POST /api/swarm/services/create — deploy a new agent as a swarm service
- GET  /api/swarm/services/{id}/tasks  — tasks per service (which node runs which replica)
- POST /api/swarm/services/{id}/scale  — scale replicas
- GET  /api/swarm/join-token    — worker/manager join command with token + manager addr
- POST /api/swarm/shell         — execute commands on the HOST via nsenter PID 1

### Docker client (client.go)
- ListServices, GetService, ScaleService, ListServiceTasks, CreateAgentService
- AddNodeLabel, UpdateNodeAvailability (patch node spec via Docker API)
- ExecOnHost (nsenter -t 1 → falls back to container scope)

### DB persistence (db.go)
- UpsertSwarmNodes — stores live node state to swarmNodes table
- UpsertSwarmTokens / GetSwarmTokens — persist join tokens
- Startup goroutine in main.go syncs tokens to DB on gateway start

### Node.js tRPC wrappers (routers.ts + gateway-proxy.ts)
- nodes.swarmInfo, nodes.list, nodes.services, nodes.serviceTasks
- nodes.scaleService, nodes.joinToken, nodes.execShell
- nodes.addNodeLabel, nodes.setAvailability, nodes.deployAgentService

### Frontend — Nodes.tsx (complete rewrite)
- Real swarm overview cards (nodes, managers, services, running tasks)
- Join token cards with copy button for worker & manager tokens
- Node cards with inline availability selector (active/pause/drain) + add-label form
- Services table with Scale dialog + Tasks drawer (replica → node mapping)
- Deploy Agent dialog (image, replicas, env vars, published port)
- Host Shell tab with command history and quick-command buttons

### docker-compose.yml
- gateway now runs with privileged: true + pid: host
  → nsenter can access the host PID namespace for real host-level shell execution

## Verified end-to-end
- GET /api/swarm/info returns manager addr + join tokens ✓
- GET /api/swarm/nodes returns node wsm (2 cores, 3.9 GB) ✓
- POST /api/swarm/services/create → deployed goclaw-test-agent (2 replicas) ✓
- GET /api/swarm/services/{id}/tasks returns task list with nodeId ✓
- POST /api/swarm/services/{id}/scale → scale to 0 ✓
- POST /api/swarm/shell {command:'docker node ls'} → real host output ✓
- tRPC chain: browser → control-center → gateway → docker.sock ✓
2026-03-21 17:23:32 +00:00
bboxwtf
91684956bb fix(phase17): 401 auth, provider config from server, remove hardcoded PROVIDERS
Problems fixed:
1. 401 unauthorized on chat — OLLAMA_API_KEY was not set in containers
   - Created docker/.env with real API key
   - Added OLLAMA_BASE_URL + OLLAMA_API_KEY to control-center in docker-compose.yml

2. AgentDetailModal/AgentCreateModal showed hardcoded providers list
   (Ollama, OpenAI, Anthropic, Mistral, Groq) regardless of what is configured
   - Removed const PROVIDERS = [...] from both modals
   - Now loads providers via trpc.config.providers (server-side)
   - Only shows providers that are actually configured in env

3. Settings.tsx had API key hardcoded in frontend source code (security issue)
   - API key removed from frontend
   - New trpc.config.providers endpoint returns masked key (first 8 chars + ***)
   - Shows red warning badge 'NO KEY — chat will fail' if key is missing
   - Base URL read from server env, not hardcoded

New tRPC endpoint: config.providers
   - Returns list of configured providers with name, baseUrl, hasKey, maskedKey
   - Provider name auto-detected from URL (ollama.com → 'Ollama Cloud', etc.)
2026-03-21 02:55:05 +00:00
NW
6a033c2db0 fix: run gateway as root for docker.sock access (read-only Docker API) 2026-03-20 20:18:39 -04:00
NW
4a3530feb7 fix: add docker group to gateway container for docker.sock access 2026-03-20 20:17:43 -04:00
Manus
dda99d7056 Checkpoint: Phase 10: LLM Provider Configuration — Ollama Cloud как дефолт, локальный Ollama закомментирован (GPU only), docker-compose/stack обновлены, .env.example с 4 провайдерами 2026-03-20 19:16:44 -04:00
Manus
02742f836c Checkpoint: Phase 9: Go Gateway — полный перенос оркестратора и tool executor на Go. Добавлены gateway/ (Go), docker/ (docker-compose + stack + Dockerfiles), server/gateway-proxy.ts 2026-03-20 18:43:49 -04:00