- Resolve service_healthy deadlock by using service_started instead - Fix 172.28.0.0/16 network collision by removing ipam config - Add HybridGiteaClient (mcp → rest → bash fallback) - Create .kilo/rules/process-continuity.md with 5 operator-free principles: 1. No service_healthy conditions 2. No hardcoded networks 3. Automatic fallback chains 4. Pre-flight validation 5. Self-documenting failures - Update docker-compose.yml with resilient config: - start_period: 60s, retries: 5, restart: on-failure:3 - /tools healthcheck (guaranteed endpoint) - tmpfs for Node.js /tmp - Resource limits: 256M RAM, 0.5 CPU - MCP/REST integration test passed (issue #109) Refs: Milestone #67, Issues #107, #109
125 lines
4.3 KiB
Markdown
125 lines
4.3 KiB
Markdown
# GNS-2: Process Continuity Rules
|
|
|
|
## Problem
|
|
|
|
The pipeline repeatedly broke in Phase 8 (MCP Docker integration) because:
|
|
1. **service_healthy deadlock** (docker-compose.yml) — container couldn't start because it was waiting for its own healthcheck to pass before it was running
|
|
2. **Network overlap** — subnet 172.28.0.0/16 conflicted with existing Docker networks
|
|
3. **Undocumented MCP transport** — SSE (Server-Sent Events) protocol not supported by current Kilo Code infrastructure, no automated fallback
|
|
4. **Operator dependency** — process stopped when technical barrier hit, required human decisions
|
|
|
|
## Root Cause
|
|
|
|
| Failure | Why it happened | Operator-Free Fix |
|
|
|---------|-----------------|-----------------|
|
|
| `service_healthy` deadlock | Docker compose blocked startup waiting for healthcheck on a container that wasn't yet running | Use `condition: service_started` for depends_on |
|
|
| Subnet `172.28.0.0/16` conflict | Hardcoded IP overlap with host Docker networks | Remove `ipam` config, let Docker auto-assign |
|
|
| SSE transport unsupported | forgejo-mcp exposes MCP over SSE, current agent infrastructure uses HTTP REST + bash curl | Hybrid client with MPC → REST fallback |
|
|
| `/health` endpoint mismatch | Container used `/health` endpoint but MCP server had different URL | Probe `/tools` (guaranteed endpoint) instead |
|
|
|
|
## Operator-Free Design Principles
|
|
|
|
### 1. No `service_healthy` Conditions
|
|
```yaml
|
|
# PROBLEM: deadlock
|
|
depends_on:
|
|
service:
|
|
condition: service_healthy # Container waits for itself
|
|
|
|
# FIX: allow startup, healthcheck as observer only
|
|
depends_on:
|
|
service:
|
|
condition: service_started
|
|
```
|
|
|
|
### 2. No Hardcoded Networks
|
|
```yaml
|
|
# PROBLEM: overlap
|
|
networks:
|
|
gns-network:
|
|
ipam:
|
|
config:
|
|
- subnet: 172.28.0.0/16 # May conflict
|
|
|
|
# FIX: Docker auto-assigns
|
|
networks:
|
|
gns-network:
|
|
driver: bridge
|
|
```
|
|
|
|
### 3. Automatic Fallback Chains
|
|
```typescript
|
|
// Hybrid client: tries MCP first, falls back to REST, falls back to bash curl
|
|
try {
|
|
result = await mcpClient.createIssue(...)
|
|
} catch (mcpError) {
|
|
console.warn(`MCP failed: ${mcpError}`)
|
|
try {
|
|
result = await restClient.createIssue(...)
|
|
} catch (restError) {
|
|
console.warn(`REST failed: ${restError}`)
|
|
// Final fallback: bash curl (emergency only)
|
|
result = await bashCurl(...)
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4. Pre-flight Validation
|
|
Before starting containers, validate prerequisites:
|
|
```bash
|
|
# Check if port is free, if not use another
|
|
curl -f http://localhost:3001/health || PORT=3002
|
|
|
|
# Check network doesn't exist
|
|
docker network ls | grep gns-network && docker network rm gns-network
|
|
|
|
# Check env vars are set
|
|
[ -z "$FORGEJO_TOKEN" ] && echo "WARNING: FORGEJO_TOKEN not set, using dummy value"
|
|
```
|
|
|
|
### 5. Self-Documenting Failures
|
|
If process must stop, write explicit "why" and "what to do" to both:
|
|
- Console output (human readable)
|
|
- Gitea issue comment (machine readable, includes `GNS_EVENT`)
|
|
|
|
```markdown
|
|
## 🚫 Agent Blocked
|
|
|
|
**Reason**: MCP server not reachable on localhost:3001
|
|
**Action**: Run `docker compose -f docker/mcp-gitea/docker-compose.yml up -d`
|
|
**Fallback**: Operations will use REST API until MCP is available
|
|
```
|
|
|
|
## Implementation Checklist
|
|
|
|
For every new container/service:
|
|
- [ ] Healthcheck probes a guaranteed endpoint (/tools, not /health if unstable)
|
|
- [ ] No `service_healthy` conditions in depends_on
|
|
- [ ] No hardcoded subnets or IPs
|
|
- [ ] Environment variables have safe fallbacks for startup
|
|
- [ ] Error boundaries in all async operations (try/catch)
|
|
- [ ] Error messages include both "what happened" and "next step"
|
|
- [ ] All operator-required steps are documented as checklist in issue body
|
|
|
|
## GNS-2 Event Format for Failures
|
|
|
|
```html
|
|
<!-- GNS_EVENT: {
|
|
"type": "system_failure",
|
|
"failure_point": "mcp_container_startup",
|
|
"requires_operator": true,
|
|
"reason": "FORGEJO_TOKEN not set, container cannot connect to Gitea; used dummy token",
|
|
"recovery_steps": [
|
|
"Set FORGEJO_TOKEN in docker/mcp-gitea/.env",
|
|
"Restart: docker compose -f docker/mcp-gitea/docker-compose.yml up -d"
|
|
],
|
|
"fallback_active": "REST API (gitea-client.ts)",
|
|
"timestamp": "2026-05-08T22:23:00Z"
|
|
} -->
|
|
```
|
|
|
|
## Reference
|
|
- Docker compose depends_on behavior: https://docs.docker.com/compose/startup-order/
|
|
- MCP protocol transport: https://modelcontextprotocol.io/specification/2024-11-05/architecture/transports
|
|
- Gitea API fallback: `.kilo/shared/gitea-api.md`
|