Deployment Topology
:::info Source
Sourced from services/ai-gateway-service/DEPLOYMENT_TOPOLOGY.md in the documentation repo.
:::
1. Containers
ai-api— REST + SSE endpoint.ai-local-worker— local model inference (GPU-backed).ai-outbox-relay.ai-eval-worker— runs eval sets nightly + on PR.ai-budget-reaper— resets daily/monthly budgets.ai-audit-archiver— batch archives audit rows to cold S3.
2. Scaling
| Container | Min | Max | HPA |
|---|---|---|---|
| api | 5 | 50 | CPU>60% or in-flight completions > 200/pod |
| local-worker | 2 | 20 (GPU pool) | GPU util > 70% |
| outbox-relay | 2 | 8 | backlog > 5000 |
| eval-worker | 1 | 5 | cron-driven |
3. Resources
api: 1000m/4000m, 1Gi/4Gi. local-worker: GPU node (e.g., L4 / A100), 8-16 vCPU, 32Gi.
4. Provider Egress
- Dedicated egress NAT per region.
- Provider allowlist.
- Rate-limit outbound per provider (protect from burst DoS).
5. Cache
Redis (per region):
- AI output cache
- Rate-limit counters
- Budget counters (primary in Postgres; Redis for hot reads).
6. Regional
Per region: us, eu, me, ap.
Provider routing respects tenant residency.
7. Service Mesh
mTLS internal. Egress to providers through dedicated proxy.
8. Release
Blue/green for api. Local workers: drain before replace (GPU scheduling). Prompt version deploys versioned — no gateway restart.
9. DR
RPO 5 min (audit outbox + cold archive). RTO 60 min.
10. Diagram
Service SDK (AIClient) ──mTLS──▶ ai-api
│
├─▶ Postgres (prompts, completions, budgets, embeddings)
├─▶ Redis (cache, rate-limit)
├─▶ Safety pipeline (local classifiers)
├─▶ local-worker (GPU; on-prem or cloud)
└─▶ Provider egress (NAT + allowlist)
│
├─▶ OpenAI
├─▶ Anthropic
├─▶ Google
├─▶ Azure OpenAI (BAA)
└─▶ Mistral / etc.
Audit firehose ──▶ analytics-service + audit sink (WORM S3)