Skip to main content

Deployment Topology

:::info Source Sourced from services/ai-gateway-service/DEPLOYMENT_TOPOLOGY.md in the documentation repo. :::

1. Containers

  • ai-api — REST + SSE endpoint.
  • ai-local-worker — local model inference (GPU-backed).
  • ai-outbox-relay.
  • ai-eval-worker — runs eval sets nightly + on PR.
  • ai-budget-reaper — resets daily/monthly budgets.
  • ai-audit-archiver — batch archives audit rows to cold S3.

2. Scaling

ContainerMinMaxHPA
api550CPU>60% or in-flight completions > 200/pod
local-worker220 (GPU pool)GPU util > 70%
outbox-relay28backlog > 5000
eval-worker15cron-driven

3. Resources

api: 1000m/4000m, 1Gi/4Gi. local-worker: GPU node (e.g., L4 / A100), 8-16 vCPU, 32Gi.

4. Provider Egress

  • Dedicated egress NAT per region.
  • Provider allowlist.
  • Rate-limit outbound per provider (protect from burst DoS).

5. Cache

Redis (per region):

  • AI output cache
  • Rate-limit counters
  • Budget counters (primary in Postgres; Redis for hot reads).

6. Regional

Per region: us, eu, me, ap. Provider routing respects tenant residency.

7. Service Mesh

mTLS internal. Egress to providers through dedicated proxy.

8. Release

Blue/green for api. Local workers: drain before replace (GPU scheduling). Prompt version deploys versioned — no gateway restart.

9. DR

RPO 5 min (audit outbox + cold archive). RTO 60 min.

10. Diagram

Service SDK (AIClient) ──mTLS──▶ ai-api

├─▶ Postgres (prompts, completions, budgets, embeddings)
├─▶ Redis (cache, rate-limit)
├─▶ Safety pipeline (local classifiers)
├─▶ local-worker (GPU; on-prem or cloud)
└─▶ Provider egress (NAT + allowlist)

├─▶ OpenAI
├─▶ Anthropic
├─▶ Google
├─▶ Azure OpenAI (BAA)
└─▶ Mistral / etc.

Audit firehose ──▶ analytics-service + audit sink (WORM S3)