Skip to main content

Deployment Topology

:::info Source Sourced from services/assignment-service/DEPLOYMENT_TOPOLOGY.md in the documentation repo. :::

Companion: 01 Enterprise Architecture · 15 Observability


1. Deployment Unit

The service ships as a single OCI image with four roles selectable via APP_ROLE:

RoleProcessReplicas (prod)HPA trigger
apiFastify HTTP API4–24CPU 60% / RPS
workerConsumers + outbox publisher3–12Queue lag
schedulerMaterializer + sweepers + reminder planner2 (leader-elected)
ai-suggest-workerAI call handler (bulkheaded)1–4AI queue length

All roles share the same image, same codebase, but differ in the entry subcommand.

2. Kubernetes Layout

namespace: assignment

├─ Deployment/assignment-api (4 replicas)
├─ Deployment/assignment-worker (3 replicas)
├─ StatefulSet/assignment-scheduler (2 replicas, leader-elected via lease)
├─ Deployment/assignment-ai-suggest (1 replica)
├─ Service/assignment-api (ClusterIP, exposed via ingress gateway)
├─ HPA/assignment-api
├─ HPA/assignment-worker
├─ PodDisruptionBudget/assignment-api (minAvailable=2)
├─ PodDisruptionBudget/assignment-worker(minAvailable=1)
├─ NetworkPolicy (see §6)
├─ ServiceMonitor (prometheus-operator → SigNoz)
├─ ConfigMap/assignment-config
└─ ExternalSecret/assignment-secrets (ESO → AWS Secrets Manager)

3. Topology

┌───────────────────────────┐
│ API Gateway / WAF │
└─────────────┬─────────────┘

┌─────────────▼─────────────┐
│ assignment-api (Pods) │
└─┬────────────┬────────────┘
│ │
┌─────────▼──┐ ┌────▼───────┐
│ Postgres │ │ Redis │
│ (Aurora) │ │ (cluster) │
└─────────┬──┘ └────────────┘


┌──────────────┐
│ NATS │
│ JetStream │
└──────┬───────┘

┌───────────────────┼───────────────────┐
│ │ │
┌─▼──────────┐ ┌─────▼──────┐ ┌────────▼─────────┐
│ worker │ │ scheduler │ │ ai-suggest │
└────────────┘ └────────────┘ └───────┬──────────┘

┌─────────▼─────────┐
│ ai-gateway-svc │
└───────────────────┘

4. Environments

EnvRegionClusterTenant modePurpose
devlocal / us-west1xanyloops & debug
ciephemeral1xsyntheticCI verification
stagingus-east-11x primaryall tenants shadow + syntheticsoak, perf, pre-GA
prod-usus-east-1 + us-west-2active/activeproductionUS tenants
prod-eueu-west-1active/passiveproductionEU tenants (data residency)
prod-meme-central-1active/passiveproductionMENA tenants

Traffic pinned per tenant to home region via tenant-service metadata.

5. Scaling Profile

Baseline (p50 tenant):

RoleCPU reqMem reqLimits
api500 m512 Mi2 / 1 Gi
worker500 m512 Mi2 / 1 Gi
scheduler250 m256 Mi1 / 512 Mi
ai-suggest250 m256 Mi1 / 512 Mi

HPA metrics:

  • api: CPU 60%, custom http_requests_in_flight
  • worker: custom nats_consumer_pending > 1000
  • ai-suggest: custom ai_suggest_queue > 10

6. Network Policy

  • Inbound: only from gateway namespace on 8080; from prometheus on 9464; from peer services via mesh.
  • Outbound: Postgres, Redis, NATS, ai-gateway-service, notification-service, tenant-service, catalog-service only.
  • No internet egress.

7. Secrets & Config

External Secrets Operator syncs from AWS Secrets Manager:

  • POSTGRES_URL
  • REDIS_URL
  • NATS_CREDS
  • INTERNAL_SVC_JWT_SIGNING_KEY
  • OTEL_EXPORTER_OTLP_ENDPOINT
  • AI_GATEWAY_URL, AI_GATEWAY_TOKEN

Config (ConfigMap):

  • RRULE_HORIZON_DAYS=90
  • MATERIALIZER_BATCH_SIZE=1000
  • OVERDUE_SWEEP_INTERVAL=5m
  • CLOSED_MISSED_SWEEP_INTERVAL=15m
  • REMINDER_BATCH_SIZE=500
  • FEATURE_AI_SUGGEST=on|off

Feature flags via launchdarkly-compatible SDK (LaunchDarkly or OpenFeature + Flagd).

8. Release Pipeline

main branch push

Build → Test → Image → Sign (cosign) → SBOM (cyclonedx)

Deploy to staging (canary 10% → 100%)

Soak 48h + synthetic checks

Gated approval (compliance_admin of ops)

Deploy to prod-us (canary 1% → 10% → 50% → 100%)

(t+24h) prod-eu, prod-me same staged rollout

Rollback: kubectl rollout undo + DB schema is backward-compat by policy. Outbox + saga safe because consumers are version-tolerant.

9. DR / BCP

  • RPO: 5 min (PITR + NATS JetStream replication).
  • RTO: 30 min (cross-region failover runbook).
  • Quarterly DR drill: restore staging in alternate region from last backup; validate via synthetic tenant.

10. Tenant Sharding Strategy

Single logical Postgres cluster per region; tenant LIST partitioning on compliance_window gives isolation without multiple DBs. If a single tenant exceeds 100M windows, we promote to its own dedicated cluster via standby-promotion runbook.

11. Deployment Pre-checks

Automated gate runs before every prod deploy:

  • All DB migrations reversible or forward-compat.
  • Event schema compat check against registered consumers.
  • No freeze-point violation (F25/F26 require RFC before change).

12. Resource Budget

Projected p95 load (M5):

  • 100 tenants × 5k windows/month active
  • 500 rps API peak (combined)
  • 5k events/s peak during materializer bursts

Measured fit: 8 api pods × 2 CPU, 6 worker pods × 2 CPU — well under cluster budget.