Deployment Topology

:::info Source Sourced from services/assignment-service/DEPLOYMENT_TOPOLOGY.md in the documentation repo. :::

Companion: 01 Enterprise Architecture · 15 Observability

1. Deployment Unit

The service ships as a single OCI image with four roles selectable via APP_ROLE:

Role	Process	Replicas (prod)	HPA trigger
`api`	Fastify HTTP API	4–24	CPU 60% / RPS
`worker`	Consumers + outbox publisher	3–12	Queue lag
`scheduler`	Materializer + sweepers + reminder planner	2 (leader-elected)	—
`ai-suggest-worker`	AI call handler (bulkheaded)	1–4	AI queue length

All roles share the same image, same codebase, but differ in the entry subcommand.

2. Kubernetes Layout

namespace: assignment
│
├─ Deployment/assignment-api            (4 replicas)
├─ Deployment/assignment-worker         (3 replicas)
├─ StatefulSet/assignment-scheduler     (2 replicas, leader-elected via lease)
├─ Deployment/assignment-ai-suggest     (1 replica)
├─ Service/assignment-api               (ClusterIP, exposed via ingress gateway)
├─ HPA/assignment-api
├─ HPA/assignment-worker
├─ PodDisruptionBudget/assignment-api   (minAvailable=2)
├─ PodDisruptionBudget/assignment-worker(minAvailable=1)
├─ NetworkPolicy                         (see §6)
├─ ServiceMonitor                        (prometheus-operator → SigNoz)
├─ ConfigMap/assignment-config
└─ ExternalSecret/assignment-secrets     (ESO → AWS Secrets Manager)

3. Topology

                     ┌───────────────────────────┐
                     │     API Gateway / WAF     │
                     └─────────────┬─────────────┘
                                   │
                     ┌─────────────▼─────────────┐
                     │   assignment-api (Pods)   │
                     └─┬────────────┬────────────┘
                       │            │
             ┌─────────▼──┐    ┌────▼───────┐
             │ Postgres   │    │   Redis    │
             │  (Aurora)  │    │ (cluster)  │
             └─────────┬──┘    └────────────┘
                       │
                       ▼
               ┌──────────────┐
               │    NATS      │
               │ JetStream    │
               └──────┬───────┘
                      │
  ┌───────────────────┼───────────────────┐
  │                   │                   │
┌─▼──────────┐  ┌─────▼──────┐  ┌────────▼─────────┐
│ worker     │  │ scheduler  │  │ ai-suggest       │
└────────────┘  └────────────┘  └───────┬──────────┘
                                        │
                              ┌─────────▼─────────┐
                              │ ai-gateway-svc    │
                              └───────────────────┘

4. Environments

Env	Region	Cluster	Tenant mode	Purpose
`dev`	local / us-west	1x	any	loops & debug
`ci`	ephemeral	1x	synthetic	CI verification
`staging`	us-east-1	1x primary	all tenants shadow + synthetic	soak, perf, pre-GA
`prod-us`	us-east-1 + us-west-2	active/active	production	US tenants
`prod-eu`	eu-west-1	active/passive	production	EU tenants (data residency)
`prod-me`	me-central-1	active/passive	production	MENA tenants

Traffic pinned per tenant to home region via tenant-service metadata.

5. Scaling Profile

Baseline (p50 tenant):

Role	CPU req	Mem req	Limits
api	500 m	512 Mi	2 / 1 Gi
worker	500 m	512 Mi	2 / 1 Gi
scheduler	250 m	256 Mi	1 / 512 Mi
ai-suggest	250 m	256 Mi	1 / 512 Mi

HPA metrics:

api: CPU 60%, custom http_requests_in_flight
worker: custom nats_consumer_pending > 1000
ai-suggest: custom ai_suggest_queue > 10

6. Network Policy

Inbound: only from gateway namespace on 8080; from prometheus on 9464; from peer services via mesh.
Outbound: Postgres, Redis, NATS, ai-gateway-service, notification-service, tenant-service, catalog-service only.
No internet egress.

7. Secrets & Config

External Secrets Operator syncs from AWS Secrets Manager:

POSTGRES_URL
REDIS_URL
NATS_CREDS
INTERNAL_SVC_JWT_SIGNING_KEY
OTEL_EXPORTER_OTLP_ENDPOINT
AI_GATEWAY_URL, AI_GATEWAY_TOKEN

Config (ConfigMap):

RRULE_HORIZON_DAYS=90
MATERIALIZER_BATCH_SIZE=1000
OVERDUE_SWEEP_INTERVAL=5m
CLOSED_MISSED_SWEEP_INTERVAL=15m
REMINDER_BATCH_SIZE=500
FEATURE_AI_SUGGEST=on|off

Feature flags via launchdarkly-compatible SDK (LaunchDarkly or OpenFeature + Flagd).

8. Release Pipeline

main branch push
   ↓
Build → Test → Image → Sign (cosign) → SBOM (cyclonedx)
   ↓
Deploy to staging (canary 10% → 100%)
   ↓
Soak 48h + synthetic checks
   ↓
Gated approval (compliance_admin of ops)
   ↓
Deploy to prod-us (canary 1% → 10% → 50% → 100%)
   ↓
(t+24h) prod-eu, prod-me same staged rollout

Rollback: kubectl rollout undo + DB schema is backward-compat by policy. Outbox + saga safe because consumers are version-tolerant.

9. DR / BCP

RPO: 5 min (PITR + NATS JetStream replication).
RTO: 30 min (cross-region failover runbook).
Quarterly DR drill: restore staging in alternate region from last backup; validate via synthetic tenant.

10. Tenant Sharding Strategy

Single logical Postgres cluster per region; tenant LIST partitioning on compliance_window gives isolation without multiple DBs. If a single tenant exceeds 100M windows, we promote to its own dedicated cluster via standby-promotion runbook.

11. Deployment Pre-checks

Automated gate runs before every prod deploy:

All DB migrations reversible or forward-compat.
Event schema compat check against registered consumers.
No freeze-point violation (F25/F26 require RFC before change).

12. Resource Budget

Projected p95 load (M5):

100 tenants × 5k windows/month active
500 rps API peak (combined)
5k events/s peak during materializer bursts

Measured fit: 8 api pods × 2 CPU, 6 worker pods × 2 CPU — well under cluster budget.

1. Deployment Unit​

2. Kubernetes Layout​

3. Topology​

4. Environments​

5. Scaling Profile​

6. Network Policy​

7. Secrets & Config​

8. Release Pipeline​

9. DR / BCP​

10. Tenant Sharding Strategy​

11. Deployment Pre-checks​

12. Resource Budget​