Skip to main content

Deployment Topology

:::info Source Sourced from services/delivery-service/DEPLOYMENT_TOPOLOGY.md in the documentation repo. :::

Companion: 01 Enterprise Architecture · SECURITY_MODEL · OBSERVABILITY

1. Runtime

ComponentChoiceRationale
LanguageTypeScriptPlatform standard
FrameworkNestJSPlatform standard (controllers, modules, DI)
Node runtimeNode.js 22 LTSPlatform standard
ContainerOCI image via Buildpacks / DockerfileReproducible builds
OrchestrationKubernetesPlatform standard
Helm chartcharts/delivery-servicePer-service chart with values per environment

2. Infrastructure Dependencies

DependencyManaged ByConnection
PostgreSQL 16AWS RDS / Cloud SQLPrivate network; TLS required
Redis 7AWS ElastiCache / MemorystorePrivate network; TLS required
NATS JetStreamSelf-hosted on K8s (operator)Multi-AZ; 5 replicas
S3/R2AWS S3 / Cloudflare R2For PlayPackage bundle signed URLs (read-only access)
KMSAWS KMS / VaultJWT verification, secret decryption

3. Kubernetes Topology

┌──────────────────────────────────────────────────────┐
│ delivery-service │
│ │
│ Deployment: delivery-api │
│ - Replicas: 6 (min) -> 30 (max via HPA) │
│ - Resources: 1 CPU / 1.5Gi RAM per pod │
│ - Probes: liveness, readiness, startup │
│ │
│ Deployment: delivery-outbox-relay │
│ - Replicas: 3 │
│ - Resources: 0.5 CPU / 1Gi RAM │
│ │
│ Deployment: delivery-event-projector │
│ - Replicas: 3 │
│ - Resources: 0.5 CPU / 1Gi RAM │
│ │
│ Service: delivery-service (ClusterIP) │
│ - Port: 8080 (HTTP) │
│ - Port: 9090 (metrics) │
│ │
│ Ingress via platform gateway (Istio / Traefik) │
│ - TLS termination at edge │
│ - Istio sidecar for mTLS internal │
└──────────────────────────────────────────────────────┘

4. Scaling

4.1 Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: delivery-api
spec:
minReplicas: 6
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Pods
pods:
metric:
name: delivery_http_requests_per_second
target:
type: AverageValue
averageValue: "200"

4.2 Database Scaling

  • Primary instance: 16 vCPU / 64Gi RAM (prod)
  • Read replicas: 2 (for analytics queries + replica read of session state)
  • Connection pooling: PgBouncer sidecar in transaction mode
  • Max connections: 500 per writer, 200 per reader

4.3 NATS Scaling

  • 5-node JetStream cluster, multi-AZ
  • Stream DELIVERY: 3 replicas, 30 day retention, 1TB max size
  • Durable consumers per projector

5. Pod Spec

spec:
containers:
- name: delivery-api
image: ghasi/delivery-service:<sha>
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: metrics
env:
- name: NODE_ENV
value: production
- name: LOG_LEVEL
value: info
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector.observability:4317
envFrom:
- secretRef:
name: delivery-secrets
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 15
timeoutSeconds: 3
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
timeoutSeconds: 2
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 2
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1000m"
memory: "1.5Gi"

6. Environments

EnvironmentPurposeTenant Data
devLocal + ephemeral clusterSynthetic only
stagingPre-prod integrationSynthetic + opt-in real tenants (UAT)
prod-usUS productionReal tenants, US region
prod-euEU productionReal tenants, EU region (data residency)
prod-meME productionReal tenants, Middle East region
prod-apAPAC productionReal tenants, APAC region

Each region is fully independent (no cross-region writes). Tenants select home region at creation.

7. Deployment Process

7.1 CI/CD Pipeline

┌────────────────────────────────────────────────────────┐
│ 1. PR opened -> CI runs: │
│ - Lint, type check, unit + integration tests │
│ - Contract tests, SAST, dependency scan │
│ - Two-tenant simulator │
│ │
│ 2. Merge to main -> build + push image │
│ │
│ 3. Deploy to staging (automatic) │
│ - Run smoke tests │
│ - Run E2E suite │
│ │
│ 4. Deploy to prod-us via manual approval │
│ - Canary: 5% traffic for 30 min │
│ - Monitor SLO burn rate │
│ - Auto-rollback on elevated error rate │
│ - Progressive: 25%, 50%, 100% │
│ │
│ 5. Deploy to other prod regions │
│ - Staggered rollout, 1 region per day │
└────────────────────────────────────────────────────────┘

7.2 Deployment Tool

  • Argo CD for GitOps deployment
  • Argo Rollouts for canary + blue/green
  • Flagger alternative (optional)

8. Rollback Strategy

  • Argo Rollouts maintains previous ReplicaSet.
  • Rollback by re-pinning to previous image tag (1-command).
  • Database migrations are forward-only; incompatible migrations require paired forward migrations.
  • Event schema changes follow dual-publish pattern (04 Event-Driven §12).

9. Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: delivery-service
spec:
podSelector:
matchLabels:
app: delivery-service
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: gateway
- namespaceSelector:
matchLabels:
name: istio-system
ports:
- port: 8080
- from:
- namespaceSelector:
matchLabels:
name: observability
ports:
- port: 9090
egress:
- to:
- namespaceSelector:
matchLabels:
name: data
ports:
- port: 5432 # Postgres
- port: 6379 # Redis
- port: 4222 # NATS
- to:
- namespaceSelector:
matchLabels:
name: ghasi-services
# Allow internal service calls

10. Disaster Recovery

ScenarioRTORPOStrategy
Pod crash< 1 min0K8s auto-restart
Node failure< 5 min0Pod rescheduled
AZ failure< 15 min0Multi-AZ deployment
Region failure< 4 hours< 5 minStandby region promotion; tenant data-residency permitting
Database corruption< 1 hour< 5 minPITR from continuous backups
Full platform loss< 24 hours< 1 hourRestore from cross-region backup

11. Release Cadence

  • Release cycle: Weekly (every Tuesday prod deploy window)
  • Hotfix window: 24/7 on-call coverage for P1/P2
  • Feature flag gating: All new features behind LaunchDarkly / OpenFeature flags; rolled out per tenant

12. Resource Sizing (Reference)

EnvironmentReplicasCPU (per pod)RAM (per pod)DB sizeRedis sizeNATS size
dev2250m512Mi5 GB1 GB10 GB
staging3500m1 GB50 GB2 GB50 GB
prod (small region)61 CPU1.5 GB500 GB8 GB500 GB
prod (large region)121 CPU1.5 GB2 TB16 GB2 TB