Deployment Topology
:::info Source
Sourced from services/delivery-service/DEPLOYMENT_TOPOLOGY.md in the documentation repo.
:::
Companion: 01 Enterprise Architecture · SECURITY_MODEL · OBSERVABILITY
1. Runtime
| Component | Choice | Rationale |
|---|---|---|
| Language | TypeScript | Platform standard |
| Framework | NestJS | Platform standard (controllers, modules, DI) |
| Node runtime | Node.js 22 LTS | Platform standard |
| Container | OCI image via Buildpacks / Dockerfile | Reproducible builds |
| Orchestration | Kubernetes | Platform standard |
| Helm chart | charts/delivery-service | Per-service chart with values per environment |
2. Infrastructure Dependencies
| Dependency | Managed By | Connection |
|---|---|---|
| PostgreSQL 16 | AWS RDS / Cloud SQL | Private network; TLS required |
| Redis 7 | AWS ElastiCache / Memorystore | Private network; TLS required |
| NATS JetStream | Self-hosted on K8s (operator) | Multi-AZ; 5 replicas |
| S3/R2 | AWS S3 / Cloudflare R2 | For PlayPackage bundle signed URLs (read-only access) |
| KMS | AWS KMS / Vault | JWT verification, secret decryption |
3. Kubernetes Topology
┌──────────────────────────────────────────────────────┐
│ delivery-service │
│ │
│ Deployment: delivery-api │
│ - Replicas: 6 (min) -> 30 (max via HPA) │
│ - Resources: 1 CPU / 1.5Gi RAM per pod │
│ - Probes: liveness, readiness, startup │
│ │
│ Deployment: delivery-outbox-relay │
│ - Replicas: 3 │
│ - Resources: 0.5 CPU / 1Gi RAM │
│ │
│ Deployment: delivery-event-projector │
│ - Replicas: 3 │
│ - Resources: 0.5 CPU / 1Gi RAM │
│ │
│ Service: delivery-service (ClusterIP) │
│ - Port: 8080 (HTTP) │
│ - Port: 9090 (metrics) │
│ │
│ Ingress via platform gateway (Istio / Traefik) │
│ - TLS termination at edge │
│ - Istio sidecar for mTLS internal │
└──────────────────────────────────────────────────────┘
4. Scaling
4.1 Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: delivery-api
spec:
minReplicas: 6
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Pods
pods:
metric:
name: delivery_http_requests_per_second
target:
type: AverageValue
averageValue: "200"
4.2 Database Scaling
- Primary instance: 16 vCPU / 64Gi RAM (prod)
- Read replicas: 2 (for analytics queries + replica read of session state)
- Connection pooling: PgBouncer sidecar in transaction mode
- Max connections: 500 per writer, 200 per reader
4.3 NATS Scaling
- 5-node JetStream cluster, multi-AZ
- Stream
DELIVERY: 3 replicas, 30 day retention, 1TB max size - Durable consumers per projector
5. Pod Spec
spec:
containers:
- name: delivery-api
image: ghasi/delivery-service:<sha>
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: metrics
env:
- name: NODE_ENV
value: production
- name: LOG_LEVEL
value: info
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector.observability:4317
envFrom:
- secretRef:
name: delivery-secrets
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 15
timeoutSeconds: 3
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
timeoutSeconds: 2
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 2
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1000m"
memory: "1.5Gi"
6. Environments
| Environment | Purpose | Tenant Data |
|---|---|---|
dev | Local + ephemeral cluster | Synthetic only |
staging | Pre-prod integration | Synthetic + opt-in real tenants (UAT) |
prod-us | US production | Real tenants, US region |
prod-eu | EU production | Real tenants, EU region (data residency) |
prod-me | ME production | Real tenants, Middle East region |
prod-ap | APAC production | Real tenants, APAC region |
Each region is fully independent (no cross-region writes). Tenants select home region at creation.
7. Deployment Process
7.1 CI/CD Pipeline
┌────────────────────────────────────────────────────────┐
│ 1. PR opened -> CI runs: │
│ - Lint, type check, unit + integration tests │
│ - Contract tests, SAST, dependency scan │
│ - Two-tenant simulator │
│ │
│ 2. Merge to main -> build + push image │
│ │
│ 3. Deploy to staging (automatic) │
│ - Run smoke tests │
│ - Run E2E suite │
│ │
│ 4. Deploy to prod-us via manual approval │
│ - Canary: 5% traffic for 30 min │
│ - Monitor SLO burn rate │
│ - Auto-rollback on elevated error rate │
│ - Progressive: 25%, 50%, 100% │
│ │
│ 5. Deploy to other prod regions │
│ - Staggered rollout, 1 region per day │
└────────────────────────────────────────────────────────┘
7.2 Deployment Tool
- Argo CD for GitOps deployment
- Argo Rollouts for canary + blue/green
- Flagger alternative (optional)
8. Rollback Strategy
- Argo Rollouts maintains previous ReplicaSet.
- Rollback by re-pinning to previous image tag (1-command).
- Database migrations are forward-only; incompatible migrations require paired forward migrations.
- Event schema changes follow dual-publish pattern (04 Event-Driven §12).
9. Network Policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: delivery-service
spec:
podSelector:
matchLabels:
app: delivery-service
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: gateway
- namespaceSelector:
matchLabels:
name: istio-system
ports:
- port: 8080
- from:
- namespaceSelector:
matchLabels:
name: observability
ports:
- port: 9090
egress:
- to:
- namespaceSelector:
matchLabels:
name: data
ports:
- port: 5432 # Postgres
- port: 6379 # Redis
- port: 4222 # NATS
- to:
- namespaceSelector:
matchLabels:
name: ghasi-services
# Allow internal service calls
10. Disaster Recovery
| Scenario | RTO | RPO | Strategy |
|---|---|---|---|
| Pod crash | < 1 min | 0 | K8s auto-restart |
| Node failure | < 5 min | 0 | Pod rescheduled |
| AZ failure | < 15 min | 0 | Multi-AZ deployment |
| Region failure | < 4 hours | < 5 min | Standby region promotion; tenant data-residency permitting |
| Database corruption | < 1 hour | < 5 min | PITR from continuous backups |
| Full platform loss | < 24 hours | < 1 hour | Restore from cross-region backup |
11. Release Cadence
- Release cycle: Weekly (every Tuesday prod deploy window)
- Hotfix window: 24/7 on-call coverage for P1/P2
- Feature flag gating: All new features behind LaunchDarkly / OpenFeature flags; rolled out per tenant
12. Resource Sizing (Reference)
| Environment | Replicas | CPU (per pod) | RAM (per pod) | DB size | Redis size | NATS size |
|---|---|---|---|---|---|---|
| dev | 2 | 250m | 512Mi | 5 GB | 1 GB | 10 GB |
| staging | 3 | 500m | 1 GB | 50 GB | 2 GB | 50 GB |
| prod (small region) | 6 | 1 CPU | 1.5 GB | 500 GB | 8 GB | 500 GB |
| prod (large region) | 12 | 1 CPU | 1.5 GB | 2 TB | 16 GB | 2 TB |