Deployment Topology

:::info Source Sourced from services/delivery-service/DEPLOYMENT_TOPOLOGY.md in the documentation repo. :::

Companion: 01 Enterprise Architecture · SECURITY_MODEL · OBSERVABILITY

1. Runtime

Component	Choice	Rationale
Language	TypeScript	Platform standard
Framework	NestJS	Platform standard (controllers, modules, DI)
Node runtime	Node.js 22 LTS	Platform standard
Container	OCI image via Buildpacks / Dockerfile	Reproducible builds
Orchestration	Kubernetes	Platform standard
Helm chart	`charts/delivery-service`	Per-service chart with values per environment

2. Infrastructure Dependencies

Dependency	Managed By	Connection
PostgreSQL 16	AWS RDS / Cloud SQL	Private network; TLS required
Redis 7	AWS ElastiCache / Memorystore	Private network; TLS required
NATS JetStream	Self-hosted on K8s (operator)	Multi-AZ; 5 replicas
S3/R2	AWS S3 / Cloudflare R2	For PlayPackage bundle signed URLs (read-only access)
KMS	AWS KMS / Vault	JWT verification, secret decryption

3. Kubernetes Topology

┌──────────────────────────────────────────────────────┐
│                  delivery-service                     │
│                                                      │
│  Deployment: delivery-api                            │
│    - Replicas: 6 (min) -> 30 (max via HPA)          │
│    - Resources: 1 CPU / 1.5Gi RAM per pod            │
│    - Probes: liveness, readiness, startup            │
│                                                      │
│  Deployment: delivery-outbox-relay                   │
│    - Replicas: 3                                     │
│    - Resources: 0.5 CPU / 1Gi RAM                    │
│                                                      │
│  Deployment: delivery-event-projector                │
│    - Replicas: 3                                     │
│    - Resources: 0.5 CPU / 1Gi RAM                    │
│                                                      │
│  Service: delivery-service (ClusterIP)               │
│    - Port: 8080 (HTTP)                               │
│    - Port: 9090 (metrics)                            │
│                                                      │
│  Ingress via platform gateway (Istio / Traefik)      │
│    - TLS termination at edge                         │
│    - Istio sidecar for mTLS internal                 │
└──────────────────────────────────────────────────────┘

4. Scaling

4.1 Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: delivery-api
spec:
  minReplicas: 6
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Pods
      pods:
        metric:
          name: delivery_http_requests_per_second
        target:
          type: AverageValue
          averageValue: "200"

4.2 Database Scaling

Primary instance: 16 vCPU / 64Gi RAM (prod)
Read replicas: 2 (for analytics queries + replica read of session state)
Connection pooling: PgBouncer sidecar in transaction mode
Max connections: 500 per writer, 200 per reader

4.3 NATS Scaling

5-node JetStream cluster, multi-AZ
Stream DELIVERY: 3 replicas, 30 day retention, 1TB max size
Durable consumers per projector

5. Pod Spec

spec:
  containers:
    - name: delivery-api
      image: ghasi/delivery-service:<sha>
      ports:
        - containerPort: 8080
          name: http
        - containerPort: 9090
          name: metrics
      env:
        - name: NODE_ENV
          value: production
        - name: LOG_LEVEL
          value: info
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: http://otel-collector.observability:4317
      envFrom:
        - secretRef:
            name: delivery-secrets
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        periodSeconds: 15
        timeoutSeconds: 3
      readinessProbe:
        httpGet:
          path: /readyz
          port: 8080
        periodSeconds: 5
        timeoutSeconds: 2
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 30
        periodSeconds: 2
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"
        limits:
          cpu: "1000m"
          memory: "1.5Gi"

6. Environments

Environment	Purpose	Tenant Data
`dev`	Local + ephemeral cluster	Synthetic only
`staging`	Pre-prod integration	Synthetic + opt-in real tenants (UAT)
`prod-us`	US production	Real tenants, US region
`prod-eu`	EU production	Real tenants, EU region (data residency)
`prod-me`	ME production	Real tenants, Middle East region
`prod-ap`	APAC production	Real tenants, APAC region

Each region is fully independent (no cross-region writes). Tenants select home region at creation.

7. Deployment Process

7.1 CI/CD Pipeline

┌────────────────────────────────────────────────────────┐
│ 1. PR opened -> CI runs:                               │
│    - Lint, type check, unit + integration tests        │
│    - Contract tests, SAST, dependency scan             │
│    - Two-tenant simulator                              │
│                                                        │
│ 2. Merge to main -> build + push image                 │
│                                                        │
│ 3. Deploy to staging (automatic)                       │
│    - Run smoke tests                                   │
│    - Run E2E suite                                     │
│                                                        │
│ 4. Deploy to prod-us via manual approval               │
│    - Canary: 5% traffic for 30 min                     │
│    - Monitor SLO burn rate                             │
│    - Auto-rollback on elevated error rate              │
│    - Progressive: 25%, 50%, 100%                       │
│                                                        │
│ 5. Deploy to other prod regions                        │
│    - Staggered rollout, 1 region per day               │
└────────────────────────────────────────────────────────┘

7.2 Deployment Tool

Argo CD for GitOps deployment
Argo Rollouts for canary + blue/green
Flagger alternative (optional)

8. Rollback Strategy

Argo Rollouts maintains previous ReplicaSet.
Rollback by re-pinning to previous image tag (1-command).
Database migrations are forward-only; incompatible migrations require paired forward migrations.
Event schema changes follow dual-publish pattern (04 Event-Driven §12).

9. Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: delivery-service
spec:
  podSelector:
    matchLabels:
      app: delivery-service
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: gateway
        - namespaceSelector:
            matchLabels:
              name: istio-system
      ports:
        - port: 8080
    - from:
        - namespaceSelector:
            matchLabels:
              name: observability
      ports:
        - port: 9090
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: data
      ports:
        - port: 5432   # Postgres
        - port: 6379   # Redis
        - port: 4222   # NATS
    - to:
        - namespaceSelector:
            matchLabels:
              name: ghasi-services
      # Allow internal service calls

10. Disaster Recovery

Scenario	RTO	RPO	Strategy
Pod crash	< 1 min	0	K8s auto-restart
Node failure	< 5 min	0	Pod rescheduled
AZ failure	< 15 min	0	Multi-AZ deployment
Region failure	< 4 hours	< 5 min	Standby region promotion; tenant data-residency permitting
Database corruption	< 1 hour	< 5 min	PITR from continuous backups
Full platform loss	< 24 hours	< 1 hour	Restore from cross-region backup

11. Release Cadence

Release cycle: Weekly (every Tuesday prod deploy window)
Hotfix window: 24/7 on-call coverage for P1/P2
Feature flag gating: All new features behind LaunchDarkly / OpenFeature flags; rolled out per tenant

12. Resource Sizing (Reference)

Environment	Replicas	CPU (per pod)	RAM (per pod)	DB size	Redis size	NATS size
dev	2	250m	512Mi	5 GB	1 GB	10 GB
staging	3	500m	1 GB	50 GB	2 GB	50 GB
prod (small region)	6	1 CPU	1.5 GB	500 GB	8 GB	500 GB
prod (large region)	12	1 CPU	1.5 GB	2 TB	16 GB	2 TB

1. Runtime​

2. Infrastructure Dependencies​

3. Kubernetes Topology​

4. Scaling​

4.1 Horizontal Pod Autoscaler​

4.2 Database Scaling​

4.3 NATS Scaling​

5. Pod Spec​

6. Environments​

7. Deployment Process​

7.1 CI/CD Pipeline​

7.2 Deployment Tool​

8. Rollback Strategy​

9. Network Policies​

10. Disaster Recovery​

11. Release Cadence​

12. Resource Sizing (Reference)​