Deployment Topology

:::info Source Sourced from services/tenant-service/DEPLOYMENT_TOPOLOGY.md in the documentation repo. :::

Blueprint doc 12 of 17. Companion: 01 Enterprise Architecture | SECURITY_MODEL | OBSERVABILITY

1. Topology Overview

                                  Cloudflare Edge (TLS, WAF, DDoS)
                                            │
                              ┌─────────────┴─────────────┐
                              │    API Gateway (NestJS)    │
                              │   tenant-aware proxy init  │
                              └─────────────┬─────────────┘
                                            │ mTLS
              ┌─────────────────────────────┼─────────────────────────────┐
              │                             │                             │
     ┌────────▼──────────┐         ┌────────▼──────────┐         ┌────────▼──────────┐
     │  tenant-service   │         │  tenant-service   │         │  tenant-service   │
     │   Region: us-*    │         │   Region: eu-*    │         │   Region: me-*    │
     │  (3+ replicas)    │         │  (3+ replicas)    │         │  (3+ replicas)    │
     └────────┬──────────┘         └────────┬──────────┘         └────────┬──────────┘
              │                             │                             │
     ┌────────▼──────────┐         ┌────────▼──────────┐         ┌────────▼──────────┐
     │ PgBouncer pool    │         │ PgBouncer pool    │         │ PgBouncer pool    │
     │ (transaction mode)│         │ (transaction mode)│         │ (transaction mode)│
     └────────┬──────────┘         └────────┬──────────┘         └────────┬──────────┘
              │                             │                             │
     ┌────────▼──────────┐         ┌────────▼──────────┐         ┌────────▼──────────┐
     │  Postgres 16      │         │  Postgres 16      │         │  Postgres 16      │
     │  (1 primary +     │         │  (1 primary +     │         │  (1 primary +     │
     │   2 replicas)     │         │   2 replicas)     │         │   2 replicas)     │
     └───────────────────┘         └───────────────────┘         └───────────────────┘

                    ┌────────────────────────────────────────────┐
                    │  NATS JetStream (multi-region cluster)     │
                    │  TENANT stream partitioned by tenantId     │
                    └────────────────────────────────────────────┘

                    ┌────────────────────────────────────────────┐
                    │  Redis cluster (per region)                │
                    │  authz cache, tenant cache, rate limits    │
                    └────────────────────────────────────────────┘

2. Regions

Region code	Location	Purpose
`us`	us-east-1, us-west-2 (active/active)	North America tenants
`eu`	eu-central-1 (Frankfurt), eu-west-1 (Dublin)	EU tenants (GDPR residency)
`me`	me-central-1 (UAE), af-south-1 (South Africa)	Middle East + Africa
`ap`	ap-southeast-1 (Singapore), ap-south-1 (Mumbai)	Asia-Pacific

Tenant data is pinned to homeRegion. Cross-region access requires explicit residency migration.

3. Kubernetes Deployment

3.1 Helm Chart Structure

charts/tenant-service/
├── Chart.yaml
├── values.yaml
├── values.us-prod.yaml
├── values.eu-prod.yaml
├── values.me-prod.yaml
├── values.ap-prod.yaml
├── values.staging.yaml
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    ├── hpa.yaml
    ├── pdb.yaml
    ├── networkpolicy.yaml
    ├── servicemonitor.yaml
    ├── prometheusrule.yaml
    ├── serviceaccount.yaml
    └── secrets-external.yaml       # ExternalSecrets refs

3.2 Baseline Deployment Spec

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tenant-service
  labels:
    app: tenant-service
    tier: core
    region: {{ .Values.region }}
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: tenant-service
  template:
    metadata:
      labels:
        app: tenant-service
        region: {{ .Values.region }}
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      serviceAccountName: tenant-service
      securityContext:
        runAsNonRoot: true
        runAsUser: 10001
        fsGroup: 10001
        seccompProfile: { type: RuntimeDefault }
      containers:
      - name: tenant-service
        image: ghasi/tenant-service:{{ .Values.image.tag }}
        imagePullPolicy: IfNotPresent
        ports:
        - { name: http, containerPort: 3000 }
        - { name: metrics, containerPort: 9090 }
        resources:
          requests: { cpu: 500m, memory: 512Mi }
          limits:   { cpu: 2,    memory: 2Gi }
        env:
        - { name: NODE_ENV, value: production }
        - { name: REGION,   value: {{ .Values.region }} }
        - { name: OTEL_SERVICE_NAME, value: tenant-service }
        - { name: OTEL_EXPORTER_OTLP_ENDPOINT, value: http://otel-collector.observability:4317 }
        envFrom:
        - secretRef: { name: tenant-service-secrets }   # populated by ExternalSecrets
        livenessProbe:
          httpGet: { path: /health/live, port: http }
          initialDelaySeconds: 15
          periodSeconds: 10
        readinessProbe:
          httpGet: { path: /health/ready, port: http }
          initialDelaySeconds: 5
          periodSeconds: 5
        startupProbe:
          httpGet: { path: /health/startup, port: http }
          failureThreshold: 30
          periodSeconds: 5
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities: { drop: [ALL] }
        volumeMounts:
        - { name: tmp, mountPath: /tmp }
      volumes:
      - { name: tmp, emptyDir: {} }
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector: { matchLabels: { app: tenant-service } }

3.3 HorizontalPodAutoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: tenant-service }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: tenant-service }
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
  - type: Pods
    pods:
      metric: { name: tenant_http_requests_per_second }
      target: { type: AverageValue, averageValue: "500" }
  behavior:
    scaleUp:   { stabilizationWindowSeconds: 60,  policies: [{ type: Percent, value: 100, periodSeconds: 30 }] }
    scaleDown: { stabilizationWindowSeconds: 300, policies: [{ type: Percent, value: 50,  periodSeconds: 60 }] }

3.4 PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: tenant-service }
spec:
  minAvailable: 2
  selector: { matchLabels: { app: tenant-service } }

3.5 NetworkPolicy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: tenant-service-np }
spec:
  podSelector: { matchLabels: { app: tenant-service } }
  policyTypes: [Ingress, Egress]
  ingress:
  - from:
    - namespaceSelector: { matchLabels: { name: api-gateway } }
    - namespaceSelector: { matchLabels: { name: sync-service } }
    - namespaceSelector: { matchLabels: { name: observability } }
    ports:
    - { port: 3000, protocol: TCP }
    - { port: 9090, protocol: TCP }
  egress:
  - to:
    - namespaceSelector: { matchLabels: { name: data-postgres } }
    ports: [{ port: 6432, protocol: TCP }]   # PgBouncer
  - to:
    - namespaceSelector: { matchLabels: { name: data-redis } }
    ports: [{ port: 6379, protocol: TCP }]
  - to:
    - namespaceSelector: { matchLabels: { name: messaging-nats } }
    ports: [{ port: 4222, protocol: TCP }]
  - to:
    - namespaceSelector: { matchLabels: { name: identity-service } }
    ports: [{ port: 443, protocol: TCP }]
  - to:                                       # DNS
    - namespaceSelector: { matchLabels: { name: kube-system } }
    ports: [{ port: 53, protocol: UDP }, { port: 53, protocol: TCP }]

4. Infrastructure

4.1 Postgres

Attribute	Config
Engine	PostgreSQL 16 (managed; RDS / Cloud SQL / Aurora-compatible)
Instance class	`db.r6g.xlarge` (primary) / `db.r6g.large` × 2 (replicas)
Storage	500 GB GP3 (auto-scale)
Extensions	`ltree`, `pg_trgm`, `pgcrypto`, `uuid-ossp`
Parameter tweaks	`max_connections=500`, `shared_buffers=25%`, `work_mem=16MB`, `effective_cache_size=75%`
Backups	Continuous WAL to S3; nightly snapshot; 35-day PITR
Encryption	AES-256 at rest (KMS); TLS 1.3 in transit

4.2 PgBouncer

Attribute	Config
Mode	`transaction` (required for RLS `SET LOCAL`)
Pool size per replica	200
Max client conns	10,000
Init hook	`SET LOCAL app.tenant_id = $1` per transaction via app-supplied context

4.3 Redis

Attribute	Config
Deployment	Redis 7 cluster mode, 3 shards × 2 replicas per region
TLS	Enabled (TLS 1.3)
Auth	Token (rotated 90 days)
Eviction	`allkeys-lru`
Persistence	AOF with everysec
Namespaces	`tenants/{tid}/authz/...`, `tenants/{tid}/profile`, etc.

4.4 NATS JetStream

Attribute	Config
Cluster	5 nodes, cross-region replication (R3 for hot, R5 for regulated)
Streams	`TENANT` (partition=16), `TENANT.DLQ`
Storage	File-backed; 1 TB per node
Auth	NKey + per-service permissions

5. Secret Management

ExternalSecrets operator syncs from HashiCorp Vault / AWS Secrets Manager / GCP Secret Manager:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata: { name: tenant-service-secrets }
spec:
  refreshInterval: 1h
  secretStoreRef: { name: vault-backend, kind: ClusterSecretStore }
  target:
    name: tenant-service-secrets
    creationPolicy: Owner
  data:
  - { secretKey: DATABASE_URL,     remoteRef: { key: secret/tenant-service/db,    property: url } }
  - { secretKey: REDIS_URL,        remoteRef: { key: secret/tenant-service/redis, property: url } }
  - { secretKey: NATS_CREDS,       remoteRef: { key: secret/tenant-service/nats,  property: creds } }
  - { secretKey: KMS_KEY_ID,       remoteRef: { key: secret/tenant-service/kms,   property: key_id } }
  - { secretKey: INVITE_TOKEN_SALT, remoteRef: { key: secret/tenant-service/crypto, property: invite_salt } }

6. CI/CD Pipeline

PR opened
  ├─ Lint + typecheck
  ├─ Unit tests + coverage gate (≥ 85%)
  ├─ Integration tests (Testcontainers)
  ├─ Two-tenant isolation suite
  ├─ Contract verification (Pact broker)
  ├─ Event schema backward-compat check
  ├─ OpenAPI diff check
  ├─ SAST (CodeQL, Semgrep)
  ├─ SBOM + provenance attestation (SLSA Level 3)
  └─ Docker build + sign (cosign)

Merge to main
  ├─ Tag semver
  ├─ Deploy to staging (all regions)
  ├─ k6 smoke test in staging
  ├─ Chaos drill (automated)
  └─ Manual approval for prod

Deploy to prod
  ├─ Canary: 1 region, 5% traffic for 30 min
  ├─ SLO burn rate monitored; auto-rollback if > 2x baseline
  ├─ Ramp to 50% → 100% in 2h
  └─ Deploy to remaining regions sequentially

7. Resource Planning

Tier	Replicas/region	Requests/pod	Peak rps/pod
Steady state	3	0.5 CPU / 512Mi	~500
Peak (tenant onboarding event)	6	same	~1k
HPA ceiling	20	—	~10k

Cost-plan baseline (per region): ~$800/mo (3 pods + Postgres primary + 2 replicas + Redis cluster shard).

8. Service Dependencies (runtime)

Dependency	Purpose	Degradation behavior
Postgres	Primary datastore	Reads fall back to replica; writes fail with 503 + Retry-After
Redis	Authz cache, rate limits	Cache miss → DB fallback; rate limit fails open with alert
NATS JetStream	Event pub/sub	Outbox queues; consumer lag alert; writes succeed
identity-service (JWKS)	JWT validation	Cached 1h; stale-but-usable fallback
ai-gateway-service	AI features	Fail-soft: hide suggestions, keep core flows
billing-service	Plan entitlement lookup	Cached 15 min; proceed on lookup failure with last-known

9. Disaster Recovery

Scenario	RPO	RTO
Single pod failure	0	< 30s (k8s reschedules)
AZ failure	0	< 2 min (HPA + topology spread)
Region failure	1 min (WAL shipped cross-region)	< 30 min (promote DR replica; DNS update)
Data corruption	1 min	≤ 1h (PITR replay)

DR drill: quarterly per region. Full platform-wide DR drill: annual.

10. Rollback Procedure

kubectl rollout undo deployment/tenant-service (restores previous image).
Migrations: forward-only; rollback via compensating migration (pre-tested).
Feature-flag kill-switches for new behaviors; toggle without redeploy.
Database rollback only via PITR in worst case (coordinated multi-service operation).

1. Topology Overview​

2. Regions​

3. Kubernetes Deployment​

3.1 Helm Chart Structure​

3.2 Baseline Deployment Spec​

3.3 HorizontalPodAutoscaler​

3.4 PodDisruptionBudget​

3.5 NetworkPolicy​

4. Infrastructure​

4.1 Postgres​

4.2 PgBouncer​

4.3 Redis​

4.4 NATS JetStream​

5. Secret Management​

6. CI/CD Pipeline​

7. Resource Planning​

8. Service Dependencies (runtime)​

9. Disaster Recovery​

10. Rollback Procedure​