Skip to main content

Deployment Topology

:::info Source Sourced from services/tenant-service/DEPLOYMENT_TOPOLOGY.md in the documentation repo. :::

Blueprint doc 12 of 17. Companion: 01 Enterprise Architecture | SECURITY_MODEL | OBSERVABILITY


1. Topology Overview

Cloudflare Edge (TLS, WAF, DDoS)

┌─────────────┴─────────────┐
│ API Gateway (NestJS) │
│ tenant-aware proxy init │
└─────────────┬─────────────┘
│ mTLS
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
┌────────▼──────────┐ ┌────────▼──────────┐ ┌────────▼──────────┐
│ tenant-service │ │ tenant-service │ │ tenant-service │
│ Region: us-* │ │ Region: eu-* │ │ Region: me-* │
│ (3+ replicas) │ │ (3+ replicas) │ │ (3+ replicas) │
└────────┬──────────┘ └────────┬──────────┘ └────────┬──────────┘
│ │ │
┌────────▼──────────┐ ┌────────▼──────────┐ ┌────────▼──────────┐
│ PgBouncer pool │ │ PgBouncer pool │ │ PgBouncer pool │
│ (transaction mode)│ │ (transaction mode)│ │ (transaction mode)│
└────────┬──────────┘ └────────┬──────────┘ └────────┬──────────┘
│ │ │
┌────────▼──────────┐ ┌────────▼──────────┐ ┌────────▼──────────┐
│ Postgres 16 │ │ Postgres 16 │ │ Postgres 16 │
│ (1 primary + │ │ (1 primary + │ │ (1 primary + │
│ 2 replicas) │ │ 2 replicas) │ │ 2 replicas) │
└───────────────────┘ └───────────────────┘ └───────────────────┘

┌────────────────────────────────────────────┐
│ NATS JetStream (multi-region cluster) │
│ TENANT stream partitioned by tenantId │
└────────────────────────────────────────────┘

┌────────────────────────────────────────────┐
│ Redis cluster (per region) │
│ authz cache, tenant cache, rate limits │
└────────────────────────────────────────────┘

2. Regions

Region codeLocationPurpose
usus-east-1, us-west-2 (active/active)North America tenants
eueu-central-1 (Frankfurt), eu-west-1 (Dublin)EU tenants (GDPR residency)
meme-central-1 (UAE), af-south-1 (South Africa)Middle East + Africa
apap-southeast-1 (Singapore), ap-south-1 (Mumbai)Asia-Pacific

Tenant data is pinned to homeRegion. Cross-region access requires explicit residency migration.


3. Kubernetes Deployment

3.1 Helm Chart Structure

charts/tenant-service/
├── Chart.yaml
├── values.yaml
├── values.us-prod.yaml
├── values.eu-prod.yaml
├── values.me-prod.yaml
├── values.ap-prod.yaml
├── values.staging.yaml
└── templates/
├── deployment.yaml
├── service.yaml
├── hpa.yaml
├── pdb.yaml
├── networkpolicy.yaml
├── servicemonitor.yaml
├── prometheusrule.yaml
├── serviceaccount.yaml
└── secrets-external.yaml # ExternalSecrets refs

3.2 Baseline Deployment Spec

apiVersion: apps/v1
kind: Deployment
metadata:
name: tenant-service
labels:
app: tenant-service
tier: core
region: {{ .Values.region }}
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: tenant-service
template:
metadata:
labels:
app: tenant-service
region: {{ .Values.region }}
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
serviceAccountName: tenant-service
securityContext:
runAsNonRoot: true
runAsUser: 10001
fsGroup: 10001
seccompProfile: { type: RuntimeDefault }
containers:
- name: tenant-service
image: ghasi/tenant-service:{{ .Values.image.tag }}
imagePullPolicy: IfNotPresent
ports:
- { name: http, containerPort: 3000 }
- { name: metrics, containerPort: 9090 }
resources:
requests: { cpu: 500m, memory: 512Mi }
limits: { cpu: 2, memory: 2Gi }
env:
- { name: NODE_ENV, value: production }
- { name: REGION, value: {{ .Values.region }} }
- { name: OTEL_SERVICE_NAME, value: tenant-service }
- { name: OTEL_EXPORTER_OTLP_ENDPOINT, value: http://otel-collector.observability:4317 }
envFrom:
- secretRef: { name: tenant-service-secrets } # populated by ExternalSecrets
livenessProbe:
httpGet: { path: /health/live, port: http }
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet: { path: /health/ready, port: http }
initialDelaySeconds: 5
periodSeconds: 5
startupProbe:
httpGet: { path: /health/startup, port: http }
failureThreshold: 30
periodSeconds: 5
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: [ALL] }
volumeMounts:
- { name: tmp, mountPath: /tmp }
volumes:
- { name: tmp, emptyDir: {} }
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector: { matchLabels: { app: tenant-service } }

3.3 HorizontalPodAutoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: tenant-service }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: tenant-service }
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
- type: Pods
pods:
metric: { name: tenant_http_requests_per_second }
target: { type: AverageValue, averageValue: "500" }
behavior:
scaleUp: { stabilizationWindowSeconds: 60, policies: [{ type: Percent, value: 100, periodSeconds: 30 }] }
scaleDown: { stabilizationWindowSeconds: 300, policies: [{ type: Percent, value: 50, periodSeconds: 60 }] }

3.4 PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: tenant-service }
spec:
minAvailable: 2
selector: { matchLabels: { app: tenant-service } }

3.5 NetworkPolicy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: tenant-service-np }
spec:
podSelector: { matchLabels: { app: tenant-service } }
policyTypes: [Ingress, Egress]
ingress:
- from:
- namespaceSelector: { matchLabels: { name: api-gateway } }
- namespaceSelector: { matchLabels: { name: sync-service } }
- namespaceSelector: { matchLabels: { name: observability } }
ports:
- { port: 3000, protocol: TCP }
- { port: 9090, protocol: TCP }
egress:
- to:
- namespaceSelector: { matchLabels: { name: data-postgres } }
ports: [{ port: 6432, protocol: TCP }] # PgBouncer
- to:
- namespaceSelector: { matchLabels: { name: data-redis } }
ports: [{ port: 6379, protocol: TCP }]
- to:
- namespaceSelector: { matchLabels: { name: messaging-nats } }
ports: [{ port: 4222, protocol: TCP }]
- to:
- namespaceSelector: { matchLabels: { name: identity-service } }
ports: [{ port: 443, protocol: TCP }]
- to: # DNS
- namespaceSelector: { matchLabels: { name: kube-system } }
ports: [{ port: 53, protocol: UDP }, { port: 53, protocol: TCP }]

4. Infrastructure

4.1 Postgres

AttributeConfig
EnginePostgreSQL 16 (managed; RDS / Cloud SQL / Aurora-compatible)
Instance classdb.r6g.xlarge (primary) / db.r6g.large × 2 (replicas)
Storage500 GB GP3 (auto-scale)
Extensionsltree, pg_trgm, pgcrypto, uuid-ossp
Parameter tweaksmax_connections=500, shared_buffers=25%, work_mem=16MB, effective_cache_size=75%
BackupsContinuous WAL to S3; nightly snapshot; 35-day PITR
EncryptionAES-256 at rest (KMS); TLS 1.3 in transit

4.2 PgBouncer

AttributeConfig
Modetransaction (required for RLS SET LOCAL)
Pool size per replica200
Max client conns10,000
Init hookSET LOCAL app.tenant_id = $1 per transaction via app-supplied context

4.3 Redis

AttributeConfig
DeploymentRedis 7 cluster mode, 3 shards × 2 replicas per region
TLSEnabled (TLS 1.3)
AuthToken (rotated 90 days)
Evictionallkeys-lru
PersistenceAOF with everysec
Namespacestenants/{tid}/authz/..., tenants/{tid}/profile, etc.

4.4 NATS JetStream

AttributeConfig
Cluster5 nodes, cross-region replication (R3 for hot, R5 for regulated)
StreamsTENANT (partition=16), TENANT.DLQ
StorageFile-backed; 1 TB per node
AuthNKey + per-service permissions

5. Secret Management

ExternalSecrets operator syncs from HashiCorp Vault / AWS Secrets Manager / GCP Secret Manager:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata: { name: tenant-service-secrets }
spec:
refreshInterval: 1h
secretStoreRef: { name: vault-backend, kind: ClusterSecretStore }
target:
name: tenant-service-secrets
creationPolicy: Owner
data:
- { secretKey: DATABASE_URL, remoteRef: { key: secret/tenant-service/db, property: url } }
- { secretKey: REDIS_URL, remoteRef: { key: secret/tenant-service/redis, property: url } }
- { secretKey: NATS_CREDS, remoteRef: { key: secret/tenant-service/nats, property: creds } }
- { secretKey: KMS_KEY_ID, remoteRef: { key: secret/tenant-service/kms, property: key_id } }
- { secretKey: INVITE_TOKEN_SALT, remoteRef: { key: secret/tenant-service/crypto, property: invite_salt } }

6. CI/CD Pipeline

PR opened
├─ Lint + typecheck
├─ Unit tests + coverage gate (≥ 85%)
├─ Integration tests (Testcontainers)
├─ Two-tenant isolation suite
├─ Contract verification (Pact broker)
├─ Event schema backward-compat check
├─ OpenAPI diff check
├─ SAST (CodeQL, Semgrep)
├─ SBOM + provenance attestation (SLSA Level 3)
└─ Docker build + sign (cosign)

Merge to main
├─ Tag semver
├─ Deploy to staging (all regions)
├─ k6 smoke test in staging
├─ Chaos drill (automated)
└─ Manual approval for prod

Deploy to prod
├─ Canary: 1 region, 5% traffic for 30 min
├─ SLO burn rate monitored; auto-rollback if > 2x baseline
├─ Ramp to 50% → 100% in 2h
└─ Deploy to remaining regions sequentially

7. Resource Planning

TierReplicas/regionRequests/podPeak rps/pod
Steady state30.5 CPU / 512Mi~500
Peak (tenant onboarding event)6same~1k
HPA ceiling20~10k

Cost-plan baseline (per region): ~$800/mo (3 pods + Postgres primary + 2 replicas + Redis cluster shard).


8. Service Dependencies (runtime)

DependencyPurposeDegradation behavior
PostgresPrimary datastoreReads fall back to replica; writes fail with 503 + Retry-After
RedisAuthz cache, rate limitsCache miss → DB fallback; rate limit fails open with alert
NATS JetStreamEvent pub/subOutbox queues; consumer lag alert; writes succeed
identity-service (JWKS)JWT validationCached 1h; stale-but-usable fallback
ai-gateway-serviceAI featuresFail-soft: hide suggestions, keep core flows
billing-servicePlan entitlement lookupCached 15 min; proceed on lookup failure with last-known

9. Disaster Recovery

ScenarioRPORTO
Single pod failure0< 30s (k8s reschedules)
AZ failure0< 2 min (HPA + topology spread)
Region failure1 min (WAL shipped cross-region)< 30 min (promote DR replica; DNS update)
Data corruption1 min≤ 1h (PITR replay)

DR drill: quarterly per region. Full platform-wide DR drill: annual.


10. Rollback Procedure

  1. kubectl rollout undo deployment/tenant-service (restores previous image).
  2. Migrations: forward-only; rollback via compensating migration (pre-tested).
  3. Feature-flag kill-switches for new behaviors; toggle without redeploy.
  4. Database rollback only via PITR in worst case (coordinated multi-service operation).