Deployment Topology
:::info Source
Sourced from services/tenant-service/DEPLOYMENT_TOPOLOGY.md in the documentation repo.
:::
Blueprint doc 12 of 17. Companion: 01 Enterprise Architecture | SECURITY_MODEL | OBSERVABILITY
1. Topology Overview
Cloudflare Edge (TLS, WAF, DDoS)
│
┌─────────────┴─────────────┐
│ API Gateway (NestJS) │
│ tenant-aware proxy init │
└─────────────┬─────────────┘
│ mTLS
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
┌────────▼──────────┐ ┌────────▼──────────┐ ┌────────▼──────────┐
│ tenant-service │ │ tenant-service │ │ tenant-service │
│ Region: us-* │ │ Region: eu-* │ │ Region: me-* │
│ (3+ replicas) │ │ (3+ replicas) │ │ (3+ replicas) │
└────────┬──────────┘ └────────┬──────────┘ └────────┬──────────┘
│ │ │
┌────────▼──────────┐ ┌────────▼──────────┐ ┌────────▼──────────┐
│ PgBouncer pool │ │ PgBouncer pool │ │ PgBouncer pool │
│ (transaction mode)│ │ (transaction mode)│ │ (transaction mode)│
└────────┬──────────┘ └────────┬──────────┘ └────────┬──────────┘
│ │ │
┌────────▼──────────┐ ┌────────▼──────────┐ ┌────────▼──────────┐
│ Postgres 16 │ │ Postgres 16 │ │ Postgres 16 │
│ (1 primary + │ │ (1 primary + │ │ (1 primary + │
│ 2 replicas) │ │ 2 replicas) │ │ 2 replicas) │
└───────────────────┘ └───────────────────┘ └───────────────────┘
┌────────────────────────────────────────────┐
│ NATS JetStream (multi-region cluster) │
│ TENANT stream partitioned by tenantId │
└────────────────────────────────────────────┘
┌────────────────────────────────────────────┐
│ Redis cluster (per region) │
│ authz cache, tenant cache, rate limits │
└────────────────────────────────────────────┘
2. Regions
| Region code | Location | Purpose |
|---|---|---|
us | us-east-1, us-west-2 (active/active) | North America tenants |
eu | eu-central-1 (Frankfurt), eu-west-1 (Dublin) | EU tenants (GDPR residency) |
me | me-central-1 (UAE), af-south-1 (South Africa) | Middle East + Africa |
ap | ap-southeast-1 (Singapore), ap-south-1 (Mumbai) | Asia-Pacific |
Tenant data is pinned to homeRegion. Cross-region access requires explicit residency migration.
3. Kubernetes Deployment
3.1 Helm Chart Structure
charts/tenant-service/
├── Chart.yaml
├── values.yaml
├── values.us-prod.yaml
├── values.eu-prod.yaml
├── values.me-prod.yaml
├── values.ap-prod.yaml
├── values.staging.yaml
└── templates/
├── deployment.yaml
├── service.yaml
├── hpa.yaml
├── pdb.yaml
├── networkpolicy.yaml
├── servicemonitor.yaml
├── prometheusrule.yaml
├── serviceaccount.yaml
└── secrets-external.yaml # ExternalSecrets refs
3.2 Baseline Deployment Spec
apiVersion: apps/v1
kind: Deployment
metadata:
name: tenant-service
labels:
app: tenant-service
tier: core
region: {{ .Values.region }}
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: tenant-service
template:
metadata:
labels:
app: tenant-service
region: {{ .Values.region }}
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
serviceAccountName: tenant-service
securityContext:
runAsNonRoot: true
runAsUser: 10001
fsGroup: 10001
seccompProfile: { type: RuntimeDefault }
containers:
- name: tenant-service
image: ghasi/tenant-service:{{ .Values.image.tag }}
imagePullPolicy: IfNotPresent
ports:
- { name: http, containerPort: 3000 }
- { name: metrics, containerPort: 9090 }
resources:
requests: { cpu: 500m, memory: 512Mi }
limits: { cpu: 2, memory: 2Gi }
env:
- { name: NODE_ENV, value: production }
- { name: REGION, value: {{ .Values.region }} }
- { name: OTEL_SERVICE_NAME, value: tenant-service }
- { name: OTEL_EXPORTER_OTLP_ENDPOINT, value: http://otel-collector.observability:4317 }
envFrom:
- secretRef: { name: tenant-service-secrets } # populated by ExternalSecrets
livenessProbe:
httpGet: { path: /health/live, port: http }
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet: { path: /health/ready, port: http }
initialDelaySeconds: 5
periodSeconds: 5
startupProbe:
httpGet: { path: /health/startup, port: http }
failureThreshold: 30
periodSeconds: 5
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: [ALL] }
volumeMounts:
- { name: tmp, mountPath: /tmp }
volumes:
- { name: tmp, emptyDir: {} }
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector: { matchLabels: { app: tenant-service } }
3.3 HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: tenant-service }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: tenant-service }
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
- type: Pods
pods:
metric: { name: tenant_http_requests_per_second }
target: { type: AverageValue, averageValue: "500" }
behavior:
scaleUp: { stabilizationWindowSeconds: 60, policies: [{ type: Percent, value: 100, periodSeconds: 30 }] }
scaleDown: { stabilizationWindowSeconds: 300, policies: [{ type: Percent, value: 50, periodSeconds: 60 }] }
3.4 PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: tenant-service }
spec:
minAvailable: 2
selector: { matchLabels: { app: tenant-service } }
3.5 NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: tenant-service-np }
spec:
podSelector: { matchLabels: { app: tenant-service } }
policyTypes: [Ingress, Egress]
ingress:
- from:
- namespaceSelector: { matchLabels: { name: api-gateway } }
- namespaceSelector: { matchLabels: { name: sync-service } }
- namespaceSelector: { matchLabels: { name: observability } }
ports:
- { port: 3000, protocol: TCP }
- { port: 9090, protocol: TCP }
egress:
- to:
- namespaceSelector: { matchLabels: { name: data-postgres } }
ports: [{ port: 6432, protocol: TCP }] # PgBouncer
- to:
- namespaceSelector: { matchLabels: { name: data-redis } }
ports: [{ port: 6379, protocol: TCP }]
- to:
- namespaceSelector: { matchLabels: { name: messaging-nats } }
ports: [{ port: 4222, protocol: TCP }]
- to:
- namespaceSelector: { matchLabels: { name: identity-service } }
ports: [{ port: 443, protocol: TCP }]
- to: # DNS
- namespaceSelector: { matchLabels: { name: kube-system } }
ports: [{ port: 53, protocol: UDP }, { port: 53, protocol: TCP }]
4. Infrastructure
4.1 Postgres
| Attribute | Config |
|---|---|
| Engine | PostgreSQL 16 (managed; RDS / Cloud SQL / Aurora-compatible) |
| Instance class | db.r6g.xlarge (primary) / db.r6g.large × 2 (replicas) |
| Storage | 500 GB GP3 (auto-scale) |
| Extensions | ltree, pg_trgm, pgcrypto, uuid-ossp |
| Parameter tweaks | max_connections=500, shared_buffers=25%, work_mem=16MB, effective_cache_size=75% |
| Backups | Continuous WAL to S3; nightly snapshot; 35-day PITR |
| Encryption | AES-256 at rest (KMS); TLS 1.3 in transit |
4.2 PgBouncer
| Attribute | Config |
|---|---|
| Mode | transaction (required for RLS SET LOCAL) |
| Pool size per replica | 200 |
| Max client conns | 10,000 |
| Init hook | SET LOCAL app.tenant_id = $1 per transaction via app-supplied context |
4.3 Redis
| Attribute | Config |
|---|---|
| Deployment | Redis 7 cluster mode, 3 shards × 2 replicas per region |
| TLS | Enabled (TLS 1.3) |
| Auth | Token (rotated 90 days) |
| Eviction | allkeys-lru |
| Persistence | AOF with everysec |
| Namespaces | tenants/{tid}/authz/..., tenants/{tid}/profile, etc. |
4.4 NATS JetStream
| Attribute | Config |
|---|---|
| Cluster | 5 nodes, cross-region replication (R3 for hot, R5 for regulated) |
| Streams | TENANT (partition=16), TENANT.DLQ |
| Storage | File-backed; 1 TB per node |
| Auth | NKey + per-service permissions |
5. Secret Management
ExternalSecrets operator syncs from HashiCorp Vault / AWS Secrets Manager / GCP Secret Manager:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata: { name: tenant-service-secrets }
spec:
refreshInterval: 1h
secretStoreRef: { name: vault-backend, kind: ClusterSecretStore }
target:
name: tenant-service-secrets
creationPolicy: Owner
data:
- { secretKey: DATABASE_URL, remoteRef: { key: secret/tenant-service/db, property: url } }
- { secretKey: REDIS_URL, remoteRef: { key: secret/tenant-service/redis, property: url } }
- { secretKey: NATS_CREDS, remoteRef: { key: secret/tenant-service/nats, property: creds } }
- { secretKey: KMS_KEY_ID, remoteRef: { key: secret/tenant-service/kms, property: key_id } }
- { secretKey: INVITE_TOKEN_SALT, remoteRef: { key: secret/tenant-service/crypto, property: invite_salt } }
6. CI/CD Pipeline
PR opened
├─ Lint + typecheck
├─ Unit tests + coverage gate (≥ 85%)
├─ Integration tests (Testcontainers)
├─ Two-tenant isolation suite
├─ Contract verification (Pact broker)
├─ Event schema backward-compat check
├─ OpenAPI diff check
├─ SAST (CodeQL, Semgrep)
├─ SBOM + provenance attestation (SLSA Level 3)
└─ Docker build + sign (cosign)
Merge to main
├─ Tag semver
├─ Deploy to staging (all regions)
├─ k6 smoke test in staging
├─ Chaos drill (automated)
└─ Manual approval for prod
Deploy to prod
├─ Canary: 1 region, 5% traffic for 30 min
├─ SLO burn rate monitored; auto-rollback if > 2x baseline
├─ Ramp to 50% → 100% in 2h
└─ Deploy to remaining regions sequentially
7. Resource Planning
| Tier | Replicas/region | Requests/pod | Peak rps/pod |
|---|---|---|---|
| Steady state | 3 | 0.5 CPU / 512Mi | ~500 |
| Peak (tenant onboarding event) | 6 | same | ~1k |
| HPA ceiling | 20 | — | ~10k |
Cost-plan baseline (per region): ~$800/mo (3 pods + Postgres primary + 2 replicas + Redis cluster shard).
8. Service Dependencies (runtime)
| Dependency | Purpose | Degradation behavior |
|---|---|---|
| Postgres | Primary datastore | Reads fall back to replica; writes fail with 503 + Retry-After |
| Redis | Authz cache, rate limits | Cache miss → DB fallback; rate limit fails open with alert |
| NATS JetStream | Event pub/sub | Outbox queues; consumer lag alert; writes succeed |
| identity-service (JWKS) | JWT validation | Cached 1h; stale-but-usable fallback |
| ai-gateway-service | AI features | Fail-soft: hide suggestions, keep core flows |
| billing-service | Plan entitlement lookup | Cached 15 min; proceed on lookup failure with last-known |
9. Disaster Recovery
| Scenario | RPO | RTO |
|---|---|---|
| Single pod failure | 0 | < 30s (k8s reschedules) |
| AZ failure | 0 | < 2 min (HPA + topology spread) |
| Region failure | 1 min (WAL shipped cross-region) | < 30 min (promote DR replica; DNS update) |
| Data corruption | 1 min | ≤ 1h (PITR replay) |
DR drill: quarterly per region. Full platform-wide DR drill: annual.
10. Rollback Procedure
kubectl rollout undo deployment/tenant-service(restores previous image).- Migrations: forward-only; rollback via compensating migration (pre-tested).
- Feature-flag kill-switches for new behaviors; toggle without redeploy.
- Database rollback only via PITR in worst case (coordinated multi-service operation).