iam-service — Deployment Topology
Catalog · 02 Enterprise Architecture · SECURITY_MODEL · OBSERVABILITY
GCP-native deployment. Cloud Run for compute, Cloud SQL for state, Memorystore for cache, KMS for crypto, Pub/Sub for events, Secret Manager for secrets. Multi-region active-active by M2.
1. Containers
| Container | Purpose | Replicas (prod) | CPU | Memory |
|---|---|---|---|---|
iam-api | HTTP/REST + JWKS endpoint | min 2, max 50 | 1 vCPU | 512 MiB |
iam-worker | Outbox relay, inbox consumer, scheduled jobs (key rotation, breach audit, cert expiry) | min 2, max 10 | 1 vCPU | 512 MiB |
iam-jwks | Static-content variant of API serving only /.well-known/jwks.json (cache-friendly, no DB) | min 2, max 5 | 0.5 vCPU | 256 MiB |
All three from the same source repo, different entrypoints. Image tag is the short git SHA; latest not used in prod.
2. Scaling Rules
| Container | Trigger | Threshold |
|---|---|---|
iam-api | Concurrent requests | target 80 / instance |
iam-api | CPU | target 60 % |
iam-worker | Outbox depth | scale-up if depth > 100 for 30 s |
iam-worker | Pub/Sub subscription backlog | scale-up if num_undelivered_messages > 500 |
iam-jwks | Concurrent requests | target 200 / instance |
Cold-start mitigation: min instances ≥ 2 in prod always. CPU is allocated even when idle for iam-api (Cloud Run "CPU always allocated").
3. Resource Budgets
| Resource | Limit | Notes |
|---|---|---|
| Request timeout | 30 s | Most requests < 1 s; 30 s for OIDC flows w/ slow IdPs |
| Max concurrent requests / instance | 100 | tuned per release |
| Container startup probe | GET /health/startup 200, deadline 30 s | |
| Liveness probe | GET /health/live every 30 s | |
| Readiness probe | GET /health/ready every 10 s; checks DB, Redis, KMS |
4. Storage Topology
| Layer | Service | Config |
|---|---|---|
| Primary DB | Cloud SQL Postgres 15 (Enterprise Plus) | HA (regional, read replicas in 2 zones); CMEK; PITR 7 d; backup daily 35 d retention |
| Sessions cache | Memorystore Redis 7 | HA (Standard tier, 5 GB), VPC-attached, AUTH enabled |
| Crypto | Cloud KMS | Region-pinned (per data-residency tenant policy); HSM keyring |
| Secrets | Secret Manager | Region-replicated; auto-rotation where supported |
| Events | Pub/Sub | Topics melmastoon.iam.*.v1, retention 7 d |
| Cold audit | BigQuery | Daily export from audit_events partitions ≥ 90 d old |
| Object storage | Cloud Storage | DSAR PDFs (short-lived, 30 d), tenant CA exports |
5. Region Topology
5.1 M0 (single region)
me-central1 (Doha)
├── iam-api (Cloud Run)
├── iam-worker (Cloud Run)
├── iam-jwks (Cloud Run + CDN)
├── Cloud SQL primary + 2 read replicas
├── Memorystore primary + replica
├── Cloud KMS keyring
├── Pub/Sub
└── Secret Manager
5.2 M2 (multi-region active-active)
┌─ Cloud DNS (geo-routed) ─┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ me-central1 │ │ europe-west1 │
│ (primary) │ │ (active-active) │
│ │ │ │
│ iam-api ▸▸▸▸▸▸▸ │◀──────▶│ iam-api │
│ iam-worker │ │ iam-worker │
│ iam-jwks/CDN │ │ iam-jwks/CDN │
│ Cloud SQL HA │ ▸pgsync│ Cloud SQL HA │
│ Memorystore HA │ TODO │ Memorystore HA │
│ KMS keyring │ ── ── ─│ KMS keyring │
│ Pub/Sub │ X-rep │ Pub/Sub │
└─────────────────┘ └─────────────────┘
Cross-region DB topology TBD (logical replication w/ tenant-scoped routing OR per-tenant pinning). Decision lives in architecture/ADR-multi-region.md (M2 milestone).
6. Caching Strategy
| Layer | What | TTL | Invalidation |
|---|---|---|---|
| CDN (Cloud CDN) | /.well-known/jwks.json | 5 min | manual purge on key rotation |
| CDN | OIDC discovery doc | 1 h | rare changes |
| Memorystore | Session lookup | 24 h | on revoke (key delete) |
| Memorystore | Rate-limit counters | rolling 5 min | natural decay |
| Memorystore | Adaptive-MFA risk score (idempotent) | 60 s | natural |
| Memorystore | Magic-link tokens | 10 min | on consume |
| Memorystore | API-key denylist post-revoke | 60 min | TTL |
| In-memory (per pod) | Tenant CA chain | 1 h, jittered | rare changes |
| In-memory (per pod) | OIDC IdP metadata | 1 h, jittered | refresh on parse error |
7. Edge Rules
| Layer | Rule |
|---|---|
| Cloud Armor (WAF) | Rate limit /auth/login 10/min/IP; /auth/password/reset/request 3/h/email; /auth/refresh 60/min/family |
| Cloud Armor | Block known TOR exit nodes for /auth/login unless tenant explicitly opts in |
| Cloud Armor | OWASP CRS preconfigured rules + custom credential-stuffing patterns |
| Cloud Armor | Geo block: per-tenant allowlist; OFAC denylist always |
| API Gateway | mTLS for /internal/*; reject otherwise |
| Cloud Load Balancer | TLS 1.3; HSTS preload; HTTP/2; QUIC opt-in |
| Cloud CDN | Caches only /.well-known/jwks.json and OIDC discovery |
8. Service Mesh
| Aspect | Choice |
|---|---|
| Mesh | Anthos Service Mesh (Istio-managed) |
| Identity | SPIFFE: spiffe://melmastoon/prod/iam-service |
| mTLS | STRICT (mesh-internal only) |
| Authorization | AuthorizationPolicy: only tenant-service, audit-service, gdpr-service, notification-service, ai-orchestrator-service, api-gateway may call iam |
| Egress | Restricted: *.googleapis.com, configured IdPs, *.haveibeenpwned.com (HIBP) |
9. Release Strategy
| Phase | Mechanism |
|---|---|
| Build | Cloud Build, signed image to Artifact Registry; SBOM + provenance attestation |
| Deploy | Cloud Deploy pipeline: dev → staging → prod-canary (5 %) → prod-full |
| Canary criteria | Error rate Δ < 0.5 %, latency p99 Δ < 10 %, login success rate Δ < 0.1 %, no critical alerts in 30 min |
| Promotion | Auto if all criteria green; manual otherwise |
| Rollback | One-command revert to previous Cloud Run revision (≤ 2 min); preserves env + traffic split |
| Schema migrations | Forward-compatible only (per MIGRATION_PLAN); separate Cloud Build job; gated on ALL-PASS migration tests |
10. Disaster Recovery
| Event | RPO | RTO | Mechanism |
|---|---|---|---|
| Pod failure | 0 | < 1 min | Cloud Run auto-restart |
| Zone failure | 0 | < 5 min | regional Cloud SQL + Cloud Run |
| Region failure (M0) | ≤ 5 min | ≤ 4 h | Restore from cross-region backup; DNS failover |
| Region failure (M2) | 0 | < 10 min | Active-active failover |
| Data corruption | ≤ 5 min | ≤ 1 h | PITR to pre-corruption point |
| KMS regional outage | n/a | ≤ 30 min | Cross-region KMS replica + DR runbook |
DR drill cadence: quarterly. Result tracked in runbooks/iam/dr-drill-log.md.
11. Secret Rotation Cadence
| Secret | Cadence | Mechanism |
|---|---|---|
JWT signing key (kid) | Monthly | KMS rotation alias; 2-day overlap |
| Tenant CA | Annual | KMS; cert overlap; gradual reissue |
| OIDC client secrets | Quarterly or per IdP cadence | Secret Manager + rolling deploy |
| SAML signing key | Annual | KMS |
| HIBP API key | Per provider | Secret Manager |
| SMTP creds | Annual | Secret Manager |
| Tenant fingerprint HMAC secret | Annual | KMS DEK |
| API-key HMAC pepper | Annual | KMS DEK |
| Cloud SQL CMEK | Annual | KMS rotation; transparent |
12. Observability Wiring
| Signal | Sink |
|---|---|
| Logs | Cloud Logging → Log Router → BigQuery (90 d analytical) + Coralogix mirror |
| Metrics | Cloud Monitoring + Prometheus scrape → Cloud Monitoring + SigNoz |
| Traces | OTel collector → Cloud Trace + SigNoz |
| Audit | Cloud Audit Logs (admin actions) + iam audit DB (auth events) |
| Alerts | Cloud Monitoring → PagerDuty #iam-oncall + Slack #oncall-iam |
13. Network
| Component | Subnet |
|---|---|
iam-api | prod-services-me-central1 (private) |
iam-worker | same |
iam-jwks | same |
| Cloud SQL | prod-data-me-central1 (private; private IP only) |
| Memorystore | prod-data-me-central1 |
| Egress to internet | Cloud NAT (single egress IP per region for IdP allowlists) |
14. Compliance & Sovereignty
| Region | Tenant residency | Data scope |
|---|---|---|
me-central1 | Default for new tenants in MENA | All iam data + crypto |
europe-west1 | EU tenants opt-in (GDPR data residency) | All iam data + crypto |
us-central1 | M3, on demand | All iam data + crypto |
Tenant residency recorded in tenant.created.v1; iam writes only to the tenant's residency region. Cross-region read for ops requires elevated approval + audit event.
15. Cost Posture
| Lever | Optimization |
|---|---|
iam-jwks separate | Cheap (no DB), CDN-fronted; offloads 70 %+ of read traffic |
| Min instances | 2 (warm) only; rest scales to zero (jwks) or low |
| Cloud SQL | Right-sized; auto-storage-grow off (manual) |
| KMS | Sign rate is bounded; bursts cached at API layer |
| Pub/Sub | 7-d retention only; longer-term in BigQuery |
| AI calls | Cached for 60 s by (userId, ipMasked) to reduce orchestrator load |
Monthly per-tenant cost dashboard in dashboards/cost/iam.json; outliers (P95 cost) reviewed weekly.
16. Versioning & Rollout Discipline
- API:
/api/v1/*. Adding endpoints / fields is non-breaking. Removing or changing requires/api/v2/*withSunsetheader on v1. - Events:
melmastoon.iam.<entity>.<verb>.vN. New version added side-by-side; consumers migrate; old version deprecated per MIGRATION_PLAN §7. - Database: forward-compatible migrations only.
- Helm/Cloud Run config in GitOps (
infra/iam/).
17. ASCII Deployment Diagram (M0)
┌────────────────┐
Internet ────────────▶│ Cloud Armor │
│ (WAF + rate) │
└──────┬────────┘
▼
┌──────────────────────────────┐
│ Global HTTPS Load Balancer │
│ (TLS term, HSTS, geo-route) │
└──────┬─────────────┬─────────┘
CDN-cached others
│ │
┌──────────▼──┐ ┌─────▼──────────┐
│ iam-jwks │ │ iam-api │
│ (Cloud Run) │ │ (Cloud Run) │
└─────────────┘ └─┬──────────────┘
│ mTLS (mesh)
┌──────────┐ ┌───────▼──────┐ ┌──────────┐
│ Cloud SQL│ │ Memorystore │ │ Cloud KMS │
│ (Postgres│ │ (Redis 7) │ │ (HSM) │
│ HA) │ │ │ │ │
└──────────┘ └──────────────┘ └──────────┘
▲
┌─────────────────────┴─────────────────────┐
│ iam-worker (Cloud Run) │
│ outbox relay · inbox consumer · jobs │
└────────┬───────────────────────────┬──────┘
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Pub/Sub │ │ External: │
│ topics + DLQ │ │ OIDC/SAML │
└──────────────┘ │ HIBP, SMTP │
└──────────────┘