Skip to main content

iam-service — Service Readiness

Catalog · SERVICE_OVERVIEW · TESTING_STRATEGY · SECURITY_MODEL

iam-service is T0 / platform-critical. The bar for "ready" is non-negotiable: any RED gate blocks merge to main and prevents promotion past staging.

1. Readiness Levels

LevelMeaningRequired for
L0 — SketchBundle directories exist; SERVICE_OVERVIEW draftedArchitecture sign-off
L1 — DesignedDomain model, API, events, data model, security model documented; OpenAPI/AsyncAPI lint cleanStory estimation
L2 — ImplementedCode matches docs; unit + integration tests green; coverage thresholds metInternal alpha
L3 — HardenedSecurity tests, load baseline, chaos scenarios, SLOs defined, runbooks completeClosed beta
L4 — Production-readyAll gates green, canary verified, on-call rotation active, DR drilledPublic M0 launch
L5 — Multi-regionActive-active deployed, cross-region failover drilledM2 launch

Current target: L4 by M0 cutover.

2. Canonical Readiness Gates

Each gate has a state of GREEN / AMBER / RED tracked in infra/readiness/iam-service.yaml.

2.1 Domain & Documentation

GateCriteria
docs.bundle_completeAll 17 deep-bundle docs exist + summary doc + linked from 03-microservices
docs.cross_links_validNo broken markdown links (CI: markdown-link-check)
docs.adr_alignedAll ADRs touching iam are linked from SERVICE_OVERVIEW
domain.aggregates_definedAll seven aggregates have invariants + domain events documented
domain.ubiquitous_languageGlossary exists in DOMAIN_MODEL

2.2 API

GateCriteria
api.openapi_presentopenapi/iam-service.yaml parses, lints clean (Spectral)
api.error_envelopeEvery endpoint returns application/problem+json per ERROR_CODES
api.error_codes_registeredEvery MELMASTOON.IAM.* code is in the registry
api.versioning/api/v1/* prefix; Sunset plan for breaking changes
api.idempotencyAll POST mutations accept Idempotency-Key
api.tracingtraceparent propagated end-to-end
api.contract_testsPact provider verification green

2.3 Events

GateCriteria
events.asyncapi_presentasyncapi/iam-service.yaml parses, lints clean
events.namingAll subjects use melmastoon.iam.<entity>.<verb>.vN
events.envelopeAll events use the canonical envelope (id, time, source, subject, type, datacontenttype, data)
events.outboxTransactional outbox in same Postgres tx as domain row
events.idempotent_consumersInbox dedup table + eventId unique constraint
events.schema_registrySchemas published to registry; CI checks compatibility

2.4 Data

GateCriteria
data.migrations_forward_onlyAll migrations are forward-compatible per MIGRATION_PLAN
data.rls_enabledRLS on every tenant-scoped table; CI assertion
data.indexes_documentedEvery index in DATA_MODEL exists in DB and vice-versa (drift check)
data.partitions_provisionedaudit_events monthly partitions exist for next 6 months
data.backups_verifiedPITR test succeeded within last 30 d

2.5 Sync

GateCriteria
sync.minimal_surfaceOnly Device entity is in client sync per SYNC_CONTRACT
sync.signed_deltasAll deltas signed; client verifies Ed25519 signature
sync.offline_refresh_worksE2E test: 7-d offline grace works with device cert

2.6 AI

GateCriteria
ai.through_orchestratoriam never calls model endpoints directly; only via ai-orchestrator-service
ai.fallback_pathRules-based fallback exists; covered by tests
ai.provenanceEvery decision logged with runId and decisionId
ai.hitl_for_locksAI-suggested locks > confidence threshold require admin confirmation
ai.bias_reviewQuarterly review of false-positive rate by tenant geography

2.7 Observability

GateCriteria
obs.logsStructured logs; redaction enforced
obs.metricsRED + auth-specific metrics exposed at /metrics
obs.tracestraceparent propagated; OTel exporter healthy
obs.dashboardsIAM SRE + IAM Security + Per-Tenant dashboards live
obs.slosSLOs declared in error-budget service; burn alerts wired to PagerDuty
obs.runbooksEvery alert has a runbook; every failure mode has a runbook
obs.syntheticCanary probes from ≥ 3 regions every 60 s

2.8 Security

GateCriteria
sec.threat_modelThreat model reviewed within last 6 months
sec.cryptoArgon2id, Ed25519 (KMS), TLS 1.3 in place
sec.audit_loggingAll actions in SECURITY_MODEL §12 emit audit rows
sec.gdpr_participationErasure saga end-to-end test green
sec.pen_testAnnual pen test report on file; criticals closed
sec.dependency_scanTrivy/Snyk + Semgrep clean (no high/critical)
sec.secret_scanGitleaks clean
sec.waf_rulesCloud Armor rules deployed and tested
sec.access_reviewKMS + Secret Manager IAM reviewed quarterly

2.9 Performance

GateCriteria
perf.load_baseline_recordedPer TESTING_STRATEGY §9
perf.no_regression< 20 % regression vs previous release
perf.cold_start< 0.5 % error rate in first 60 s post-deploy

2.10 Operational

GateCriteria
ops.on_call_rotationPagerDuty rotation active; primary + secondary
ops.canary_verifiedLast release canary passed all criteria
ops.dr_drillDR drill within last 90 d; passed
ops.rollback_testedRollback path verified within last 30 d
ops.cost_dashboardPer-tenant cost dashboard exists; weekly review

3. Gate Status (M0 target)

Gate GroupTargetOwner
DocumentationGREENArchitecture
APIGREENiam-team
EventsGREENiam-team
DataGREENiam-team
SyncGREENiam-team
AIGREENiam-team + AI platform
ObservabilityGREENSRE
SecurityGREENSecurity
PerformanceGREENiam-team
OperationalGREENSRE

CI publishes gate status badges in the service README.

4. SLOs (committed)

SLOTargetWindow
Auth availability99.99 %30 d
JWKS availability99.999 %30 d
Login latency p99< 800 ms30 d
Refresh latency p95< 100 ms30 d
MFA challenge success> 99.5 %7 d
SSO callback success> 99.5 %7 d
Outbox publish lag p95< 5 s30 d
Auth error rate (5xx)< 0.1 %7 d

Error budget burn alerts wired per OBSERVABILITY §6.

5. Definition of Done — Per Story

Every iam story merges only when all of:

  • Domain model updated if aggregates changed
  • OpenAPI updated; clients regenerated
  • AsyncAPI updated; events published to schema registry
  • DB migration is forward-compatible; RLS preserved
  • Unit + integration tests added/updated; coverage thresholds met
  • Negative-path catalog entry added for every new error code
  • Pact provider verification green
  • OWASP ZAP / Semgrep / Trivy clean
  • Logs / metrics / traces present for new code paths
  • Audit events emitted for new security-sensitive actions
  • Runbook updated if new failure mode possible
  • Documentation in this bundle updated (which file is changed is in PR description)
  • Reviewed by ≥ 1 iam team member + ≥ 1 security reviewer if security-sensitive
  • Linear ticket linked, AC verified

6. Release Readiness Checklist

Before promoting to prod:

  • All gates GREEN
  • Last 7 d alert rate within budget
  • Last load test within 20 % of baseline
  • Last chaos run passed
  • DR drill within 90 d
  • Migration plan reviewed with DBRE
  • Canary plan documented in PR
  • Rollback plan documented in PR
  • Communication ready (status page draft, customer comms if user-facing)
  • Sign-off: iam EM + SRE on-call + Security lead

7. Owner Sign-Off Matrix

AspectSign-off owner
Domain & APIiam team lead
Events & dataiam team lead + DBRE
SecuritySecurity lead
ObservabilitySRE on-call
Performanceiam team lead
Compliance / GDPRCompliance lead
Cost postureFinance ops

All sign-offs recorded in the release ticket; readiness automation refuses promotion otherwise.

8. M0 Launch Gate

For first production tenant onboarding:

GateStatus
All 17 docs completerequired
All canonical gates GREENrequired
Last DR drill passedrequired
Pen test report on filerequired
Tenant CA bootstrap automation testedrequired
Offline-cert renewal tested at 7-d boundaryrequired
Tenant offboarding (tenant.deleted) end-to-end testedrequired
GDPR erasure saga end-to-end testedrequired
Migration plan reviewed (no legacy data scenario applies)required

9. Continuous Readiness

  • Weekly: gate status reviewed in iam team standup; any AMBER/RED → action item.
  • Monthly: SLO review; error-budget posture; incident review.
  • Quarterly: threat model review; access reviews; bias/fairness review for AI-driven decisions.
  • Annually: pen test; DR drill formal report; CA rotation.

Readiness is not a milestone — it's a steady state.