iam-service — Migration Plan
Catalog · DATA_MODEL · API_CONTRACTS · EVENT_SCHEMAS · SECURITY_MODEL
iam-service launches greenfield in M0 — there is no legacy data to migrate from prior systems for the first tenant. This plan instead covers (a) internal evolution (DB / API / events / crypto rolling forward without breaking running clients) and (b) inbound tenant migrations (M2+) where a hotel switches to Melmastoon from an existing PMS that has its own user accounts.
1. Backward-Compatibility Principles
| Principle | Application |
|---|---|
Never break a v1 event consumer | Add fields = OK; remove / rename = new vN+1 side-by-side |
Never break a /api/v1 consumer | Same rule; deprecate via Sunset header before removing |
| DB migrations are forward-compatible | Two-phase: write to old + new, backfill, switch reads, remove old |
| Crypto upgrades are user-transparent | Rehash on next login; never lock users out |
| Identity is sticky | userId and tenantId never change post-creation |
All migrations are gated by SERVICE_READINESS and reviewed by DBRE.
2. Schema Evolution
2.1 DB
| Phase | Action |
|---|---|
| 1. Add | New column / table with NULL-allowed; deploy code that writes to it |
| 2. Backfill | Background job populates historic rows (idempotent, batched) |
| 3. Read | Deploy code that reads new column; old code still tolerates absence |
| 4. Constrain | Add NOT NULL / UNIQUE constraints once backfill verified |
| 5. Remove old | Drop legacy column / table after ≥ 1 release with no reads |
Each phase is its own migration file; never combined. Migration tool: node-pg-migrate. Migrations are idempotent (IF NOT EXISTS patterns).
2.2 RLS
When introducing tenant-scoping to a previously platform-wide table, run a four-step migration:
- Add
tenant_idcolumn nullable. - Backfill
tenant_id(or leave NULL for platform-only rows). - Enable RLS with
(tenant_id = current_setting('app.tenant_id')::uuid OR tenant_id IS NULL). - Tighten policy to drop the OR clause once all rows have
tenant_id.
2.3 Indexes
Created with CREATE INDEX CONCURRENTLY to avoid lock; verified on read replica before primary.
2.4 Partition Maintenance
audit_events is monthly-partitioned. Cron creates next 6 months ahead; old partitions detached and archived to BigQuery at T-90d.
3. API Versioning
| Change | Treatment |
|---|---|
| Add new endpoint | Non-breaking; ship in v1 |
| Add optional field to request | Non-breaking |
| Add field to response | Non-breaking; consumers must tolerate unknown fields (per 30-api rule) |
| Make field required | Breaking → new v2 endpoint; Sunset header on v1 |
| Rename / remove field | Breaking → new v2 |
| Change validation rules to be stricter | Breaking → new v2; old v1 keeps lenient validation |
| Change error code mapping | Breaking → new v2 |
Deprecation timeline: minimum 6 months from Sunset announcement to removal; major changes ≥ 12 months.
Sunset headers carry: Sunset: Sat, 31 Dec 2026 23:59:59 GMT, Deprecation: true, Link: <https://docs.melmastoon.cloud/iam/migration/v2>; rel="deprecation".
4. JWT Claim Evolution
| Change | Allowed in v1? | Treatment |
|---|---|---|
| Add new optional claim | ✅ | Non-breaking; consumers must ignore unknown claims |
| Add new required claim (consumers must read) | ❌ | New JWT version negotiated via iss-suffix or ver claim |
| Change semantics of an existing claim | ❌ | Same as above |
| Remove a claim | ❌ | Same |
A ver claim (currently 1) anticipates future versioning. Consumers MUST verify ver.
Rotation does not change claim shape; only kid and signature.
5. Event Versioning
Per platform standard (40-events):
| Action | Treatment |
|---|---|
| Add a field to event payload | Non-breaking; new field becomes part of vN; consumer must tolerate |
| Make existing field required | Breaking → publish vN+1 alongside vN |
| Change semantics | Breaking → vN+1 |
| Remove a field | Breaking → vN+1 |
Lifecycle:
publish vN ─► publish vN+1 alongside vN ─► consumers migrate ─► stop publishing vN ─► retire topic vN (after retention)
Both versions are published for ≥ 30 days; longer if a major consumer hasn't migrated.
Schema-registry CI gates compatibility (BACKWARD).
6. Password Hash Migration
Argon2id parameters are reviewed annually against OWASP Password Storage Cheat Sheet. When parameters bump:
| Step | Action |
|---|---|
| 1 | Bump PASSWORD_HASH_VERSION in config; new column credentials.hash_version already records it |
| 2 | New registrations use new params |
| 3 | On every successful login, if hash_version < current → recompute hash, store, emit iam.password.rehashed.v1 |
| 4 | Background pwn_audit job optionally rehashes inactive accounts on monthly cycle (lower priority) |
| 5 | After ≥ 12 months, decide whether to force a reset for residual stale-hashed accounts (rare) |
Algorithm changes (e.g. argon2id → next-gen) follow the same flow with a wider migration window.
7. MFA Factor Migration
| Change | Treatment |
|---|---|
| New factor type added | Backward-compat; existing factors unaffected |
| TOTP algorithm change (e.g. SHA-1 → SHA-256) | New TOTP enrollments use new algo; existing remain valid until user re-enrolls; UI nudges re-enroll |
| WebAuthn algorithm bump | New credentials negotiate at registration; existing remain valid until expiry/replace |
| Recovery code regeneration | Always invalidates previous bundle; user must download anew |
Factor data is never silently mutated — a user must take action to upgrade an existing factor.
8. IdP Configuration Migration
When tenant rotates SSO IdP:
| Step | Action |
|---|---|
| 1 | Tenant adds new IdP via admin UI; both old + new active |
| 2 | iam supports per-tenant multiple IdPs in DATA_MODEL |
| 3 | Communications from tenant tell users to log in via new IdP |
| 4 | After migration window, tenant removes old IdP; iam refuses new logins via it but ExternalIdentity linkage retained for audit |
| 5 | Optional: re-link existing ExternalIdentity rows from old provider to new (admin tool, requires user-side reconfirmation) |
9. Offline Bundle / Device-Key Migration
| Change | Treatment |
|---|---|
| Bundle format bump | New bundle_version field; client tolerates missing fields; server can serve both for 30 d |
| Device key algorithm bump | New device bindings use new algo; existing certs remain valid until expiry; renewal upgrades |
| Tenant CA rotation | Always overlapped (≥ 7 d); new certs signed by new CA; existing certs valid until natural expiry; both CA chains in client trust store during overlap |
10. Tenant Onboarding (Greenfield — M0)
Default path. No legacy data.
| Step | Action |
|---|---|
| 1 | tenant-service emits melmastoon.tenant.created.v1 |
| 2 | iam consumes; provisions super-tenant-admin user with email-verification flow |
| 3 | Tenant admin completes verification; sets password + MFA |
| 4 | Tenant admin invites staff via tenant-service (which calls iam to create user records) |
| 5 | Optional: tenant admin configures SSO; iam stores IdP config |
| 6 | Optional: tenant admin downloads + signs CA bootstrap for offline desktops |
Time-to-first-user: < 5 min.
11. Tenant Migration from Existing PMS (M2+)
When a hotel switches to Melmastoon from an existing PMS:
11.1 Inputs
| Source | Format |
|---|---|
| User export | CSV with columns: external_id, email, display_name, role, mfa_enabled, last_login_at, created_at |
| Optional: hashed passwords | only if from compatible argon2id deployment; else users must reset on first login (recommended path) |
| Optional: SSO claim mappings | JSON document |
11.2 Importer flow (scripts/iam-import.ts)
CSV ──► validation ──► dry-run report ──► import (batched)
│
▼
┌──────────────┐
│ For each row │
├──────────────┤
│ - Create User (status=invited)
│ - Map external role → Melmastoon role (passed to tenant-service event)
│ - Insert ExternalIdentity row (provider="legacy:<pms>", subject=external_id)
│ - If hashed password supplied + compatible: insert Credential w/ hash_version=2 (force rehash on first login)
│ - If not: leave Credential absent; user receives magic-link invite
│ - Emit `iam.user.registered.v1` (with import_batch_id for traceability)
└──────────────┘
11.3 First-login UX
| Path A (no password supplied) | Path B (password supplied) |
|---|---|
| User receives invitation email with magic link | User logs in with same password |
| Sets password + MFA via standard flow | iam transparently rehashes to current params |
Linked to existing ExternalIdentity for audit | If breach-list match → forced reset before continue |
11.4 Cutover
| Step | Action |
|---|---|
| 1 | Import dry-run; review counts + errors |
| 2 | Schedule cutover window |
| 3 | Old PMS frozen (read-only) |
| 4 | Run import (resumable, batched, 100 users/batch) |
| 5 | Send invitations |
| 6 | T+7d: track adoption metric (users who completed first login); poke laggards |
| 7 | T+30d: deactivate any never-logged-in invitee (tenant admin decision) |
Idempotent: re-running import skips users with matching ExternalIdentity{provider, subject}.
12. Multi-Region Migration (M2)
| Step | Action |
|---|---|
| 1 | Stand up second region (Cloud SQL primary, Memorystore, KMS, Cloud Run). |
| 2 | Configure logical replication (DB layer; tenant-router routes write per residency). |
| 3 | KMS keyring replicated; same kid aliases. |
| 4 | Pub/Sub topic + subscription replicated; consumer dedup based on eventId. |
| 5 | DNS geo-routes new tenants to nearest region. |
| 6 | Existing tenants stay pinned unless residency change requested. |
Tenant residency change procedure (rare): create export → quiesce writes for tenant → import to target region → switch routing → verify → un-quiesce. Documented in runbooks/iam/tenant-region-move.md.
13. Crypto Algorithm Migration (Future)
If Ed25519 is ever superseded:
| Step | Action |
|---|---|
| 1 | Generate new keypair in KMS for new algorithm |
| 2 | Publish in JWKS alongside existing |
| 3 | New JWTs signed with new algorithm; new kid |
| 4 | Consumers must support both algorithms during transition |
| 5 | After ≥ 30 d, stop signing with old algorithm |
| 6 | Old kid retained in JWKS for token-lifetime + audit |
Same pattern for tenant CA; existing device certs remain valid until natural expiry.
14. Breaking-Change Changelog
Maintained in docs/migrations/iam-changelog.md. Each entry records: date, what changed, why, migration window, removal date, runbook link.
| Date | Change | Status |
|---|---|---|
| (pending M0 launch — no entries) | — | — |
15. Rollback Strategy
Every migration step has an explicit rollback:
| Phase | Rollback |
|---|---|
| Schema (add column / table) | DROP IF EXISTS; safe because phase 1 doesn't read |
| Schema (backfill) | Stop job; rows partial-NULL is benign |
| Schema (read switch) | Revert deploy; reads still tolerate absence |
| Schema (constraint tighten) | DROP CONSTRAINT |
| Schema (drop legacy) | Restore from PITR (last resort) |
| API (new version) | Keep old version live; revert traffic |
| Event (new version) | Keep old version publishing; revert consumer config |
| Crypto (param bump) | Old hash version still verifies; lower-priority rehash continues from old |
CI runs forward + rollback migrations in test before merge; rollback failures block.
16. Observability of Migrations
Every migration emits:
iam_migration_step_total{name, phase, result}counteriam_migration_backfill_progress{name}gauge (0–100 %)- Audit event
iam.migration.executed{name, phase, sha}for traceability
Long-running backfills publish a progress record every 60 s; on-call dashboard shows in-flight migrations.
17. Cross-References
- Aggregate / claim shapes → DOMAIN_MODEL
- Endpoint shapes → API_CONTRACTS
- Event shapes → EVENT_SCHEMAS
- DB schema → DATA_MODEL
- Crypto details → SECURITY_MODEL
- Risks introduced by changes → SERVICE_RISK_REGISTER
- Pre-promotion checks → SERVICE_READINESS