MIGRATION_PLAN — payment-gateway-service
Sibling: DATA_MODEL · DEPLOYMENT_TOPOLOGY · SERVICE_READINESS
This plan covers two scenarios: (a) the initial production cutover — moving the service from staging to prod with first tenants — and (b) per-tenant migrations that occur continuously as new tenants onboard, schemas evolve, and existing Stripe accounts attach via OAuth.
1. Initial production cutover (T-zero)
1.1 Pre-conditions
- All checkboxes in
SERVICE_READINESScomplete and dated. - Vendor production accounts created and verified (Stripe Connect platform live, PayPal merchant approved, HesabPay live key issued).
- Cloud SQL
gm-payments-prodprovisioned with HA, CMEK, regional backup. - Cloud Armor + Workload Identity policies in place.
- DNS records (
api.melmastoon.ghasi.ioroute,webhooks.payments.melmastoon.ghasi.io) ready but not yet pointed at the new service.
1.2 Cutover steps
- Deploy the production image to
payment-apiandpayment-worker. No traffic yet (manifests atreplicas: 0for scale-out checks). - Apply central migrations (
migrations/central/*.sql). - Smoke with internal synthetic tenant
tnt_smoke_001. - Scale up to baseline (
payment-api3 pods,payment-worker2 pods). - Cut DNS for the webhook subdomain first (vendors begin delivering events; nothing to apply yet — empty inbox is OK).
- Cut DNS for the API subdomain.
- Onboard pilot tenants (3 tenants in 1 week). For each, follow §2.
- Observe SLOs for 7 days; complete the post-launch readiness review.
1.3 Rollback
- DNS swap back to legacy provider (if any) or to the staging deployment serving as a temporary stub.
- Cloud SQL holds all transactions — no data loss.
- Vendor webhook deliveries continue against the new endpoint (200 OK), but state changes can be replayed when the service is brought back.
2. Per-tenant onboarding
For each new tenant, tenant-service invokes the payments admin CLI and tracks completion in tenant_schema_registry.
2.1 Schema provisioning
payments-admin-cli provision-tenant \
--tenant-id tnt_01H… \
--pci-profile saq_a \
--cmek-key projects/gm-payments-prod/locations/us/keyRings/gm-payments/cryptoKeys/tnt_01H…
This:
- Creates schema
tnt_<tenantId>in Cloud SQL. - Creates per-tenant role with
USAGEon the schema only. - Applies all tenant migrations from
migrations/tenant/in order. - Inserts a row in
payments_central.tenant_schema_registrywith the CMEK URI and applied migration versions. - Verifies role separation by attempting a cross-tenant SELECT and asserting
permission denied.
2.2 Vendor credentials
payments-admin-cli attach-vendor \
--tenant-id tnt_01H… \
--processor stripe \
--env production \
--account-ref acct_1NXz… \
--api-key-from-secret projects/gm-payments-prod/secrets/tnt_01H…/stripe/prod/api_key/versions/latest \
--webhook-secret-from-secret projects/gm-payments-prod/secrets/tnt_01H…/stripe/prod/webhook_secret/versions/latest \
--precedence 100
Repeat per processor. Existing Stripe accounts attach via OAuth (Stripe Connect Standard): the tenant admin clicks "Connect Stripe" in bff-backoffice-service, completes OAuth at Stripe, and the callback handler stores the acct_…, refresh token, and webhook secret in Secret Manager, then calls attach-vendor.
2.3 Verification
- Run a £1.00 / $1.00 test authorize+void against each enabled processor.
- Confirm webhook receipt and apply.
- Mark tenant
payments_ready=trueintenant-service.
3. Schema evolution (existing tenants)
3.1 Approach
- Per-tenant migrations are versioned per-tenant in
payments_central.tenant_migrations(tenant_id, version, applied_at). - Deploy applies the latest schema to all tenants in batches of 50 in parallel, with a 30-second timeout per tenant.
- Failed applications are retried automatically up to 3 times; remaining failures are surfaced via P2 alert and resumed via
payments-admin-cli migrate-tenant <id>. - Any migration introducing a column with a non-null default uses the expand-then-migrate-then-contract pattern over multiple deploys.
3.2 Backward compatibility rules
- New columns must be nullable or have defaults; never
NOT NULLwithout default. - Renames are forbidden in a single deploy; always add new column → backfill → swap reads → deprecate old.
- Index creation uses
CONCURRENTLYto avoid table locks.
3.3 Read traffic during migration
- API stays online; ORM models are tolerant of unknown columns and missing-but-optional columns.
- Drizzle migrations are dry-run in
stagingagainst a snapshot of production tenants.
4. Vendor adapter additions
When introducing a new processor (e.g., Adyen):
- Implement
AdyenAdapter implements PaymentPort(full unit + contract tests in CI). - Add
adyento theprocessorenum check on relevant tables (tnt_<id>.transactions.processoretc.). - Provision sandbox vendor credentials per tenant who opts in (precedence default
200, off until tenant flips a flag). - Roll out to 1 pilot tenant; canary for 7 days; promote to all opt-ins.
- Add canary cron and dashboard panels.
5. Tenant offboarding
- Tenant submits deletion request via
tenant-service. tenant-servicechecks withpayment-gateway-servicefor non-terminal transactions; if any, deletion is paused.- Soft-delete window of 30 days begins. During this period:
- No new authorize/capture allowed.
- Refunds and webhook ingestion continue.
- Operators can run final reconciliation.
- After 30 days:
payments-admin-cli offboard-tenant <id>runs.- Schema dropped after final snapshot to GCS Coldline (7-year retention per legal hold).
- Secret Manager entries scheduled for
secrets:destroy(24 h grace). - Row in
tenant_schema_registryset toarchived_at = now().
- Webhook-inbox rows for the tenant are anonymized (
tenant_id = NULL) and retained 90 days for fraud forensics.
6. Disaster-driven re-platforming
Should a future ADR replace Cloud SQL with another database (e.g., AlloyDB), the migration approach is:
- Stand up the new platform alongside.
- Use logical replication for per-tenant schemas in batches.
- Cut writes per tenant during a defined low-traffic window.
- Validate row counts and outstanding transactions match.
- Swap connection strings in ConfigMap.
- Monitor; rollback by reverting ConfigMap (replication still running in reverse).
7. Documentation & sign-off
Every migration plan in production must be:
- Linked to a Linear/Jira ticket with run sheet and rollback plan.
- Reviewed by SecOps (PCI impact), SRE (downtime impact), and the service owner.
- Recorded in the platform change log.
The first cutover and the first three tenant onboardings will be paired-operations (two engineers minimum, one as observer).