Skip to main content

numbering-service — Migration Plan

Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform SRE + Commerce Ops Last Updated: 2026-04-21 Companion: SERVICE_READINESS · DEPLOYMENT_TOPOLOGY · DATA_MODEL


1. Overview

numbering-service is greenfield — there is no legacy inventory system to migrate data from at the platform level. However, deployment is gated by significant off-platform prerequisites: signed MNO MoUs for MSISDN block leases, ATRA short-code allocations, and Legal sign-off on the lifecycle / quarantine policy.

The rollout is phased so that internal services come online before tenant self-service, and tenant self-service comes online before regulator submission. Because the service is fail-closed on ValidateLease, an early-phase outage cannot leak unauthenticated dispatch.


2. Migration Phases

Phase 0 — Pre-Deployment Readiness (Weeks -8 to -1)

This phase is entirely off-platform for most items.

TaskOwnerStatus
Sign MoU with Roshan (MCC 412/MNC 40) for initial MSISDN blockCommerce ops + Legal☐ [BLOCKER]
Sign MoU with Etisalat-AF (412/50)Commerce ops + Legal☐ [BLOCKER]
Sign MoU with MTN-AF (412/01 + 412/20)Commerce ops + Legal☐ [BLOCKER]
Sign MoU with AWCC (412/03)Commerce ops + Legal☐ [BLOCKER]
Sign MoU with Salaam (412/88)Commerce ops + Legal☐ [BLOCKER]
Obtain ATRA short-code allocation (initial 50 codes minimum)Commerce ops + Legal☐ [BLOCKER]
Quarantine cool-off durations approved by LegalLegal☐ [BLOCKER]
Audit-retention policy (13 m hot + 7 y cold) approved by Legal + ATRALegal + Compliance☐ [BLOCKER]
Provision PostgreSQL numbering schema in shared clusterPlatform DBA
Provision Redis logical DB 7 with keyspace notifications enabledPlatform SRE
Create NATS streams (NUMBERING_EVENTS, NUMBERING_AUDIT, NUMBERING_LEASES, NUMBERING_OPS, NUMBERING_REGULATOR)Platform SRE
Provision S3 bucket ghasi-regulator-exports-{kbl,mzr} with object-lock WORM 7 yPlatform SRE + Security
Vault PKI: provision mTLS certs for numbering-service and all six allowlisted callersSecurity
Vault Transit: provision regulator-export signing keySecurity
Install MNO public signing keys into numbering.mno_signing_keys tableCommerce ops + Security☐ [BLOCKER]

Phase 1 — Internal-Only Deployment (Week 1)

Deploy numbering-service with mTLS access only for internal services — no Kong route, no tenant portal endpoint.

Goal: validate hot-path latency, multi-region replication, fail-closed behaviour, and downstream event consumers — without admitting any tenant traffic.

Steps:

  1. Deploy 3 replicas in kbl + 3 in mzr.
  2. Apply NetworkPolicy: ingress only from sms-orchestrator, routing-engine, number-intelligence-service, sender-id-registry-service, compliance-engine, billing-service.
  3. Pre-seed mobile_operators, lease_contracts, and initial signed MNO CSV imports (≥ 100 MSISDNs per operator + ATRA-allocated short codes).
  4. Pre-seed tenant_pools for 3 internal test tenants (one enterprise-sim, one SMB-sim, one government-sim).
  5. Wire sms-orchestrator to call ValidateLease per outbound message; orchestrator continues legacy "always-allow" behaviour in shadow mode (compares numbering verdict to allow, logs divergence).

Observation window: 7 days production traffic.

Exit criteria:

  • ValidateLease P95 ≤ 20 ms cache-hit, ≤ 50 ms PG-fallback, sustained 7 days.
  • Error rate < 0.05 %.
  • Cache hit ratio ≥ 95 %.
  • Cross-region lag P95 < 2 s.
  • Hash-chain verify cron green for 7 consecutive days.
  • Outbox lag P95 < 5 s; no consumer falling behind.
  • Zero unexpected INVALID_TRANSITION events.
  • compliance.tenant.suspended.v1 consumer correctly bulk-recalls in test scenarios.

Phase 2 — Tenant Self-Service Pool Management (Week 2–3)

Activate the customer-portal REST surface so tenants can browse, reserve, hold, lease, and release identifiers.

Prerequisites:

  • Phase 1 exit criteria all green.
  • Customer-portal-bff deployed with numbering REST integration.
  • T&Cs published with reservation/quarantine semantics.
  • Onboarding runbook for tenant pool admin role.

Steps:

  1. Open Kong route /v1/portal/numbering/* with JWT + per-tenant rate limit.
  2. Open Kong route /v1/admin/numbering/* for platform-admin only.
  3. Roll out to 5 pilot tenants (one per market segment).
  4. Monitor numbering_reserve_total, numbering_assign_total, conflict rate, quota-exceeded errors, support tickets.
  5. After 7 days of pilot: open to all tenants.

Exit criteria:

  • Pilot tenants successfully complete browse → reserve → hold → assign → release flow.
  • Reservation TTL precision ±2 s.
  • Quarantine cool-off correctly enforced (test by tenant-recall and immediate re-attempt).
  • Support ticket rate < 1 / 1000 reservations.
  • All [BLOCKER] items in SERVICE_READINESS §1–§7 green.

Phase 3 — Regulator Export Live (Week 4–6)

Activate monthly ATRA submission via regulator-portal-service.

Prerequisites:

  • Phase 2 exit criteria all green.
  • ATRA-approved export format finalised.
  • ATRA submission SOP signed off by Legal + Compliance.
  • regulator-portal-service consuming numbering.regulator.export.generated.v1.

Steps:

  1. Run dry-run export for the most recent complete month (e.g., 2026-03 if launching mid-April).
  2. Manual review of dry-run output by Legal + Compliance.
  3. ATRA acceptance test on the dry-run file (out-of-band).
  4. Activate monthly cron at 01:00 UTC on the 1st.
  5. First live submission on next month boundary.

Exit criteria:

  • ATRA accepts the first live monthly export within their stated SLA.
  • All audit hash-chain verifications green.
  • No data quality complaints from ATRA.

Phase 4 — Steady State + Continuous Improvement (Week 7+)

ActivityCadence
Monthly commerce ops review (pool utilisation, MNO contract status, scarcity outlook)Monthly
Quarterly Legal review (cool-off policy, T&Cs alignment with regulatory updates)Quarterly
Quarterly security review (mTLS rotations, MNO signing-key rotations, hash-chain audits)Quarterly
Quarterly disaster-recovery drill (region failover, PG primary loss, NATS outage)Quarterly
Annual ATRA compliance reviewAnnual
Annual MNO MoU renegotiation cycleAnnual (rolling, per MNO)
Continuous: HPA threshold tuning based on load patternsOngoing
Continuous: anomaly-signal threshold tuning with fraud-intelOngoing

Future feature roadmap:

  • Phase 4.1 — Bulk renumbering tool (R-BUS-04 mitigation): supports MNO-driven prefix reshuffles.
  • Phase 4.2 — Portability ingestion: when ATRA enables national portability registry, integrate as a Lookup enrichment.
  • Phase 4.3 — Predictive scarcity dashboard (small ML model): forecast block exhaustion 60 d in advance.
  • Phase 4.4 — Tenant pool sub-segmentation: per-accountId quotas inside a pool, for enterprise multi-account scenarios.
  • Phase 4.5 — Two-person rule for admin tier overrides (R-SEC-01 mitigation).

3. Database Migrations

All DDL via Prisma migrations; forward-only; reviewed by Security for any PII-adjacent columns.

#MigrationNotes
120260601000000_create_numbering_schemaschema + enum types
220260601100000_create_mno_and_contractsmobile_operators, lease_contracts, mno_signing_keys
320260601200000_create_numbers_with_rlsnumbers + partial unique indexes + RLS policies
420260601300000_create_leases_reservationsleases (active-unique), reservations (TTL-unique)
520260602000000_create_quarantine_records+ cool-off CHECK constraints
620260602100000_create_tenant_poolsone row per tenant; quotas
720260602200000_create_lease_importsbatches + errors
820260602300000_create_audit_partitionedhash-chained, append-only via Postgres rules + trigger
920260603000000_create_regulator_exports+ status enum
1020260603100000_create_idempotency_keys
1120260603200000_create_outbox+ per-aggregate ordering index
1220260603300000_create_audit_initial_partitionsnext 3 months
1320260604000000_seed_operators_and_vanity_eligibleseeds
1420260604100000_seed_initial_lease_contractsper-MNO contract rows

Each migration is staging-tested with rollback SQL prepared (forward-only, but rollback DDL captured for emergency).


4. Existing Service Changes

sms-orchestrator

ChangeComplexityRisk
Add NumberingClient gRPC stub in NATS consumerLowLow
Call ValidateLease per dequeued message; fail-closed on UNAVAILABLEMediumMedium
Honour WRONG_TENANT, LEASE_SUSPENDED, NOT_REGISTERED, QUARANTINE_ACTIVE reason codesLowLow
Subscribe to num.cache.invalidate.v1 ephemeral subjectLowLow

routing-engine

ChangeComplexityRisk
Add NumberingClient.Lookup for per-message metadataLowLow
Use operatorId / mcc / mnc for carrier selectionMediumLow

sender-id-registry-service

ChangeComplexityRisk
Add gRPC server endpoint IsVerified(alphaId, tenantId) (numbering's hard dependency)MediumLow
Subscribe to number.assigned.v1 (alpha) → mark inventory committedLowLow
Subscribe to number.recalled.v1 (alpha) → release inventory commitLowLow
Publish senderid.revoked.v1 → numbering recalls alpha leaseLowLow

compliance-engine

ChangeComplexityRisk
compliance.tenant.suspended.v1 already published — no change neededNone

billing-service

ChangeComplexityRisk
Subscribe to number.assigned.v1 → start lease billingMediumLow
Subscribe to number.released.v1, .recalled.v1 → stop / prorate billingMediumLow
Subscribe to number.renewed.v1 → bill renewal cycleMediumLow
Add gRPC endpoint PreviewCharge (numbering calls before auto-renewal)MediumLow
Publish billing.account.delinquent.v1, .paid.v1LowLow

customer-portal-bff

ChangeComplexityRisk
Wire /v1/portal/numbering/* REST surfaceHighLow
Display lifecycle state, quotas, lease history in UIHighLow
Quarantine timeline visualisationMediumLow

admin-dashboard-bff

ChangeComplexityRisk
Pool admin UI (quotas, allowlist, vanity flag)HighLow
MNO contract management UIMediumLow
Lease import workflow with signature uploadMediumLow
Quarantine override workflow with justification captureMediumLow
Regulator-export review screenMediumLow

regulator-portal-service

ChangeComplexityRisk
Subscribe to numbering.regulator.export.generated.v1LowLow
Surface S3-ref to ATRA-facing UIMediumLow
Capture ATRA acknowledgementMediumLow

5. Rollback Plan

Because numbering is fail-closed, rolling back to a prior release does not simply disable the service — that would block all SMS dispatch. Rollback strategies per phase:

PhaseRollback actionTime
Phase 1 (internal)Revert sms-orchestrator config to skip ValidateLease (legacy "always-allow"); keep numbering-service deployed for cleanup< 5 min
Phase 2 (tenant portal)Disable Kong route /v1/portal/numbering/* (Kong API call); existing tenant operations continue via admin-only path< 1 min
Phase 3 (regulator export)Disable monthly cron via kubectl patch cronjob; manual export still possible< 2 min
Phase 4 (any feature)Standard feature-flag rollback< 5 min

Schema rollback is forward-only — emergency rollback would require a manual restore from the most recent backup (RPO 15 min).


6. Data Migration

None at platform level. Initial inventory is loaded via the MNO CSV import flow (Phase 0 step). No legacy database to migrate from.

If a legacy spreadsheet of tenant-claimed numbers exists (e.g., in Commerce Ops Excel), it is converted into a one-time admin script pnpm seed:legacy-leases --csv=path that performs Reserve + Assign per row — going through the same lifecycle path as a normal lease, ensuring audit trail integrity from day one.


7. Timeline Summary

WeekPhaseMilestone
-8 to -1Phase 0MNO MoUs signed; initial inventory ingested; Legal sign-off
1Phase 1Internal callers integrated; 7-day soak
2–3Phase 2Tenant portal live (pilot then GA)
4–6Phase 3Regulator export live; first ATRA submission
7+Phase 4Steady state, continuous improvement

8. Success Metrics (Per Phase)

PhaseMetricTarget
Phase 1ValidateLease P95 cache-hit≤ 20 ms over 7 d
Phase 1Cache hit ratio≥ 95 %
Phase 2Tenant Reserve → Assign conversion≥ 60 % within 24 h
Phase 2CAS conflict rate< 1 % of Reserve attempts
Phase 2Support ticket rate< 1 / 1000 reservations
Phase 3ATRA acceptance rate100 % of monthly exports
Phase 3Audit-chain integrity0 violations
Phase 4SLO error budgetWithin budget every quarter

End of MIGRATION_PLAN.md