numbering-service — Migration Plan
Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform SRE + Commerce Ops Last Updated: 2026-04-21 Companion: SERVICE_READINESS · DEPLOYMENT_TOPOLOGY · DATA_MODEL
1. Overview
numbering-service is greenfield — there is no legacy inventory system to migrate data from at the platform level. However, deployment is gated by significant off-platform prerequisites: signed MNO MoUs for MSISDN block leases, ATRA short-code allocations, and Legal sign-off on the lifecycle / quarantine policy.
The rollout is phased so that internal services come online before tenant self-service, and tenant self-service comes online before regulator submission. Because the service is fail-closed on ValidateLease, an early-phase outage cannot leak unauthenticated dispatch.
2. Migration Phases
Phase 0 — Pre-Deployment Readiness (Weeks -8 to -1)
This phase is entirely off-platform for most items.
| Task | Owner | Status |
|---|---|---|
| Sign MoU with Roshan (MCC 412/MNC 40) for initial MSISDN block | Commerce ops + Legal | ☐ [BLOCKER] |
| Sign MoU with Etisalat-AF (412/50) | Commerce ops + Legal | ☐ [BLOCKER] |
| Sign MoU with MTN-AF (412/01 + 412/20) | Commerce ops + Legal | ☐ [BLOCKER] |
| Sign MoU with AWCC (412/03) | Commerce ops + Legal | ☐ [BLOCKER] |
| Sign MoU with Salaam (412/88) | Commerce ops + Legal | ☐ [BLOCKER] |
| Obtain ATRA short-code allocation (initial 50 codes minimum) | Commerce ops + Legal | ☐ [BLOCKER] |
| Quarantine cool-off durations approved by Legal | Legal | ☐ [BLOCKER] |
| Audit-retention policy (13 m hot + 7 y cold) approved by Legal + ATRA | Legal + Compliance | ☐ [BLOCKER] |
Provision PostgreSQL numbering schema in shared cluster | Platform DBA | ☐ |
| Provision Redis logical DB 7 with keyspace notifications enabled | Platform SRE | ☐ |
Create NATS streams (NUMBERING_EVENTS, NUMBERING_AUDIT, NUMBERING_LEASES, NUMBERING_OPS, NUMBERING_REGULATOR) | Platform SRE | ☐ |
Provision S3 bucket ghasi-regulator-exports-{kbl,mzr} with object-lock WORM 7 y | Platform SRE + Security | ☐ |
Vault PKI: provision mTLS certs for numbering-service and all six allowlisted callers | Security | ☐ |
| Vault Transit: provision regulator-export signing key | Security | ☐ |
Install MNO public signing keys into numbering.mno_signing_keys table | Commerce ops + Security | ☐ [BLOCKER] |
Phase 1 — Internal-Only Deployment (Week 1)
Deploy numbering-service with mTLS access only for internal services — no Kong route, no tenant portal endpoint.
Goal: validate hot-path latency, multi-region replication, fail-closed behaviour, and downstream event consumers — without admitting any tenant traffic.
Steps:
- Deploy 3 replicas in kbl + 3 in mzr.
- Apply NetworkPolicy: ingress only from
sms-orchestrator,routing-engine,number-intelligence-service,sender-id-registry-service,compliance-engine,billing-service. - Pre-seed
mobile_operators,lease_contracts, and initial signed MNO CSV imports (≥ 100 MSISDNs per operator + ATRA-allocated short codes). - Pre-seed
tenant_poolsfor 3 internal test tenants (one enterprise-sim, one SMB-sim, one government-sim). - Wire
sms-orchestratorto callValidateLeaseper outbound message; orchestrator continues legacy "always-allow" behaviour in shadow mode (compares numbering verdict to allow, logs divergence).
Observation window: 7 days production traffic.
Exit criteria:
ValidateLeaseP95 ≤ 20 ms cache-hit, ≤ 50 ms PG-fallback, sustained 7 days.- Error rate < 0.05 %.
- Cache hit ratio ≥ 95 %.
- Cross-region lag P95 < 2 s.
- Hash-chain verify cron green for 7 consecutive days.
- Outbox lag P95 < 5 s; no consumer falling behind.
- Zero unexpected
INVALID_TRANSITIONevents. compliance.tenant.suspended.v1consumer correctly bulk-recalls in test scenarios.
Phase 2 — Tenant Self-Service Pool Management (Week 2–3)
Activate the customer-portal REST surface so tenants can browse, reserve, hold, lease, and release identifiers.
Prerequisites:
- Phase 1 exit criteria all green.
- Customer-portal-bff deployed with numbering REST integration.
- T&Cs published with reservation/quarantine semantics.
- Onboarding runbook for tenant pool admin role.
Steps:
- Open Kong route
/v1/portal/numbering/*with JWT + per-tenant rate limit. - Open Kong route
/v1/admin/numbering/*for platform-admin only. - Roll out to 5 pilot tenants (one per market segment).
- Monitor
numbering_reserve_total,numbering_assign_total, conflict rate, quota-exceeded errors, support tickets. - After 7 days of pilot: open to all tenants.
Exit criteria:
- Pilot tenants successfully complete browse → reserve → hold → assign → release flow.
- Reservation TTL precision ±2 s.
- Quarantine cool-off correctly enforced (test by tenant-recall and immediate re-attempt).
- Support ticket rate < 1 / 1000 reservations.
- All
[BLOCKER]items in SERVICE_READINESS §1–§7 green.
Phase 3 — Regulator Export Live (Week 4–6)
Activate monthly ATRA submission via regulator-portal-service.
Prerequisites:
- Phase 2 exit criteria all green.
- ATRA-approved export format finalised.
- ATRA submission SOP signed off by Legal + Compliance.
regulator-portal-serviceconsumingnumbering.regulator.export.generated.v1.
Steps:
- Run dry-run export for the most recent complete month (e.g., 2026-03 if launching mid-April).
- Manual review of dry-run output by Legal + Compliance.
- ATRA acceptance test on the dry-run file (out-of-band).
- Activate monthly cron at 01:00 UTC on the 1st.
- First live submission on next month boundary.
Exit criteria:
- ATRA accepts the first live monthly export within their stated SLA.
- All audit hash-chain verifications green.
- No data quality complaints from ATRA.
Phase 4 — Steady State + Continuous Improvement (Week 7+)
| Activity | Cadence |
|---|---|
| Monthly commerce ops review (pool utilisation, MNO contract status, scarcity outlook) | Monthly |
| Quarterly Legal review (cool-off policy, T&Cs alignment with regulatory updates) | Quarterly |
| Quarterly security review (mTLS rotations, MNO signing-key rotations, hash-chain audits) | Quarterly |
| Quarterly disaster-recovery drill (region failover, PG primary loss, NATS outage) | Quarterly |
| Annual ATRA compliance review | Annual |
| Annual MNO MoU renegotiation cycle | Annual (rolling, per MNO) |
| Continuous: HPA threshold tuning based on load patterns | Ongoing |
| Continuous: anomaly-signal threshold tuning with fraud-intel | Ongoing |
Future feature roadmap:
- Phase 4.1 — Bulk renumbering tool (R-BUS-04 mitigation): supports MNO-driven prefix reshuffles.
- Phase 4.2 — Portability ingestion: when ATRA enables national portability registry, integrate as a
Lookupenrichment. - Phase 4.3 — Predictive scarcity dashboard (small ML model): forecast block exhaustion 60 d in advance.
- Phase 4.4 — Tenant pool sub-segmentation: per-
accountIdquotas inside a pool, for enterprise multi-account scenarios. - Phase 4.5 — Two-person rule for admin tier overrides (R-SEC-01 mitigation).
3. Database Migrations
All DDL via Prisma migrations; forward-only; reviewed by Security for any PII-adjacent columns.
| # | Migration | Notes |
|---|---|---|
| 1 | 20260601000000_create_numbering_schema | schema + enum types |
| 2 | 20260601100000_create_mno_and_contracts | mobile_operators, lease_contracts, mno_signing_keys |
| 3 | 20260601200000_create_numbers_with_rls | numbers + partial unique indexes + RLS policies |
| 4 | 20260601300000_create_leases_reservations | leases (active-unique), reservations (TTL-unique) |
| 5 | 20260602000000_create_quarantine_records | + cool-off CHECK constraints |
| 6 | 20260602100000_create_tenant_pools | one row per tenant; quotas |
| 7 | 20260602200000_create_lease_imports | batches + errors |
| 8 | 20260602300000_create_audit_partitioned | hash-chained, append-only via Postgres rules + trigger |
| 9 | 20260603000000_create_regulator_exports | + status enum |
| 10 | 20260603100000_create_idempotency_keys | |
| 11 | 20260603200000_create_outbox | + per-aggregate ordering index |
| 12 | 20260603300000_create_audit_initial_partitions | next 3 months |
| 13 | 20260604000000_seed_operators_and_vanity_eligible | seeds |
| 14 | 20260604100000_seed_initial_lease_contracts | per-MNO contract rows |
Each migration is staging-tested with rollback SQL prepared (forward-only, but rollback DDL captured for emergency).
4. Existing Service Changes
sms-orchestrator
| Change | Complexity | Risk |
|---|---|---|
Add NumberingClient gRPC stub in NATS consumer | Low | Low |
Call ValidateLease per dequeued message; fail-closed on UNAVAILABLE | Medium | Medium |
Honour WRONG_TENANT, LEASE_SUSPENDED, NOT_REGISTERED, QUARANTINE_ACTIVE reason codes | Low | Low |
Subscribe to num.cache.invalidate.v1 ephemeral subject | Low | Low |
routing-engine
| Change | Complexity | Risk |
|---|---|---|
Add NumberingClient.Lookup for per-message metadata | Low | Low |
| Use operatorId / mcc / mnc for carrier selection | Medium | Low |
sender-id-registry-service
| Change | Complexity | Risk |
|---|---|---|
Add gRPC server endpoint IsVerified(alphaId, tenantId) (numbering's hard dependency) | Medium | Low |
Subscribe to number.assigned.v1 (alpha) → mark inventory committed | Low | Low |
Subscribe to number.recalled.v1 (alpha) → release inventory commit | Low | Low |
Publish senderid.revoked.v1 → numbering recalls alpha lease | Low | Low |
compliance-engine
| Change | Complexity | Risk |
|---|---|---|
compliance.tenant.suspended.v1 already published — no change needed | None | — |
billing-service
| Change | Complexity | Risk |
|---|---|---|
Subscribe to number.assigned.v1 → start lease billing | Medium | Low |
Subscribe to number.released.v1, .recalled.v1 → stop / prorate billing | Medium | Low |
Subscribe to number.renewed.v1 → bill renewal cycle | Medium | Low |
Add gRPC endpoint PreviewCharge (numbering calls before auto-renewal) | Medium | Low |
Publish billing.account.delinquent.v1, .paid.v1 | Low | Low |
customer-portal-bff
| Change | Complexity | Risk |
|---|---|---|
Wire /v1/portal/numbering/* REST surface | High | Low |
| Display lifecycle state, quotas, lease history in UI | High | Low |
| Quarantine timeline visualisation | Medium | Low |
admin-dashboard-bff
| Change | Complexity | Risk |
|---|---|---|
| Pool admin UI (quotas, allowlist, vanity flag) | High | Low |
| MNO contract management UI | Medium | Low |
| Lease import workflow with signature upload | Medium | Low |
| Quarantine override workflow with justification capture | Medium | Low |
| Regulator-export review screen | Medium | Low |
regulator-portal-service
| Change | Complexity | Risk |
|---|---|---|
Subscribe to numbering.regulator.export.generated.v1 | Low | Low |
| Surface S3-ref to ATRA-facing UI | Medium | Low |
| Capture ATRA acknowledgement | Medium | Low |
5. Rollback Plan
Because numbering is fail-closed, rolling back to a prior release does not simply disable the service — that would block all SMS dispatch. Rollback strategies per phase:
| Phase | Rollback action | Time |
|---|---|---|
| Phase 1 (internal) | Revert sms-orchestrator config to skip ValidateLease (legacy "always-allow"); keep numbering-service deployed for cleanup | < 5 min |
| Phase 2 (tenant portal) | Disable Kong route /v1/portal/numbering/* (Kong API call); existing tenant operations continue via admin-only path | < 1 min |
| Phase 3 (regulator export) | Disable monthly cron via kubectl patch cronjob; manual export still possible | < 2 min |
| Phase 4 (any feature) | Standard feature-flag rollback | < 5 min |
Schema rollback is forward-only — emergency rollback would require a manual restore from the most recent backup (RPO 15 min).
6. Data Migration
None at platform level. Initial inventory is loaded via the MNO CSV import flow (Phase 0 step). No legacy database to migrate from.
If a legacy spreadsheet of tenant-claimed numbers exists (e.g., in Commerce Ops Excel), it is converted into a one-time admin script pnpm seed:legacy-leases --csv=path that performs Reserve + Assign per row — going through the same lifecycle path as a normal lease, ensuring audit trail integrity from day one.
7. Timeline Summary
| Week | Phase | Milestone |
|---|---|---|
| -8 to -1 | Phase 0 | MNO MoUs signed; initial inventory ingested; Legal sign-off |
| 1 | Phase 1 | Internal callers integrated; 7-day soak |
| 2–3 | Phase 2 | Tenant portal live (pilot then GA) |
| 4–6 | Phase 3 | Regulator export live; first ATRA submission |
| 7+ | Phase 4 | Steady state, continuous improvement |
8. Success Metrics (Per Phase)
| Phase | Metric | Target |
|---|---|---|
| Phase 1 | ValidateLease P95 cache-hit | ≤ 20 ms over 7 d |
| Phase 1 | Cache hit ratio | ≥ 95 % |
| Phase 2 | Tenant Reserve → Assign conversion | ≥ 60 % within 24 h |
| Phase 2 | CAS conflict rate | < 1 % of Reserve attempts |
| Phase 2 | Support ticket rate | < 1 / 1000 reservations |
| Phase 3 | ATRA acceptance rate | 100 % of monthly exports |
| Phase 3 | Audit-chain integrity | 0 violations |
| Phase 4 | SLO error budget | Within budget every quarter |
End of MIGRATION_PLAN.md