numbering-service — Service Readiness
Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform SRE Last Updated: 2026-04-21 Companion: DEPLOYMENT_TOPOLOGY · TESTING_STRATEGY · OBSERVABILITY · MIGRATION_PLAN
This document tracks the readiness criteria for taking numbering-service from development to production GA. Items marked [BLOCKER] must be green before any tenant traffic is admitted.
1. Code Readiness
| Criterion | Status | Notes |
|---|---|---|
gRPC ValidateLease handler with cache-first hot-path | ☐ | P95 ≤ 20 ms cache-hit verified |
gRPC Lookup handler | ☐ | |
gRPC Reserve / Assign / Release / Recall handlers with CAS | ☐ | |
| REST admin surface (pools, contracts, blocks, numbers, exports) | ☐ | All endpoints in API_CONTRACTS §2 |
| REST tenant portal surface | ☐ | |
| MNO CSV import with RSA signature verification | ☐ | [BLOCKER] |
| Reservation TTL via Redis keyspace + safety-net cron | ☐ | |
| Quarantine sweep cron (5 min) | ☐ | |
| Lease auto-renewal cron (daily) | ☐ | |
| Monthly regulator export cron + S3 object-lock | ☐ | [BLOCKER for ATRA submission] |
Bulk-recall consumer (compliance.tenant.suspended.v1, tenant.deleted.v1) | ☐ | |
Outbox relay with ordering by aggregate_id | ☐ | |
Hash-chained audit table + daily verify cron | ☐ | [BLOCKER] |
| Multi-region CAS with cross-region quorum | ☐ | Per ADR-0004 §14 |
| Anomaly-signal publisher → fraud-intel | ☐ | |
| Idempotency-key cache for state-mutation RPCs | ☐ |
2. Testing Readiness
| Criterion | Target | Status |
|---|---|---|
| Unit-test coverage | ≥ 85 % line, ≥ 80 % branch | ☐ |
| State-machine unit tests cover all valid + all invalid transitions | All paths | ☐ |
| Integration tests under testcontainers (PG + Redis + NATS) | ≥ 60 tests | ☐ |
| Contract tests (Pact) with sms-orchestrator, routing-engine, sender-id-registry, customer-portal-bff, billing-service, regulator-portal-service | 6 / 6 | ☐ |
| Load test: 5000 RPS sustained 10 m, P95 ≤ 20 ms | Passed | ☐ |
| Load test: lifecycle mix at 2000 RPS for 30 m | Passed | ☐ |
| CAS race test: 100 concurrent Reserves on same number | exactly 1 success | ☐ |
| Multi-region failover test in staging | < 30 s recovery, no double-assignment | ☐ |
| Chaos drill: PG primary failover during load | Passed | ☐ |
| Chaos drill: Redis outage; cache-fallback verified | Passed | ☐ |
| Chaos drill: NATS outage; outbox backlog drains | Passed | ☐ |
| Security: cross-tenant RLS test | All endpoints blocked | ☐ |
| Security: mTLS allowlist enforcement | Non-allowlisted CN rejected | ☐ |
| Security: hash-chain tamper detection | Verified by integrity cron | ☐ |
| Security: CSV-injection attack vector | Rejected | ☐ |
3. Observability Readiness
| Criterion | Status |
|---|---|
| All Prometheus metrics in OBSERVABILITY §1 emitting | ☐ |
Grafana dashboard numbering-service.json deployed | ☐ |
| All alerts in OBSERVABILITY §4 configured + tested via chaos drill | ☐ |
| OTel traces propagate end-to-end (orchestrator → numbering → PG) | ☐ |
Structured logs validated for redaction (no full identifier in debug traces) | ☐ |
| Loki log-parsing rules validated | ☐ |
| Runbook written for each alert | ☐ |
| SLO dashboard live with error-budget burn rate | ☐ |
Alert configurations
-
NumberingValidateLeaseP95High -
NumberingValidateLeaseP99High -
NumberingUnavailableRetries -
NumberingShortCodeScarcityCritical -
NumberingPoolExhaustionWarning/Critical -
NumberingLeaseImportFailed -
NumberingConflictSpike -
NumberingQuarantineBacklog -
NumberingAuditChainBroken[BLOCKER] -
NumberingOutboxLag -
NumberingRegulatorExportFailed -
NumberingRenewalFailureSpike -
NumberingCrossRegionLag -
NumberingReservationCleanupLag
4. Security Readiness
| Criterion | Status |
|---|---|
| mTLS enforced on gRPC port in production | ☐ [BLOCKER] |
| TLS certificates provisioned via Vault PKI + cert-manager | ☐ [BLOCKER] |
| NetworkPolicy restricting ingress to allowlisted callers | ☐ |
| JWT validation on all REST endpoints (Kong) | ☐ |
| RBAC roles enforced per endpoint + role-matrix integration test | ☐ |
| Database credentials dynamic from Vault | ☐ |
| Regulator-export signing key in Vault Transit, rotated | ☐ |
RLS policies on numbers, leases, tenant_pools verified | ☐ |
| Penetration test completed (gRPC + REST surface) | ☐ |
| Security review sign-off | ☐ [BLOCKER] |
5. Operational Readiness
| Criterion | Status |
|---|---|
| Kubernetes Deployment manifests reviewed | ☐ |
| HPA configured with CPU + custom RPS metric | ☐ |
PDB minAvailable=2 validated against rolling update | ☐ |
| Graceful shutdown (SIGTERM 30 s grace) drains in-flight gRPC | ☐ |
| All CronJobs configured + dry-run tested | ☐ |
| PG connection pool sized via load test | ☐ |
| Redis cluster sized + memory-pressure tested | ☐ |
| NATS streams provisioned with correct retention | ☐ |
S3 bucket ghasi-regulator-exports-* created with object-lock WORM 7 y | ☐ [BLOCKER for export] |
| Multi-region replication validated under load | ☐ |
| On-call playbook written | ☐ |
| Escalation matrix documented (Commerce Eng primary, SRE secondary) | ☐ |
6. Documentation Readiness
| Document | Status |
|---|---|
| SERVICE_OVERVIEW.md | Complete |
| DOMAIN_MODEL.md | Complete |
| APPLICATION_LOGIC.md | Complete |
| API_CONTRACTS.md | Complete |
| DATA_MODEL.md | Complete |
| EVENT_SCHEMAS.md | Complete |
| SYNC_CONTRACT.md | Complete |
| SECURITY_MODEL.md | Complete |
| OBSERVABILITY.md | Complete |
| FAILURE_MODES.md | Complete |
| DEPLOYMENT_TOPOLOGY.md | Complete |
| TESTING_STRATEGY.md | Complete |
| LOCAL_DEV_SETUP.md | Complete |
| MIGRATION_PLAN.md | Complete |
| SERVICE_RISK_REGISTER.md | Complete |
| AI_INTEGRATION.md | Complete |
| Runbook for on-call | ☐ |
| MNO contract operations playbook | ☐ |
| ATRA regulator-export submission SOP | ☐ |
7. Compliance / Regulatory Readiness
| Criterion | Status |
|---|---|
| MNO MoUs in place for all five operators (Roshan, Etisalat-AF, MTN-AF, AWCC, Salaam) with prefix allocations agreed | ☐ [BLOCKER] |
| Initial MSISDN inventory loaded from each MNO's signed CSV | ☐ [BLOCKER] |
| Initial short-code allocation from ATRA loaded | ☐ [BLOCKER] |
| Quarantine cool-off durations (90/30/365/0 days) approved by Legal | ☐ [BLOCKER] |
| Regulator-export format approved by ATRA | ☐ [BLOCKER for Phase 3] |
| Audit log retention policy (13 m hot + 7 y cold) approved by Legal + ATRA | ☐ [BLOCKER] |
| Tenant T&Cs include reservation/quarantine semantics | ☐ |
| MNO signing key pre-installed for each operator | ☐ [BLOCKER] |
| Vanity short-code price tiers approved by Commerce + Legal | ☐ |
8. On-Call
| Role | Owner | Pager rotation |
|---|---|---|
| Primary | Commerce Engineering | 24×7 PagerDuty rotation |
| Secondary | Platform SRE | 24×7 backup |
| Tertiary | Database / SRE Lead | Business hours + critical pages |
Escalation: Severity-1 (audit chain broken, full outage, multi-region split-brain) pages all three immediately.
9. Go / No-Go Criteria
Production GA is GO when:
- All
[BLOCKER]items in §1–§7 are green. - Test coverage ≥ 85 % line; all integration + contract tests passing.
- Load test at 1.5× expected peak (7500
ValidateLeaseRPS) passes SLO. - All alerts configured and tested via chaos drill.
- Security team sign-off obtained.
- 14-day shadow mode in staging completed (numbering deployed but sms-orchestrator routes to mocked validate).
- On-call playbook finalised and rotation staffed.
- Rollback plan validated in staging.
- At minimum: 5000 MSISDN, 50 short codes, vanity-eligible list loaded.
- At minimum: 3 tenant pools live (one enterprise, one SMB, one government).
10. Post-Launch Review
Within 30 days of GA:
- False-positive rate audit on
CONFLICTerrors (target < 1 % of Reserve attempts). - Reservation TTL precision audit (target ±2 s).
- Quarantine cool-off compliance audit (no premature releases).
- Audit hash-chain integrity verified across the period.
- Regulator export submitted on schedule and accepted by ATRA.
- Performance review — any P99 latency regressions?
- HPA threshold tuning based on observed load patterns.
- Tenant complaint review and any commerce ops rule adjustments.
End of SERVICE_READINESS.md