Skip to main content

numbering-service — Service Readiness

Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform SRE Last Updated: 2026-04-21 Companion: DEPLOYMENT_TOPOLOGY · TESTING_STRATEGY · OBSERVABILITY · MIGRATION_PLAN

This document tracks the readiness criteria for taking numbering-service from development to production GA. Items marked [BLOCKER] must be green before any tenant traffic is admitted.


1. Code Readiness

CriterionStatusNotes
gRPC ValidateLease handler with cache-first hot-pathP95 ≤ 20 ms cache-hit verified
gRPC Lookup handler
gRPC Reserve / Assign / Release / Recall handlers with CAS
REST admin surface (pools, contracts, blocks, numbers, exports)All endpoints in API_CONTRACTS §2
REST tenant portal surface
MNO CSV import with RSA signature verification[BLOCKER]
Reservation TTL via Redis keyspace + safety-net cron
Quarantine sweep cron (5 min)
Lease auto-renewal cron (daily)
Monthly regulator export cron + S3 object-lock[BLOCKER for ATRA submission]
Bulk-recall consumer (compliance.tenant.suspended.v1, tenant.deleted.v1)
Outbox relay with ordering by aggregate_id
Hash-chained audit table + daily verify cron[BLOCKER]
Multi-region CAS with cross-region quorumPer ADR-0004 §14
Anomaly-signal publisher → fraud-intel
Idempotency-key cache for state-mutation RPCs

2. Testing Readiness

CriterionTargetStatus
Unit-test coverage≥ 85 % line, ≥ 80 % branch
State-machine unit tests cover all valid + all invalid transitionsAll paths
Integration tests under testcontainers (PG + Redis + NATS)≥ 60 tests
Contract tests (Pact) with sms-orchestrator, routing-engine, sender-id-registry, customer-portal-bff, billing-service, regulator-portal-service6 / 6
Load test: 5000 RPS sustained 10 m, P95 ≤ 20 msPassed
Load test: lifecycle mix at 2000 RPS for 30 mPassed
CAS race test: 100 concurrent Reserves on same numberexactly 1 success
Multi-region failover test in staging< 30 s recovery, no double-assignment
Chaos drill: PG primary failover during loadPassed
Chaos drill: Redis outage; cache-fallback verifiedPassed
Chaos drill: NATS outage; outbox backlog drainsPassed
Security: cross-tenant RLS testAll endpoints blocked
Security: mTLS allowlist enforcementNon-allowlisted CN rejected
Security: hash-chain tamper detectionVerified by integrity cron
Security: CSV-injection attack vectorRejected

3. Observability Readiness

CriterionStatus
All Prometheus metrics in OBSERVABILITY §1 emitting
Grafana dashboard numbering-service.json deployed
All alerts in OBSERVABILITY §4 configured + tested via chaos drill
OTel traces propagate end-to-end (orchestrator → numbering → PG)
Structured logs validated for redaction (no full identifier in debug traces)
Loki log-parsing rules validated
Runbook written for each alert
SLO dashboard live with error-budget burn rate

Alert configurations

  • NumberingValidateLeaseP95High
  • NumberingValidateLeaseP99High
  • NumberingUnavailableRetries
  • NumberingShortCodeScarcityCritical
  • NumberingPoolExhaustionWarning / Critical
  • NumberingLeaseImportFailed
  • NumberingConflictSpike
  • NumberingQuarantineBacklog
  • NumberingAuditChainBroken [BLOCKER]
  • NumberingOutboxLag
  • NumberingRegulatorExportFailed
  • NumberingRenewalFailureSpike
  • NumberingCrossRegionLag
  • NumberingReservationCleanupLag

4. Security Readiness

CriterionStatus
mTLS enforced on gRPC port in production☐ [BLOCKER]
TLS certificates provisioned via Vault PKI + cert-manager☐ [BLOCKER]
NetworkPolicy restricting ingress to allowlisted callers
JWT validation on all REST endpoints (Kong)
RBAC roles enforced per endpoint + role-matrix integration test
Database credentials dynamic from Vault
Regulator-export signing key in Vault Transit, rotated
RLS policies on numbers, leases, tenant_pools verified
Penetration test completed (gRPC + REST surface)
Security review sign-off☐ [BLOCKER]

5. Operational Readiness

CriterionStatus
Kubernetes Deployment manifests reviewed
HPA configured with CPU + custom RPS metric
PDB minAvailable=2 validated against rolling update
Graceful shutdown (SIGTERM 30 s grace) drains in-flight gRPC
All CronJobs configured + dry-run tested
PG connection pool sized via load test
Redis cluster sized + memory-pressure tested
NATS streams provisioned with correct retention
S3 bucket ghasi-regulator-exports-* created with object-lock WORM 7 y☐ [BLOCKER for export]
Multi-region replication validated under load
On-call playbook written
Escalation matrix documented (Commerce Eng primary, SRE secondary)

6. Documentation Readiness

DocumentStatus
SERVICE_OVERVIEW.mdComplete
DOMAIN_MODEL.mdComplete
APPLICATION_LOGIC.mdComplete
API_CONTRACTS.mdComplete
DATA_MODEL.mdComplete
EVENT_SCHEMAS.mdComplete
SYNC_CONTRACT.mdComplete
SECURITY_MODEL.mdComplete
OBSERVABILITY.mdComplete
FAILURE_MODES.mdComplete
DEPLOYMENT_TOPOLOGY.mdComplete
TESTING_STRATEGY.mdComplete
LOCAL_DEV_SETUP.mdComplete
MIGRATION_PLAN.mdComplete
SERVICE_RISK_REGISTER.mdComplete
AI_INTEGRATION.mdComplete
Runbook for on-call
MNO contract operations playbook
ATRA regulator-export submission SOP

7. Compliance / Regulatory Readiness

CriterionStatus
MNO MoUs in place for all five operators (Roshan, Etisalat-AF, MTN-AF, AWCC, Salaam) with prefix allocations agreed☐ [BLOCKER]
Initial MSISDN inventory loaded from each MNO's signed CSV☐ [BLOCKER]
Initial short-code allocation from ATRA loaded☐ [BLOCKER]
Quarantine cool-off durations (90/30/365/0 days) approved by Legal☐ [BLOCKER]
Regulator-export format approved by ATRA☐ [BLOCKER for Phase 3]
Audit log retention policy (13 m hot + 7 y cold) approved by Legal + ATRA☐ [BLOCKER]
Tenant T&Cs include reservation/quarantine semantics
MNO signing key pre-installed for each operator☐ [BLOCKER]
Vanity short-code price tiers approved by Commerce + Legal

8. On-Call

RoleOwnerPager rotation
PrimaryCommerce Engineering24×7 PagerDuty rotation
SecondaryPlatform SRE24×7 backup
TertiaryDatabase / SRE LeadBusiness hours + critical pages

Escalation: Severity-1 (audit chain broken, full outage, multi-region split-brain) pages all three immediately.


9. Go / No-Go Criteria

Production GA is GO when:

  • All [BLOCKER] items in §1–§7 are green.
  • Test coverage ≥ 85 % line; all integration + contract tests passing.
  • Load test at 1.5× expected peak (7500 ValidateLease RPS) passes SLO.
  • All alerts configured and tested via chaos drill.
  • Security team sign-off obtained.
  • 14-day shadow mode in staging completed (numbering deployed but sms-orchestrator routes to mocked validate).
  • On-call playbook finalised and rotation staffed.
  • Rollback plan validated in staging.
  • At minimum: 5000 MSISDN, 50 short codes, vanity-eligible list loaded.
  • At minimum: 3 tenant pools live (one enterprise, one SMB, one government).

10. Post-Launch Review

Within 30 days of GA:

  • False-positive rate audit on CONFLICT errors (target < 1 % of Reserve attempts).
  • Reservation TTL precision audit (target ±2 s).
  • Quarantine cool-off compliance audit (no premature releases).
  • Audit hash-chain integrity verified across the period.
  • Regulator export submitted on schedule and accepted by ATRA.
  • Performance review — any P99 latency regressions?
  • HPA threshold tuning based on observed load patterns.
  • Tenant complaint review and any commerce ops rule adjustments.

End of SERVICE_READINESS.md