Number Intelligence Service — Service Readiness
Version: 1.0 Status: Draft Owner: Messaging Core Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, docs/architecture/ADR-0004-national-backbone-resilience.md, SERVICE_RISK_REGISTER.md
This document tracks the readiness criteria for taking number-intelligence-service from development to production. Given the service is the platform's authoritative source for MSISDN attribution and is consulted synchronously by every routing decision (≥ 25 k QPS at peak from routing-engine alone), the readiness bar is elevated: sub-5 ms cache-hit P95, fail-degraded confidence semantics, hash-chained MNP audit defensible to ATRA, MNO MoUs in place for MNP exports, and Redis cluster + Postgres replicas sized for load with headroom.
1. Code Readiness
| Criterion | Status | Notes |
|---|---|---|
gRPC NumberIntelligenceService.v1 — ResolveMsisdn / ResolveBatch / ProbeHlr / LookupPorting / LookupEir / GetMnpHistory / LookupMsisdnImei | ☐ | Hot path |
REST /v1/lookup/* — Public Lookup API (single, batch, bulk-CSV) | ☐ | Tenant-billable |
REST /v1/admin/numint/* — MNP, EIR, adapter, override admin | ☐ | |
REST /v1/regulator/numint/* — MNP audit + chain verify | ☐ | |
REST /v1/lookup/audit — tenant self-service audit read | ☐ | |
ni-hlr-gateway DaemonSet (MAP + REST adapters) | ☐ | SIGTRAN M3UA + per-MNO REST |
| MNP daily reconciliation (CronJob; per-MNO) | ☐ | Pluggable adapter pattern |
| EIR daily reconciliation (ATRA + per-MNO CEIR) | ☐ | |
| MNP conflict detection + resolution flow | ☐ | Heuristic + admin UI |
| HLR probe TPS governor (Redis Lua token bucket) | ☐ | Per-MNO |
| Tenant lookup quota enforcement (RPS + monthly + fresh-lookup sub-bucket) | ☐ | |
| Multi-tier cache (LRU 100 k/pod → Redis 6-node cluster → PG) with per-class TTLs | ☐ | |
| Cache warm-on-deploy + hourly warm cron | ☐ | Readiness gate at 80 % |
| Audit hash-chain (RFC 8785 canonicalisation; daily verifier; freeze-on-break) | ☐ | |
| MSISDN normalisation (NFKC + strict E.164 regex) + per-tenant salt for audit | ☐ | |
| IMEI Luhn validation per 3GPP TS 23.003 §6.2.1 | ☐ | |
Outbox publisher with Nats-Msg-Id idempotency | ☐ | |
| Idempotency-Key support on REST writes | ☐ | |
| mTLS gRPC + SPIFFE SAN allowlist | ☐ | SPIRE SVID enforcement |
Pepper rotation tooling (envelope re-hash with pepper_version tracking) | ☐ | |
| Prefix-table fallback (loaded from Vault-pinned ATRA CSV at startup) | ☐ | |
| Per-tenant salt rotation tooling | ☐ | |
| AI conflict-triage adapter (advisory only; HITL) | ☐ | Optional in MVP |
| MNP-churn anomaly detector (IsolationForest weekly retrain) | ☐ | Phase 2; can defer |
2. Testing Readiness
| Criterion | Target | Status |
|---|---|---|
| Unit test coverage | ≥ 90 % line (domain), ≥ 80 % branch | ☐ |
| Unit tests for MSISDN normalisation (E.164, NFKC, confusables) | ≥ 30 cases | ☐ |
| Unit tests for IMEI Luhn validation | ≥ 15 | ☐ |
| Unit tests for hash-chain integrity (happy path, tamper, key rotation) | ≥ 15 | ☐ |
| Unit tests for MNP state-machine transitions | ≥ 12 | ☐ |
| Unit tests for MNP conflict resolver weighted score | ≥ 10 | ☐ |
| Unit tests for TPS governor (Redis Lua) | ≥ 10 | ☐ |
| Unit tests for tenant quota enforcer | ≥ 10 | ☐ |
| Unit tests for prefix-table fallback | ≥ 10 | ☐ |
| Unit tests for AI redactor (no raw MSISDN reaches LLM) | ≥ 8 | ☐ |
| Property-based tests (fast-check) — chain, scope isolation, TPS invariant | ≥ 8 properties | ☐ |
| Integration: gRPC ResolveMsisdn P95 ≤ 5 ms @ 10 000 RPS cache-hit | Passed | ☐ |
| Integration: ResolveMsisdn cascade transitions (LRU → Redis → PG → live) | Passed | ☐ |
| Integration: MNP reconciliation idempotent on re-ingest | Passed | ☐ |
| Integration: MNP conflict detection on overlapping files | Passed | ☐ |
| Integration: HLR probe MAP transport against mock SS7 | Passed | ☐ |
| Integration: HLR probe REST transport against mock REST endpoint | Passed | ☐ |
| Integration: per-MNO TPS governor admits ≤ capacity under burst | Passed | ☐ |
| Integration: tenant quota enforces RPS + monthly cap | Passed | ☐ |
| Integration: outbox drains; replay works | Passed | ☐ |
| Contract: routing-engine ↔ numint gRPC | Passed | ☐ |
| Contract: sms-firewall-service ↔ numint gRPC (incl. EIR observation) | Passed | ☐ |
| Contract: compliance-engine ↔ numint gRPC (GEO_RESTRICTION fail-degraded path) | Passed | ☐ |
| Contract: channel-router-service ↔ numint gRPC | Passed | ☐ |
| Contract: fraud-intel-service ↔ numint events | Passed | ☐ |
| Contract: billing-service ← numint.lookup.billed.v1 | Passed | ☐ |
| Contract: tenant Public Lookup REST shape | Passed | ☐ |
| Chaos: Postgres unavailable → fail-degraded with stale cache | Passed | ☐ |
| Chaos: Redis unavailable → PG direct, P95 ≤ 20 ms | Passed | ☐ |
| Chaos: NATS lag → outbox accumulates; hot path unaffected | Passed | ☐ |
| Chaos: HLR adapter pod failure → sibling pod takes over | Passed | ☐ |
| Chaos: MNP SFTP unreachable → use last-known; alert fires | Passed | ☐ |
| Chaos: simultaneous MNP file + live HLR write → cache poisoning prevented | Passed | ☐ |
| Chaos: cross-region partition → both regions continue hot reads | Passed | ☐ |
| Security: SPIFFE SAN allowlist enforced (forged caller rejected) | Passed | ☐ |
| Security: tenant A cannot read tenant B's lookup_audit (RLS) | Passed | ☐ |
| Security: hash-chain tamper detected by verifier within 24 h | Passed | ☐ |
| Security: residency test asserts no offshore endpoints | Passed | ☐ |
| Security: redactor verified — no raw MSISDN in LLM prompt | Passed | ☐ |
| Load: 10 000 RPS sustained 1 h, P99 ≤ 20 ms | Passed | ☐ |
| Load: MNP recon 1 M-row file completes ≤ 4 h | Passed | ☐ |
| E2E: full MNP scenario (pre-port → recon → post-port routing) | Passed | ☐ |
| E2E: MNP conflict resolution journey | Passed | ☐ |
| E2E: tenant Public Lookup journey + billing event flows | Passed | ☐ |
3. Observability Readiness
| Criterion | Status |
|---|---|
| All Prometheus metrics emitting (see OBSERVABILITY §2) | ☐ |
Grafana dashboards numint-hot-path.json, numint-mnp-eir.json, numint-adapter.json, numint-public-api.json, numint-audit.json deployed | ☐ |
| All alerts configured in Alertmanager with runbooks (see OBSERVABILITY §3) | ☐ |
| Structured JSON logs (Pino) with MSISDN/IMEI hash-masking | ☐ |
| OpenTelemetry trace propagation Kong → numint → ni-hlr-gateway → MNO verified | ☐ |
| Loki parsing rules for service logs validated | ☐ |
SIEM forwarding of numint.audit.*, numint.mnp.changed.v1 via regulator-portal verified | ☐ |
| Synthetic probe (every 30 s, both regions) | ☐ |
| Daily audit-chain tamper-detect drill | ☐ |
Alerts configured
-
NumIntLookupLatencyHigh(P95 > 15 ms 5 min) -
NumIntCacheHitRateLow(< 85 % 30 min) -
NumIntMnpReconciliationStale(HIGH 26 h, CRITICAL 48 h) -
NumIntReconciliationConflictSpike(> 50 conflicts/h) -
NumIntHlrProbeFailureHigh(> 5 % 15 min) -
NumIntHlrAdapterDown -
NumIntAuditChainBrokenCRITICAL -
NumIntOutboxStuck(> 60 s lag 5 min) -
NumIntPublicLookupQuotaAbuse(per-tenant breach > 10/15 min) -
NumIntEventsDlqGrowing(> 100/10 min) -
NumIntPartitionMissing(next-month partition not provisioned) -
NumIntEgressOffshore(residency violation)
4. Security Readiness
| Criterion | Status |
|---|---|
| mTLS enforced on gRPC port (SPIRE SVID per ADR-0004 §12) | ☐ |
| SPIFFE SAN allowlist verified (positive + negative cases) | ☐ |
| NetworkPolicy restricts ingress to SPIFFE-allowed callers + Kong | ☐ |
| NetworkPolicy denies all non-cluster egress (residency) | ☐ |
| Istio AuthorizationPolicy as defence in depth | ☐ |
| Kong JWT validation + tenant rate-limit on Public Lookup REST | ☐ |
| Per-tenant audit salt configured in Vault for every onboarded tenant | ☐ |
| MSISDN pepper + IMEI pepper in Vault; quarterly rotation drill executed | ☐ |
| Audit chain signing key in Vault Transit; rotation test passes | ☐ |
Append-only DB rules verified on portability_history, lookup_audit, audit_log | ☐ |
RLS policy verified on lookup_audit | ☐ |
| Penetration test against Public Lookup API + admin REST + ni-hlr-gateway | ☐ |
| Security team sign-off (Trust & Safety + Platform Security) | ☐ |
| DPIA reviewed for the Public Lookup API (subscriber-attribution disclosure) | ☐ |
| Regulator briefing complete for MNP-authority positioning | ☐ |
5. Operational Readiness
| Criterion | Status |
|---|---|
| K8s Deployment manifests (hot-path 6-30 replicas + batch 2 leader-elected + ni-hlr-gateway DaemonSet) reviewed | ☐ |
HPA configured on grpc_inflight_requests + CPU | ☐ |
PodDisruptionBudget minAvailable: 4 per region | ☐ |
| Rolling update tested: zero dropped gRPC calls under steady 5 000 RPS | ☐ |
| Graceful shutdown: 15 s drain with SIGTERM handler | ☐ |
| Resource requests/limits validated under 10 000 RPS load | ☐ |
| Postgres connection pool sized; PgBouncer transaction mode | ☐ |
| Postgres read-replica pool sized for cascade fall-through | ☐ |
| Redis cluster sized 6 nodes/region; tested under cache-flush scenario | ☐ |
| Multi-region active-active verified (kbl ↔ mzr) | ☐ |
| MNP reconciliation leader-election tested (Mazar takes over after 10 min Kabul outage) | ☐ |
| Cache warm-on-deploy verified ≥ 80 % before readiness | ☐ |
| MNP recon runbook drafted | ☐ |
| MNP conflict resolution runbook drafted | ☐ |
| HLR adapter outage runbook drafted | ☐ |
| Audit chain break runbook drafted (frozen-write recovery) | ☐ |
| Pepper rotation runbook drafted | ☐ |
| Tenant onboarding playbook (per-tenant salt provisioning) | ☐ |
6. External Dependencies
| Dependency | Status | Notes |
|---|---|---|
| MNO MoU for daily MNP file delivery (5 MNOs: Afghan Wireless, MTN, Etisalat AF, Roshan, Salaam) | ☐ | BLOCKER — at least 3 MNOs must be MoU-signed before public launch |
| MNO HLR endpoint specifications (SIGTRAN params or REST URLs) | ☐ | Per-MNO; some MNOs may take time |
| ATRA CEIR feed access (SFTP credentials + PGP key) | ☐ | Optional — service runs without |
| ATRA numbering-plan CSV (prefix → MNO) | ☐ | Authoritative; refreshed quarterly |
| Per-MNO SS7 commercial agreement (TPS quota + per-query pricing) | ☐ | Drives MnoSnapshot.tpsLimit |
| Vault namespaces provisioned | ☐ | |
| Postgres patroni cluster provisioned per region | ☐ | |
| Redis cluster provisioned per region | ☐ | |
| NATS JetStream streams provisioned | ☐ |
7. Documentation Readiness
| Document | Status |
|---|---|
| SERVICE_OVERVIEW.md (this service) | ☑ Done |
| DOMAIN_MODEL.md | ☑ Done |
| APPLICATION_LOGIC.md | ☑ Done |
| API_CONTRACTS.md | ☑ Done |
| EVENT_SCHEMAS.md | ☑ Done |
| DATA_MODEL.md | ☑ Done |
| SYNC_CONTRACT.md | ☑ Done |
| SECURITY_MODEL.md | ☑ Done |
| OBSERVABILITY.md | ☑ Done |
| TESTING_STRATEGY.md | ☑ Done |
| DEPLOYMENT_TOPOLOGY.md | ☑ Done |
| FAILURE_MODES.md | ☑ Done |
| AI_INTEGRATION.md | ☑ Done |
| SERVICE_RISK_REGISTER.md | ☑ Done |
| MIGRATION_PLAN.md | ☑ Done |
| LOCAL_DEV_SETUP.md | ☑ Done |
| Tenant Public Lookup API guide (DevRel) | ☐ |
| MNP regulator briefing pack | ☐ |
| Per-MNO onboarding runbook | ☐ |
8. On-call
- Primary on-call: Messaging Core squad (PagerDuty rotation
numint-primary). - Secondary on-call: Platform SRE (PagerDuty
sre-secondary). - Security on-call: Trust & Safety + Platform Security (engaged on
NumIntAuditChainBrokenand any quota-abuse pattern suggestive of insider misuse). - Escalation to MNO ops: Per MoU contact tree; for MNP file delays > 26 h.
9. Go-live gates (in order)
- Internal beta: traffic from
routing-engineonly; cache cold-start verified; no tenant traffic. - Internal full: all internal callers (firewall, compliance, channel-router, fraud-intel) live.
- MNP reconciliation live: at least 3 MNO MoUs in place; daily runs passing for 14 consecutive days.
- Public Lookup API beta: 5 pilot tenants; quota enforcement verified.
- Public Lookup API GA: open to all tenants; billing live; full SLA.
- EIR/CEIR live: ATRA feed integrated;
LookupEirreturns BLACKLIST data.
Each gate requires:
- All
Status☐ in this document = ☑ for the relevant scope. - 14 consecutive days of green SLOs.
- Sign-off by Messaging Core lead, SRE on-call, Security, and (for gates 3/5/6) Regulator Liaison.