Skip to main content

Number Intelligence Service — Service Readiness

Version: 1.0 Status: Draft Owner: Messaging Core Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, docs/architecture/ADR-0004-national-backbone-resilience.md, SERVICE_RISK_REGISTER.md

This document tracks the readiness criteria for taking number-intelligence-service from development to production. Given the service is the platform's authoritative source for MSISDN attribution and is consulted synchronously by every routing decision (≥ 25 k QPS at peak from routing-engine alone), the readiness bar is elevated: sub-5 ms cache-hit P95, fail-degraded confidence semantics, hash-chained MNP audit defensible to ATRA, MNO MoUs in place for MNP exports, and Redis cluster + Postgres replicas sized for load with headroom.


1. Code Readiness

CriterionStatusNotes
gRPC NumberIntelligenceService.v1 — ResolveMsisdn / ResolveBatch / ProbeHlr / LookupPorting / LookupEir / GetMnpHistory / LookupMsisdnImeiHot path
REST /v1/lookup/* — Public Lookup API (single, batch, bulk-CSV)Tenant-billable
REST /v1/admin/numint/* — MNP, EIR, adapter, override admin
REST /v1/regulator/numint/* — MNP audit + chain verify
REST /v1/lookup/audit — tenant self-service audit read
ni-hlr-gateway DaemonSet (MAP + REST adapters)SIGTRAN M3UA + per-MNO REST
MNP daily reconciliation (CronJob; per-MNO)Pluggable adapter pattern
EIR daily reconciliation (ATRA + per-MNO CEIR)
MNP conflict detection + resolution flowHeuristic + admin UI
HLR probe TPS governor (Redis Lua token bucket)Per-MNO
Tenant lookup quota enforcement (RPS + monthly + fresh-lookup sub-bucket)
Multi-tier cache (LRU 100 k/pod → Redis 6-node cluster → PG) with per-class TTLs
Cache warm-on-deploy + hourly warm cronReadiness gate at 80 %
Audit hash-chain (RFC 8785 canonicalisation; daily verifier; freeze-on-break)
MSISDN normalisation (NFKC + strict E.164 regex) + per-tenant salt for audit
IMEI Luhn validation per 3GPP TS 23.003 §6.2.1
Outbox publisher with Nats-Msg-Id idempotency
Idempotency-Key support on REST writes
mTLS gRPC + SPIFFE SAN allowlistSPIRE SVID enforcement
Pepper rotation tooling (envelope re-hash with pepper_version tracking)
Prefix-table fallback (loaded from Vault-pinned ATRA CSV at startup)
Per-tenant salt rotation tooling
AI conflict-triage adapter (advisory only; HITL)Optional in MVP
MNP-churn anomaly detector (IsolationForest weekly retrain)Phase 2; can defer

2. Testing Readiness

CriterionTargetStatus
Unit test coverage≥ 90 % line (domain), ≥ 80 % branch
Unit tests for MSISDN normalisation (E.164, NFKC, confusables)≥ 30 cases
Unit tests for IMEI Luhn validation≥ 15
Unit tests for hash-chain integrity (happy path, tamper, key rotation)≥ 15
Unit tests for MNP state-machine transitions≥ 12
Unit tests for MNP conflict resolver weighted score≥ 10
Unit tests for TPS governor (Redis Lua)≥ 10
Unit tests for tenant quota enforcer≥ 10
Unit tests for prefix-table fallback≥ 10
Unit tests for AI redactor (no raw MSISDN reaches LLM)≥ 8
Property-based tests (fast-check) — chain, scope isolation, TPS invariant≥ 8 properties
Integration: gRPC ResolveMsisdn P95 ≤ 5 ms @ 10 000 RPS cache-hitPassed
Integration: ResolveMsisdn cascade transitions (LRU → Redis → PG → live)Passed
Integration: MNP reconciliation idempotent on re-ingestPassed
Integration: MNP conflict detection on overlapping filesPassed
Integration: HLR probe MAP transport against mock SS7Passed
Integration: HLR probe REST transport against mock REST endpointPassed
Integration: per-MNO TPS governor admits ≤ capacity under burstPassed
Integration: tenant quota enforces RPS + monthly capPassed
Integration: outbox drains; replay worksPassed
Contract: routing-engine ↔ numint gRPCPassed
Contract: sms-firewall-service ↔ numint gRPC (incl. EIR observation)Passed
Contract: compliance-engine ↔ numint gRPC (GEO_RESTRICTION fail-degraded path)Passed
Contract: channel-router-service ↔ numint gRPCPassed
Contract: fraud-intel-service ↔ numint eventsPassed
Contract: billing-service ← numint.lookup.billed.v1Passed
Contract: tenant Public Lookup REST shapePassed
Chaos: Postgres unavailable → fail-degraded with stale cachePassed
Chaos: Redis unavailable → PG direct, P95 ≤ 20 msPassed
Chaos: NATS lag → outbox accumulates; hot path unaffectedPassed
Chaos: HLR adapter pod failure → sibling pod takes overPassed
Chaos: MNP SFTP unreachable → use last-known; alert firesPassed
Chaos: simultaneous MNP file + live HLR write → cache poisoning preventedPassed
Chaos: cross-region partition → both regions continue hot readsPassed
Security: SPIFFE SAN allowlist enforced (forged caller rejected)Passed
Security: tenant A cannot read tenant B's lookup_audit (RLS)Passed
Security: hash-chain tamper detected by verifier within 24 hPassed
Security: residency test asserts no offshore endpointsPassed
Security: redactor verified — no raw MSISDN in LLM promptPassed
Load: 10 000 RPS sustained 1 h, P99 ≤ 20 msPassed
Load: MNP recon 1 M-row file completes ≤ 4 hPassed
E2E: full MNP scenario (pre-port → recon → post-port routing)Passed
E2E: MNP conflict resolution journeyPassed
E2E: tenant Public Lookup journey + billing event flowsPassed

3. Observability Readiness

CriterionStatus
All Prometheus metrics emitting (see OBSERVABILITY §2)
Grafana dashboards numint-hot-path.json, numint-mnp-eir.json, numint-adapter.json, numint-public-api.json, numint-audit.json deployed
All alerts configured in Alertmanager with runbooks (see OBSERVABILITY §3)
Structured JSON logs (Pino) with MSISDN/IMEI hash-masking
OpenTelemetry trace propagation Kong → numint → ni-hlr-gateway → MNO verified
Loki parsing rules for service logs validated
SIEM forwarding of numint.audit.*, numint.mnp.changed.v1 via regulator-portal verified
Synthetic probe (every 30 s, both regions)
Daily audit-chain tamper-detect drill

Alerts configured

  • NumIntLookupLatencyHigh (P95 > 15 ms 5 min)
  • NumIntCacheHitRateLow (< 85 % 30 min)
  • NumIntMnpReconciliationStale (HIGH 26 h, CRITICAL 48 h)
  • NumIntReconciliationConflictSpike (> 50 conflicts/h)
  • NumIntHlrProbeFailureHigh (> 5 % 15 min)
  • NumIntHlrAdapterDown
  • NumIntAuditChainBroken CRITICAL
  • NumIntOutboxStuck (> 60 s lag 5 min)
  • NumIntPublicLookupQuotaAbuse (per-tenant breach > 10/15 min)
  • NumIntEventsDlqGrowing (> 100/10 min)
  • NumIntPartitionMissing (next-month partition not provisioned)
  • NumIntEgressOffshore (residency violation)

4. Security Readiness

CriterionStatus
mTLS enforced on gRPC port (SPIRE SVID per ADR-0004 §12)
SPIFFE SAN allowlist verified (positive + negative cases)
NetworkPolicy restricts ingress to SPIFFE-allowed callers + Kong
NetworkPolicy denies all non-cluster egress (residency)
Istio AuthorizationPolicy as defence in depth
Kong JWT validation + tenant rate-limit on Public Lookup REST
Per-tenant audit salt configured in Vault for every onboarded tenant
MSISDN pepper + IMEI pepper in Vault; quarterly rotation drill executed
Audit chain signing key in Vault Transit; rotation test passes
Append-only DB rules verified on portability_history, lookup_audit, audit_log
RLS policy verified on lookup_audit
Penetration test against Public Lookup API + admin REST + ni-hlr-gateway
Security team sign-off (Trust & Safety + Platform Security)
DPIA reviewed for the Public Lookup API (subscriber-attribution disclosure)
Regulator briefing complete for MNP-authority positioning

5. Operational Readiness

CriterionStatus
K8s Deployment manifests (hot-path 6-30 replicas + batch 2 leader-elected + ni-hlr-gateway DaemonSet) reviewed
HPA configured on grpc_inflight_requests + CPU
PodDisruptionBudget minAvailable: 4 per region
Rolling update tested: zero dropped gRPC calls under steady 5 000 RPS
Graceful shutdown: 15 s drain with SIGTERM handler
Resource requests/limits validated under 10 000 RPS load
Postgres connection pool sized; PgBouncer transaction mode
Postgres read-replica pool sized for cascade fall-through
Redis cluster sized 6 nodes/region; tested under cache-flush scenario
Multi-region active-active verified (kbl ↔ mzr)
MNP reconciliation leader-election tested (Mazar takes over after 10 min Kabul outage)
Cache warm-on-deploy verified ≥ 80 % before readiness
MNP recon runbook drafted
MNP conflict resolution runbook drafted
HLR adapter outage runbook drafted
Audit chain break runbook drafted (frozen-write recovery)
Pepper rotation runbook drafted
Tenant onboarding playbook (per-tenant salt provisioning)

6. External Dependencies

DependencyStatusNotes
MNO MoU for daily MNP file delivery (5 MNOs: Afghan Wireless, MTN, Etisalat AF, Roshan, Salaam)BLOCKER — at least 3 MNOs must be MoU-signed before public launch
MNO HLR endpoint specifications (SIGTRAN params or REST URLs)Per-MNO; some MNOs may take time
ATRA CEIR feed access (SFTP credentials + PGP key)Optional — service runs without
ATRA numbering-plan CSV (prefix → MNO)Authoritative; refreshed quarterly
Per-MNO SS7 commercial agreement (TPS quota + per-query pricing)Drives MnoSnapshot.tpsLimit
Vault namespaces provisioned
Postgres patroni cluster provisioned per region
Redis cluster provisioned per region
NATS JetStream streams provisioned

7. Documentation Readiness

DocumentStatus
SERVICE_OVERVIEW.md (this service)☑ Done
DOMAIN_MODEL.md☑ Done
APPLICATION_LOGIC.md☑ Done
API_CONTRACTS.md☑ Done
EVENT_SCHEMAS.md☑ Done
DATA_MODEL.md☑ Done
SYNC_CONTRACT.md☑ Done
SECURITY_MODEL.md☑ Done
OBSERVABILITY.md☑ Done
TESTING_STRATEGY.md☑ Done
DEPLOYMENT_TOPOLOGY.md☑ Done
FAILURE_MODES.md☑ Done
AI_INTEGRATION.md☑ Done
SERVICE_RISK_REGISTER.md☑ Done
MIGRATION_PLAN.md☑ Done
LOCAL_DEV_SETUP.md☑ Done
Tenant Public Lookup API guide (DevRel)
MNP regulator briefing pack
Per-MNO onboarding runbook

8. On-call

  • Primary on-call: Messaging Core squad (PagerDuty rotation numint-primary).
  • Secondary on-call: Platform SRE (PagerDuty sre-secondary).
  • Security on-call: Trust & Safety + Platform Security (engaged on NumIntAuditChainBroken and any quota-abuse pattern suggestive of insider misuse).
  • Escalation to MNO ops: Per MoU contact tree; for MNP file delays > 26 h.

9. Go-live gates (in order)

  1. Internal beta: traffic from routing-engine only; cache cold-start verified; no tenant traffic.
  2. Internal full: all internal callers (firewall, compliance, channel-router, fraud-intel) live.
  3. MNP reconciliation live: at least 3 MNO MoUs in place; daily runs passing for 14 consecutive days.
  4. Public Lookup API beta: 5 pilot tenants; quota enforcement verified.
  5. Public Lookup API GA: open to all tenants; billing live; full SLA.
  6. EIR/CEIR live: ATRA feed integrated; LookupEir returns BLACKLIST data.

Each gate requires:

  • All Status ☐ in this document = ☑ for the relevant scope.
  • 14 consecutive days of green SLOs.
  • Sign-off by Messaging Core lead, SRE on-call, Security, and (for gates 3/5/6) Regulator Liaison.