Number Intelligence Service — Failure Modes
Version: 1.0 Status: Draft Owner: Messaging Core / Platform SRE Last Updated: 2026-04-21 Companion: APPLICATION_LOGIC · OBSERVABILITY · SERVICE_RISK_REGISTER
1. Operating principle: fail-degraded
number-intelligence-service operates fail-degraded for all read paths:
A typed answer with
confidence: LOWorUNKNOWNis always preferred over an error.
Downstream callers (routing-engine, sms-firewall-service, compliance-engine, channel-router-service, fraud-intel-service) each have their own fallbacks:
routing-enginefalls back to its E.164 prefix-table.sms-firewall-servicedefaults to allow-by-default for unknown origins.compliance-engineGEO_RESTRICTION treatsUNKNOWNcountry as the most-restrictive class.channel-router-servicerejects only onlineType = FIXED(deterministic from prefix); UNKNOWN passes the capability gate.
The single exception is LookupEir for BLACKLIST checking — sms-firewall-service may configure FAIL_CLOSED_ON_UNKNOWN for high-risk tenants (default OFF).
2. Failure-mode summary
| # | Failure | Likelihood | Impact | Rating | Mitigation summary |
|---|---|---|---|---|---|
| FM-01 | Postgres unavailable (primary + standbys) | Low | High | HIGH | Stale-cache fallback; promote standby; degraded confidence |
| FM-02 | Redis unavailable | Medium | Medium | MEDIUM | PG direct; latency degrades; no functional break |
| FM-03 | Both Redis and Postgres unavailable | Very Low | High | HIGH | Prefix-table fallback; confidence: UNKNOWN |
| FM-04 | MNO MNP SFTP unreachable | Medium | Medium | MEDIUM | Use last-known MNP state; alert at 26 h stale; escalate at 48 h |
| FM-05 | HLR probe (SS7) unavailable | Medium | Medium | MEDIUM | Fall back to MNO MNP state + last persisted attribution |
| FM-06 | MNP reconciliation conflict spike | Low | Medium | MEDIUM | Triage queue; AI-assisted ranking; admin manual resolution |
| FM-07 | Audit hash-chain break detected | Very Low | Critical | HIGH | Freeze writes; investigate; never auto-resolve |
| FM-08 | Per-MNO TPS quota exhausted | Medium | Medium | MEDIUM | Token-bucket throttle returns stale answer; alert at 5 % denial rate |
| FM-09 | Tenant enumeration / quota abuse on Public Lookup | Medium | Medium | MEDIUM | RPS quota + monthly cap; Kong bot-detection; per-tenant audit |
| FM-10 | MSISDN normalisation edge case (Unicode confusable) | Low | Medium | MEDIUM | NFKC + strict regex at VO; property tests in CI |
| FM-11 | Cross-region replication conflict | Low | Medium | MEDIUM | Source-priority + version conflict policy; cross-region verifier |
| FM-12 | EIR source unreachable | Low | Medium | MEDIUM | Last-known EIR; LookupEir returns UNKNOWN for new IMEIs |
| FM-13 | Cache poisoning during MNP transition | Low | Medium | MEDIUM | DEL-on-MNP-change + outbox-driven event; subscriber warmers |
| FM-14 | NetworkPolicy mis-config exposes egress to offshore | Very Low | Critical | HIGH | Deploy-time residency test; Istio AuthorizationPolicy |
| FM-15 | Outbox publish stuck | Low | Medium | MEDIUM | Retry; manual replay; alert |
| FM-16 | ni-hlr-gateway adapter pod down | Low | Medium | MEDIUM | Sibling pod takes over; synthetic probe recovery < 30 s |
3. Detailed failure modes
FM-01 — Postgres unavailable (primary + standbys)
Scenario. All Patroni primaries and standbys for the numint schema unreachable in the active region.
Detection:
numint_pg_query_duration_secondserrors;/health/ready503.- Patroni events;
PostgresPrimaryDownalert.
Impact:
- Hot path: serves from Redis only; cache misses fall through to in-memory prefix-table;
confidencedrops to LOW/UNKNOWN. - Writes (live HLR write-through, MNP recon) fail with
503. - Audit chain pauses.
Recovery:
- Patroni auto-promotes a standby; RTO ≤ 90 s.
- If Kabul region is fully isolated, manual cutover routes RW traffic to Mazar (Mazar already serves hot reads — no visible outage on the read path).
Mitigation:
- 3-node Patroni per region; PgBouncer in transaction mode shields per-pod connection churn.
- Circuit breaker on PG client opens after 5 consecutive errors / 30 s; half-open after 60 s.
- Mazar warm-active for hot reads.
Runbook: numint-pg-down.md
FM-02 — Redis unavailable
Scenario. Redis cluster for the active region unreachable.
Detection:
numint_redis_op_duration_secondserrors; cache hit ratio drops to 0 %.numint_cache_misses_totalnear-100 %.
Impact:
- Cascade falls through to Postgres on every cache miss; latency from 5 ms → 15 ms P95.
- Distributed locks (audit verifier, MNP recon, cache warm) cannot be acquired → workers skip cycle and emit
numint_worker_skipped_no_lock_total. - Per-tenant rate-limit buckets cannot be enforced → temporary tenant unrestricted access (mitigated by Kong's own rate-limit-advanced fallback).
Recovery:
- Redis cluster recovers; cache warms naturally on TTL-driven re-fill; manual
numint-cache-warmerrun accelerates. - HPA may scale up replicas under elevated PG load.
Mitigation:
- 6-node Sentinel cluster per region; multi-AZ.
- Postgres replica pool sized for 2 000 RPS sustained without cache (load-tested with 80 % headroom).
Runbook: numint-redis-down.md
FM-03 — Both Redis and Postgres unavailable
Scenario. Catastrophic dual-dependency outage in the active region.
Detection:
- Both FM-01 and FM-02 alerts firing.
numint_lookup_total{tier="fallback"}rises sharply.
Impact:
ResolveMsisdnfalls back to in-memory prefix table;source = PREFIX_FALLBACK,confidence = UNKNOWN.routing-engineuses its own prefix table → routing continues with reduced MNP awareness.- Write paths fail entirely; outbox accumulates in process memory? No — outbox requires PG write; live HLR results are dropped on the floor for the duration (acceptable: live HLR is opportunistic).
- MNP & EIR reconciliation paused.
Recovery:
- Restore PG; restore Redis; re-warm.
Mitigation:
- Cross-region active-active means a regional outage simply reduces capacity; the other region serves.
- Prefix table loaded at startup from Vault-pinned ATRA CSV; resident in process memory.
FM-04 — MNO MNP SFTP unreachable
Scenario. A specific MNO's SFTP endpoint refuses connection or returns no file.
Detection:
numint_mnp_recon_runs_total{mno_id, outcome="failed"}increments.numint_mnp_recon_last_success_timestamp_seconds{mno_id}stops advancing.NumIntMnpReconciliationStaleHIGH at 26 h, CRITICAL at 48 h.
Impact:
- New ports for that MNO not reflected in attribution;
routing-enginemay continue to dispatch to the donor MNO until next successful run. - Live HLR probes still work (independently); they may catch some new ports.
Recovery:
- Retry hourly until 23:00 same day.
- After 23:00 escalate to MNO operations contact (per MNO MoU).
- Manual SFTP re-fetch via admin endpoint
POST /v1/admin/numint/mnp/runsonce the MNO restores service.
Mitigation:
- MNO SLA agreements include MNP file delivery cadence and escalation path.
- Backup MNP delivery via secure email (PGP-signed) for emergencies.
Tenant impact: Misrouting risk for ported numbers — message may be sent to donor MNO and rejected; orchestrator retries automatically. End-user may see a brief delivery delay (seconds) on ported numbers.
Runbook: numint-mnp-stale.md
FM-05 — HLR probe (SS7) unavailable
Scenario. ni-hlr-gateway adapter for an MNO is down (SIGTRAN association failed; REST endpoint 5xx).
Detection:
numint_hlr_adapter_health{mno_id} == 0.numint_hlr_probes_total{mno_id, status!="OK"}rises.
Impact:
- Forced-fresh requests cannot reach live HLR; receive stale persisted answer with
confidence = LOW. fraud-intel-serviceloses real-time VLR-change signal.- Tenant Public Lookup
maxStaleness < 86400returns stale; SKU adjusted.
Recovery:
- Sibling DaemonSet pod takes over (DNS round-robin via headless service).
- For SS7, M3UA association re-establishment takes 30-60 s.
- Manual restart of the affected pod via runbook.
Mitigation:
- DaemonSet ensures multiple pods; SS7 stack uses M3UA failover ASP groups.
- Per-MNO REST adapter has multiple endpoints configured (primary + secondary).
FM-06 — MNP reconciliation conflict spike
Scenario. Daily MNP runs produce > 50 conflicts per MNO (baseline ≤ 10) — likely an MNO file format change or systemic regulatory event.
Detection:
numint_mnp_recon_conflicts_total{mno_id, severity="HIGH"}spike.NumIntReconciliationConflictSpikeHIGH alert.
Impact:
- MSISDNs in conflict do not transition until manually resolved → routing may be wrong for those numbers.
- Trust & safety analyst backlog.
Recovery:
- AI-assisted triage ranks conflicts (see AI_INTEGRATION §2).
- Admin reviews and resolves via
POST /v1/admin/numint/mnp/conflicts/{conflictId}/resolve. - Coordinate with MNO if file format / process changed.
Mitigation:
- Defensible audit trail per SECURITY_MODEL §4.
- Conflict aging dashboard panels in
numint-mnp-eir.json.
FM-07 — Audit hash-chain break detected
Scenario. Daily verifier finds record_hash mismatch in lookup_audit or portability_history.
Detection:
numint_audit_chain_breaks_detected_total > 0.NumIntAuditChainBrokenCRITICAL pages on-call.
Impact:
- Loss of regulator-defensibility for the affected partition.
- Possible undetected tampering — must be treated as a security incident.
Recovery:
- Freeze writes (manual flag
numint.write_freeze=truevia ConfigMap). - Identify break boundary; preserve database state.
- Forensic review by Security; possible Postgres-replica byte-comparison.
- Restore from cold archive S3 if needed; re-link chain manually with documented gap.
Mitigation:
- Append-only DB rules.
- Two independent chain implementations (producer in TS, verifier in Python) cross-check; a divergence between them prevents shipping a bug.
Runbook: numint-audit-chain-broken.md
FM-08 — Per-MNO TPS quota exhausted
Scenario. Tenant traffic + fraud-intel forced-fresh together exceed an MNO's contracted SS7 TPS.
Detection:
numint_hlr_tps_denied_total{mno_id}rises.- Per-MNO denial rate > 5 % over 5 min →
NumIntHlrThrottlingMEDIUM.
Impact:
- Live HLR results return as
STALE_THROTTLED; callers receive last-known persisted answer withconfidence = MEDIUM. - Some divergence from real-world state for low-volume MNOs.
Recovery:
- Tune
freshLookupRpsLimitlower for the offending tenants. - Negotiate higher TPS with MNO if sustained business need.
Mitigation:
- Token bucket per
(mno, op)is mandatory. - Tenant SDK strongly discourages
forceFreshon every call (documented).
FM-09 — Tenant enumeration / quota abuse on Public Lookup
Scenario. Malicious tenant or stolen credentials send 10× plan RPS attempting MSISDN enumeration.
Detection:
numint_public_lookup_quota_breach_total{tenant_id}sustained > 10/15 min →NumIntPublicLookupQuotaAbuseMEDIUM.- Kong bot-detection plugin flags JA3 fingerprint.
Impact:
- Other tenants unaffected (per-tenant buckets isolate).
- Audit log fills with the attacker's hash sequence (tenant-salted; not cross-tenant correlatable).
Recovery:
- Tenant-side: 429 with
Retry-Afterreturned for the duration of breach. - Platform-side: alert prompts manual review; suspend tenant if abuse confirmed.
- Rotate tenant API key + JWT issuer revocation if credentials suspected stolen.
Mitigation:
- Per-tenant RPS + monthly + fresh-lookup buckets.
- Anti-enumeration: response time uniform whether MSISDN is known or unknown.
- Tenant-salted audit hash limits intelligence value of any leak.
FM-10 — MSISDN normalisation edge case
Scenario. Tenant submits an MSISDN with Unicode confusable characters (RTL marks, Arabic-Indic digits, look-alikes) that the regex initially accepts.
Detection:
- Property-based test in CI catches first.
- Production: would surface as cache key explosion (semantically same MSISDN keyed differently).
Impact:
- Correctness regression — same MSISDN looked up twice produces different results / billing.
Recovery:
- Regex/NFKC normalisation fix; redeploy.
- Re-bill affected tenant calls (rare).
Mitigation:
- NFKC normalisation BEFORE regex match.
- Strict ASCII digit class (
[0-9], not\dwhich matches Unicode digits). - Property-based test asserts
parse(canonicalise(s)) == parse(s)for any conformant input.
FM-11 — Cross-region replication conflict
Scenario. Concurrent writes in Kabul and Mazar to the same NumberRecord (rare — typically only on admin override or simultaneous MNP recon).
Detection:
numint_pg_replication_conflict_totalincrements.- Cross-region audit verifier reports divergent
recordHash.
Impact:
- One write may be lost (LWW on
version); audit chain may fork.
Recovery:
- Freeze writes; reconcile manually using source-priority.
- Re-derive winning row; replay outbox.
Mitigation:
- Per-aggregate conflict policy (see SYNC_CONTRACT §4).
- Batch jobs run in Kabul only with leader-election lock.
- Admin overrides require dual-control + are rare.
FM-12 — EIR source unreachable
Scenario. ATRA SFTP or per-MNO CEIR endpoint unreachable for a daily run.
Detection:
numint_eir_recon_runs_total{outcome="failed"}increments.NumIntEirSyncStaleMEDIUM at 26 h.
Impact:
- New stolen-IMEI flags not reflected;
LookupEirreturns last-known. sms-firewall-servicemay not block traffic to newly-flagged devices.
Recovery:
- Retry hourly until 23:00.
- Manual resync via
POST /v1/admin/numint/eir/runs.
Mitigation:
- Multi-source aggregation (ATRA + per-MNO) means single-source failure does not lose all signal.
FM-13 — Cache poisoning during MNP transition
Scenario. A live HLR probe lands a stale answer in cache milliseconds before MNP recon writes the new state, leaving cache poisoned for the cache TTL.
Detection:
numint.mnp.divergence.v1events spike for newly-ported MSISDNs in the hours after recon.
Impact:
- For up to 24 h (MNP TTL) some lookups return wrong MNO.
Recovery:
- MNP recon writes both invalidate Redis (
DEL) and emitnumint.attribution.changed.v1to warm subscribers. - If poisoning is detected, manual
numint-cache-flushcron forces full warm.
Mitigation:
- Atomic UPSERT in PG with version conditional prevents the live-HLR write from clobbering a fresher MNP row.
- MNP overlay step in UC-Lookup checks
PortabilityRecordpost-cache and overrides.
FM-14 — NetworkPolicy mis-config exposes egress to offshore
Scenario. A NetworkPolicy or AuthorizationPolicy change accidentally permits egress to a non-Afghan IP (cloud LLM, third-party telemetry).
Detection:
- Deploy-time residency test fails.
- Runtime:
numint_egress_offshore_total(synthetic check) > 0.
Impact:
- Potential PII leak to non-Afghan jurisdiction → regulatory violation.
Recovery:
- Rollback the NetworkPolicy; re-deploy.
- Audit which calls (if any) touched the offshore endpoint; report to DPO if PII was in transit.
Mitigation:
- Deploy-time residency test is mandatory CI gate.
- Istio AuthorizationPolicy as second layer.
- Egress NetworkPolicy explicitly denies non-
10.0.0.0/8egress on the hot-path Deployment.
FM-15 — Outbox publish stuck
Scenario. NATS unreachable or specific subject misconfigured; outbox rows accumulate.
Detection:
numint_outbox_oldest_unpublished_age_seconds > 60→NumIntOutboxStuckHIGH.
Impact:
- Subscriber services (routing-engine, sms-firewall, billing) miss state-change events; their caches drift.
Recovery:
- Restore NATS connectivity; outbox relay drains.
- Manual replay via
POST /v1/admin/numint/outbox/replay.
Mitigation:
- Outbox relay is per-replica with
SELECT … FOR UPDATE SKIP LOCKEDso multiple workers pick up backlog. - Dead-letter subjects retain failed messages for SRE inspection.
FM-16 — ni-hlr-gateway adapter pod down
Same as FM-05; documented separately because the gateway is the single network hop that owns SIGTRAN sockets — a pod restart drops live MAP dialogs (orphaned invokes return TIMEOUT to the caller; replay logic in NI handles this).
4. Failure-mode interaction matrix
| Concurrent failure | Combined impact |
|---|---|
| FM-01 + FM-02 | FM-03 (prefix-table fallback) |
| FM-04 + FM-05 (MNP file + live HLR for same MNO) | Attribution stale for that MNO; routing-engine prefix-table fallback handles |
| FM-07 + FM-11 | Catastrophic — freeze all writes; declare incident; engage Security + Platform Arch |
| FM-09 + FM-08 | Tenant abuse drives SS7 quota exhaustion → throttle both at the gateway and at the per-tenant fresh-lookup bucket |