Skip to main content

Number Intelligence Service — Observability

Version: 1.0 Status: Draft Owner: Messaging Core / Platform SRE Last Updated: 2026-04-21 Companion: FAILURE_MODES · SERVICE_READINESS · APPLICATION_LOGIC

1. SLIs / SLOs

Bound to the platform NFR catalog (EP-PLAT-NB-09) and SERVICE_OVERVIEW §10.

SLISLOWindowBurn-rate alert
ResolveMsisdn cache-hit P95 latency≤ 5 ms28 d14 × 1 h, 6 × 6 h
ResolveMsisdn end-to-end P99 latency (any path, excluding forced fresh)≤ 50 ms28 d14 × 1 h, 6 × 6 h
ResolveMsisdn availability (non-error, any-confidence)≥ 99.95 %28 d14 × 1 h, 6 × 6 h
Aggregate cache hit ratio (LRU+Redis / total)≥ 95 % sustained7 d< 85 % for 30 min → MEDIUM
LRU hit ratio≥ 60 % sustained7 d< 40 % for 30 min → LOW (tune)
LookupPorting P95 latency≤ 8 ms28 d14 × 1 h
LookupEir P95 latency≤ 8 ms28 d14 × 1 h
Public Lookup REST P95 latency≤ 200 ms28 d14 × 1 h
Public Lookup REST P99 latency≤ 500 ms28 d14 × 1 h
MNP reconciliation freshness (per MNO)last successful run < 26 hcontinuous26 h stale → HIGH; 48 h → CRITICAL
EIR reconciliation freshnesslast successful run < 26 hcontinuous26 h stale → HIGH
HLR probe success rate (per MNO)≥ 98 %7 d< 95 % for 15 min → MEDIUM
MNP reconciliation conflict count per run≤ 10 per MNO per day (baseline)7 d> 50 → HIGH (spike investigation)
Audit hash-chain integrityDaily verifier returns OKcontinuousbreak → CRITICAL (page)
Outbox publish lag P95≤ 5 s28 d> 60 s for 5 min → HIGH
Tenant lookup quota breach rate< 1 % of total RPS (steady-state)1 d> 5 % → tenant-abuse investigation

2. Prometheus metrics

All metrics exposed at GET /metrics on port 3073. Service label service="number-intelligence-service".

2.1 ResolveMsisdn (hot path)

MetricTypeLabelsDescription
numint_lookup_totalCountertier (lru/redis/pg/live/fallback), source, confidenceLookups by outcome
numint_lookup_duration_secondsHistogramtier, resultEnd-to-end latency
numint_lookup_concurrency_in_flightGaugeIn-flight gRPC calls
numint_lookup_invalid_e164_totalCounterINVALID_ARGUMENT responses

Histogram buckets for numint_lookup_duration_seconds: [0.0005, 0.001, 0.002, 0.003, 0.005, 0.008, 0.012, 0.020, 0.050, 0.100, 0.250, 0.500, 1.000, 2.000]

2.2 Cache tiers

MetricTypeLabelsDescription
numint_cache_hits_totalCountertier (lru/redis/pg)Cache hits by tier
numint_cache_misses_totalCountertierCache misses
numint_cache_hit_ratioGaugetierRolling 5-min ratio (derived)
numint_lru_sizeGaugeCurrent LRU entry count per pod
numint_lru_evictions_totalCounterLRU evictions
numint_redis_op_duration_secondsHistogramop (get/set/del/eval)
numint_cache_warm_duration_secondsHistogramkind (cold/hourly)
numint_cache_warm_keys_loadedCounterkind

2.3 MNP reconciliation

MetricTypeLabelsDescription
numint_mnp_recon_runs_totalCountermno_id, outcome (success/failed)Reconciliation runs
numint_mnp_recon_duration_secondsHistogrammno_idPer-run duration
numint_mnp_recon_records_totalCountermno_id, result (accepted/rejected)
numint_mnp_recon_conflicts_totalCountermno_id, severity
numint_mnp_recon_last_success_timestamp_secondsGaugemno_idUnix epoch — drives (time() - this) > 26h alert
numint_mnp_registry_sizeGaugemno_idCurrent PortabilityRecord rows per recipient MNO
numint_mnp_registry_growth_rateGaugemno_idRolling 7-day daily port delta

2.4 HLR probes

MetricTypeLabelsDescription
numint_hlr_probes_totalCountermno_id, transport, statusPer-probe outcome
numint_hlr_probe_duration_secondsHistogrammno_id, transport
numint_hlr_tps_admitted_totalCountermno_idToken bucket admits
numint_hlr_tps_denied_totalCountermno_idToken bucket denials
numint_hlr_adapter_healthGauge (0/1)mno_id, transportDaemonSet adapter reachability
numint_hlr_map_dialog_activeGaugemno_idOpen TCAP dialogs on SIGTRAN

2.5 EIR

MetricTypeLabelsDescription
numint_eir_lookups_totalCounteroutcome (blacklist/greylist/whitelist/unknown)
numint_eir_recon_runs_totalCountersource (atra/mno_id), outcome
numint_eir_recon_last_success_timestamp_secondsGaugesource
numint_eir_blacklist_sizeGaugeCurrent BLACKLIST IMEI count

2.6 Public Lookup API (tenant-facing)

MetricTypeLabelsDescription
numint_public_lookup_totalCountertenant_id, result_class, sku
numint_public_lookup_duration_secondsHistogramtenant_id, tier
numint_public_lookup_quota_breach_totalCountertenant_id, scope (rps/month/fresh_rps)
numint_public_lookup_forced_fresh_totalCountertenant_idmaxStaleness < 86400 calls
numint_tenant_monthly_quota_usedGaugetenant_idCurrent month usage

2.7 Audit + outbox

MetricTypeLabelsDescription
numint_lookup_audit_rows_totalCountertenant_id
numint_audit_chain_verifier_runs_totalCounterchain_kind, outcome (ok/broken)
numint_audit_chain_verifier_last_ok_timestamp_secondsGaugechain_kind
numint_audit_chain_breaks_detected_totalCounterchain_kind, partitionMUST equal 0
numint_outbox_unpublished_countGauge
numint_outbox_oldest_unpublished_age_secondsGauge
numint_outbox_publish_totalCountersubject, outcome
numint_outbox_publish_duration_secondsHistogramsubject

2.8 Dependencies / health

MetricTypeLabelsDescription
numint_pg_pool_in_useGaugepool (primary/replica)PgBouncer pool
numint_pg_query_duration_secondsHistogramstatement
numint_vault_op_duration_secondsHistogramop
numint_vault_errors_totalCounterop

3. Alert catalogue

groups:
- name: numint.slo
rules:
- alert: NumIntLookupLatencyHigh
expr: histogram_quantile(0.95, sum by (le) (rate(numint_lookup_duration_seconds_bucket[5m]))) > 0.015
for: 5m
labels: { severity: HIGH, service: number-intelligence-service }
annotations:
summary: "ResolveMsisdn P95 > 15 ms"
runbook: "https://runbooks.ghasi.gov.af/numint/lookup-latency.md"

- alert: NumIntCacheHitRateLow
expr: sum(rate(numint_cache_hits_total[10m])) / (sum(rate(numint_cache_hits_total[10m])) + sum(rate(numint_cache_misses_total[10m]))) < 0.85
for: 30m
labels: { severity: MEDIUM }

- alert: NumIntMnpReconciliationStale
expr: (time() - max by (mno_id) (numint_mnp_recon_last_success_timestamp_seconds)) > 93600 # 26h
for: 10m
labels: { severity: HIGH }

- alert: NumIntMnpReconciliationCritical
expr: (time() - max by (mno_id) (numint_mnp_recon_last_success_timestamp_seconds)) > 172800 # 48h
for: 10m
labels: { severity: CRITICAL }

- alert: NumIntReconciliationConflictSpike
expr: sum by (mno_id) (increase(numint_mnp_recon_conflicts_total[1h])) > 50
for: 10m
labels: { severity: HIGH }

- alert: NumIntHlrProbeFailureHigh
expr: sum by (mno_id) (rate(numint_hlr_probes_total{status!="OK"}[10m])) / sum by (mno_id) (rate(numint_hlr_probes_total[10m])) > 0.05
for: 15m
labels: { severity: MEDIUM }

- alert: NumIntHlrAdapterDown
expr: numint_hlr_adapter_health == 0
for: 3m
labels: { severity: HIGH }

- alert: NumIntAuditChainBroken
expr: increase(numint_audit_chain_breaks_detected_total[1d]) > 0
for: 0m
labels: { severity: CRITICAL }
annotations: { runbook: "https://runbooks.ghasi.gov.af/numint/audit-chain-broken.md" }

- alert: NumIntOutboxStuck
expr: numint_outbox_oldest_unpublished_age_seconds > 60
for: 5m
labels: { severity: HIGH }

- alert: NumIntPublicLookupQuotaAbuse
expr: sum by (tenant_id) (rate(numint_public_lookup_quota_breach_total[15m])) > 10
for: 15m
labels: { severity: MEDIUM }
annotations: { summary: "Tenant {{ $labels.tenant_id }} sustained quota breach — possible enumeration" }

- alert: NumIntEventsDlqGrowing
expr: sum by (subject) (nats_jetstream_stream_messages{subject=~".*\\.deadletter"}) > 100
for: 10m
labels: { severity: HIGH }

4. Structured logs

All logs are valid JSON (Pino), redacted per SECURITY_MODEL §3.3. LOG_LEVEL env var controls verbosity.

{ "level": "info", "time": "2026-04-21T11:00:00.000Z",
"event": "numint.lookup", "tenantId": null, "msisdnHash": "8d4f…", "scope": "BASIC",
"tier": "redis", "source": "redis", "confidence": "high", "latencyMs": 2,
"traceId": "00-abc-def-01" }
{ "level": "info", "event": "numint.mnp.recon.completed", "mnoId": "mtn-afghanistan",
"runId": "rcn_01HZX7…", "fileSha256": "bd…", "accepted": 47298, "rejected": 24,
"conflictsCount": 5, "durationMs": 287340 }
{ "level": "warn", "event": "numint.mnp.divergence", "msisdnMasked": "+93701***",
"mnpMno": "mtn-afghanistan", "hlrMno": "afghan-wireless",
"portDate": "2026-04-10", "severity": "HIGH", "traceId": "…" }
{ "level": "error", "event": "numint.audit.chain_broken", "verifierRunId": "cvr_…",
"chainKind": "LOOKUP_AUDIT", "partition": "lookup_audit_2026_04",
"firstBadSeq": 4837521, "auditId": "nia_…",
"expectedPrevHash": "ab12…", "actualPrevHash": "cd34…" }
{ "level": "info", "event": "numint.hlr_probe", "mnoId": "afghan-wireless",
"transport": "MAP_SRI_SM", "status": "OK", "durationMs": 284,
"msisdnHash": "8d4f…", "vlrChanged": true, "traceId": "…" }

5. Distributed tracing

  • OpenTelemetry SDK exporting OTLP to the platform collector.
  • Parent span: grpc.server.ResolveMsisdn or http.server GET /v1/lookup/{msisdn}.
  • Child spans: cache.lru.get, cache.redis.get, db.pg.select, hlr.gateway.LiveLookup, db.pg.upsert, outbox.insert.
  • Propagation: W3C traceparent in gRPC metadata and HTTP headers; received trace IDs flow through into logs (traceId field).

6. Grafana dashboards

6.1 numint-hot-path.json

  • Row 1 — Lookup throughput & latency: RPS by tier, P50/P95/P99 latency, LRU/Redis/PG hit ratios.
  • Row 2 — Hot-path health: in-flight concurrency, pod CPU/memory, replica count, HPA signal.
  • Row 3 — Error rates: INVALID_ARGUMENT, UNAVAILABLE, INTERNAL by tier.

6.2 numint-mnp-eir.json

  • Row 1 — MNP reconciliation: per-MNO last success timestamp, accepted vs rejected, conflict count.
  • Row 2 — MNP divergence: numint.mnp.divergence.v1 event rate by severity.
  • Row 3 — EIR: daily IMEI flags by reporter; BLACKLIST-size trend.

6.3 numint-adapter.json

  • Per-MNO adapter health (MAP dialog count, REST 5xx rate, token bucket admit/deny, latency).
  • PCAP sample age / size (ops visibility into MinIO archive).

6.4 numint-public-api.json

  • Tenant-facing view: per-tenant RPS, latency, quota usage, 429 rate, billing SKU distribution.

6.5 numint-audit.json

  • Audit chain verifier run history; outbox lag; chain-break counter (must stay at 0).

7. Synthetic probes

  • Every 30 s (numint-synth-probe cron in kbl and mzr): synthetic ResolveMsisdn(+93701111111) to a fixed test MSISDN; latency + confidence recorded as numint_synth_probe_duration_seconds / numint_synth_probe_confidence.
  • Hourly: synthetic MNP reconciliation fixture against a mock MNO SFTP; verifies end-to-end pipeline.
  • Daily: audit-chain tamper-detect drill (deliberately inject a canary tamper row; verifier MUST flag it; operator MUST acknowledge within 1 h).