Skip to main content

numbering-service — Failure Modes

Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform SRE Last Updated: 2026-04-21 Companion: SECURITY_MODEL · OBSERVABILITY · DEPLOYMENT_TOPOLOGY · SERVICE_RISK_REGISTER


1. Operating Principle

Numbering-service is fail-closed on writes and fail-open-via-cache on hot-path reads. The hot path can serve cached ValidateLease results for up to 60 s on PG outage; beyond that, it returns UNAVAILABLE and sms-orchestrator does not dispatch the message.

No identifier is ever assigned, recalled, or reinstated without a confirmed durable PG write.


2. Failure Mode Summary

#FailureProbabilityImpactMitigation summary
FM-01PostgreSQL primary unavailable (writes)LowCriticalFail-closed on writes; cached reads continue 60 s
FM-02PostgreSQL read replica lagMediumMediumHot path always reads primary; replica used only for admin reports
FM-03Redis unavailableLowMediumPG direct fallback; reservation cleanup falls back to safety-net cron
FM-04Redis keyspace notifications droppedMediumLow60 s safety-net cron picks up missed expirations
FM-05NATS outage (outbox publish blocked)LowMediumEvents buffer in numbering.outbox; relay retries on recovery
FM-06Sender-id-registry unavailable during alpha-AssignMediumMediumFail-closed on alpha-Assign; tenants retry; MSISDN/short-code paths unaffected
FM-07Concurrent Reserve race (CAS conflict)HighLowCAS resolves; loser receives CONFLICT and retries on a different candidate
FM-08MNO CSV parse / signature failureMediumMediumReject batch atomically; per-row errors written to lease_import_errors
FM-09Reservation cleanup cron stalledLowMediumRedis TTL still expires keys; numbering_reservations_active_total rises; alert
FM-10Quarantine sweep stalledLowMediumNumbers stay in QUARANTINE longer; tenant cannot re-lease until sweep recovers
FM-11MNO contract renewal not in place before lease expiryMediumHighPre-expiry alerts at 60d/30d/7d; commerce ops engages MNO; manual extension allowed
FM-12Short-code scarcity (national exhaustion)MediumHighATRA allocation requires long lead-time; alert at < 10 % platform-wide; capacity planning
FM-13Cross-region replication lagLowMediumCAS prevents incorrect double-assignment; alert at > 5 s lag
FM-14Region failover (kbl primary lost)Very LowHighmzr promoted; in-flight reservations expire; leases preserved via quorum
FM-15Audit hash chain brokenVery LowCriticalDaily verify cron; CRITICAL alert; halt regulator export until investigated
FM-16Regulator export generation failsLowHighStatus stays PENDING; alert; admin manually re-runs
FM-17Compliance bulk-recall stormLowMediumWorker is rate-limited (100 IDs/tick); recovers gracefully
FM-18Malicious tenant Reserve floodMediumMediumPer-tenant rate limit; RESERVATION_BURST signal to fraud-intel
FM-19Idempotency-key replay collisionLowLow409 IDEMPOTENCY_CONFLICT; client must use unique keys
FM-20Outbox relay stuckLowHighBacklog grows; alert at lag > 60 s; manual restart of relay pod

3. Detailed Failure Modes

FM-01 — PostgreSQL primary unavailable (writes)

Scenario: PG primary node down or pool exhausted.

Detection: numbering_validate_lease_unavailable_total{reason="pg_down"} > 0; /health/ready returns 503; PG connection-pool metrics flatline.

Impact:

  • All Reserve, Assign, Release, Recall return UNAVAILABLE.
  • ValidateLease continues serving Redis cache for ≤ 60 s; thereafter returns UNAVAILABLE and sms-orchestrator does not dispatch.
  • MNO CSV import jobs pause.

Recovery:

  • PgBouncer + replica failover via Patroni; auto-recovery typically < 30 s.
  • During outage, all in-flight customer-portal Reserve/Assign requests get 503; portal surfaces friendly retry message.
  • After recovery, sms-orchestrator resumes; messages held in NATS during fail-closed window are dispatched.

Mitigation:

  • Multi-region quorum (kbl + mzr) means single-region PG outage degrades to mzr-served reads.
  • HPA + PgBouncer transaction mode prevents pool exhaustion.

FM-02 — PostgreSQL read replica lag

Scenario: Replication lag on read replica grows.

Detection: pg_replication_lag_seconds > 5.

Impact: Admin reports may show slightly stale data. Hot-path is unaffected (reads primary).

Recovery: Investigate and tune; reroute admin reads back to primary if persistent.


FM-03 — Redis unavailable

Scenario: Redis cluster lost or partitioned.

Detection: numbering_redis_cache_hit_ratio drops to 0; connection errors in logs.

Impact:

  • Hot path falls back to PG direct (latency rises P95 from ~5 ms to ~15 ms but still under SLA).
  • Reservation cleanup loses keyspace-notification trigger; safety-net cron (60 s) catches expirations within 60 s instead of 2 s.
  • Quota cache miss → PG aggregation per Reserve/Assign call (modest latency hit).
  • Idempotency replay cache lost — duplicate state mutations possible if client retries within the in-flight window.

Recovery: Restart Redis cluster; cache repopulates lazily.


FM-04 — Redis keyspace notifications dropped

Scenario: Redis keyspace notifications misconfigured (notify-keyspace-events not set) or events lost during reconfiguration.

Detection: numbering_reservation_cleanup_lag_seconds P95 > 5 s; mismatch between Redis active keys and PG reservations rows.

Impact: Reservations cleanup delayed up to 60 s (safety-net cron interval). Tenants may see "still reserved" identifiers slightly longer than expected.

Recovery:

  • Verify Redis config: CONFIG GET notify-keyspace-events should include Ex.
  • Safety-net cron runs every 60 s and reconciles.

FM-05 — NATS outage (outbox publish blocked)

Scenario: NATS JetStream cluster lost.

Detection: numbering_outbox_lag_seconds > 30; relay error logs.

Impact:

  • State writes succeed (PG writes are independent of NATS).
  • Events buffer in numbering.outbox; downstream consumers (billing, sender-id-registry, analytics) lag behind.
  • compliance.tenant.suspended.v1 consumer pauses — bulk recalls deferred.

Recovery:

  • NATS auto-recovers via 3-node cluster.
  • Outbox relay drains backlog on recovery; consumers catch up via NATS redelivery.
  • Worst case: a few minutes of lag with no data loss.

FM-06 — sender-id-registry unavailable during alpha-Assign

Scenario: sender-id-registry-service down; numbering cannot verify alpha-ID KYC.

Detection: gRPC errors on IsVerified; numbering returns FAILED_PRECONDITION with reason ALPHA_VERIFY_UNAVAILABLE.

Impact: Alpha-ID Assign calls fail. Tenants retry once registry is back. MSISDN and short-code paths unaffected.

Recovery: Registry recovery; numbering retries opportunistic. Fail-closed by design — no alpha-ID is leased without verification.


FM-07 — Concurrent Reserve race (CAS conflict)

Scenario: Two tenants race to Reserve the same AVAILABLE identifier.

Detection: numbering_conflict_detected_total{kind="CAS_RACE"} increments; numbering.conflict.detected.v1 event.

Impact: Loser receives 409 CONFLICT with the current state and version. Expected behaviour, not a failure.

Recovery: Client retries on a different candidate. Browse endpoint surfaces refreshed state on next call.


FM-08 — MNO CSV parse / signature failure

Scenario: Operator uploads CSV with bad signature or malformed rows.

Detection: numbering_lease_import_batches_total{status="FAILED"} > 0; admin sees error report.

Impact: Batch rejected atomically (no partial ingest). No effect on existing inventory.

Recovery: Operator re-signs and re-uploads. Per-row errors are returned to operator via /v1/admin/numbering/blocks/imports/{batchId}/errors.


FM-09 — Reservation cleanup cron stalled

Scenario: Cleanup CronJob fails or is paused.

Detection: numbering_reservations_active_total rises monotonically; numbering_reservation_cleanup_lag_seconds > 60 s.

Impact: Tenants see reservations stuck longer than TTL. Eventually pool browse returns "fewer available" than expected.

Recovery: Restart CronJob; manual kubectl create job --from=cronjob/numbering-reservation-cleanup … to trigger immediately.


FM-10 — Quarantine sweep stalled

Scenario: Sweep CronJob fails.

Detection: numbering_quarantine_backlog_total > 100 for 10 m.

Impact: Numbers stuck in QUARANTINE past their quarantineUntil. New leases on those identifiers return QUARANTINE_ACTIVE.

Recovery: Restart sweep; manually trigger.


FM-11 — MNO contract renewal not in place before lease expiry

Scenario: Roshan/Etisalat-AF/MTN-AF/AWCC/Salaam contract for a prefix range expires; no renewal MoU in place.

Detection: Daily contract-expiry alert at 60 d / 30 d / 7 d before effective_until.

Impact: All MSISDNs in the affected block are affected. Existing leases honour validUntil from the contract; new assignments rejected.

Recovery:

  • Commerce ops engages MNO before expiry (mandatory per readiness gate).
  • If renewal is delayed, manual contract extension via PUT /v1/admin/numbering/contracts/{id} honours the expected new term once signed.

FM-12 — Short-code scarcity (national exhaustion)

Scenario: ATRA-allocated short-code pool is below 10 % AVAILABLE platform-wide.

Detection: NumberingShortCodeScarcityCritical alert (< 10 %).

Impact: New short-code Assigns rejected with NOT_AVAILABLE. Existing leases honoured.

Recovery:

  • ATRA allocation request (long lead time, weeks).
  • Capacity planning is part of monthly commerce ops review.

FM-13 — Cross-region replication lag

Scenario: kbl ↔ mzr replication lag exceeds 5 s.

Detection: numbering_cross_region_lag_seconds > 5 alert.

Impact: CAS conflict rate may spike if writes occur in both regions. No double-assignment (CAS prevents).

Recovery: Investigate WAN link / DB replication; throttle write load to a single region temporarily.


FM-14 — Region failover (kbl primary lost)

Scenario: kbl region completely lost (DC outage).

Detection: Multi-region health check; manual or automated failover trigger.

Impact:

  • mzr promoted to primary; all in-flight reservations in kbl are lost (acceptable; TTL semantics).
  • Active leases preserved via cross-region quorum on numbers/leases.
  • Monthly regulator-export cron scheduled in kbl is rescheduled in mzr.

Recovery: Per platform DR runbook; failback when kbl restored.


FM-15 — Audit hash chain broken

Scenario: A row in numbering.audit has been modified out-of-band, breaking the SHA-256 chain.

Detection: Daily audit-chain-verify cron raises NumberingAuditChainBroken CRITICAL alert.

Impact: Regulator-export generation is halted automatically until investigated. Audit evidence integrity is in question.

Recovery:

  • SECURITY incident response.
  • Trace the offending row; recover from cold backup if necessary.
  • Re-anchor chain only after sign-off from Security + Legal + Compliance.

FM-16 — Regulator export generation fails

Scenario: Monthly export cron fails (PG read error, S3 upload failure, signing failure).

Detection: Export row stays in PENDING 24 h past scheduled time; NumberingRegulatorExportFailed alert.

Impact: ATRA submission delayed.

Recovery: Investigate root cause; manually trigger via POST /v1/admin/numbering/regulator-exports:generate.


FM-17 — Compliance bulk-recall storm

Scenario: A large tenant suspension triggers recall of thousands of leases.

Detection: numbering_recall_total{reason="ABUSE"} spike.

Impact: Worker rate-limits recall to 100 IDs / tick (10 ticks / s = 1000 IDs/s). For a 100k-lease tenant, recall takes ~100 s.

Recovery: Self-recovers; monitor backlog.


FM-18 — Malicious tenant Reserve flood

Scenario: Tenant scripts thousands of Reserves to exhaust their quota or scrape inventory.

Detection: num:rate:reserve:{tenantId} exceeds 60/min; RESERVATION_BURST anomaly signal.

Impact: Tenant gets RESERVATION_QUOTA or 429 RATE_LIMITED.

Recovery: Fraud-intel scores tenant; if SUSPENDED, compliance-engine triggers bulk recall via FM-17 pathway.


FM-19 — Idempotency-key replay collision

Scenario: Same Idempotency-Key used for two different request payloads within the 24 h window.

Detection: 409 IDEMPOTENCY_CONFLICT returned; numbering_idempotency_conflict_total increments.

Impact: Second request rejected. No state corruption.

Recovery: Client uses unique keys (UUIDv4 recommended).


FM-20 — Outbox relay stuck

Scenario: Relay pod stuck or NATS publish hangs.

Detection: numbering_outbox_lag_seconds > 60.

Impact: Downstream consumers lag — billing missed lease starts, sender-id-registry inventory state stale.

Recovery: Restart relay deployment; backlog drains in order.


4. Graceful Degradation Summary

Full operation:
Hot path: Redis cache hit (P95 5 ms)
Lifecycle: PG CAS + outbox + Redis mirror

Redis unavailable:
Hot path: PG direct (P95 15 ms)
Lifecycle: continues; reservation cleanup via 60 s safety-net cron

PG primary unavailable (writes):
Hot path: Redis cache for ≤ 60 s, then UNAVAILABLE → orchestrator fail-closes
Lifecycle: rejects with UNAVAILABLE

Sender-id-registry unavailable:
Alpha-Assign: rejected (FAILED_PRECONDITION)
MSISDN/Short-code: unaffected

NATS unavailable:
State writes: succeed
Events: buffer in outbox, drain on recovery
Compliance bulk recall: deferred until consumer recovers

Region kbl lost:
Active leases: preserved (quorum)
In-flight reservations: lost (TTL semantics)
Regulator export: rescheduled in mzr

5. Tenant-Experience Matrix

FailureTenant view
FM-01 (PG down, brief)Reserve / Assign returns 503 with retry hint
FM-01 (PG down, extended)Outbound messages held in EVALUATING; eventually DEAD_LETTER with reason numbering_unavailable
FM-06 (sender-id-registry down)Alpha-Assign returns 422 ALPHA_VERIFY_UNAVAILABLE with retry hint
FM-07 (CAS conflict)Reserve returns 409 CONFLICT; portal auto-refreshes pool view
FM-09 (cleanup stalled)Pool browse shows fewer AVAILABLE for up to 60 s past TTL
FM-12 (short-code scarce)Assign returns 404/409 with "no inventory available; contact commerce ops"
FM-15 (audit chain broken)No tenant impact; admin sees regulator-export blocked

End of FAILURE_MODES.md