numbering-service — Failure Modes

Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform SRE Last Updated: 2026-04-21 Companion: SECURITY_MODEL · OBSERVABILITY · DEPLOYMENT_TOPOLOGY · SERVICE_RISK_REGISTER

1. Operating Principle

Numbering-service is fail-closed on writes and fail-open-via-cache on hot-path reads. The hot path can serve cached ValidateLease results for up to 60 s on PG outage; beyond that, it returns UNAVAILABLE and sms-orchestrator does not dispatch the message.

No identifier is ever assigned, recalled, or reinstated without a confirmed durable PG write.

2. Failure Mode Summary

#	Failure	Probability	Impact	Mitigation summary
FM-01	PostgreSQL primary unavailable (writes)	Low	Critical	Fail-closed on writes; cached reads continue 60 s
FM-02	PostgreSQL read replica lag	Medium	Medium	Hot path always reads primary; replica used only for admin reports
FM-03	Redis unavailable	Low	Medium	PG direct fallback; reservation cleanup falls back to safety-net cron
FM-04	Redis keyspace notifications dropped	Medium	Low	60 s safety-net cron picks up missed expirations
FM-05	NATS outage (outbox publish blocked)	Low	Medium	Events buffer in `numbering.outbox`; relay retries on recovery
FM-06	Sender-id-registry unavailable during alpha-Assign	Medium	Medium	Fail-closed on alpha-Assign; tenants retry; MSISDN/short-code paths unaffected
FM-07	Concurrent Reserve race (CAS conflict)	High	Low	CAS resolves; loser receives `CONFLICT` and retries on a different candidate
FM-08	MNO CSV parse / signature failure	Medium	Medium	Reject batch atomically; per-row errors written to `lease_import_errors`
FM-09	Reservation cleanup cron stalled	Low	Medium	Redis TTL still expires keys; `numbering_reservations_active_total` rises; alert
FM-10	Quarantine sweep stalled	Low	Medium	Numbers stay in QUARANTINE longer; tenant cannot re-lease until sweep recovers
FM-11	MNO contract renewal not in place before lease expiry	Medium	High	Pre-expiry alerts at 60d/30d/7d; commerce ops engages MNO; manual extension allowed
FM-12	Short-code scarcity (national exhaustion)	Medium	High	ATRA allocation requires long lead-time; alert at < 10 % platform-wide; capacity planning
FM-13	Cross-region replication lag	Low	Medium	CAS prevents incorrect double-assignment; alert at > 5 s lag
FM-14	Region failover (kbl primary lost)	Very Low	High	mzr promoted; in-flight reservations expire; leases preserved via quorum
FM-15	Audit hash chain broken	Very Low	Critical	Daily verify cron; CRITICAL alert; halt regulator export until investigated
FM-16	Regulator export generation fails	Low	High	Status stays PENDING; alert; admin manually re-runs
FM-17	Compliance bulk-recall storm	Low	Medium	Worker is rate-limited (100 IDs/tick); recovers gracefully
FM-18	Malicious tenant Reserve flood	Medium	Medium	Per-tenant rate limit; `RESERVATION_BURST` signal to fraud-intel
FM-19	Idempotency-key replay collision	Low	Low	409 `IDEMPOTENCY_CONFLICT`; client must use unique keys
FM-20	Outbox relay stuck	Low	High	Backlog grows; alert at lag > 60 s; manual restart of relay pod

3. Detailed Failure Modes

FM-01 — PostgreSQL primary unavailable (writes)

Scenario: PG primary node down or pool exhausted.

Detection: numbering_validate_lease_unavailable_total{reason="pg_down"} > 0; /health/ready returns 503; PG connection-pool metrics flatline.

Impact:

All Reserve, Assign, Release, Recall return UNAVAILABLE.
ValidateLease continues serving Redis cache for ≤ 60 s; thereafter returns UNAVAILABLE and sms-orchestrator does not dispatch.
MNO CSV import jobs pause.

Recovery:

PgBouncer + replica failover via Patroni; auto-recovery typically < 30 s.
During outage, all in-flight customer-portal Reserve/Assign requests get 503; portal surfaces friendly retry message.
After recovery, sms-orchestrator resumes; messages held in NATS during fail-closed window are dispatched.

Mitigation:

Multi-region quorum (kbl + mzr) means single-region PG outage degrades to mzr-served reads.
HPA + PgBouncer transaction mode prevents pool exhaustion.

FM-02 — PostgreSQL read replica lag

Scenario: Replication lag on read replica grows.

Detection: pg_replication_lag_seconds > 5.

Impact: Admin reports may show slightly stale data. Hot-path is unaffected (reads primary).

Recovery: Investigate and tune; reroute admin reads back to primary if persistent.

FM-03 — Redis unavailable

Scenario: Redis cluster lost or partitioned.

Detection: numbering_redis_cache_hit_ratio drops to 0; connection errors in logs.

Impact:

Hot path falls back to PG direct (latency rises P95 from ~5 ms to ~15 ms but still under SLA).
Reservation cleanup loses keyspace-notification trigger; safety-net cron (60 s) catches expirations within 60 s instead of 2 s.
Quota cache miss → PG aggregation per Reserve/Assign call (modest latency hit).
Idempotency replay cache lost — duplicate state mutations possible if client retries within the in-flight window.

Recovery: Restart Redis cluster; cache repopulates lazily.

FM-04 — Redis keyspace notifications dropped

Scenario: Redis keyspace notifications misconfigured (notify-keyspace-events not set) or events lost during reconfiguration.

Detection: numbering_reservation_cleanup_lag_seconds P95 > 5 s; mismatch between Redis active keys and PG reservations rows.

Impact: Reservations cleanup delayed up to 60 s (safety-net cron interval). Tenants may see "still reserved" identifiers slightly longer than expected.

Recovery:

Verify Redis config: CONFIG GET notify-keyspace-events should include Ex.
Safety-net cron runs every 60 s and reconciles.

FM-05 — NATS outage (outbox publish blocked)

Scenario: NATS JetStream cluster lost.

Detection: numbering_outbox_lag_seconds > 30; relay error logs.

Impact:

State writes succeed (PG writes are independent of NATS).
Events buffer in numbering.outbox; downstream consumers (billing, sender-id-registry, analytics) lag behind.
compliance.tenant.suspended.v1 consumer pauses — bulk recalls deferred.

Recovery:

NATS auto-recovers via 3-node cluster.
Outbox relay drains backlog on recovery; consumers catch up via NATS redelivery.
Worst case: a few minutes of lag with no data loss.

FM-06 — sender-id-registry unavailable during alpha-Assign

Scenario: sender-id-registry-service down; numbering cannot verify alpha-ID KYC.

Detection: gRPC errors on IsVerified; numbering returns FAILED_PRECONDITION with reason ALPHA_VERIFY_UNAVAILABLE.

Impact: Alpha-ID Assign calls fail. Tenants retry once registry is back. MSISDN and short-code paths unaffected.

Recovery: Registry recovery; numbering retries opportunistic. Fail-closed by design — no alpha-ID is leased without verification.

FM-07 — Concurrent Reserve race (CAS conflict)

Scenario: Two tenants race to Reserve the same AVAILABLE identifier.

Detection: numbering_conflict_detected_total{kind="CAS_RACE"} increments; numbering.conflict.detected.v1 event.

Impact: Loser receives 409 CONFLICT with the current state and version. Expected behaviour, not a failure.

Recovery: Client retries on a different candidate. Browse endpoint surfaces refreshed state on next call.

FM-08 — MNO CSV parse / signature failure

Scenario: Operator uploads CSV with bad signature or malformed rows.

Detection: numbering_lease_import_batches_total{status="FAILED"} > 0; admin sees error report.

Impact: Batch rejected atomically (no partial ingest). No effect on existing inventory.

Recovery: Operator re-signs and re-uploads. Per-row errors are returned to operator via /v1/admin/numbering/blocks/imports/{batchId}/errors.

FM-09 — Reservation cleanup cron stalled

Scenario: Cleanup CronJob fails or is paused.

Detection: numbering_reservations_active_total rises monotonically; numbering_reservation_cleanup_lag_seconds > 60 s.

Impact: Tenants see reservations stuck longer than TTL. Eventually pool browse returns "fewer available" than expected.

Recovery: Restart CronJob; manual kubectl create job --from=cronjob/numbering-reservation-cleanup … to trigger immediately.

FM-10 — Quarantine sweep stalled

Scenario: Sweep CronJob fails.

Detection: numbering_quarantine_backlog_total > 100 for 10 m.

Impact: Numbers stuck in QUARANTINE past their quarantineUntil. New leases on those identifiers return QUARANTINE_ACTIVE.

Recovery: Restart sweep; manually trigger.

FM-11 — MNO contract renewal not in place before lease expiry

Scenario: Roshan/Etisalat-AF/MTN-AF/AWCC/Salaam contract for a prefix range expires; no renewal MoU in place.

Detection: Daily contract-expiry alert at 60 d / 30 d / 7 d before effective_until.

Impact: All MSISDNs in the affected block are affected. Existing leases honour validUntil from the contract; new assignments rejected.

Recovery:

Commerce ops engages MNO before expiry (mandatory per readiness gate).
If renewal is delayed, manual contract extension via PUT /v1/admin/numbering/contracts/{id} honours the expected new term once signed.

FM-12 — Short-code scarcity (national exhaustion)

Scenario: ATRA-allocated short-code pool is below 10 % AVAILABLE platform-wide.

Detection: NumberingShortCodeScarcityCritical alert (< 10 %).

Impact: New short-code Assigns rejected with NOT_AVAILABLE. Existing leases honoured.

Recovery:

ATRA allocation request (long lead time, weeks).
Capacity planning is part of monthly commerce ops review.

FM-13 — Cross-region replication lag

Scenario: kbl ↔ mzr replication lag exceeds 5 s.

Detection: numbering_cross_region_lag_seconds > 5 alert.

Impact: CAS conflict rate may spike if writes occur in both regions. No double-assignment (CAS prevents).

Recovery: Investigate WAN link / DB replication; throttle write load to a single region temporarily.

FM-14 — Region failover (kbl primary lost)

Scenario: kbl region completely lost (DC outage).

Detection: Multi-region health check; manual or automated failover trigger.

Impact:

mzr promoted to primary; all in-flight reservations in kbl are lost (acceptable; TTL semantics).
Active leases preserved via cross-region quorum on numbers/leases.
Monthly regulator-export cron scheduled in kbl is rescheduled in mzr.

Recovery: Per platform DR runbook; failback when kbl restored.

FM-15 — Audit hash chain broken

Scenario: A row in numbering.audit has been modified out-of-band, breaking the SHA-256 chain.

Detection: Daily audit-chain-verify cron raises NumberingAuditChainBroken CRITICAL alert.

Impact: Regulator-export generation is halted automatically until investigated. Audit evidence integrity is in question.

Recovery:

SECURITY incident response.
Trace the offending row; recover from cold backup if necessary.
Re-anchor chain only after sign-off from Security + Legal + Compliance.

FM-16 — Regulator export generation fails

Scenario: Monthly export cron fails (PG read error, S3 upload failure, signing failure).

Detection: Export row stays in PENDING 24 h past scheduled time; NumberingRegulatorExportFailed alert.

Impact: ATRA submission delayed.

Recovery: Investigate root cause; manually trigger via POST /v1/admin/numbering/regulator-exports:generate.

FM-17 — Compliance bulk-recall storm

Scenario: A large tenant suspension triggers recall of thousands of leases.

Detection: numbering_recall_total{reason="ABUSE"} spike.

Impact: Worker rate-limits recall to 100 IDs / tick (10 ticks / s = 1000 IDs/s). For a 100k-lease tenant, recall takes ~100 s.

Recovery: Self-recovers; monitor backlog.

FM-18 — Malicious tenant Reserve flood

Scenario: Tenant scripts thousands of Reserves to exhaust their quota or scrape inventory.

Detection: num:rate:reserve:{tenantId} exceeds 60/min; RESERVATION_BURST anomaly signal.

Impact: Tenant gets RESERVATION_QUOTA or 429 RATE_LIMITED.

Recovery: Fraud-intel scores tenant; if SUSPENDED, compliance-engine triggers bulk recall via FM-17 pathway.

FM-19 — Idempotency-key replay collision

Scenario: Same Idempotency-Key used for two different request payloads within the 24 h window.

Detection: 409 IDEMPOTENCY_CONFLICT returned; numbering_idempotency_conflict_total increments.

Impact: Second request rejected. No state corruption.

Recovery: Client uses unique keys (UUIDv4 recommended).

FM-20 — Outbox relay stuck

Scenario: Relay pod stuck or NATS publish hangs.

Detection: numbering_outbox_lag_seconds > 60.

Impact: Downstream consumers lag — billing missed lease starts, sender-id-registry inventory state stale.

Recovery: Restart relay deployment; backlog drains in order.

4. Graceful Degradation Summary

Full operation:
  Hot path: Redis cache hit (P95 5 ms)
  Lifecycle: PG CAS + outbox + Redis mirror

Redis unavailable:
  Hot path: PG direct (P95 15 ms)
  Lifecycle: continues; reservation cleanup via 60 s safety-net cron

PG primary unavailable (writes):
  Hot path: Redis cache for ≤ 60 s, then UNAVAILABLE → orchestrator fail-closes
  Lifecycle: rejects with UNAVAILABLE

Sender-id-registry unavailable:
  Alpha-Assign: rejected (FAILED_PRECONDITION)
  MSISDN/Short-code: unaffected

NATS unavailable:
  State writes: succeed
  Events: buffer in outbox, drain on recovery
  Compliance bulk recall: deferred until consumer recovers

Region kbl lost:
  Active leases: preserved (quorum)
  In-flight reservations: lost (TTL semantics)
  Regulator export: rescheduled in mzr

5. Tenant-Experience Matrix

Failure	Tenant view
FM-01 (PG down, brief)	Reserve / Assign returns 503 with retry hint
FM-01 (PG down, extended)	Outbound messages held in `EVALUATING`; eventually `DEAD_LETTER` with reason `numbering_unavailable`
FM-06 (sender-id-registry down)	Alpha-Assign returns 422 `ALPHA_VERIFY_UNAVAILABLE` with retry hint
FM-07 (CAS conflict)	Reserve returns 409 `CONFLICT`; portal auto-refreshes pool view
FM-09 (cleanup stalled)	Pool browse shows fewer AVAILABLE for up to 60 s past TTL
FM-12 (short-code scarce)	Assign returns 404/409 with "no inventory available; contact commerce ops"
FM-15 (audit chain broken)	No tenant impact; admin sees regulator-export blocked

End of FAILURE_MODES.md

1. Operating Principle​

2. Failure Mode Summary​

3. Detailed Failure Modes​

FM-01 — PostgreSQL primary unavailable (writes)​

FM-02 — PostgreSQL read replica lag​

FM-03 — Redis unavailable​

FM-04 — Redis keyspace notifications dropped​

FM-05 — NATS outage (outbox publish blocked)​

FM-06 — sender-id-registry unavailable during alpha-Assign​

FM-07 — Concurrent Reserve race (CAS conflict)​

FM-08 — MNO CSV parse / signature failure​

FM-09 — Reservation cleanup cron stalled​

FM-10 — Quarantine sweep stalled​

FM-11 — MNO contract renewal not in place before lease expiry​

FM-12 — Short-code scarcity (national exhaustion)​

FM-13 — Cross-region replication lag​

FM-14 — Region failover (kbl primary lost)​

FM-15 — Audit hash chain broken​

FM-16 — Regulator export generation fails​

FM-17 — Compliance bulk-recall storm​

FM-18 — Malicious tenant Reserve flood​

FM-19 — Idempotency-key replay collision​

FM-20 — Outbox relay stuck​

4. Graceful Degradation Summary​

5. Tenant-Experience Matrix​

1. Operating Principle

2. Failure Mode Summary

3. Detailed Failure Modes

FM-01 — PostgreSQL primary unavailable (writes)

FM-02 — PostgreSQL read replica lag

FM-03 — Redis unavailable

FM-04 — Redis keyspace notifications dropped

FM-05 — NATS outage (outbox publish blocked)

FM-06 — sender-id-registry unavailable during alpha-Assign

FM-07 — Concurrent Reserve race (CAS conflict)

FM-08 — MNO CSV parse / signature failure

FM-09 — Reservation cleanup cron stalled

FM-10 — Quarantine sweep stalled

FM-11 — MNO contract renewal not in place before lease expiry

FM-12 — Short-code scarcity (national exhaustion)

FM-13 — Cross-region replication lag

FM-14 — Region failover (kbl primary lost)

FM-15 — Audit hash chain broken

FM-16 — Regulator export generation fails

FM-17 — Compliance bulk-recall storm

FM-18 — Malicious tenant Reserve flood

FM-19 — Idempotency-key replay collision

FM-20 — Outbox relay stuck

4. Graceful Degradation Summary

5. Tenant-Experience Matrix