numbering-service — Failure Modes
Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform SRE Last Updated: 2026-04-21 Companion: SECURITY_MODEL · OBSERVABILITY · DEPLOYMENT_TOPOLOGY · SERVICE_RISK_REGISTER
1. Operating Principle
Numbering-service is fail-closed on writes and fail-open-via-cache on hot-path reads. The hot path can serve cached ValidateLease results for up to 60 s on PG outage; beyond that, it returns UNAVAILABLE and sms-orchestrator does not dispatch the message.
No identifier is ever assigned, recalled, or reinstated without a confirmed durable PG write.
2. Failure Mode Summary
| # | Failure | Probability | Impact | Mitigation summary |
|---|---|---|---|---|
| FM-01 | PostgreSQL primary unavailable (writes) | Low | Critical | Fail-closed on writes; cached reads continue 60 s |
| FM-02 | PostgreSQL read replica lag | Medium | Medium | Hot path always reads primary; replica used only for admin reports |
| FM-03 | Redis unavailable | Low | Medium | PG direct fallback; reservation cleanup falls back to safety-net cron |
| FM-04 | Redis keyspace notifications dropped | Medium | Low | 60 s safety-net cron picks up missed expirations |
| FM-05 | NATS outage (outbox publish blocked) | Low | Medium | Events buffer in numbering.outbox; relay retries on recovery |
| FM-06 | Sender-id-registry unavailable during alpha-Assign | Medium | Medium | Fail-closed on alpha-Assign; tenants retry; MSISDN/short-code paths unaffected |
| FM-07 | Concurrent Reserve race (CAS conflict) | High | Low | CAS resolves; loser receives CONFLICT and retries on a different candidate |
| FM-08 | MNO CSV parse / signature failure | Medium | Medium | Reject batch atomically; per-row errors written to lease_import_errors |
| FM-09 | Reservation cleanup cron stalled | Low | Medium | Redis TTL still expires keys; numbering_reservations_active_total rises; alert |
| FM-10 | Quarantine sweep stalled | Low | Medium | Numbers stay in QUARANTINE longer; tenant cannot re-lease until sweep recovers |
| FM-11 | MNO contract renewal not in place before lease expiry | Medium | High | Pre-expiry alerts at 60d/30d/7d; commerce ops engages MNO; manual extension allowed |
| FM-12 | Short-code scarcity (national exhaustion) | Medium | High | ATRA allocation requires long lead-time; alert at < 10 % platform-wide; capacity planning |
| FM-13 | Cross-region replication lag | Low | Medium | CAS prevents incorrect double-assignment; alert at > 5 s lag |
| FM-14 | Region failover (kbl primary lost) | Very Low | High | mzr promoted; in-flight reservations expire; leases preserved via quorum |
| FM-15 | Audit hash chain broken | Very Low | Critical | Daily verify cron; CRITICAL alert; halt regulator export until investigated |
| FM-16 | Regulator export generation fails | Low | High | Status stays PENDING; alert; admin manually re-runs |
| FM-17 | Compliance bulk-recall storm | Low | Medium | Worker is rate-limited (100 IDs/tick); recovers gracefully |
| FM-18 | Malicious tenant Reserve flood | Medium | Medium | Per-tenant rate limit; RESERVATION_BURST signal to fraud-intel |
| FM-19 | Idempotency-key replay collision | Low | Low | 409 IDEMPOTENCY_CONFLICT; client must use unique keys |
| FM-20 | Outbox relay stuck | Low | High | Backlog grows; alert at lag > 60 s; manual restart of relay pod |
3. Detailed Failure Modes
FM-01 — PostgreSQL primary unavailable (writes)
Scenario: PG primary node down or pool exhausted.
Detection: numbering_validate_lease_unavailable_total{reason="pg_down"} > 0; /health/ready returns 503; PG connection-pool metrics flatline.
Impact:
- All
Reserve,Assign,Release,RecallreturnUNAVAILABLE. ValidateLeasecontinues serving Redis cache for ≤ 60 s; thereafter returnsUNAVAILABLEandsms-orchestratordoes not dispatch.- MNO CSV import jobs pause.
Recovery:
- PgBouncer + replica failover via Patroni; auto-recovery typically < 30 s.
- During outage, all in-flight customer-portal Reserve/Assign requests get 503; portal surfaces friendly retry message.
- After recovery, sms-orchestrator resumes; messages held in NATS during fail-closed window are dispatched.
Mitigation:
- Multi-region quorum (kbl + mzr) means single-region PG outage degrades to mzr-served reads.
- HPA + PgBouncer transaction mode prevents pool exhaustion.
FM-02 — PostgreSQL read replica lag
Scenario: Replication lag on read replica grows.
Detection: pg_replication_lag_seconds > 5.
Impact: Admin reports may show slightly stale data. Hot-path is unaffected (reads primary).
Recovery: Investigate and tune; reroute admin reads back to primary if persistent.
FM-03 — Redis unavailable
Scenario: Redis cluster lost or partitioned.
Detection: numbering_redis_cache_hit_ratio drops to 0; connection errors in logs.
Impact:
- Hot path falls back to PG direct (latency rises P95 from ~5 ms to ~15 ms but still under SLA).
- Reservation cleanup loses keyspace-notification trigger; safety-net cron (60 s) catches expirations within 60 s instead of 2 s.
- Quota cache miss → PG aggregation per Reserve/Assign call (modest latency hit).
- Idempotency replay cache lost — duplicate state mutations possible if client retries within the in-flight window.
Recovery: Restart Redis cluster; cache repopulates lazily.
FM-04 — Redis keyspace notifications dropped
Scenario: Redis keyspace notifications misconfigured (notify-keyspace-events not set) or events lost during reconfiguration.
Detection: numbering_reservation_cleanup_lag_seconds P95 > 5 s; mismatch between Redis active keys and PG reservations rows.
Impact: Reservations cleanup delayed up to 60 s (safety-net cron interval). Tenants may see "still reserved" identifiers slightly longer than expected.
Recovery:
- Verify Redis config:
CONFIG GET notify-keyspace-eventsshould includeEx. - Safety-net cron runs every 60 s and reconciles.
FM-05 — NATS outage (outbox publish blocked)
Scenario: NATS JetStream cluster lost.
Detection: numbering_outbox_lag_seconds > 30; relay error logs.
Impact:
- State writes succeed (PG writes are independent of NATS).
- Events buffer in
numbering.outbox; downstream consumers (billing, sender-id-registry, analytics) lag behind. compliance.tenant.suspended.v1consumer pauses — bulk recalls deferred.
Recovery:
- NATS auto-recovers via 3-node cluster.
- Outbox relay drains backlog on recovery; consumers catch up via NATS redelivery.
- Worst case: a few minutes of lag with no data loss.
FM-06 — sender-id-registry unavailable during alpha-Assign
Scenario: sender-id-registry-service down; numbering cannot verify alpha-ID KYC.
Detection: gRPC errors on IsVerified; numbering returns FAILED_PRECONDITION with reason ALPHA_VERIFY_UNAVAILABLE.
Impact: Alpha-ID Assign calls fail. Tenants retry once registry is back. MSISDN and short-code paths unaffected.
Recovery: Registry recovery; numbering retries opportunistic. Fail-closed by design — no alpha-ID is leased without verification.
FM-07 — Concurrent Reserve race (CAS conflict)
Scenario: Two tenants race to Reserve the same AVAILABLE identifier.
Detection: numbering_conflict_detected_total{kind="CAS_RACE"} increments; numbering.conflict.detected.v1 event.
Impact: Loser receives 409 CONFLICT with the current state and version. Expected behaviour, not a failure.
Recovery: Client retries on a different candidate. Browse endpoint surfaces refreshed state on next call.
FM-08 — MNO CSV parse / signature failure
Scenario: Operator uploads CSV with bad signature or malformed rows.
Detection: numbering_lease_import_batches_total{status="FAILED"} > 0; admin sees error report.
Impact: Batch rejected atomically (no partial ingest). No effect on existing inventory.
Recovery: Operator re-signs and re-uploads. Per-row errors are returned to operator via /v1/admin/numbering/blocks/imports/{batchId}/errors.
FM-09 — Reservation cleanup cron stalled
Scenario: Cleanup CronJob fails or is paused.
Detection: numbering_reservations_active_total rises monotonically; numbering_reservation_cleanup_lag_seconds > 60 s.
Impact: Tenants see reservations stuck longer than TTL. Eventually pool browse returns "fewer available" than expected.
Recovery: Restart CronJob; manual kubectl create job --from=cronjob/numbering-reservation-cleanup … to trigger immediately.
FM-10 — Quarantine sweep stalled
Scenario: Sweep CronJob fails.
Detection: numbering_quarantine_backlog_total > 100 for 10 m.
Impact: Numbers stuck in QUARANTINE past their quarantineUntil. New leases on those identifiers return QUARANTINE_ACTIVE.
Recovery: Restart sweep; manually trigger.
FM-11 — MNO contract renewal not in place before lease expiry
Scenario: Roshan/Etisalat-AF/MTN-AF/AWCC/Salaam contract for a prefix range expires; no renewal MoU in place.
Detection: Daily contract-expiry alert at 60 d / 30 d / 7 d before effective_until.
Impact: All MSISDNs in the affected block are affected. Existing leases honour validUntil from the contract; new assignments rejected.
Recovery:
- Commerce ops engages MNO before expiry (mandatory per readiness gate).
- If renewal is delayed, manual contract extension via
PUT /v1/admin/numbering/contracts/{id}honours the expected new term once signed.
FM-12 — Short-code scarcity (national exhaustion)
Scenario: ATRA-allocated short-code pool is below 10 % AVAILABLE platform-wide.
Detection: NumberingShortCodeScarcityCritical alert (< 10 %).
Impact: New short-code Assigns rejected with NOT_AVAILABLE. Existing leases honoured.
Recovery:
- ATRA allocation request (long lead time, weeks).
- Capacity planning is part of monthly commerce ops review.
FM-13 — Cross-region replication lag
Scenario: kbl ↔ mzr replication lag exceeds 5 s.
Detection: numbering_cross_region_lag_seconds > 5 alert.
Impact: CAS conflict rate may spike if writes occur in both regions. No double-assignment (CAS prevents).
Recovery: Investigate WAN link / DB replication; throttle write load to a single region temporarily.
FM-14 — Region failover (kbl primary lost)
Scenario: kbl region completely lost (DC outage).
Detection: Multi-region health check; manual or automated failover trigger.
Impact:
- mzr promoted to primary; all in-flight reservations in kbl are lost (acceptable; TTL semantics).
- Active leases preserved via cross-region quorum on
numbers/leases. - Monthly regulator-export cron scheduled in kbl is rescheduled in mzr.
Recovery: Per platform DR runbook; failback when kbl restored.
FM-15 — Audit hash chain broken
Scenario: A row in numbering.audit has been modified out-of-band, breaking the SHA-256 chain.
Detection: Daily audit-chain-verify cron raises NumberingAuditChainBroken CRITICAL alert.
Impact: Regulator-export generation is halted automatically until investigated. Audit evidence integrity is in question.
Recovery:
- SECURITY incident response.
- Trace the offending row; recover from cold backup if necessary.
- Re-anchor chain only after sign-off from Security + Legal + Compliance.
FM-16 — Regulator export generation fails
Scenario: Monthly export cron fails (PG read error, S3 upload failure, signing failure).
Detection: Export row stays in PENDING 24 h past scheduled time; NumberingRegulatorExportFailed alert.
Impact: ATRA submission delayed.
Recovery: Investigate root cause; manually trigger via POST /v1/admin/numbering/regulator-exports:generate.
FM-17 — Compliance bulk-recall storm
Scenario: A large tenant suspension triggers recall of thousands of leases.
Detection: numbering_recall_total{reason="ABUSE"} spike.
Impact: Worker rate-limits recall to 100 IDs / tick (10 ticks / s = 1000 IDs/s). For a 100k-lease tenant, recall takes ~100 s.
Recovery: Self-recovers; monitor backlog.
FM-18 — Malicious tenant Reserve flood
Scenario: Tenant scripts thousands of Reserves to exhaust their quota or scrape inventory.
Detection: num:rate:reserve:{tenantId} exceeds 60/min; RESERVATION_BURST anomaly signal.
Impact: Tenant gets RESERVATION_QUOTA or 429 RATE_LIMITED.
Recovery: Fraud-intel scores tenant; if SUSPENDED, compliance-engine triggers bulk recall via FM-17 pathway.
FM-19 — Idempotency-key replay collision
Scenario: Same Idempotency-Key used for two different request payloads within the 24 h window.
Detection: 409 IDEMPOTENCY_CONFLICT returned; numbering_idempotency_conflict_total increments.
Impact: Second request rejected. No state corruption.
Recovery: Client uses unique keys (UUIDv4 recommended).
FM-20 — Outbox relay stuck
Scenario: Relay pod stuck or NATS publish hangs.
Detection: numbering_outbox_lag_seconds > 60.
Impact: Downstream consumers lag — billing missed lease starts, sender-id-registry inventory state stale.
Recovery: Restart relay deployment; backlog drains in order.
4. Graceful Degradation Summary
Full operation:
Hot path: Redis cache hit (P95 5 ms)
Lifecycle: PG CAS + outbox + Redis mirror
Redis unavailable:
Hot path: PG direct (P95 15 ms)
Lifecycle: continues; reservation cleanup via 60 s safety-net cron
PG primary unavailable (writes):
Hot path: Redis cache for ≤ 60 s, then UNAVAILABLE → orchestrator fail-closes
Lifecycle: rejects with UNAVAILABLE
Sender-id-registry unavailable:
Alpha-Assign: rejected (FAILED_PRECONDITION)
MSISDN/Short-code: unaffected
NATS unavailable:
State writes: succeed
Events: buffer in outbox, drain on recovery
Compliance bulk recall: deferred until consumer recovers
Region kbl lost:
Active leases: preserved (quorum)
In-flight reservations: lost (TTL semantics)
Regulator export: rescheduled in mzr
5. Tenant-Experience Matrix
| Failure | Tenant view |
|---|---|
| FM-01 (PG down, brief) | Reserve / Assign returns 503 with retry hint |
| FM-01 (PG down, extended) | Outbound messages held in EVALUATING; eventually DEAD_LETTER with reason numbering_unavailable |
| FM-06 (sender-id-registry down) | Alpha-Assign returns 422 ALPHA_VERIFY_UNAVAILABLE with retry hint |
| FM-07 (CAS conflict) | Reserve returns 409 CONFLICT; portal auto-refreshes pool view |
| FM-09 (cleanup stalled) | Pool browse shows fewer AVAILABLE for up to 60 s past TTL |
| FM-12 (short-code scarce) | Assign returns 404/409 with "no inventory available; contact commerce ops" |
| FM-15 (audit chain broken) | No tenant impact; admin sees regulator-export blocked |
End of FAILURE_MODES.md