Skip to main content

sender-id-registry-service — Failure Modes

Version: 1.0 Status: Draft Owner: Trust & Safety + Platform SRE Last Updated: 2026-04-21 Companion: OBSERVABILITY · SYNC_CONTRACT · DEPLOYMENT_TOPOLOGY


1. Operating Principle: Fail-Closed at the Caller

sender-id-registry-service itself does not make pass/block decisions on outbound SMS — it answers Verify(senderId, tenantId). Fail-closed is enforced at the caller (compliance-engine, routing-engine, sms-firewall-service):

If Verify returns an error or RegistryStatus ∈ {UNKNOWN, TENANT_MISMATCH, SUSPENDED, REVOKED}, the caller MUST treat the message as non-allow and apply its lane-specific fail-closed rule (compliance HOLDs; routing rejects; firewall blocks).

For the service itself, fail-closed manifests as:

  • Submission: rejected with 503 if KYC cannot be persisted/encrypted (we never accept what we cannot store securely).
  • State transitions: rejected with 503 if the audit row cannot be written (we never change state without an audit trail).
  • Verify hot path: prefers cached value; falls to UNKNOWN if both cache and DB are unavailable, allowing caller fail-closed to take over.

2. Failure Mode Summary

#FailureProbabilityImpactMitigation Summary
FM-01sender-id-registry-service unreachableLowCriticalCaller fail-closed; HPA + PDB; multi-region active-active
FM-02Postgres unavailableLowHighRedis cache absorbs hot path; submission returns 503
FM-03Redis unavailableLowMediumDB-direct evaluation; latency degrades; OTP submit returns 503
FM-04DNS resolver unreachable (DoT)MediumLowBackground poller retries; manual fallback
FM-05S3 KYC fetch failureLowMediumReviewer retries; alert; nightly hash check
FM-06Vault Transit unavailableLowHighSubmission 503; verify still works (no decryption needed)
FM-07HSM unavailable for regulator exportMediumMediumDefer signing; queue export; alert regulator-liaison
FM-08Regulator export SFTP failureMediumMediumLocal persistence; exponential backoff; manual delivery fallback
FM-09Reputation cron failureLowLowLast snapshots persist; manual re-trigger; alert at 25 h freshness
FM-10Verification storm (DDoS)MediumMediumPer-IP rate-limit; tarpit; per-MSISDN OTP cap
FM-11Public-search abuse / scrapingMediumLowEdge cache + per-IP rate-limit + abuse alert + soft-block
FM-12Multi-region split-brainLowHighHLC + LWW + UUID tiebreaker; conflict alert; manual reconciliation
FM-13NATS publish failure (outbox)LowMediumOutbox retry; alert at lag; eventual delivery
FM-14KYC document forgery (insider/external)LowCriticalAI-assist + reviewer human-in-loop; dual-control on NOTARISED; tamper-detection cron
FM-15Notary whitelist compromiseLowHighDaily integrity verification; quarterly notary re-attestation

3. Detailed Failure Modes

FM-01 — Service unreachable

Scenario: All sender-id-registry-service pods crash; gRPC Verify calls fail platform-wide.

Detection: up{app="sender-id-registry-service"} == 0; alert SidServiceDown. Caller-side sid_verify_requests_total{status="error"} rises sharply.

Impact:

  • All outbound SMS is HOLDed by compliance-engine (fail-closed) within seconds.
  • Tenants see growing ON_HOLD queue in customer-portal.

Recovery:

  • HPA + PDB ensures minimum 2 replicas. Restart of evicted pods auto-resumes service.
  • If region-wide outage: GeoDNS routes traffic to healthy region (multi-master per ADR-0004 §5).
  • Reconciliation: HOLDed messages auto-released once Verify recovers (compliance-engine reviewer or auto-expiry).

Runbook: https://runbooks.ghasi.local/sid/service-down


FM-02 — Postgres unavailable

Scenario: Connection pool exhausted; primary down without sync replica; DB unreachable.

Detection: sid_verify_db_unavail_total > 0; readiness probe fails.

Impact:

  • Verify cache (Redis) absorbs hot path for ≤ 5 min (TTL).
  • Cold-miss → RegistryStatus.UNKNOWN → caller fail-closed (HOLD).
  • Submissions, state changes, verifications all return 503.

Recovery:

  • Patroni promotes replica (≤ 30 s); Redis cache covers gap.
  • After 5 min cache TTL: cold-miss rate rises; alert on sid_verify_cache_miss_total rate.
  • On recovery: backlog of state changes processed via HTTP retry on the caller side.

FM-03 — Redis unavailable

Scenario: Redis cluster lost; cache fully unavailable.

Detection: Redis client errors; sid_verify_cache_hit_total rate drops to zero.

Impact:

  • Every Verify hits Postgres directly. P95 latency rises from ~2 ms to ~30 ms.
  • OTP plaintext storage unavailable → OTP submit returns 503 (no plaintext to compare).
  • Distributed locks (cron coordination) unavailable → multi-replica risk for cron jobs (mitigated by cron's own idempotency).

Recovery:

  • DB-direct path is sustainable for many hours; HPA scales pods up.
  • Reduce Verify SLO temporarily (P95 to ≤ 30 ms) while Redis recovers.

FM-04 — DNS resolver unreachable

Scenario: Both DoT resolvers (1.1.1.1, 8.8.8.8) unreachable from cluster.

Detection: sid_dns_check_failures_total{reason="timeout"} rises.

Impact: DNS-TXT verifications fail. Tenants cannot complete DOMAIN_DNS verification.

Recovery:

  • Background DnsVerificationPoller keeps retrying every 30 min for 24 h before marking EXPIRED.
  • Tenant can use OTP/DOCUMENT/NOTARISED method as workaround.
  • On resolver recovery: pending verifications auto-complete on next poll.

FM-05 — S3 KYC fetch failure

Scenario: Object storage 5xx or transient network failure during reviewer KYC view.

Detection: sid_kyc_doc_view_failures_total rises; reviewer sees error toast.

Impact: Reviewer cannot view a specific document; review delayed.

Recovery:

  • Retry with exponential backoff up to 3 times.
  • Reviewer notified of degraded state in UI.
  • Persistent failure (> 5 min) → alert SidKycStorageDegraded.

FM-06 — Vault Transit unavailable

Scenario: Vault unsealed-but-unreachable, or transit engine errors.

Detection: sid_kyc_upload_failures_total{reason="encryption_failed"} rises.

Impact:

  • Submissions returning 503 (cannot encrypt new KYC).
  • Existing KYC blob reads still work (cached DEKs).
  • Verify continues unaffected.

Recovery: Vault HA failover should be automatic. Manual unseal if needed. Submissions queue up at customer-portal retry layer.


FM-07 — HSM unavailable for regulator export

Scenario: PKCS#11 module errors; HSM appliance failover; key slot temporarily inaccessible.

Detection: sid_regulator_export_total{result="signing_failed"} > 0; alert SidExportSignerDown.

Impact: Daily 04:00 UTC export cannot be signed and transmitted.

Recovery:

  • Export file generated and persisted unsigned in s3://...staged/.
  • Cron retries hourly until signing succeeds.
  • After 12 h: page regulator-liaison; consider failover to peer-region HSM (per ADR-0004 §5 cross-region key escrow).

FM-08 — Regulator SFTP transmission failure

Scenario: ATRA SFTP endpoint down or credentials rotated without notice.

Detection: sid_regulator_export_sftp_failures_total > 0.

Recovery:

  • Local file persisted with full signature.
  • Exponential backoff retry: 5 min, 15 min, 1 h, 2 h, 4 h, 6 h.
  • After 24 h: page regulator-liaison; manual delivery via secondary channel (encrypted email + signed receipt).
  • sid_regulator_export_lag_seconds alert at 48 h.

FM-09 — Reputation cron failure

Scenario: Daily 00:30 UTC cron fails (DB query timeout, locking contention, OOM).

Detection: sid_reputation_cron_failures_total > 0; sid_reputation_cron_last_success_timestamp stale.

Impact:

  • Reputation snapshots stale (up to 25 h).
  • Auto-suspension decisions deferred.
  • No impact on outbound SMS path (Verify still serves last-known reputation).

Recovery:

  • Cron retries automatically next tick.
  • Three consecutive failures → HIGH alert SidReputationCronStale.
  • Manual re-trigger via pnpm cli reputation:run after fixing root cause.

FM-10 — Verification storm (DDoS / abuse)

Scenario: Bot floods POST /v1/sender-ids/{id}/verifications (OTP request) at high RPS.

Detection: sid_otp_issuance_rate_limited_total rate spikes; sid_verifications_started_total{method=OTP} spikes.

Mitigations (layered):

  1. Kong rate-limit: 60 req/min per tenant; 100 RPS per IP.
  2. Service layer: OTP issuance ≤ 3 per registrant MSISDN per hour.
  3. SMS cost protection: channel-router rejects OTP if tenant's lane budget exhausted.
  4. Tarpit on repeated rate-limit hits (1 s sleep then 503).
  5. Auto-suspend tenant on sustained pattern (manual T&S decision).

FM-11 — Public-search abuse / scraping

Scenario: Scraper attempts to enumerate the registry via /v1/sender-ids/public/search.

Detection: sid_public_search_per_ip_distinct_queries > 1000 over 1 h; alert SidPublicSearchAbuse.

Mitigations:

  1. Edge cache (Cloudflare) absorbs ~95% of repeated queries.
  2. Origin per-IP rate-limit: 100 RPS hard cap; tarpit on overflow.
  3. 1000 distinct queries / hour from single IP → soft-block + JA3-fingerprint flagging at Cloudflare.

  4. Captcha challenge layer (future) for confirmed abuse.

FM-12 — Multi-region split-brain

Scenario: kbl ↔ mzr link partitions for > 5 min. Both regions accept submissions; some collide on (value_normalised, type).

Detection: Replication apply detects conflict; sid_replication_conflict_total > 0; alert SidSplitBrainConflict.

Mitigation:

  • HLC + UUID tiebreaker decides "winner" deterministically.
  • Losing tenant notified to resubmit; their attempt remains as KYC_REJECTED row.
  • Manual reconciliation via Trust & Safety reviewer.

FM-13 — NATS publish failure (outbox)

Scenario: NATS unavailable or producer errors; outbox accumulates.

Detection: sid_outbox_lag_rows > 1000 for > 5 min; alert SidOutboxLag.

Impact: Downstream consumers (notification-service, analytics, admin-dashboard SSE) miss live events.

Recovery:

  • Outbox row count grows but DB state remains consistent.
  • Once NATS recovers, relay drains backlog at ~1000 events/sec.
  • For severe lag (> 100 000 rows): scale OutboxRelay deployment temporarily.

FM-14 — KYC document forgery

Scenario: Tenant uploads forged regulator letter or notarised authority.

Detection: AI forgery indicators (per AI_INTEGRATION §4); reviewer suspicion; post-fact regulator complaint.

Mitigation:

  • AI-assist surfaces font inconsistency, OCR-confidence anomalies, metadata edit traces.
  • Reviewer human-in-loop required for approval — AI does not auto-approve.
  • Dual-control mandatory on NOTARISED (US-SID-008).
  • Notary whitelist with quarterly re-attestation.
  • On post-fact discovery: revoke (US-SID-014); 12-month name reservation; regulator notification; legal escalation.

FM-15 — Notary whitelist compromise

Scenario: A whitelisted notary's signature stamp is leaked or fabricated.

Detection: Multiple suspicious notarised approvals from same notary; reviewer pattern detection; regulator complaint.

Mitigation:

  • Quarterly re-attestation of all whitelist entries.
  • Revoke notary entry → all NOTARISED-level sender-IDs that rest solely on that notary are downgraded to DOCUMENT and require re-verification.
  • Audit log captures every notarised approval with notary identity → forensic-friendly.

4. Graceful Degradation Summary

Full operation:
Verify (Redis hit) → response in ~2 ms
Submission → encrypt + persist + outbox
Cron → reputation, export

Redis down:
Verify (DB direct) → response in ~30 ms
OTP submit → 503 (no plaintext)
RATE_VOLUME-style protections degrade

DB down (Redis warm):
Verify (Redis stale up to 5 min) → continues
Submissions/state changes → 503
Reverts to FM-02

Vault down:
Verify continues
Submissions → 503 (cannot encrypt)

HSM down:
Verify continues
Regulator export queues unsigned files

Service down:
All callers fail-closed (HOLD)
Multi-region failover via GeoDNS
Backlog auto-released on recovery

5. Failure Mode ↔ Caller Behaviour Matrix

Failurecompliance-enginerouting-enginesms-firewallcustomer-portaladmin-dashboard
FM-01 service downHOLD storm; auto-expiry per existing FMsreject + retry budgetblock insubmission 503reviewer queue empty
FM-02 Postgres downHOLD stormrejectblock in503503
FM-03 Redis downnormal but slowernormal but slowernormal but slowernormalnormal
FM-04 DNS downunaffectedunaffectedunaffectedDNS verify fails — workaround OTPnormal
FM-12 split-brainunaffectedunaffectedunaffectedresubmit promptreconciliation queue