sender-id-registry-service — Failure Modes
Version: 1.0 Status: Draft Owner: Trust & Safety + Platform SRE Last Updated: 2026-04-21 Companion: OBSERVABILITY · SYNC_CONTRACT · DEPLOYMENT_TOPOLOGY
1. Operating Principle: Fail-Closed at the Caller
sender-id-registry-service itself does not make pass/block decisions on outbound SMS — it answers Verify(senderId, tenantId). Fail-closed is enforced at the caller (compliance-engine, routing-engine, sms-firewall-service):
If
Verifyreturns an error orRegistryStatus ∈ {UNKNOWN, TENANT_MISMATCH, SUSPENDED, REVOKED}, the caller MUST treat the message as non-allow and apply its lane-specific fail-closed rule (compliance HOLDs; routing rejects; firewall blocks).
For the service itself, fail-closed manifests as:
- Submission: rejected with 503 if KYC cannot be persisted/encrypted (we never accept what we cannot store securely).
- State transitions: rejected with 503 if the audit row cannot be written (we never change state without an audit trail).
- Verify hot path: prefers cached value; falls to
UNKNOWNif both cache and DB are unavailable, allowing caller fail-closed to take over.
2. Failure Mode Summary
| # | Failure | Probability | Impact | Mitigation Summary |
|---|---|---|---|---|
| FM-01 | sender-id-registry-service unreachable | Low | Critical | Caller fail-closed; HPA + PDB; multi-region active-active |
| FM-02 | Postgres unavailable | Low | High | Redis cache absorbs hot path; submission returns 503 |
| FM-03 | Redis unavailable | Low | Medium | DB-direct evaluation; latency degrades; OTP submit returns 503 |
| FM-04 | DNS resolver unreachable (DoT) | Medium | Low | Background poller retries; manual fallback |
| FM-05 | S3 KYC fetch failure | Low | Medium | Reviewer retries; alert; nightly hash check |
| FM-06 | Vault Transit unavailable | Low | High | Submission 503; verify still works (no decryption needed) |
| FM-07 | HSM unavailable for regulator export | Medium | Medium | Defer signing; queue export; alert regulator-liaison |
| FM-08 | Regulator export SFTP failure | Medium | Medium | Local persistence; exponential backoff; manual delivery fallback |
| FM-09 | Reputation cron failure | Low | Low | Last snapshots persist; manual re-trigger; alert at 25 h freshness |
| FM-10 | Verification storm (DDoS) | Medium | Medium | Per-IP rate-limit; tarpit; per-MSISDN OTP cap |
| FM-11 | Public-search abuse / scraping | Medium | Low | Edge cache + per-IP rate-limit + abuse alert + soft-block |
| FM-12 | Multi-region split-brain | Low | High | HLC + LWW + UUID tiebreaker; conflict alert; manual reconciliation |
| FM-13 | NATS publish failure (outbox) | Low | Medium | Outbox retry; alert at lag; eventual delivery |
| FM-14 | KYC document forgery (insider/external) | Low | Critical | AI-assist + reviewer human-in-loop; dual-control on NOTARISED; tamper-detection cron |
| FM-15 | Notary whitelist compromise | Low | High | Daily integrity verification; quarterly notary re-attestation |
3. Detailed Failure Modes
FM-01 — Service unreachable
Scenario: All sender-id-registry-service pods crash; gRPC Verify calls fail platform-wide.
Detection: up{app="sender-id-registry-service"} == 0; alert SidServiceDown. Caller-side sid_verify_requests_total{status="error"} rises sharply.
Impact:
- All outbound SMS is HOLDed by
compliance-engine(fail-closed) within seconds. - Tenants see growing
ON_HOLDqueue in customer-portal.
Recovery:
- HPA + PDB ensures minimum 2 replicas. Restart of evicted pods auto-resumes service.
- If region-wide outage: GeoDNS routes traffic to healthy region (multi-master per ADR-0004 §5).
- Reconciliation: HOLDed messages auto-released once Verify recovers (compliance-engine reviewer or auto-expiry).
Runbook: https://runbooks.ghasi.local/sid/service-down
FM-02 — Postgres unavailable
Scenario: Connection pool exhausted; primary down without sync replica; DB unreachable.
Detection: sid_verify_db_unavail_total > 0; readiness probe fails.
Impact:
Verifycache (Redis) absorbs hot path for ≤ 5 min (TTL).- Cold-miss →
RegistryStatus.UNKNOWN→ caller fail-closed (HOLD). - Submissions, state changes, verifications all return 503.
Recovery:
- Patroni promotes replica (≤ 30 s); Redis cache covers gap.
- After 5 min cache TTL: cold-miss rate rises; alert on
sid_verify_cache_miss_totalrate. - On recovery: backlog of state changes processed via HTTP retry on the caller side.
FM-03 — Redis unavailable
Scenario: Redis cluster lost; cache fully unavailable.
Detection: Redis client errors; sid_verify_cache_hit_total rate drops to zero.
Impact:
- Every
Verifyhits Postgres directly. P95 latency rises from ~2 ms to ~30 ms. - OTP plaintext storage unavailable → OTP submit returns 503 (no plaintext to compare).
- Distributed locks (cron coordination) unavailable → multi-replica risk for cron jobs (mitigated by cron's own idempotency).
Recovery:
- DB-direct path is sustainable for many hours; HPA scales pods up.
- Reduce
VerifySLO temporarily (P95 to ≤ 30 ms) while Redis recovers.
FM-04 — DNS resolver unreachable
Scenario: Both DoT resolvers (1.1.1.1, 8.8.8.8) unreachable from cluster.
Detection: sid_dns_check_failures_total{reason="timeout"} rises.
Impact: DNS-TXT verifications fail. Tenants cannot complete DOMAIN_DNS verification.
Recovery:
- Background
DnsVerificationPollerkeeps retrying every 30 min for 24 h before markingEXPIRED. - Tenant can use OTP/DOCUMENT/NOTARISED method as workaround.
- On resolver recovery: pending verifications auto-complete on next poll.
FM-05 — S3 KYC fetch failure
Scenario: Object storage 5xx or transient network failure during reviewer KYC view.
Detection: sid_kyc_doc_view_failures_total rises; reviewer sees error toast.
Impact: Reviewer cannot view a specific document; review delayed.
Recovery:
- Retry with exponential backoff up to 3 times.
- Reviewer notified of degraded state in UI.
- Persistent failure (> 5 min) → alert
SidKycStorageDegraded.
FM-06 — Vault Transit unavailable
Scenario: Vault unsealed-but-unreachable, or transit engine errors.
Detection: sid_kyc_upload_failures_total{reason="encryption_failed"} rises.
Impact:
- Submissions returning 503 (cannot encrypt new KYC).
- Existing KYC blob reads still work (cached DEKs).
- Verify continues unaffected.
Recovery: Vault HA failover should be automatic. Manual unseal if needed. Submissions queue up at customer-portal retry layer.
FM-07 — HSM unavailable for regulator export
Scenario: PKCS#11 module errors; HSM appliance failover; key slot temporarily inaccessible.
Detection: sid_regulator_export_total{result="signing_failed"} > 0; alert SidExportSignerDown.
Impact: Daily 04:00 UTC export cannot be signed and transmitted.
Recovery:
- Export file generated and persisted unsigned in
s3://...staged/. - Cron retries hourly until signing succeeds.
- After 12 h: page regulator-liaison; consider failover to peer-region HSM (per ADR-0004 §5 cross-region key escrow).
FM-08 — Regulator SFTP transmission failure
Scenario: ATRA SFTP endpoint down or credentials rotated without notice.
Detection: sid_regulator_export_sftp_failures_total > 0.
Recovery:
- Local file persisted with full signature.
- Exponential backoff retry: 5 min, 15 min, 1 h, 2 h, 4 h, 6 h.
- After 24 h: page regulator-liaison; manual delivery via secondary channel (encrypted email + signed receipt).
sid_regulator_export_lag_secondsalert at 48 h.
FM-09 — Reputation cron failure
Scenario: Daily 00:30 UTC cron fails (DB query timeout, locking contention, OOM).
Detection: sid_reputation_cron_failures_total > 0; sid_reputation_cron_last_success_timestamp stale.
Impact:
- Reputation snapshots stale (up to 25 h).
- Auto-suspension decisions deferred.
- No impact on outbound SMS path (Verify still serves last-known reputation).
Recovery:
- Cron retries automatically next tick.
- Three consecutive failures → HIGH alert
SidReputationCronStale. - Manual re-trigger via
pnpm cli reputation:runafter fixing root cause.
FM-10 — Verification storm (DDoS / abuse)
Scenario: Bot floods POST /v1/sender-ids/{id}/verifications (OTP request) at high RPS.
Detection: sid_otp_issuance_rate_limited_total rate spikes; sid_verifications_started_total{method=OTP} spikes.
Mitigations (layered):
- Kong rate-limit: 60 req/min per tenant; 100 RPS per IP.
- Service layer: OTP issuance ≤ 3 per registrant MSISDN per hour.
- SMS cost protection:
channel-routerrejects OTP if tenant's lane budget exhausted. - Tarpit on repeated rate-limit hits (1 s sleep then 503).
- Auto-suspend tenant on sustained pattern (manual T&S decision).
FM-11 — Public-search abuse / scraping
Scenario: Scraper attempts to enumerate the registry via /v1/sender-ids/public/search.
Detection: sid_public_search_per_ip_distinct_queries > 1000 over 1 h; alert SidPublicSearchAbuse.
Mitigations:
- Edge cache (Cloudflare) absorbs ~95% of repeated queries.
- Origin per-IP rate-limit: 100 RPS hard cap; tarpit on overflow.
-
1000 distinct queries / hour from single IP → soft-block + JA3-fingerprint flagging at Cloudflare.
- Captcha challenge layer (future) for confirmed abuse.
FM-12 — Multi-region split-brain
Scenario: kbl ↔ mzr link partitions for > 5 min. Both regions accept submissions; some collide on (value_normalised, type).
Detection: Replication apply detects conflict; sid_replication_conflict_total > 0; alert SidSplitBrainConflict.
Mitigation:
- HLC + UUID tiebreaker decides "winner" deterministically.
- Losing tenant notified to resubmit; their attempt remains as
KYC_REJECTEDrow. - Manual reconciliation via Trust & Safety reviewer.
FM-13 — NATS publish failure (outbox)
Scenario: NATS unavailable or producer errors; outbox accumulates.
Detection: sid_outbox_lag_rows > 1000 for > 5 min; alert SidOutboxLag.
Impact: Downstream consumers (notification-service, analytics, admin-dashboard SSE) miss live events.
Recovery:
- Outbox row count grows but DB state remains consistent.
- Once NATS recovers, relay drains backlog at ~1000 events/sec.
- For severe lag (> 100 000 rows): scale
OutboxRelaydeployment temporarily.
FM-14 — KYC document forgery
Scenario: Tenant uploads forged regulator letter or notarised authority.
Detection: AI forgery indicators (per AI_INTEGRATION §4); reviewer suspicion; post-fact regulator complaint.
Mitigation:
- AI-assist surfaces font inconsistency, OCR-confidence anomalies, metadata edit traces.
- Reviewer human-in-loop required for approval — AI does not auto-approve.
- Dual-control mandatory on NOTARISED (US-SID-008).
- Notary whitelist with quarterly re-attestation.
- On post-fact discovery: revoke (US-SID-014); 12-month name reservation; regulator notification; legal escalation.
FM-15 — Notary whitelist compromise
Scenario: A whitelisted notary's signature stamp is leaked or fabricated.
Detection: Multiple suspicious notarised approvals from same notary; reviewer pattern detection; regulator complaint.
Mitigation:
- Quarterly re-attestation of all whitelist entries.
- Revoke notary entry → all NOTARISED-level sender-IDs that rest solely on that notary are downgraded to
DOCUMENTand require re-verification. - Audit log captures every notarised approval with notary identity → forensic-friendly.
4. Graceful Degradation Summary
Full operation:
Verify (Redis hit) → response in ~2 ms
Submission → encrypt + persist + outbox
Cron → reputation, export
Redis down:
Verify (DB direct) → response in ~30 ms
OTP submit → 503 (no plaintext)
RATE_VOLUME-style protections degrade
DB down (Redis warm):
Verify (Redis stale up to 5 min) → continues
Submissions/state changes → 503
Reverts to FM-02
Vault down:
Verify continues
Submissions → 503 (cannot encrypt)
HSM down:
Verify continues
Regulator export queues unsigned files
Service down:
All callers fail-closed (HOLD)
Multi-region failover via GeoDNS
Backlog auto-released on recovery
5. Failure Mode ↔ Caller Behaviour Matrix
| Failure | compliance-engine | routing-engine | sms-firewall | customer-portal | admin-dashboard |
|---|---|---|---|---|---|
| FM-01 service down | HOLD storm; auto-expiry per existing FMs | reject + retry budget | block in | submission 503 | reviewer queue empty |
| FM-02 Postgres down | HOLD storm | reject | block in | 503 | 503 |
| FM-03 Redis down | normal but slower | normal but slower | normal but slower | normal | normal |
| FM-04 DNS down | unaffected | unaffected | unaffected | DNS verify fails — workaround OTP | normal |
| FM-12 split-brain | unaffected | unaffected | unaffected | resubmit prompt | reconciliation queue |