Consent Ledger Service — Failure Modes
Version: 1.0 Status: Draft Owner: Trust & Safety / Platform SRE Last Updated: 2026-04-21 Companion: APPLICATION_LOGIC · OBSERVABILITY · SERVICE_RISK_REGISTER
1. Operating principle: fail-closed
consent-ledger-service operates fail-closed for CheckConsent:
No
allowed: trueis returned when consent state cannot be determined.
If state cannot be determined for any reason, CheckConsent returns { allowed: false, reason: CONSENT_UNKNOWN }. Downstream callers (compliance-engine, routing-engine, sms-firewall-service) interpret any non-OK gRPC status as allowed: false for non-emergency lanes (defence in depth).
The only exception is lane P0_EMERGENCY (Common Alerting Protocol bridge), which may proceed with an audit row noting the bypass — this is the regulator-sanctioned trade-off for life-safety messaging.
2. Failure-mode summary
| # | Failure | Likelihood | Impact | Rating | Mitigation summary |
|---|---|---|---|---|---|
| FM-01 | Postgres unavailable (primary + standbys) | Low | Critical | HIGH | Fail-closed; promote standby; redelivery upstream |
| FM-02 | Redis unavailable | Medium | Medium | MEDIUM | Fall back to Postgres; degraded latency |
| FM-03 | Both Redis and Postgres unavailable | Very Low | Critical | HIGH | Fail-closed; CONSENT_UNKNOWN; alert |
| FM-04 | NATS lag on sms.mo.inbound consumer | Medium | High (regulator risk) | HIGH | Scale consumers; alert at 60 s lag; runbook |
| FM-05 | ATRA DND endpoint unreachable | Medium | Medium | MEDIUM | Use last-known DND; alert after 24 h stale |
| FM-06 | Audit hash-chain break detected | Very Low | Critical | HIGH | Freeze writes (manual), investigate, never auto-resolve |
| FM-07 | Erasure processor failure (SLA breach) | Low | High (regulatory) | MEDIUM | Manual reprocess; legal escalation |
| FM-08 | STOP-keyword false-positive flood | Medium | Medium | MEDIUM | T&S triage; rapid keyword tuning; tenant feedback API |
| FM-09 | Vault Transit (KEK) unavailable | Low | Medium | MEDIUM | DEK cache (60 s); CheckConsent unaffected |
| FM-10 | Outbox publish stuck | Low | Medium | MEDIUM | Retry; manual replay; alert |
| FM-11 | Citizen-portal MSISDN-OTP abuse / takeover | Medium | High | HIGH | Rate limits; captcha; OTP entropy; short JWT lifetime |
| FM-12 | Bulk import abuse (fake opt-ins) | Medium | High | HIGH | CSV hash audit; volume anomaly detection; regulator dispute path |
| FM-13 | Cross-region replication divergence | Low | Critical | HIGH | Cross-region verifier; CRITICAL alert; freeze writes |
| FM-14 | NetworkPolicy mis-config exposes egress to offshore | Very Low | Critical | HIGH | Deploy-time residency test; Istio AuthorizationPolicy |
3. Detailed failure modes
FM-01 — Postgres unavailable
Scenario: Primary Postgres is unreachable; if standby promotion has not completed, all read/write paths fail.
Detection:
consent_check_failclosed_total{cause="db_unavailable"}rises/health/readyreturns 503- Patroni / etcd events; Prometheus alert
PostgresPrimaryDown
Impact:
CheckConsentcache hits continue (Redis-only); cache misses fall through to fail-closedCONSENT_UNKNOWN.- Writes fail with
503; tenant integrations retry (per their own backoff). - Audit chain pauses.
Recovery:
- Patroni auto-promotes Herat standby within 90 s (RTO); clients reconnect via Service VIP.
- If promotion fails (e.g., split-brain), on-call follows runbook
pg-primary-down.md: confirm consensus, manually trigger promotion in surviving region, verify chain integrity post-recovery. - Caller-side: NATS redelivery upstream means messages are not lost; they queue.
Mitigation:
- Patroni 3-node cluster across Kabul / Herat / Mazar (per ADR-0004 §3).
- PgBouncer in transaction mode shields per-pod connection churn.
- Circuit breaker on PG client opens after 5 consecutive errors / 30 s; half-open after 60 s.
Runbook: pg-primary-down.md
FM-02 — Redis unavailable
Scenario: Redis cluster unreachable (network partition, full restart, etc.).
Detection:
consent_redis_op_duration_secondserrors; connection error logsconsent_check_cache_misses_totalnear-100% for the duration
Impact:
CheckConsentfalls through to Postgres for every call. Latency jumps from ≤ 5 ms to ≤ 20 ms; PG load triples.- Redis distributed locks for cron workers cannot be acquired — workers skip the cycle and emit
consent_worker_skipped_no_lock_total.
Recovery:
- Redis cluster recovers; service does not need restart.
- Cache rewarms naturally (TTL-driven); explicit
consent-cache-warmercron run accelerates if needed.
Mitigation:
- HPA may scale up consent-ledger-service replicas under elevated PG load.
- Postgres replica pool sized to handle sustained 5,000 RPS without cache; load test verifies (with 80% headroom).
Runbook: redis-down.md
FM-03 — Both Redis and Postgres unavailable
Scenario: Catastrophic dual-dependency outage (data centre power, etc.).
Detection:
consent_check_failclosed_total{cause="redis_and_db_unavailable"}rises sharplyConsentCheckFailclosedSurgeCRITICAL alert pages
Impact:
- 100% of
CheckConsentreturnsCONSENT_UNKNOWN.compliance-engineblocks all non-emergency outbound. Dispatch effectively halted — by design. - Tenants see messages →
DEAD_LETTERafter redelivery exhausts (about 3 minutes per message). - Lane=P0_EMERGENCY messages still flow with audit-noted bypass.
Recovery:
- Restore one or both dependencies. Service immediately starts serving
allowed:true/falseagain. - Retry pipeline replays queued messages.
Mitigation:
- Independent failure domains for PG (Patroni) and Redis (cluster); they cannot share a single failure.
- Pre-positioned standby in Herat region for PG cross-region failover.
Runbook: checkconsent-failclosed.md — PRIORITY
FM-04 — STOP-MO consumer lag
Scenario: sms.mo.inbound consumer (UC-StopKeywordHandler) backs up. Subscriber STOPs are not honoured promptly — regulator risk: subscribers may complain, and the regulator may interpret latency as wilful non-compliance.
Detection:
consent_stop_mo_consumer_lag_seconds > 60for 5 min →ConsentStopConsumerLagHIGH- NATS
num_pendingon theconsent-moconsumer rises
Impact:
- Subscriber-typed STOPs honoured with delay (60 s+).
- Ack-back SMS delayed correspondingly.
- During the lag window, a tenant may legally have sent another marketing message after the subscriber's STOP.
Recovery:
- Scale
consent-ledger-serviceconsumer replicas (NATS pull consumer; concurrent pull count tunable). - Identify and isolate poison messages (rare; typically from SMPP-MO encoding edge cases).
- After clearing lag, run a "retroactive revoke" script that re-applies revocations as of the original MO timestamp.
Mitigation:
- HPA includes a custom metric for consumer lag; auto-scales.
- Per-MO budget 1.5 s end-to-end keeps individual processing fast.
- Dead-letter subject
sms.mo.deadlettercatches structurally-broken messages without halting the consumer.
Runbook: stop-consumer-lag.md
FM-05 — ATRA DND endpoint unreachable
Scenario: ATRA SFTP / HTTPS endpoint is down; daily sync worker fails.
Detection:
consent_dnd_sync_runs_total{outcome="failed"}incrementsconsent_dnd_sync_last_success_timestamp_secondsstops updatingConsentDndStaleHIGH after 24 h
Impact:
- DND registry not refreshed. New ATRA-listed MSISDNs not blocked until sync resumes; un-listed-since MSISDNs not unblocked.
- Existing DND state continues to enforce — service keeps running on last-known data.
Recovery:
- Manual fetch via
POST /v1/admin/consent/dnd/resynconce endpoint returns. - ATRA NOC liaison contact in runbook; escalation path includes regulator liaison if outage > 48 h.
Mitigation:
dnd.registry.synced.v1event freshness monitored by regulator-portal-service.- Dual-fetch attempt: SFTP primary + HTTPS secondary if ATRA exposes both.
Runbook: dnd-stale.md
FM-06 — Audit hash-chain break
Scenario: Daily verifier detects a prev_hash mismatch — either tampering, a write-path bug, or an unrecovered partition replication anomaly.
Detection:
consent_audit_chain_breaks_detected_total > 0→ConsentAuditChainBrokenCRITICAL pages T&S on-callconsent.audit.chain_broken.v1event published
Impact:
- Regulator-defensibility of the affected window is compromised. The platform must be able to explain why and what.
- Service continues operating; new rows continue to chain forward (but the broken link is recorded).
Recovery:
- Do NOT auto-resolve. Freeze writes to the affected partition (admin endpoint to set
consent_audit_writes_paused=trueflag). - Snapshot the partition (
pg_dump) and the corresponding S3 archive (if any). - Cross-check with the synchronous standby region's copy (independent observers).
- Determine cause: bug, replication lag, tampering. Investigation is owned by Trust & Safety + Security.
- Forensic report to regulator within 48 h regardless of root cause.
- Resume writes only after sign-off.
Mitigation:
- Cross-region verifier runs independently in Kabul and Herat; divergent results trigger immediate freeze.
- Hash key rotation supported via
signing_key_idso a key compromise can be contained without invalidating older rows. - Strict DB-level append-only rules (Postgres rule) make tampering hard to begin with.
Runbook: audit-chain-broken.md — TS-CRITICAL
FM-07 — Erasure processor failure (SLA breach)
Scenario: ErasureProcessor worker fails repeatedly; some erasure requests pass their 30-day SLA.
Detection:
consent_erasure_sla_breach_total > 0→ConsentErasureSLABreachMEDIUM (also escalates Legal)- Worker error log spike
Impact:
- GDPR-equivalent regulatory breach. Citizen has not received the deletion they were promised.
Recovery:
- Manual run:
POST /v1/admin/consent/erasure-requests/{erasureId}/process. - If repeated failures point to a code bug, hotfix and rerun.
- For the affected citizen: written confirmation + apology + completion confirmation within 24 h of detection.
- Legal team notified; if regulator notification is required (Afghan authority interpretation), follow that path.
Mitigation:
- Worker has retry-with-backoff for transient PG errors.
- Daily Slack/Email digest to Legal of pending erasures and SLA proximity.
Runbook: erasure-sla-breach.md
FM-08 — STOP-keyword false-positive flood
Scenario: A pattern of legitimate non-STOP messages is being matched as STOP, revoking consent erroneously and triggering tenant complaints.
Detection:
- Tenant complaints via
POST /v1/consent/feedback/false-positive - Spike in
consent_stop_mo_received_total{match_outcome="matched"}per language vs baseline compliance.tenant.tier.changedevents away from CLEAR for affected tenants
Impact:
- Subscribers wrongly opted out of legitimate marketing/transactional flows.
- Tenant support escalations.
Recovery:
- T&S triage: identify the responsible keyword.
- Soft-delete the offending tenant-added keyword (platform defaults are sealed; if a default is at fault, T&S adds an exception list rather than removing the default).
- Re-grant consent for affected MSISDNs via tenant
RecordConsentwithverificationMethod = TENANT_APIand a feedback reference. Audit row indicates the reason.
Mitigation:
- The keyword catalog admin requires audit log + dual review for sensitive changes.
- AI keyword-suggester (see AI_INTEGRATION §2) is HITL only — no auto-add.
- Conformance test (200 messages × 4 languages) blocks merges that lower precision.
Runbook: stop-keyword-fp-flood.md
FM-09 — Vault Transit (KEK) unavailable
Scenario: Vault Transit endpoint cannot unwrap DEKs; encryption-at-rest functions blocked.
Detection:
consent_vault_unwrap_errors_totalrate upConsentVaultUnwrapErrorsHIGH alert
Impact:
- New
RecordConsentwrites fail (cannot encrypt MSISDN field). - Reads of
msisdn_encrypted(admin/regulator handlers) fail. CheckConsentis unaffected — it never decrypts; it usesmsisdn_hash.
Recovery:
- Restore Vault.
- Cached DEKs cover ≤ 60 s of outage transparently.
- After 60 s, queued write requests retry by client backoff.
Mitigation:
- Vault HA (active + standby) in Kabul + Herat.
- DEK cache in-process with bounded TTL.
Runbook: vault-unwrap-errors.md
FM-10 — Outbox publish stuck
Scenario: NATS publish failures or poison messages cause outbox rows to accumulate.
Detection:
consent_outbox_unpublished_count > 1000orconsent_outbox_oldest_unpublished_age_seconds > 60→ConsentOutboxBacklogHIGH
Impact:
- Downstream consumers (notification-service, regulator-portal-service SIEM) miss events.
- Database state remains correct (outbox is post-commit).
Recovery:
- Investigate NATS health.
- Manual
outbox-replayscript with offset cursor. - Quarantine poison rows (move to
consent.outbox_quarantine) for forensic review.
Mitigation:
- 3 outbox-relay replicas with
FOR UPDATE SKIP LOCKED. - NATS streams replicated 3-way.
Runbook: outbox-backlog.md
FM-11 — Citizen-portal MSISDN-OTP abuse / takeover
Scenario: Attacker tries to brute-force OTP, request OTP for many MSISDNs, or use SIM-swap to take over a MSISDN.
Detection:
consent_citizen_otp_requested_totalrate per IP / per MSISDN exceeds thresholdConsentCitizenOtpAbuseMEDIUM at 100 OTP requests/hour total
Impact:
- Mass OTP delivery cost; degraded user trust; potential account takeover.
Recovery:
- Block source IPs at Kong (geographic / ASN heuristics).
- Temporarily increase OTP captcha difficulty.
- For SIM-swap claims: route to support; require KYC.
Mitigation:
- OTP rate limit (5 per MSISDN per hour); 5 per IP per hour.
- Captcha on OTP request.
- Citizen JWT lifetime 15 min; revoke immediately on logout.
- OTP entropy ≥ 6 digits; lifetime 5 min.
- Audit row on every citizen view (
CITIZEN_INSPECTION_VIEW) gives a forensic trail.
Runbook: citizen-otp-abuse.md
FM-12 — Bulk-import abuse (fake opt-ins)
Scenario: A tenant uploads bulk consent records for MSISDNs they do not legitimately have permission for.
Detection:
- Volume anomaly: tenant's bulk-import volume vs prior baseline
- Subscriber complaints reaching the regulator → cross-checked against bulk-import audit (CSV hash, captured_at)
Impact:
- Subscribers receive unwanted marketing; regulator complaint.
Recovery:
- Suspend the offending tenant's
consent:writescope. - Mass revoke for the affected MSISDNs (admin tool).
- Forensic audit of CSV provenance (the original CSV hash is on every audit row).
- Regulator escalation per platform terms.
Mitigation:
- Bulk-import audit-tagged with CSV hash.
- Daily volume anomaly check (
consent_records_written_total{verification_method="BULK_IMPORT_ATTESTATION"}per tenant vs 30-day moving average). - Tenant terms-of-service require attestation of lawful basis.
Runbook: bulk-import-abuse.md
FM-13 — Cross-region replication divergence
Scenario: Kabul and Herat (or Mazar) Postgres replicas show different states for the same row, or audit chain hashes differ between regions.
Detection:
- Cross-region
AuditChainVerifiercron compares per-partitionrecord_hashof last row across regions - Mismatch raises
ConsentAuditChainBrokenCRITICAL with region tags
Impact:
- Loss of confidence in regulator-defensibility.
- Possible failover to a wrong-state region.
Recovery:
- Same as FM-06: freeze, snapshot, investigate.
- Re-bootstrap the divergent replica from primary backup; replay WAL.
Mitigation:
- Synchronous replication to Herat.
- Per-region verifier independent of the primary's verifier.
- Patroni quorum requires 2 of 3 to agree on leader.
Runbook: replication-divergence.md
FM-14 — NetworkPolicy mis-config exposes egress to offshore
Scenario: A bad merge to NetworkPolicy or Istio AuthorizationPolicy adds an egress rule that allows traffic to non-Afghan IPs — a residency-violation incident.
Detection:
- Deploy-time residency test (
tests/residency/consent_residency.spec.ts) fails - Runtime
consent_residency_violation_total(Istio access-log-derived metric) > 0
Impact:
- Critical regulatory incident: data residency invariant broken.
Recovery:
- Roll back the offending change immediately.
- Snapshot Istio access logs to determine if any actual traffic flowed offshore.
- If yes: regulator notification; per-tenant data-export disclosure.
Mitigation:
- Deploy-time residency test mandatory in pipeline.
- Istio AuthorizationPolicy with default-deny egress.
- NetworkPolicy in deny-by-default cluster posture (per
kubernetes-cluster-posture.md). - Vault PKI doesn't issue certs for non-platform identities.
Runbook: residency-violation.md — REGULATOR-PRIORITY
4. Tenant impact matrix
| Failure | Tenant-portal view |
|---|---|
| Brief CheckConsent stall (< 1 min) | None visible |
| Extended CheckConsent fail-closed | Outbound messages delayed; eventually DEAD_LETTER with reason "Consent system temporarily unavailable. Retry available." |
| STOP-MO lag | None (subscribers see delayed STOP honoring; subscriber complaints flow back via support) |
| ATRA DND stale | None visible (verdicts continue) |
| Audit chain break | None visible to tenant; admin alert; regulator notification |
| Erasure SLA breach | Citizen-portal escalation; legal notification |
| Bulk-import abuse | Tenant suspension visible in tenant portal; appeal flow |
5. Graceful degradation summary
Full operation:
Cache hit → ALLOWED/BLOCKED in ≤ 5 ms
Redis unavailable:
PG-direct → ALLOWED/BLOCKED in ≤ 20 ms; PG load up
PG primary down (standby promotes):
Brief gap (≤ 90 s) of CONSENT_UNKNOWN; then resumed
Both Redis + PG unavailable:
100% CONSENT_UNKNOWN; compliance-engine BLOCKs all non-emergency
ATRA DND fetch failed:
Run on last-known DND; alert after 24 h
STOP MO consumer lagging:
Subscriber STOPs honoured with delay; alert on lag > 60 s
Audit chain break:
Service continues; CRITICAL alert; freeze + investigate (manual)
Vault Transit unavailable:
CheckConsent unaffected; new writes 503 after 60 s DEK cache
Outbox publish stuck:
DB state correct; downstream events delayed
Citizen OTP abuse:
Rate limit + captcha throttle; alert at 100/hr
Cross-region divergence:
CRITICAL freeze; investigate; re-bootstrap divergent replica