Consent Ledger Service — Failure Modes

Version: 1.0 Status: Draft Owner: Trust & Safety / Platform SRE Last Updated: 2026-04-21 Companion: APPLICATION_LOGIC · OBSERVABILITY · SERVICE_RISK_REGISTER

1. Operating principle: fail-closed

consent-ledger-service operates fail-closed for CheckConsent:

No allowed: true is returned when consent state cannot be determined.

If state cannot be determined for any reason, CheckConsent returns { allowed: false, reason: CONSENT_UNKNOWN }. Downstream callers (compliance-engine, routing-engine, sms-firewall-service) interpret any non-OK gRPC status as allowed: false for non-emergency lanes (defence in depth).

The only exception is lane P0_EMERGENCY (Common Alerting Protocol bridge), which may proceed with an audit row noting the bypass — this is the regulator-sanctioned trade-off for life-safety messaging.

2. Failure-mode summary

#	Failure	Likelihood	Impact	Rating	Mitigation summary
FM-01	Postgres unavailable (primary + standbys)	Low	Critical	HIGH	Fail-closed; promote standby; redelivery upstream
FM-02	Redis unavailable	Medium	Medium	MEDIUM	Fall back to Postgres; degraded latency
FM-03	Both Redis and Postgres unavailable	Very Low	Critical	HIGH	Fail-closed; CONSENT_UNKNOWN; alert
FM-04	NATS lag on `sms.mo.inbound` consumer	Medium	High (regulator risk)	HIGH	Scale consumers; alert at 60 s lag; runbook
FM-05	ATRA DND endpoint unreachable	Medium	Medium	MEDIUM	Use last-known DND; alert after 24 h stale
FM-06	Audit hash-chain break detected	Very Low	Critical	HIGH	Freeze writes (manual), investigate, never auto-resolve
FM-07	Erasure processor failure (SLA breach)	Low	High (regulatory)	MEDIUM	Manual reprocess; legal escalation
FM-08	STOP-keyword false-positive flood	Medium	Medium	MEDIUM	T&S triage; rapid keyword tuning; tenant feedback API
FM-09	Vault Transit (KEK) unavailable	Low	Medium	MEDIUM	DEK cache (60 s); CheckConsent unaffected
FM-10	Outbox publish stuck	Low	Medium	MEDIUM	Retry; manual replay; alert
FM-11	Citizen-portal MSISDN-OTP abuse / takeover	Medium	High	HIGH	Rate limits; captcha; OTP entropy; short JWT lifetime
FM-12	Bulk import abuse (fake opt-ins)	Medium	High	HIGH	CSV hash audit; volume anomaly detection; regulator dispute path
FM-13	Cross-region replication divergence	Low	Critical	HIGH	Cross-region verifier; CRITICAL alert; freeze writes
FM-14	NetworkPolicy mis-config exposes egress to offshore	Very Low	Critical	HIGH	Deploy-time residency test; Istio AuthorizationPolicy

3. Detailed failure modes

FM-01 — Postgres unavailable

Scenario: Primary Postgres is unreachable; if standby promotion has not completed, all read/write paths fail.

Detection:

consent_check_failclosed_total{cause="db_unavailable"} rises
/health/ready returns 503
Patroni / etcd events; Prometheus alert PostgresPrimaryDown

Impact:

CheckConsent cache hits continue (Redis-only); cache misses fall through to fail-closed CONSENT_UNKNOWN.
Writes fail with 503; tenant integrations retry (per their own backoff).
Audit chain pauses.

Recovery:

Patroni auto-promotes Herat standby within 90 s (RTO); clients reconnect via Service VIP.
If promotion fails (e.g., split-brain), on-call follows runbook pg-primary-down.md: confirm consensus, manually trigger promotion in surviving region, verify chain integrity post-recovery.
Caller-side: NATS redelivery upstream means messages are not lost; they queue.

Mitigation:

Patroni 3-node cluster across Kabul / Herat / Mazar (per ADR-0004 §3).
PgBouncer in transaction mode shields per-pod connection churn.
Circuit breaker on PG client opens after 5 consecutive errors / 30 s; half-open after 60 s.

Runbook: pg-primary-down.md

FM-02 — Redis unavailable

Scenario: Redis cluster unreachable (network partition, full restart, etc.).

Detection:

consent_redis_op_duration_seconds errors; connection error logs
consent_check_cache_misses_total near-100% for the duration

Impact:

CheckConsent falls through to Postgres for every call. Latency jumps from ≤ 5 ms to ≤ 20 ms; PG load triples.
Redis distributed locks for cron workers cannot be acquired — workers skip the cycle and emit consent_worker_skipped_no_lock_total.

Recovery:

Redis cluster recovers; service does not need restart.
Cache rewarms naturally (TTL-driven); explicit consent-cache-warmer cron run accelerates if needed.

Mitigation:

HPA may scale up consent-ledger-service replicas under elevated PG load.
Postgres replica pool sized to handle sustained 5,000 RPS without cache; load test verifies (with 80% headroom).

Runbook: redis-down.md

FM-03 — Both Redis and Postgres unavailable

Scenario: Catastrophic dual-dependency outage (data centre power, etc.).

Detection:

consent_check_failclosed_total{cause="redis_and_db_unavailable"} rises sharply
ConsentCheckFailclosedSurge CRITICAL alert pages

Impact:

100% of CheckConsent returns CONSENT_UNKNOWN. compliance-engine blocks all non-emergency outbound. Dispatch effectively halted — by design.
Tenants see messages → DEAD_LETTER after redelivery exhausts (about 3 minutes per message).
Lane=P0_EMERGENCY messages still flow with audit-noted bypass.

Recovery:

Restore one or both dependencies. Service immediately starts serving allowed:true/false again.
Retry pipeline replays queued messages.

Mitigation:

Independent failure domains for PG (Patroni) and Redis (cluster); they cannot share a single failure.
Pre-positioned standby in Herat region for PG cross-region failover.

Runbook: checkconsent-failclosed.md — PRIORITY

FM-04 — STOP-MO consumer lag

Scenario: sms.mo.inbound consumer (UC-StopKeywordHandler) backs up. Subscriber STOPs are not honoured promptly — regulator risk: subscribers may complain, and the regulator may interpret latency as wilful non-compliance.

Detection:

consent_stop_mo_consumer_lag_seconds > 60 for 5 min → ConsentStopConsumerLag HIGH
NATS num_pending on the consent-mo consumer rises

Impact:

Subscriber-typed STOPs honoured with delay (60 s+).
Ack-back SMS delayed correspondingly.
During the lag window, a tenant may legally have sent another marketing message after the subscriber's STOP.

Recovery:

Scale consent-ledger-service consumer replicas (NATS pull consumer; concurrent pull count tunable).
Identify and isolate poison messages (rare; typically from SMPP-MO encoding edge cases).
After clearing lag, run a "retroactive revoke" script that re-applies revocations as of the original MO timestamp.

Mitigation:

HPA includes a custom metric for consumer lag; auto-scales.
Per-MO budget 1.5 s end-to-end keeps individual processing fast.
Dead-letter subject sms.mo.deadletter catches structurally-broken messages without halting the consumer.

Runbook: stop-consumer-lag.md

FM-05 — ATRA DND endpoint unreachable

Scenario: ATRA SFTP / HTTPS endpoint is down; daily sync worker fails.

Detection:

consent_dnd_sync_runs_total{outcome="failed"} increments
consent_dnd_sync_last_success_timestamp_seconds stops updating
ConsentDndStale HIGH after 24 h

Impact:

DND registry not refreshed. New ATRA-listed MSISDNs not blocked until sync resumes; un-listed-since MSISDNs not unblocked.
Existing DND state continues to enforce — service keeps running on last-known data.

Recovery:

Manual fetch via POST /v1/admin/consent/dnd/resync once endpoint returns.
ATRA NOC liaison contact in runbook; escalation path includes regulator liaison if outage > 48 h.

Mitigation:

dnd.registry.synced.v1 event freshness monitored by regulator-portal-service.
Dual-fetch attempt: SFTP primary + HTTPS secondary if ATRA exposes both.

Runbook: dnd-stale.md

FM-06 — Audit hash-chain break

Scenario: Daily verifier detects a prev_hash mismatch — either tampering, a write-path bug, or an unrecovered partition replication anomaly.

Detection:

consent_audit_chain_breaks_detected_total > 0 → ConsentAuditChainBroken CRITICAL pages T&S on-call
consent.audit.chain_broken.v1 event published

Impact:

Regulator-defensibility of the affected window is compromised. The platform must be able to explain why and what.
Service continues operating; new rows continue to chain forward (but the broken link is recorded).

Recovery:

Do NOT auto-resolve. Freeze writes to the affected partition (admin endpoint to set consent_audit_writes_paused=true flag).
Snapshot the partition (pg_dump) and the corresponding S3 archive (if any).
Cross-check with the synchronous standby region's copy (independent observers).
Determine cause: bug, replication lag, tampering. Investigation is owned by Trust & Safety + Security.
Forensic report to regulator within 48 h regardless of root cause.
Resume writes only after sign-off.

Mitigation:

Cross-region verifier runs independently in Kabul and Herat; divergent results trigger immediate freeze.
Hash key rotation supported via signing_key_id so a key compromise can be contained without invalidating older rows.
Strict DB-level append-only rules (Postgres rule) make tampering hard to begin with.

Runbook: audit-chain-broken.md — TS-CRITICAL

FM-07 — Erasure processor failure (SLA breach)

Scenario: ErasureProcessor worker fails repeatedly; some erasure requests pass their 30-day SLA.

Detection:

consent_erasure_sla_breach_total > 0 → ConsentErasureSLABreach MEDIUM (also escalates Legal)
Worker error log spike

Impact:

GDPR-equivalent regulatory breach. Citizen has not received the deletion they were promised.

Recovery:

Manual run: POST /v1/admin/consent/erasure-requests/{erasureId}/process.
If repeated failures point to a code bug, hotfix and rerun.
For the affected citizen: written confirmation + apology + completion confirmation within 24 h of detection.
Legal team notified; if regulator notification is required (Afghan authority interpretation), follow that path.

Mitigation:

Worker has retry-with-backoff for transient PG errors.
Daily Slack/Email digest to Legal of pending erasures and SLA proximity.

Runbook: erasure-sla-breach.md

FM-08 — STOP-keyword false-positive flood

Scenario: A pattern of legitimate non-STOP messages is being matched as STOP, revoking consent erroneously and triggering tenant complaints.

Detection:

Tenant complaints via POST /v1/consent/feedback/false-positive
Spike in consent_stop_mo_received_total{match_outcome="matched"} per language vs baseline
compliance.tenant.tier.changed events away from CLEAR for affected tenants

Impact:

Subscribers wrongly opted out of legitimate marketing/transactional flows.
Tenant support escalations.

Recovery:

T&S triage: identify the responsible keyword.
Soft-delete the offending tenant-added keyword (platform defaults are sealed; if a default is at fault, T&S adds an exception list rather than removing the default).
Re-grant consent for affected MSISDNs via tenant RecordConsent with verificationMethod = TENANT_API and a feedback reference. Audit row indicates the reason.

Mitigation:

The keyword catalog admin requires audit log + dual review for sensitive changes.
AI keyword-suggester (see AI_INTEGRATION §2) is HITL only — no auto-add.
Conformance test (200 messages × 4 languages) blocks merges that lower precision.

Runbook: stop-keyword-fp-flood.md

FM-09 — Vault Transit (KEK) unavailable

Scenario: Vault Transit endpoint cannot unwrap DEKs; encryption-at-rest functions blocked.

Detection:

consent_vault_unwrap_errors_total rate up
ConsentVaultUnwrapErrors HIGH alert

Impact:

New RecordConsent writes fail (cannot encrypt MSISDN field).
Reads of msisdn_encrypted (admin/regulator handlers) fail.
CheckConsent is unaffected — it never decrypts; it uses msisdn_hash.

Recovery:

Restore Vault.
Cached DEKs cover ≤ 60 s of outage transparently.
After 60 s, queued write requests retry by client backoff.

Mitigation:

Vault HA (active + standby) in Kabul + Herat.
DEK cache in-process with bounded TTL.

Runbook: vault-unwrap-errors.md

FM-10 — Outbox publish stuck

Scenario: NATS publish failures or poison messages cause outbox rows to accumulate.

Detection:

consent_outbox_unpublished_count > 1000 or consent_outbox_oldest_unpublished_age_seconds > 60 → ConsentOutboxBacklog HIGH

Impact:

Downstream consumers (notification-service, regulator-portal-service SIEM) miss events.
Database state remains correct (outbox is post-commit).

Recovery:

Investigate NATS health.
Manual outbox-replay script with offset cursor.
Quarantine poison rows (move to consent.outbox_quarantine) for forensic review.

Mitigation:

3 outbox-relay replicas with FOR UPDATE SKIP LOCKED.
NATS streams replicated 3-way.

Runbook: outbox-backlog.md

FM-11 — Citizen-portal MSISDN-OTP abuse / takeover

Scenario: Attacker tries to brute-force OTP, request OTP for many MSISDNs, or use SIM-swap to take over a MSISDN.

Detection:

consent_citizen_otp_requested_total rate per IP / per MSISDN exceeds threshold
ConsentCitizenOtpAbuse MEDIUM at 100 OTP requests/hour total

Impact:

Mass OTP delivery cost; degraded user trust; potential account takeover.

Recovery:

Block source IPs at Kong (geographic / ASN heuristics).
Temporarily increase OTP captcha difficulty.
For SIM-swap claims: route to support; require KYC.

Mitigation:

OTP rate limit (5 per MSISDN per hour); 5 per IP per hour.
Captcha on OTP request.
Citizen JWT lifetime 15 min; revoke immediately on logout.
OTP entropy ≥ 6 digits; lifetime 5 min.
Audit row on every citizen view (CITIZEN_INSPECTION_VIEW) gives a forensic trail.

Runbook: citizen-otp-abuse.md

FM-12 — Bulk-import abuse (fake opt-ins)

Scenario: A tenant uploads bulk consent records for MSISDNs they do not legitimately have permission for.

Detection:

Volume anomaly: tenant's bulk-import volume vs prior baseline
Subscriber complaints reaching the regulator → cross-checked against bulk-import audit (CSV hash, captured_at)

Impact:

Subscribers receive unwanted marketing; regulator complaint.

Recovery:

Suspend the offending tenant's consent:write scope.
Mass revoke for the affected MSISDNs (admin tool).
Forensic audit of CSV provenance (the original CSV hash is on every audit row).
Regulator escalation per platform terms.

Mitigation:

Bulk-import audit-tagged with CSV hash.
Daily volume anomaly check (consent_records_written_total{verification_method="BULK_IMPORT_ATTESTATION"} per tenant vs 30-day moving average).
Tenant terms-of-service require attestation of lawful basis.

Runbook: bulk-import-abuse.md

FM-13 — Cross-region replication divergence

Scenario: Kabul and Herat (or Mazar) Postgres replicas show different states for the same row, or audit chain hashes differ between regions.

Detection:

Cross-region AuditChainVerifier cron compares per-partition record_hash of last row across regions
Mismatch raises ConsentAuditChainBroken CRITICAL with region tags

Impact:

Loss of confidence in regulator-defensibility.
Possible failover to a wrong-state region.

Recovery:

Same as FM-06: freeze, snapshot, investigate.
Re-bootstrap the divergent replica from primary backup; replay WAL.

Mitigation:

Synchronous replication to Herat.
Per-region verifier independent of the primary's verifier.
Patroni quorum requires 2 of 3 to agree on leader.

Runbook: replication-divergence.md

FM-14 — NetworkPolicy mis-config exposes egress to offshore

Scenario: A bad merge to NetworkPolicy or Istio AuthorizationPolicy adds an egress rule that allows traffic to non-Afghan IPs — a residency-violation incident.

Detection:

Deploy-time residency test (tests/residency/consent_residency.spec.ts) fails
Runtime consent_residency_violation_total (Istio access-log-derived metric) > 0

Impact:

Critical regulatory incident: data residency invariant broken.

Recovery:

Roll back the offending change immediately.
Snapshot Istio access logs to determine if any actual traffic flowed offshore.
If yes: regulator notification; per-tenant data-export disclosure.

Mitigation:

Deploy-time residency test mandatory in pipeline.
Istio AuthorizationPolicy with default-deny egress.
NetworkPolicy in deny-by-default cluster posture (per kubernetes-cluster-posture.md).
Vault PKI doesn't issue certs for non-platform identities.

Runbook: residency-violation.md — REGULATOR-PRIORITY

4. Tenant impact matrix

Failure	Tenant-portal view
Brief CheckConsent stall (< 1 min)	None visible
Extended CheckConsent fail-closed	Outbound messages delayed; eventually `DEAD_LETTER` with reason "Consent system temporarily unavailable. Retry available."
STOP-MO lag	None (subscribers see delayed STOP honoring; subscriber complaints flow back via support)
ATRA DND stale	None visible (verdicts continue)
Audit chain break	None visible to tenant; admin alert; regulator notification
Erasure SLA breach	Citizen-portal escalation; legal notification
Bulk-import abuse	Tenant suspension visible in tenant portal; appeal flow

5. Graceful degradation summary

Full operation:
  Cache hit → ALLOWED/BLOCKED in ≤ 5 ms

Redis unavailable:
  PG-direct → ALLOWED/BLOCKED in ≤ 20 ms; PG load up

PG primary down (standby promotes):
  Brief gap (≤ 90 s) of CONSENT_UNKNOWN; then resumed

Both Redis + PG unavailable:
  100% CONSENT_UNKNOWN; compliance-engine BLOCKs all non-emergency

ATRA DND fetch failed:
  Run on last-known DND; alert after 24 h

STOP MO consumer lagging:
  Subscriber STOPs honoured with delay; alert on lag > 60 s

Audit chain break:
  Service continues; CRITICAL alert; freeze + investigate (manual)

Vault Transit unavailable:
  CheckConsent unaffected; new writes 503 after 60 s DEK cache

Outbox publish stuck:
  DB state correct; downstream events delayed

Citizen OTP abuse:
  Rate limit + captcha throttle; alert at 100/hr

Cross-region divergence:
  CRITICAL freeze; investigate; re-bootstrap divergent replica

1. Operating principle: fail-closed​

2. Failure-mode summary​

3. Detailed failure modes​

FM-01 — Postgres unavailable​

FM-02 — Redis unavailable​

FM-03 — Both Redis and Postgres unavailable​

FM-04 — STOP-MO consumer lag​

FM-05 — ATRA DND endpoint unreachable​

FM-06 — Audit hash-chain break​

FM-07 — Erasure processor failure (SLA breach)​

FM-08 — STOP-keyword false-positive flood​

FM-09 — Vault Transit (KEK) unavailable​

FM-10 — Outbox publish stuck​

FM-11 — Citizen-portal MSISDN-OTP abuse / takeover​

FM-12 — Bulk-import abuse (fake opt-ins)​

FM-13 — Cross-region replication divergence​

FM-14 — NetworkPolicy mis-config exposes egress to offshore​

4. Tenant impact matrix​

5. Graceful degradation summary​

1. Operating principle: fail-closed

2. Failure-mode summary

3. Detailed failure modes

FM-01 — Postgres unavailable

FM-02 — Redis unavailable

FM-03 — Both Redis and Postgres unavailable

FM-04 — STOP-MO consumer lag

FM-05 — ATRA DND endpoint unreachable

FM-06 — Audit hash-chain break

FM-07 — Erasure processor failure (SLA breach)

FM-08 — STOP-keyword false-positive flood

FM-09 — Vault Transit (KEK) unavailable

FM-10 — Outbox publish stuck

FM-11 — Citizen-portal MSISDN-OTP abuse / takeover

FM-12 — Bulk-import abuse (fake opt-ins)

FM-13 — Cross-region replication divergence

FM-14 — NetworkPolicy mis-config exposes egress to offshore

4. Tenant impact matrix

5. Graceful degradation summary