Skip to main content

Consent Ledger Service — Failure Modes

Version: 1.0 Status: Draft Owner: Trust & Safety / Platform SRE Last Updated: 2026-04-21 Companion: APPLICATION_LOGIC · OBSERVABILITY · SERVICE_RISK_REGISTER

1. Operating principle: fail-closed

consent-ledger-service operates fail-closed for CheckConsent:

No allowed: true is returned when consent state cannot be determined.

If state cannot be determined for any reason, CheckConsent returns { allowed: false, reason: CONSENT_UNKNOWN }. Downstream callers (compliance-engine, routing-engine, sms-firewall-service) interpret any non-OK gRPC status as allowed: false for non-emergency lanes (defence in depth).

The only exception is lane P0_EMERGENCY (Common Alerting Protocol bridge), which may proceed with an audit row noting the bypass — this is the regulator-sanctioned trade-off for life-safety messaging.

2. Failure-mode summary

#FailureLikelihoodImpactRatingMitigation summary
FM-01Postgres unavailable (primary + standbys)LowCriticalHIGHFail-closed; promote standby; redelivery upstream
FM-02Redis unavailableMediumMediumMEDIUMFall back to Postgres; degraded latency
FM-03Both Redis and Postgres unavailableVery LowCriticalHIGHFail-closed; CONSENT_UNKNOWN; alert
FM-04NATS lag on sms.mo.inbound consumerMediumHigh (regulator risk)HIGHScale consumers; alert at 60 s lag; runbook
FM-05ATRA DND endpoint unreachableMediumMediumMEDIUMUse last-known DND; alert after 24 h stale
FM-06Audit hash-chain break detectedVery LowCriticalHIGHFreeze writes (manual), investigate, never auto-resolve
FM-07Erasure processor failure (SLA breach)LowHigh (regulatory)MEDIUMManual reprocess; legal escalation
FM-08STOP-keyword false-positive floodMediumMediumMEDIUMT&S triage; rapid keyword tuning; tenant feedback API
FM-09Vault Transit (KEK) unavailableLowMediumMEDIUMDEK cache (60 s); CheckConsent unaffected
FM-10Outbox publish stuckLowMediumMEDIUMRetry; manual replay; alert
FM-11Citizen-portal MSISDN-OTP abuse / takeoverMediumHighHIGHRate limits; captcha; OTP entropy; short JWT lifetime
FM-12Bulk import abuse (fake opt-ins)MediumHighHIGHCSV hash audit; volume anomaly detection; regulator dispute path
FM-13Cross-region replication divergenceLowCriticalHIGHCross-region verifier; CRITICAL alert; freeze writes
FM-14NetworkPolicy mis-config exposes egress to offshoreVery LowCriticalHIGHDeploy-time residency test; Istio AuthorizationPolicy

3. Detailed failure modes

FM-01 — Postgres unavailable

Scenario: Primary Postgres is unreachable; if standby promotion has not completed, all read/write paths fail.

Detection:

  • consent_check_failclosed_total{cause="db_unavailable"} rises
  • /health/ready returns 503
  • Patroni / etcd events; Prometheus alert PostgresPrimaryDown

Impact:

  • CheckConsent cache hits continue (Redis-only); cache misses fall through to fail-closed CONSENT_UNKNOWN.
  • Writes fail with 503; tenant integrations retry (per their own backoff).
  • Audit chain pauses.

Recovery:

  • Patroni auto-promotes Herat standby within 90 s (RTO); clients reconnect via Service VIP.
  • If promotion fails (e.g., split-brain), on-call follows runbook pg-primary-down.md: confirm consensus, manually trigger promotion in surviving region, verify chain integrity post-recovery.
  • Caller-side: NATS redelivery upstream means messages are not lost; they queue.

Mitigation:

  • Patroni 3-node cluster across Kabul / Herat / Mazar (per ADR-0004 §3).
  • PgBouncer in transaction mode shields per-pod connection churn.
  • Circuit breaker on PG client opens after 5 consecutive errors / 30 s; half-open after 60 s.

Runbook: pg-primary-down.md


FM-02 — Redis unavailable

Scenario: Redis cluster unreachable (network partition, full restart, etc.).

Detection:

  • consent_redis_op_duration_seconds errors; connection error logs
  • consent_check_cache_misses_total near-100% for the duration

Impact:

  • CheckConsent falls through to Postgres for every call. Latency jumps from ≤ 5 ms to ≤ 20 ms; PG load triples.
  • Redis distributed locks for cron workers cannot be acquired — workers skip the cycle and emit consent_worker_skipped_no_lock_total.

Recovery:

  • Redis cluster recovers; service does not need restart.
  • Cache rewarms naturally (TTL-driven); explicit consent-cache-warmer cron run accelerates if needed.

Mitigation:

  • HPA may scale up consent-ledger-service replicas under elevated PG load.
  • Postgres replica pool sized to handle sustained 5,000 RPS without cache; load test verifies (with 80% headroom).

Runbook: redis-down.md


FM-03 — Both Redis and Postgres unavailable

Scenario: Catastrophic dual-dependency outage (data centre power, etc.).

Detection:

  • consent_check_failclosed_total{cause="redis_and_db_unavailable"} rises sharply
  • ConsentCheckFailclosedSurge CRITICAL alert pages

Impact:

  • 100% of CheckConsent returns CONSENT_UNKNOWN. compliance-engine blocks all non-emergency outbound. Dispatch effectively halted — by design.
  • Tenants see messages → DEAD_LETTER after redelivery exhausts (about 3 minutes per message).
  • Lane=P0_EMERGENCY messages still flow with audit-noted bypass.

Recovery:

  • Restore one or both dependencies. Service immediately starts serving allowed:true/false again.
  • Retry pipeline replays queued messages.

Mitigation:

  • Independent failure domains for PG (Patroni) and Redis (cluster); they cannot share a single failure.
  • Pre-positioned standby in Herat region for PG cross-region failover.

Runbook: checkconsent-failclosed.md — PRIORITY


FM-04 — STOP-MO consumer lag

Scenario: sms.mo.inbound consumer (UC-StopKeywordHandler) backs up. Subscriber STOPs are not honoured promptly — regulator risk: subscribers may complain, and the regulator may interpret latency as wilful non-compliance.

Detection:

  • consent_stop_mo_consumer_lag_seconds > 60 for 5 min → ConsentStopConsumerLag HIGH
  • NATS num_pending on the consent-mo consumer rises

Impact:

  • Subscriber-typed STOPs honoured with delay (60 s+).
  • Ack-back SMS delayed correspondingly.
  • During the lag window, a tenant may legally have sent another marketing message after the subscriber's STOP.

Recovery:

  • Scale consent-ledger-service consumer replicas (NATS pull consumer; concurrent pull count tunable).
  • Identify and isolate poison messages (rare; typically from SMPP-MO encoding edge cases).
  • After clearing lag, run a "retroactive revoke" script that re-applies revocations as of the original MO timestamp.

Mitigation:

  • HPA includes a custom metric for consumer lag; auto-scales.
  • Per-MO budget 1.5 s end-to-end keeps individual processing fast.
  • Dead-letter subject sms.mo.deadletter catches structurally-broken messages without halting the consumer.

Runbook: stop-consumer-lag.md


FM-05 — ATRA DND endpoint unreachable

Scenario: ATRA SFTP / HTTPS endpoint is down; daily sync worker fails.

Detection:

  • consent_dnd_sync_runs_total{outcome="failed"} increments
  • consent_dnd_sync_last_success_timestamp_seconds stops updating
  • ConsentDndStale HIGH after 24 h

Impact:

  • DND registry not refreshed. New ATRA-listed MSISDNs not blocked until sync resumes; un-listed-since MSISDNs not unblocked.
  • Existing DND state continues to enforce — service keeps running on last-known data.

Recovery:

  • Manual fetch via POST /v1/admin/consent/dnd/resync once endpoint returns.
  • ATRA NOC liaison contact in runbook; escalation path includes regulator liaison if outage > 48 h.

Mitigation:

  • dnd.registry.synced.v1 event freshness monitored by regulator-portal-service.
  • Dual-fetch attempt: SFTP primary + HTTPS secondary if ATRA exposes both.

Runbook: dnd-stale.md


FM-06 — Audit hash-chain break

Scenario: Daily verifier detects a prev_hash mismatch — either tampering, a write-path bug, or an unrecovered partition replication anomaly.

Detection:

  • consent_audit_chain_breaks_detected_total > 0ConsentAuditChainBroken CRITICAL pages T&S on-call
  • consent.audit.chain_broken.v1 event published

Impact:

  • Regulator-defensibility of the affected window is compromised. The platform must be able to explain why and what.
  • Service continues operating; new rows continue to chain forward (but the broken link is recorded).

Recovery:

  • Do NOT auto-resolve. Freeze writes to the affected partition (admin endpoint to set consent_audit_writes_paused=true flag).
  • Snapshot the partition (pg_dump) and the corresponding S3 archive (if any).
  • Cross-check with the synchronous standby region's copy (independent observers).
  • Determine cause: bug, replication lag, tampering. Investigation is owned by Trust & Safety + Security.
  • Forensic report to regulator within 48 h regardless of root cause.
  • Resume writes only after sign-off.

Mitigation:

  • Cross-region verifier runs independently in Kabul and Herat; divergent results trigger immediate freeze.
  • Hash key rotation supported via signing_key_id so a key compromise can be contained without invalidating older rows.
  • Strict DB-level append-only rules (Postgres rule) make tampering hard to begin with.

Runbook: audit-chain-broken.md — TS-CRITICAL


FM-07 — Erasure processor failure (SLA breach)

Scenario: ErasureProcessor worker fails repeatedly; some erasure requests pass their 30-day SLA.

Detection:

  • consent_erasure_sla_breach_total > 0ConsentErasureSLABreach MEDIUM (also escalates Legal)
  • Worker error log spike

Impact:

  • GDPR-equivalent regulatory breach. Citizen has not received the deletion they were promised.

Recovery:

  • Manual run: POST /v1/admin/consent/erasure-requests/{erasureId}/process.
  • If repeated failures point to a code bug, hotfix and rerun.
  • For the affected citizen: written confirmation + apology + completion confirmation within 24 h of detection.
  • Legal team notified; if regulator notification is required (Afghan authority interpretation), follow that path.

Mitigation:

  • Worker has retry-with-backoff for transient PG errors.
  • Daily Slack/Email digest to Legal of pending erasures and SLA proximity.

Runbook: erasure-sla-breach.md


FM-08 — STOP-keyword false-positive flood

Scenario: A pattern of legitimate non-STOP messages is being matched as STOP, revoking consent erroneously and triggering tenant complaints.

Detection:

  • Tenant complaints via POST /v1/consent/feedback/false-positive
  • Spike in consent_stop_mo_received_total{match_outcome="matched"} per language vs baseline
  • compliance.tenant.tier.changed events away from CLEAR for affected tenants

Impact:

  • Subscribers wrongly opted out of legitimate marketing/transactional flows.
  • Tenant support escalations.

Recovery:

  • T&S triage: identify the responsible keyword.
  • Soft-delete the offending tenant-added keyword (platform defaults are sealed; if a default is at fault, T&S adds an exception list rather than removing the default).
  • Re-grant consent for affected MSISDNs via tenant RecordConsent with verificationMethod = TENANT_API and a feedback reference. Audit row indicates the reason.

Mitigation:

  • The keyword catalog admin requires audit log + dual review for sensitive changes.
  • AI keyword-suggester (see AI_INTEGRATION §2) is HITL only — no auto-add.
  • Conformance test (200 messages × 4 languages) blocks merges that lower precision.

Runbook: stop-keyword-fp-flood.md


FM-09 — Vault Transit (KEK) unavailable

Scenario: Vault Transit endpoint cannot unwrap DEKs; encryption-at-rest functions blocked.

Detection:

  • consent_vault_unwrap_errors_total rate up
  • ConsentVaultUnwrapErrors HIGH alert

Impact:

  • New RecordConsent writes fail (cannot encrypt MSISDN field).
  • Reads of msisdn_encrypted (admin/regulator handlers) fail.
  • CheckConsent is unaffected — it never decrypts; it uses msisdn_hash.

Recovery:

  • Restore Vault.
  • Cached DEKs cover ≤ 60 s of outage transparently.
  • After 60 s, queued write requests retry by client backoff.

Mitigation:

  • Vault HA (active + standby) in Kabul + Herat.
  • DEK cache in-process with bounded TTL.

Runbook: vault-unwrap-errors.md


FM-10 — Outbox publish stuck

Scenario: NATS publish failures or poison messages cause outbox rows to accumulate.

Detection:

  • consent_outbox_unpublished_count > 1000 or consent_outbox_oldest_unpublished_age_seconds > 60ConsentOutboxBacklog HIGH

Impact:

  • Downstream consumers (notification-service, regulator-portal-service SIEM) miss events.
  • Database state remains correct (outbox is post-commit).

Recovery:

  • Investigate NATS health.
  • Manual outbox-replay script with offset cursor.
  • Quarantine poison rows (move to consent.outbox_quarantine) for forensic review.

Mitigation:

  • 3 outbox-relay replicas with FOR UPDATE SKIP LOCKED.
  • NATS streams replicated 3-way.

Runbook: outbox-backlog.md


FM-11 — Citizen-portal MSISDN-OTP abuse / takeover

Scenario: Attacker tries to brute-force OTP, request OTP for many MSISDNs, or use SIM-swap to take over a MSISDN.

Detection:

  • consent_citizen_otp_requested_total rate per IP / per MSISDN exceeds threshold
  • ConsentCitizenOtpAbuse MEDIUM at 100 OTP requests/hour total

Impact:

  • Mass OTP delivery cost; degraded user trust; potential account takeover.

Recovery:

  • Block source IPs at Kong (geographic / ASN heuristics).
  • Temporarily increase OTP captcha difficulty.
  • For SIM-swap claims: route to support; require KYC.

Mitigation:

  • OTP rate limit (5 per MSISDN per hour); 5 per IP per hour.
  • Captcha on OTP request.
  • Citizen JWT lifetime 15 min; revoke immediately on logout.
  • OTP entropy ≥ 6 digits; lifetime 5 min.
  • Audit row on every citizen view (CITIZEN_INSPECTION_VIEW) gives a forensic trail.

Runbook: citizen-otp-abuse.md


FM-12 — Bulk-import abuse (fake opt-ins)

Scenario: A tenant uploads bulk consent records for MSISDNs they do not legitimately have permission for.

Detection:

  • Volume anomaly: tenant's bulk-import volume vs prior baseline
  • Subscriber complaints reaching the regulator → cross-checked against bulk-import audit (CSV hash, captured_at)

Impact:

  • Subscribers receive unwanted marketing; regulator complaint.

Recovery:

  • Suspend the offending tenant's consent:write scope.
  • Mass revoke for the affected MSISDNs (admin tool).
  • Forensic audit of CSV provenance (the original CSV hash is on every audit row).
  • Regulator escalation per platform terms.

Mitigation:

  • Bulk-import audit-tagged with CSV hash.
  • Daily volume anomaly check (consent_records_written_total{verification_method="BULK_IMPORT_ATTESTATION"} per tenant vs 30-day moving average).
  • Tenant terms-of-service require attestation of lawful basis.

Runbook: bulk-import-abuse.md


FM-13 — Cross-region replication divergence

Scenario: Kabul and Herat (or Mazar) Postgres replicas show different states for the same row, or audit chain hashes differ between regions.

Detection:

  • Cross-region AuditChainVerifier cron compares per-partition record_hash of last row across regions
  • Mismatch raises ConsentAuditChainBroken CRITICAL with region tags

Impact:

  • Loss of confidence in regulator-defensibility.
  • Possible failover to a wrong-state region.

Recovery:

  • Same as FM-06: freeze, snapshot, investigate.
  • Re-bootstrap the divergent replica from primary backup; replay WAL.

Mitigation:

  • Synchronous replication to Herat.
  • Per-region verifier independent of the primary's verifier.
  • Patroni quorum requires 2 of 3 to agree on leader.

Runbook: replication-divergence.md


FM-14 — NetworkPolicy mis-config exposes egress to offshore

Scenario: A bad merge to NetworkPolicy or Istio AuthorizationPolicy adds an egress rule that allows traffic to non-Afghan IPs — a residency-violation incident.

Detection:

  • Deploy-time residency test (tests/residency/consent_residency.spec.ts) fails
  • Runtime consent_residency_violation_total (Istio access-log-derived metric) > 0

Impact:

  • Critical regulatory incident: data residency invariant broken.

Recovery:

  • Roll back the offending change immediately.
  • Snapshot Istio access logs to determine if any actual traffic flowed offshore.
  • If yes: regulator notification; per-tenant data-export disclosure.

Mitigation:

  • Deploy-time residency test mandatory in pipeline.
  • Istio AuthorizationPolicy with default-deny egress.
  • NetworkPolicy in deny-by-default cluster posture (per kubernetes-cluster-posture.md).
  • Vault PKI doesn't issue certs for non-platform identities.

Runbook: residency-violation.md — REGULATOR-PRIORITY


4. Tenant impact matrix

FailureTenant-portal view
Brief CheckConsent stall (< 1 min)None visible
Extended CheckConsent fail-closedOutbound messages delayed; eventually DEAD_LETTER with reason "Consent system temporarily unavailable. Retry available."
STOP-MO lagNone (subscribers see delayed STOP honoring; subscriber complaints flow back via support)
ATRA DND staleNone visible (verdicts continue)
Audit chain breakNone visible to tenant; admin alert; regulator notification
Erasure SLA breachCitizen-portal escalation; legal notification
Bulk-import abuseTenant suspension visible in tenant portal; appeal flow

5. Graceful degradation summary

Full operation:
Cache hit → ALLOWED/BLOCKED in ≤ 5 ms

Redis unavailable:
PG-direct → ALLOWED/BLOCKED in ≤ 20 ms; PG load up

PG primary down (standby promotes):
Brief gap (≤ 90 s) of CONSENT_UNKNOWN; then resumed

Both Redis + PG unavailable:
100% CONSENT_UNKNOWN; compliance-engine BLOCKs all non-emergency

ATRA DND fetch failed:
Run on last-known DND; alert after 24 h

STOP MO consumer lagging:
Subscriber STOPs honoured with delay; alert on lag > 60 s

Audit chain break:
Service continues; CRITICAL alert; freeze + investigate (manual)

Vault Transit unavailable:
CheckConsent unaffected; new writes 503 after 60 s DEK cache

Outbox publish stuck:
DB state correct; downstream events delayed

Citizen OTP abuse:
Rate limit + captcha throttle; alert at 100/hr

Cross-region divergence:
CRITICAL freeze; investigate; re-bootstrap divergent replica