consent-ledger-service — Service Risk Register

Version: 1.0 Status: Draft Owner: Trust and Safety + SRE Last Updated: 2026-04-21 References: FAILURE_MODES.md, SECURITY_MODEL.md, ADR-0004

Known service-level risks with owners, mitigations, and residual-risk classification. Risks are drawn from SERVICE_OVERVIEW.md §10 Open Points, architectural assumptions, regulatory exposure, and externally-facing attack surface. Scored using a simple 1–5 Likelihood × Impact scheme; residual risk after mitigation must be ≤ Medium for GA.

1. Risk Summary

ID	Risk	Category	Likelihood	Impact	Pre-mitigation	Residual	Owner
CONS-RISK-01	ATRA National DND registry does not exist or does not expose an API	Dependency	4	4	High	Medium	Regulator Liaison
CONS-RISK-02	Hash-chain verifier bug invalidates audit integrity claim	Correctness	2	5	High	Low	Trust & Safety
CONS-RISK-03	GDPR right-to-erasure vs. 7-year regulator retention conflict	Legal	3	4	High	Medium	Legal
CONS-RISK-04	MSISDN enumeration attack via citizen-portal	Security	3	3	Medium	Low	Security
CONS-RISK-05	`CheckConsent` hot-path latency regression breaks OTP SLA (sub-3 s)	Performance	3	4	High	Low	SRE
CONS-RISK-06	STOP-keyword false positive rate too high in Pashto / Dari	Correctness	4	3	High	Medium	Trust & Safety
CONS-RISK-07	Cross-region replication conflict on concurrent tenant consent writes	Correctness	2	3	Medium	Low	Platform Arch
CONS-RISK-08	Bulk-import CSV injection attack	Security	2	4	Medium	Low	Security
CONS-RISK-09	Unauthorised tenant exporting another tenant's consent records (RLS bypass)	Security	2	5	High	Low	Security
CONS-RISK-10	HSM unavailable → erasure-tokenisation stalls; erasure SLA breach	Dependency	2	3	Medium	Low	SRE + Security
CONS-RISK-11	S3 cold-tier archive bucket misconfiguration → lost records	Operations	1	5	Medium	Low	SRE
CONS-RISK-12	STOP-keyword auto-suspension used adversarially by a tenant against competitors	Trust & Safety	2	2	Low	Low	Trust & Safety
CONS-RISK-13	Fail-closed behaviour triggered by a single-AZ Postgres incident blocks national SMS	Availability	2	5	High	Medium	SRE
CONS-RISK-14	Regulator changes DND schema without notice	Dependency	3	3	Medium	Medium	Regulator Liaison
CONS-RISK-15	Consent SDK bug in tenant integration causes spurious mass-revocation	Integration	2	4	Medium	Low	DevRel

2. Risk Details

CONS-RISK-01 — ATRA National DND registry does not exist or does not expose an API

Scenario. ATRA has not yet committed to a National DND registry interface. If the registry (a) does not exist at launch, or (b) is delivered as a manual PDF, or (c) uses an undocumented SFTP convention, EP-CONS-01 / US-CONS-001 cannot be automated at the target cadence.

Impact. We cannot enforce a national DND — only per-tenant opt-out. This is a regulator-exposure risk if ATRA later publishes a list and expects us to honour it retroactively.

Mitigation.

Design the sync worker against a pluggable adapter (AtraDndAdapter port with HttpsJsonAdapter, SftpCsvAdapter, ManualUploadAdapter implementations).
Ship ManualUploadAdapter from day 1 — a platform admin can upload a CSV via admin-dashboard and it is parsed into consent.dnd_registry.
Open an ATRA engagement workstream (Regulator Liaison, OKR) to converge on a formal API.
Fall-back: Ghasi publishes a de-facto National DND registry itself, seeded from tenant STOP events, and offers it to ATRA.

Residual risk. Medium — the adapter gives us a graceful fall-back; the regulator relationship is the long pole.

CONS-RISK-02 — Hash-chain verifier bug invalidates audit integrity claim

Scenario. A bug in record_hash = sha256(payload || prev_hash) computation (e.g., canonicalisation drift, non-deterministic JSON serialisation) causes the daily verifier to flag false-positive breaks, which erodes regulator trust; or conversely, a bug masks a real tamper.

Impact. Loss of regulator-defensibility on the entire audit log.

Mitigation.

Canonicalise payload using RFC 8785 JSON Canonicalization Scheme (JCS) — deterministic serialisation.
Property-based tests verify chain invariants on 10 000+ random inputs before release.
Two independent implementations: the producer (NestJS) and the verifier (Python cron) — divergence ⇒ test alert.
Weekly chaos drill injects a deliberate tamper to confirm verifier detects it.

Residual risk. Low.

Scenario. A citizen invokes right-to-erasure. The consent-ledger must erase MSISDN from consent.records. But consent.audit must retain the record for 7 years. The two obligations conflict if audit retains the MSISDN verbatim.

Impact. GDPR violation (~€20M fine potential; reputation).

Mitigation.

Erasure tokenises MSISDN in consent.audit via deterministic hash (FF1 with HSM-bound key). Audit retains the proof-of-event without the PII.
National-DND row is retained (regulator override per SERVICE_OVERVIEW §6.3).
Legal briefing documents the trade-off; citizen-portal erasure UI discloses that "proof of your prior consent is retained in hashed form for regulator audit".
DPIA signed off by Legal.

Residual risk. Medium — depends on whether a regulator or court accepts deterministic-hash as an acceptable pseudonymisation.

CONS-RISK-04 — MSISDN enumeration attack via citizen-portal

Scenario. Attacker probes /v1/consent/records?msisdn= for millions of MSISDNs to learn which are registered on Ghasi (and with which tenants).

Impact. Large-scale PII disclosure; tenant intelligence leak.

Mitigation.

Citizen-portal /consent/records?msisdn= requires MSISDN-OTP verification first.
OTP rate-limited per MSISDN (3/hour) and per source IP (10/hour).
Anti-enumeration: uniform response time whether MSISDN is known or unknown.
Kong adaptive rate-limit + JA3 fingerprint-based blocking at the edge.
No public API for MSISDN existence check.

Residual risk. Low.

CONS-RISK-05 — `CheckConsent` hot-path latency regression breaks OTP SLA

Scenario. An innocuous code change (e.g., added JSON-validation middleware) pushes CheckConsent P95 from 5 ms to 30 ms. Because compliance-engine and routing-engine both call it synchronously, the 30 ms flows directly into OTP submit-to-DLR latency, breaking the 3 s national SLA for OTP lane.

Impact. Bank / healthcare OTP traffic SLA breach; tenant escalations.

Mitigation.

Continuous benchmark in CI: CheckConsent benchmark must pass P95 ≤ 5 ms gate before merge.
Canary deploy with automatic rollback on P95 > 15 ms for 5 min.
Load test at 1.5× expected RPS is a GA gate (§8).
Separate hot-path (CheckConsent) and admin-path (REST CRUD) pods so admin load cannot starve hot-path.

Residual risk. Low.

CONS-RISK-06 — STOP-keyword false positive rate too high in Pashto / Dari

Scenario. The default Pashto / Dari STOP-keyword catalog matches legitimate words in those languages (e.g., a common name colliding with لغو). Result: unintended revocation at scale; regulator complaints from tenants whose customers were suddenly opted out.

Impact. Tenant churn; revenue impact; regulator exposure.

Mitigation.

Keywords hand-curated by native Pashto / Dari speakers (Trust & Safety + Legal review).
False-positive feedback loop (US-CONS-011) — tenants can report false positives via portal.
Multi-day shadow mode in staging (§10 P1) with real MO data (anonymised) to estimate FP rate.
Per-tenant keyword overrides allowed (additive only).
Context-aware matching: if body includes STOP keyword mid-sentence and is > 4 words total, require conservative-mode flag (opt-in per tenant).

Residual risk. Medium — natural-language variation is inherently imperfect.

CONS-RISK-07 — Cross-region replication conflict

Scenario. Per ADR-0004, consent.records is control-plane data replicated multi-master (kbl ↔ mzr). A concurrent write to the same (tenantId, msisdn, scope) in both regions can create a replication conflict.

Impact. Inconsistent consent state between regions; erratic CheckConsent results.

Mitigation.

Last-Write-Wins (LWW) using HLC timestamp; conflict resolver picks newest.
Consent is fundamentally append-only (opt-in and revoke are events; state is derived) — conflicts resolve monotonically because revocation is absorbing.
Consent audit (append-only) never conflicts — only insertions.
Reconciliation cron hourly checks region-divergence count and alerts if > 0.01% of records differ.

Residual risk. Low.

CONS-RISK-08 — Bulk-import CSV injection attack

Scenario. A malicious tenant uploads a CSV with embedded formula cells (=cmd|...!A1) intending to run code on a reviewer's spreadsheet when exported.

Impact. Reviewer workstation compromise.

Mitigation.

CSV parser strictly typed — no formula evaluation.
Fields sanitised on output: values beginning with =, +, -, @ are prefixed with '.
Bulk-import reports rendered as HTML / PDF only (not raw CSV) to reviewers.

Residual risk. Low.

Scenario. A bug in a query path or a service-role escalation lets tenant A read tenant B's consent.records.

Impact. Major PII breach; regulator reportable.

Mitigation.

RLS enforced at Postgres level (not application) on every tenant-scoped table.
Integration test test/integration/tenant-isolation.spec.ts is mandatory (per DEFINITION_OF_DONE) and runs in CI.
Contract test verifies every REST and gRPC path filters by tenantId from JWT / SVID.
Quarterly security review.

Residual risk. Low.

CONS-RISK-10 — HSM unavailable → erasure tokenisation stalls

Scenario. HSM unavailable for > 1 h blocks erasure processor (tokenisation requires HSM-bound deterministic key). Erasure SLA (30 d) breached only if HSM outage > 30 d — but a prolonged incident still erodes trust.

Mitigation.

HSM HA with quorum (per ADR-0004 §11): 2 regional nodes + escrow.
Erasure processor queues requests and retries on HSM recovery.
Escalation: if HSM outage > 4 h, alert (PagerDuty Critical).

Residual risk. Low.

CONS-RISK-11 — S3 cold-tier archive bucket misconfiguration → lost records

Scenario. Wrong lifecycle policy or wrong bucket-versioning setting causes partitions older than 13 m to be deleted rather than archived.

Impact. Regulator-defensibility lost for deleted range; irrecoverable.

Mitigation.

S3 bucket has immutable object-lock in governance mode (default 7 years).
Archive job is blue-green: write to new bucket before deleting from hot.
Weekly cron verifies audit-row-count(hot + cold) ≥ previous-week-count.
Bucket policy change requires dual-control approval via admin-dashboard.

Residual risk. Low.

CONS-RISK-12 — STOP auto-suspension weaponised against competitors

Scenario. A tenant's competitor sends spoofed MO messages containing STOP keywords to their subscriber base, causing mass opt-out.

Impact. Commercial damage to victim tenant.

Mitigation.

MO origin authenticated against MSISDN (the STOP MUST come from the actual recipient, not a spoofed MO).
Per-sender-MO rate-limit on STOP processing (if the same MSISDN sends STOP to many tenants in a minute, queue for review).
Tenant can appeal via false-positive reporting.

Residual risk. Low.

CONS-RISK-13 — Fail-closed blocks national SMS during Postgres incident

Scenario. A single-AZ Postgres incident takes down consent-ledger. Fail-closed behaviour means every SMS submission is now blocked.

Impact. National SMS outage.

Mitigation.

Postgres HA with synchronous replica within region; automatic fail-over ≤ 30 s.
Redis hot-cache absorbs 97%+ of hot-path traffic, masking DB incidents up to cache TTL (300 s).
Per ADR-0004 §5, region-level fail-over to Mazar is manual-gated but tested quarterly.
Emergency override: P0 emergency lane (e.g., CBC broadcasts) is not gated by consent-ledger.

Residual risk. Medium — national-scale incident still possible in worst case.

CONS-RISK-14 — Regulator changes DND schema without notice

Scenario. ATRA changes their published DND file format mid-year.

Impact. Sync worker fails silently or parses wrong fields.

Mitigation.

Schema version captured per sync; mismatch triggers alert + manual review.
Adapter abstraction allows new adapter deployment without full-service redeploy.
Regulator Liaison maintains change-management contact.

Residual risk. Medium.

Scenario. Tenant integration accidentally loops through customers calling RevokeConsent instead of RecordConsent.

Impact. Mass opt-out; tenant complaint; recovery requires investigation.

Mitigation.

SDK has rate-limit defaults (≤ 100 req/s per API key) with big-red-button warning if exceeded.
Server-side circuit breaker: if a single API key issues > 1 000 revocations in 1 h, throttle + alert tenant.
Audit-log-driven recovery: if tenant requests a restore, the audit log can be replayed to a prior point-in-time with tenant sign-off.

Residual risk. Low.

3. Residual-Risk Summary

Residual	Count	Acceptance
Low	10	Accepted for GA
Medium	5	Accepted with mitigation commitments and named owners
High	0	—

GA requires zero High residual risks and explicit sign-off on every Medium risk from the named owner.

4. Risk Review Cadence

Weekly during development (Platform Architecture).
Monthly post-GA (Trust & Safety + SRE + Security).
Quarterly regulator-risk review (Regulator Liaison + Legal).

1. Risk Summary​

2. Risk Details​

CONS-RISK-01 — ATRA National DND registry does not exist or does not expose an API​

CONS-RISK-02 — Hash-chain verifier bug invalidates audit integrity claim​

CONS-RISK-03 — GDPR right-to-erasure vs. 7-year regulator retention conflict​

CONS-RISK-04 — MSISDN enumeration attack via citizen-portal​

CONS-RISK-05 — CheckConsent hot-path latency regression breaks OTP SLA​

CONS-RISK-06 — STOP-keyword false positive rate too high in Pashto / Dari​

CONS-RISK-07 — Cross-region replication conflict​

CONS-RISK-08 — Bulk-import CSV injection attack​

CONS-RISK-09 — Tenant exporting another tenant's consent records (RLS bypass)​

CONS-RISK-10 — HSM unavailable → erasure tokenisation stalls​

CONS-RISK-11 — S3 cold-tier archive bucket misconfiguration → lost records​

CONS-RISK-12 — STOP auto-suspension weaponised against competitors​

CONS-RISK-13 — Fail-closed blocks national SMS during Postgres incident​

CONS-RISK-14 — Regulator changes DND schema without notice​

CONS-RISK-15 — Consent SDK bug causes spurious mass-revocation​

3. Residual-Risk Summary​

4. Risk Review Cadence​

1. Risk Summary

2. Risk Details

CONS-RISK-01 — ATRA National DND registry does not exist or does not expose an API

CONS-RISK-02 — Hash-chain verifier bug invalidates audit integrity claim

CONS-RISK-03 — GDPR right-to-erasure vs. 7-year regulator retention conflict

CONS-RISK-04 — MSISDN enumeration attack via citizen-portal

CONS-RISK-05 — `CheckConsent` hot-path latency regression breaks OTP SLA

CONS-RISK-06 — STOP-keyword false positive rate too high in Pashto / Dari

CONS-RISK-07 — Cross-region replication conflict

CONS-RISK-08 — Bulk-import CSV injection attack

CONS-RISK-09 — Tenant exporting another tenant's consent records (RLS bypass)

CONS-RISK-10 — HSM unavailable → erasure tokenisation stalls

CONS-RISK-11 — S3 cold-tier archive bucket misconfiguration → lost records

CONS-RISK-12 — STOP auto-suspension weaponised against competitors

CONS-RISK-13 — Fail-closed blocks national SMS during Postgres incident

CONS-RISK-14 — Regulator changes DND schema without notice

CONS-RISK-15 — Consent SDK bug causes spurious mass-revocation

3. Residual-Risk Summary

4. Risk Review Cadence