consent-ledger-service — Service Risk Register
Version: 1.0 Status: Draft Owner: Trust and Safety + SRE Last Updated: 2026-04-21 References: FAILURE_MODES.md, SECURITY_MODEL.md, ADR-0004
Known service-level risks with owners, mitigations, and residual-risk classification. Risks are drawn from SERVICE_OVERVIEW.md §10 Open Points, architectural assumptions, regulatory exposure, and externally-facing attack surface. Scored using a simple 1–5 Likelihood × Impact scheme; residual risk after mitigation must be ≤ Medium for GA.
1. Risk Summary
| ID | Risk | Category | Likelihood | Impact | Pre-mitigation | Residual | Owner |
|---|---|---|---|---|---|---|---|
| CONS-RISK-01 | ATRA National DND registry does not exist or does not expose an API | Dependency | 4 | 4 | High | Medium | Regulator Liaison |
| CONS-RISK-02 | Hash-chain verifier bug invalidates audit integrity claim | Correctness | 2 | 5 | High | Low | Trust & Safety |
| CONS-RISK-03 | GDPR right-to-erasure vs. 7-year regulator retention conflict | Legal | 3 | 4 | High | Medium | Legal |
| CONS-RISK-04 | MSISDN enumeration attack via citizen-portal | Security | 3 | 3 | Medium | Low | Security |
| CONS-RISK-05 | CheckConsent hot-path latency regression breaks OTP SLA (sub-3 s) | Performance | 3 | 4 | High | Low | SRE |
| CONS-RISK-06 | STOP-keyword false positive rate too high in Pashto / Dari | Correctness | 4 | 3 | High | Medium | Trust & Safety |
| CONS-RISK-07 | Cross-region replication conflict on concurrent tenant consent writes | Correctness | 2 | 3 | Medium | Low | Platform Arch |
| CONS-RISK-08 | Bulk-import CSV injection attack | Security | 2 | 4 | Medium | Low | Security |
| CONS-RISK-09 | Unauthorised tenant exporting another tenant's consent records (RLS bypass) | Security | 2 | 5 | High | Low | Security |
| CONS-RISK-10 | HSM unavailable → erasure-tokenisation stalls; erasure SLA breach | Dependency | 2 | 3 | Medium | Low | SRE + Security |
| CONS-RISK-11 | S3 cold-tier archive bucket misconfiguration → lost records | Operations | 1 | 5 | Medium | Low | SRE |
| CONS-RISK-12 | STOP-keyword auto-suspension used adversarially by a tenant against competitors | Trust & Safety | 2 | 2 | Low | Low | Trust & Safety |
| CONS-RISK-13 | Fail-closed behaviour triggered by a single-AZ Postgres incident blocks national SMS | Availability | 2 | 5 | High | Medium | SRE |
| CONS-RISK-14 | Regulator changes DND schema without notice | Dependency | 3 | 3 | Medium | Medium | Regulator Liaison |
| CONS-RISK-15 | Consent SDK bug in tenant integration causes spurious mass-revocation | Integration | 2 | 4 | Medium | Low | DevRel |
2. Risk Details
CONS-RISK-01 — ATRA National DND registry does not exist or does not expose an API
Scenario. ATRA has not yet committed to a National DND registry interface. If the registry (a) does not exist at launch, or (b) is delivered as a manual PDF, or (c) uses an undocumented SFTP convention, EP-CONS-01 / US-CONS-001 cannot be automated at the target cadence.
Impact. We cannot enforce a national DND — only per-tenant opt-out. This is a regulator-exposure risk if ATRA later publishes a list and expects us to honour it retroactively.
Mitigation.
- Design the sync worker against a pluggable adapter (
AtraDndAdapterport withHttpsJsonAdapter,SftpCsvAdapter,ManualUploadAdapterimplementations). - Ship
ManualUploadAdapterfrom day 1 — a platform admin can upload a CSV viaadmin-dashboardand it is parsed intoconsent.dnd_registry. - Open an ATRA engagement workstream (Regulator Liaison, OKR) to converge on a formal API.
- Fall-back: Ghasi publishes a de-facto National DND registry itself, seeded from tenant STOP events, and offers it to ATRA.
Residual risk. Medium — the adapter gives us a graceful fall-back; the regulator relationship is the long pole.
CONS-RISK-02 — Hash-chain verifier bug invalidates audit integrity claim
Scenario. A bug in record_hash = sha256(payload || prev_hash) computation (e.g., canonicalisation drift, non-deterministic JSON serialisation) causes the daily verifier to flag false-positive breaks, which erodes regulator trust; or conversely, a bug masks a real tamper.
Impact. Loss of regulator-defensibility on the entire audit log.
Mitigation.
- Canonicalise payload using RFC 8785 JSON Canonicalization Scheme (JCS) — deterministic serialisation.
- Property-based tests verify chain invariants on 10 000+ random inputs before release.
- Two independent implementations: the producer (NestJS) and the verifier (Python cron) — divergence ⇒ test alert.
- Weekly chaos drill injects a deliberate tamper to confirm verifier detects it.
Residual risk. Low.
CONS-RISK-03 — GDPR right-to-erasure vs. 7-year regulator retention conflict
Scenario. A citizen invokes right-to-erasure. The consent-ledger must erase MSISDN from consent.records. But consent.audit must retain the record for 7 years. The two obligations conflict if audit retains the MSISDN verbatim.
Impact. GDPR violation (~€20M fine potential; reputation).
Mitigation.
- Erasure tokenises MSISDN in
consent.auditvia deterministic hash (FF1 with HSM-bound key). Audit retains the proof-of-event without the PII. - National-DND row is retained (regulator override per SERVICE_OVERVIEW §6.3).
- Legal briefing documents the trade-off; citizen-portal erasure UI discloses that "proof of your prior consent is retained in hashed form for regulator audit".
- DPIA signed off by Legal.
Residual risk. Medium — depends on whether a regulator or court accepts deterministic-hash as an acceptable pseudonymisation.
CONS-RISK-04 — MSISDN enumeration attack via citizen-portal
Scenario. Attacker probes /v1/consent/records?msisdn= for millions of MSISDNs to learn which are registered on Ghasi (and with which tenants).
Impact. Large-scale PII disclosure; tenant intelligence leak.
Mitigation.
- Citizen-portal
/consent/records?msisdn=requires MSISDN-OTP verification first. - OTP rate-limited per MSISDN (3/hour) and per source IP (10/hour).
- Anti-enumeration: uniform response time whether MSISDN is known or unknown.
- Kong adaptive rate-limit + JA3 fingerprint-based blocking at the edge.
- No public API for MSISDN existence check.
Residual risk. Low.
CONS-RISK-05 — CheckConsent hot-path latency regression breaks OTP SLA
Scenario. An innocuous code change (e.g., added JSON-validation middleware) pushes CheckConsent P95 from 5 ms to 30 ms. Because compliance-engine and routing-engine both call it synchronously, the 30 ms flows directly into OTP submit-to-DLR latency, breaking the 3 s national SLA for OTP lane.
Impact. Bank / healthcare OTP traffic SLA breach; tenant escalations.
Mitigation.
- Continuous benchmark in CI:
CheckConsentbenchmark must pass P95 ≤ 5 ms gate before merge. - Canary deploy with automatic rollback on P95 > 15 ms for 5 min.
- Load test at 1.5× expected RPS is a GA gate (§8).
- Separate hot-path (
CheckConsent) and admin-path (REST CRUD) pods so admin load cannot starve hot-path.
Residual risk. Low.
CONS-RISK-06 — STOP-keyword false positive rate too high in Pashto / Dari
Scenario. The default Pashto / Dari STOP-keyword catalog matches legitimate words in those languages (e.g., a common name colliding with لغو). Result: unintended revocation at scale; regulator complaints from tenants whose customers were suddenly opted out.
Impact. Tenant churn; revenue impact; regulator exposure.
Mitigation.
- Keywords hand-curated by native Pashto / Dari speakers (Trust & Safety + Legal review).
- False-positive feedback loop (US-CONS-011) — tenants can report false positives via portal.
- Multi-day shadow mode in staging (§10 P1) with real MO data (anonymised) to estimate FP rate.
- Per-tenant keyword overrides allowed (additive only).
- Context-aware matching: if body includes STOP keyword mid-sentence and is > 4 words total, require conservative-mode flag (opt-in per tenant).
Residual risk. Medium — natural-language variation is inherently imperfect.
CONS-RISK-07 — Cross-region replication conflict
Scenario. Per ADR-0004, consent.records is control-plane data replicated multi-master (kbl ↔ mzr). A concurrent write to the same (tenantId, msisdn, scope) in both regions can create a replication conflict.
Impact. Inconsistent consent state between regions; erratic CheckConsent results.
Mitigation.
- Last-Write-Wins (LWW) using HLC timestamp; conflict resolver picks newest.
- Consent is fundamentally append-only (opt-in and revoke are events; state is derived) — conflicts resolve monotonically because revocation is absorbing.
- Consent audit (append-only) never conflicts — only insertions.
- Reconciliation cron hourly checks region-divergence count and alerts if > 0.01% of records differ.
Residual risk. Low.
CONS-RISK-08 — Bulk-import CSV injection attack
Scenario. A malicious tenant uploads a CSV with embedded formula cells (=cmd|...!A1) intending to run code on a reviewer's spreadsheet when exported.
Impact. Reviewer workstation compromise.
Mitigation.
- CSV parser strictly typed — no formula evaluation.
- Fields sanitised on output: values beginning with
=,+,-,@are prefixed with'. - Bulk-import reports rendered as HTML / PDF only (not raw CSV) to reviewers.
Residual risk. Low.
CONS-RISK-09 — Tenant exporting another tenant's consent records (RLS bypass)
Scenario. A bug in a query path or a service-role escalation lets tenant A read tenant B's consent.records.
Impact. Major PII breach; regulator reportable.
Mitigation.
- RLS enforced at Postgres level (not application) on every tenant-scoped table.
- Integration test
test/integration/tenant-isolation.spec.tsis mandatory (per DEFINITION_OF_DONE) and runs in CI. - Contract test verifies every REST and gRPC path filters by
tenantIdfrom JWT / SVID. - Quarterly security review.
Residual risk. Low.
CONS-RISK-10 — HSM unavailable → erasure tokenisation stalls
Scenario. HSM unavailable for > 1 h blocks erasure processor (tokenisation requires HSM-bound deterministic key). Erasure SLA (30 d) breached only if HSM outage > 30 d — but a prolonged incident still erodes trust.
Mitigation.
- HSM HA with quorum (per ADR-0004 §11): 2 regional nodes + escrow.
- Erasure processor queues requests and retries on HSM recovery.
- Escalation: if HSM outage > 4 h, alert (PagerDuty Critical).
Residual risk. Low.
CONS-RISK-11 — S3 cold-tier archive bucket misconfiguration → lost records
Scenario. Wrong lifecycle policy or wrong bucket-versioning setting causes partitions older than 13 m to be deleted rather than archived.
Impact. Regulator-defensibility lost for deleted range; irrecoverable.
Mitigation.
- S3 bucket has immutable object-lock in governance mode (default 7 years).
- Archive job is blue-green: write to new bucket before deleting from hot.
- Weekly cron verifies audit-row-count(hot + cold) ≥ previous-week-count.
- Bucket policy change requires dual-control approval via admin-dashboard.
Residual risk. Low.
CONS-RISK-12 — STOP auto-suspension weaponised against competitors
Scenario. A tenant's competitor sends spoofed MO messages containing STOP keywords to their subscriber base, causing mass opt-out.
Impact. Commercial damage to victim tenant.
Mitigation.
- MO origin authenticated against MSISDN (the STOP MUST come from the actual recipient, not a spoofed MO).
- Per-sender-MO rate-limit on STOP processing (if the same MSISDN sends STOP to many tenants in a minute, queue for review).
- Tenant can appeal via false-positive reporting.
Residual risk. Low.
CONS-RISK-13 — Fail-closed blocks national SMS during Postgres incident
Scenario. A single-AZ Postgres incident takes down consent-ledger. Fail-closed behaviour means every SMS submission is now blocked.
Impact. National SMS outage.
Mitigation.
- Postgres HA with synchronous replica within region; automatic fail-over ≤ 30 s.
- Redis hot-cache absorbs 97%+ of hot-path traffic, masking DB incidents up to cache TTL (300 s).
- Per ADR-0004 §5, region-level fail-over to Mazar is manual-gated but tested quarterly.
- Emergency override: P0 emergency lane (e.g., CBC broadcasts) is not gated by consent-ledger.
Residual risk. Medium — national-scale incident still possible in worst case.
CONS-RISK-14 — Regulator changes DND schema without notice
Scenario. ATRA changes their published DND file format mid-year.
Impact. Sync worker fails silently or parses wrong fields.
Mitigation.
- Schema version captured per sync; mismatch triggers alert + manual review.
- Adapter abstraction allows new adapter deployment without full-service redeploy.
- Regulator Liaison maintains change-management contact.
Residual risk. Medium.
CONS-RISK-15 — Consent SDK bug causes spurious mass-revocation
Scenario. Tenant integration accidentally loops through customers calling RevokeConsent instead of RecordConsent.
Impact. Mass opt-out; tenant complaint; recovery requires investigation.
Mitigation.
- SDK has rate-limit defaults (≤ 100 req/s per API key) with big-red-button warning if exceeded.
- Server-side circuit breaker: if a single API key issues > 1 000 revocations in 1 h, throttle + alert tenant.
- Audit-log-driven recovery: if tenant requests a restore, the audit log can be replayed to a prior point-in-time with tenant sign-off.
Residual risk. Low.
3. Residual-Risk Summary
| Residual | Count | Acceptance |
|---|---|---|
| Low | 10 | Accepted for GA |
| Medium | 5 | Accepted with mitigation commitments and named owners |
| High | 0 | — |
GA requires zero High residual risks and explicit sign-off on every Medium risk from the named owner.
4. Risk Review Cadence
- Weekly during development (Platform Architecture).
- Monthly post-GA (Trust & Safety + SRE + Security).
- Quarterly regulator-risk review (Regulator Liaison + Legal).