Skip to main content

consent-ledger-service — Service Risk Register

Version: 1.0 Status: Draft Owner: Trust and Safety + SRE Last Updated: 2026-04-21 References: FAILURE_MODES.md, SECURITY_MODEL.md, ADR-0004

Known service-level risks with owners, mitigations, and residual-risk classification. Risks are drawn from SERVICE_OVERVIEW.md §10 Open Points, architectural assumptions, regulatory exposure, and externally-facing attack surface. Scored using a simple 1–5 Likelihood × Impact scheme; residual risk after mitigation must be ≤ Medium for GA.


1. Risk Summary

IDRiskCategoryLikelihoodImpactPre-mitigationResidualOwner
CONS-RISK-01ATRA National DND registry does not exist or does not expose an APIDependency44HighMediumRegulator Liaison
CONS-RISK-02Hash-chain verifier bug invalidates audit integrity claimCorrectness25HighLowTrust & Safety
CONS-RISK-03GDPR right-to-erasure vs. 7-year regulator retention conflictLegal34HighMediumLegal
CONS-RISK-04MSISDN enumeration attack via citizen-portalSecurity33MediumLowSecurity
CONS-RISK-05CheckConsent hot-path latency regression breaks OTP SLA (sub-3 s)Performance34HighLowSRE
CONS-RISK-06STOP-keyword false positive rate too high in Pashto / DariCorrectness43HighMediumTrust & Safety
CONS-RISK-07Cross-region replication conflict on concurrent tenant consent writesCorrectness23MediumLowPlatform Arch
CONS-RISK-08Bulk-import CSV injection attackSecurity24MediumLowSecurity
CONS-RISK-09Unauthorised tenant exporting another tenant's consent records (RLS bypass)Security25HighLowSecurity
CONS-RISK-10HSM unavailable → erasure-tokenisation stalls; erasure SLA breachDependency23MediumLowSRE + Security
CONS-RISK-11S3 cold-tier archive bucket misconfiguration → lost recordsOperations15MediumLowSRE
CONS-RISK-12STOP-keyword auto-suspension used adversarially by a tenant against competitorsTrust & Safety22LowLowTrust & Safety
CONS-RISK-13Fail-closed behaviour triggered by a single-AZ Postgres incident blocks national SMSAvailability25HighMediumSRE
CONS-RISK-14Regulator changes DND schema without noticeDependency33MediumMediumRegulator Liaison
CONS-RISK-15Consent SDK bug in tenant integration causes spurious mass-revocationIntegration24MediumLowDevRel

2. Risk Details

CONS-RISK-01 — ATRA National DND registry does not exist or does not expose an API

Scenario. ATRA has not yet committed to a National DND registry interface. If the registry (a) does not exist at launch, or (b) is delivered as a manual PDF, or (c) uses an undocumented SFTP convention, EP-CONS-01 / US-CONS-001 cannot be automated at the target cadence.

Impact. We cannot enforce a national DND — only per-tenant opt-out. This is a regulator-exposure risk if ATRA later publishes a list and expects us to honour it retroactively.

Mitigation.

  1. Design the sync worker against a pluggable adapter (AtraDndAdapter port with HttpsJsonAdapter, SftpCsvAdapter, ManualUploadAdapter implementations).
  2. Ship ManualUploadAdapter from day 1 — a platform admin can upload a CSV via admin-dashboard and it is parsed into consent.dnd_registry.
  3. Open an ATRA engagement workstream (Regulator Liaison, OKR) to converge on a formal API.
  4. Fall-back: Ghasi publishes a de-facto National DND registry itself, seeded from tenant STOP events, and offers it to ATRA.

Residual risk. Medium — the adapter gives us a graceful fall-back; the regulator relationship is the long pole.


CONS-RISK-02 — Hash-chain verifier bug invalidates audit integrity claim

Scenario. A bug in record_hash = sha256(payload || prev_hash) computation (e.g., canonicalisation drift, non-deterministic JSON serialisation) causes the daily verifier to flag false-positive breaks, which erodes regulator trust; or conversely, a bug masks a real tamper.

Impact. Loss of regulator-defensibility on the entire audit log.

Mitigation.

  1. Canonicalise payload using RFC 8785 JSON Canonicalization Scheme (JCS) — deterministic serialisation.
  2. Property-based tests verify chain invariants on 10 000+ random inputs before release.
  3. Two independent implementations: the producer (NestJS) and the verifier (Python cron) — divergence ⇒ test alert.
  4. Weekly chaos drill injects a deliberate tamper to confirm verifier detects it.

Residual risk. Low.


CONS-RISK-03 — GDPR right-to-erasure vs. 7-year regulator retention conflict

Scenario. A citizen invokes right-to-erasure. The consent-ledger must erase MSISDN from consent.records. But consent.audit must retain the record for 7 years. The two obligations conflict if audit retains the MSISDN verbatim.

Impact. GDPR violation (~€20M fine potential; reputation).

Mitigation.

  1. Erasure tokenises MSISDN in consent.audit via deterministic hash (FF1 with HSM-bound key). Audit retains the proof-of-event without the PII.
  2. National-DND row is retained (regulator override per SERVICE_OVERVIEW §6.3).
  3. Legal briefing documents the trade-off; citizen-portal erasure UI discloses that "proof of your prior consent is retained in hashed form for regulator audit".
  4. DPIA signed off by Legal.

Residual risk. Medium — depends on whether a regulator or court accepts deterministic-hash as an acceptable pseudonymisation.


CONS-RISK-04 — MSISDN enumeration attack via citizen-portal

Scenario. Attacker probes /v1/consent/records?msisdn= for millions of MSISDNs to learn which are registered on Ghasi (and with which tenants).

Impact. Large-scale PII disclosure; tenant intelligence leak.

Mitigation.

  1. Citizen-portal /consent/records?msisdn= requires MSISDN-OTP verification first.
  2. OTP rate-limited per MSISDN (3/hour) and per source IP (10/hour).
  3. Anti-enumeration: uniform response time whether MSISDN is known or unknown.
  4. Kong adaptive rate-limit + JA3 fingerprint-based blocking at the edge.
  5. No public API for MSISDN existence check.

Residual risk. Low.


CONS-RISK-05 — CheckConsent hot-path latency regression breaks OTP SLA

Scenario. An innocuous code change (e.g., added JSON-validation middleware) pushes CheckConsent P95 from 5 ms to 30 ms. Because compliance-engine and routing-engine both call it synchronously, the 30 ms flows directly into OTP submit-to-DLR latency, breaking the 3 s national SLA for OTP lane.

Impact. Bank / healthcare OTP traffic SLA breach; tenant escalations.

Mitigation.

  1. Continuous benchmark in CI: CheckConsent benchmark must pass P95 ≤ 5 ms gate before merge.
  2. Canary deploy with automatic rollback on P95 > 15 ms for 5 min.
  3. Load test at 1.5× expected RPS is a GA gate (§8).
  4. Separate hot-path (CheckConsent) and admin-path (REST CRUD) pods so admin load cannot starve hot-path.

Residual risk. Low.


CONS-RISK-06 — STOP-keyword false positive rate too high in Pashto / Dari

Scenario. The default Pashto / Dari STOP-keyword catalog matches legitimate words in those languages (e.g., a common name colliding with لغو). Result: unintended revocation at scale; regulator complaints from tenants whose customers were suddenly opted out.

Impact. Tenant churn; revenue impact; regulator exposure.

Mitigation.

  1. Keywords hand-curated by native Pashto / Dari speakers (Trust & Safety + Legal review).
  2. False-positive feedback loop (US-CONS-011) — tenants can report false positives via portal.
  3. Multi-day shadow mode in staging (§10 P1) with real MO data (anonymised) to estimate FP rate.
  4. Per-tenant keyword overrides allowed (additive only).
  5. Context-aware matching: if body includes STOP keyword mid-sentence and is > 4 words total, require conservative-mode flag (opt-in per tenant).

Residual risk. Medium — natural-language variation is inherently imperfect.


CONS-RISK-07 — Cross-region replication conflict

Scenario. Per ADR-0004, consent.records is control-plane data replicated multi-master (kbl ↔ mzr). A concurrent write to the same (tenantId, msisdn, scope) in both regions can create a replication conflict.

Impact. Inconsistent consent state between regions; erratic CheckConsent results.

Mitigation.

  1. Last-Write-Wins (LWW) using HLC timestamp; conflict resolver picks newest.
  2. Consent is fundamentally append-only (opt-in and revoke are events; state is derived) — conflicts resolve monotonically because revocation is absorbing.
  3. Consent audit (append-only) never conflicts — only insertions.
  4. Reconciliation cron hourly checks region-divergence count and alerts if > 0.01% of records differ.

Residual risk. Low.


CONS-RISK-08 — Bulk-import CSV injection attack

Scenario. A malicious tenant uploads a CSV with embedded formula cells (=cmd|...!A1) intending to run code on a reviewer's spreadsheet when exported.

Impact. Reviewer workstation compromise.

Mitigation.

  1. CSV parser strictly typed — no formula evaluation.
  2. Fields sanitised on output: values beginning with =, +, -, @ are prefixed with '.
  3. Bulk-import reports rendered as HTML / PDF only (not raw CSV) to reviewers.

Residual risk. Low.


Scenario. A bug in a query path or a service-role escalation lets tenant A read tenant B's consent.records.

Impact. Major PII breach; regulator reportable.

Mitigation.

  1. RLS enforced at Postgres level (not application) on every tenant-scoped table.
  2. Integration test test/integration/tenant-isolation.spec.ts is mandatory (per DEFINITION_OF_DONE) and runs in CI.
  3. Contract test verifies every REST and gRPC path filters by tenantId from JWT / SVID.
  4. Quarterly security review.

Residual risk. Low.


CONS-RISK-10 — HSM unavailable → erasure tokenisation stalls

Scenario. HSM unavailable for > 1 h blocks erasure processor (tokenisation requires HSM-bound deterministic key). Erasure SLA (30 d) breached only if HSM outage > 30 d — but a prolonged incident still erodes trust.

Mitigation.

  1. HSM HA with quorum (per ADR-0004 §11): 2 regional nodes + escrow.
  2. Erasure processor queues requests and retries on HSM recovery.
  3. Escalation: if HSM outage > 4 h, alert (PagerDuty Critical).

Residual risk. Low.


CONS-RISK-11 — S3 cold-tier archive bucket misconfiguration → lost records

Scenario. Wrong lifecycle policy or wrong bucket-versioning setting causes partitions older than 13 m to be deleted rather than archived.

Impact. Regulator-defensibility lost for deleted range; irrecoverable.

Mitigation.

  1. S3 bucket has immutable object-lock in governance mode (default 7 years).
  2. Archive job is blue-green: write to new bucket before deleting from hot.
  3. Weekly cron verifies audit-row-count(hot + cold) ≥ previous-week-count.
  4. Bucket policy change requires dual-control approval via admin-dashboard.

Residual risk. Low.


CONS-RISK-12 — STOP auto-suspension weaponised against competitors

Scenario. A tenant's competitor sends spoofed MO messages containing STOP keywords to their subscriber base, causing mass opt-out.

Impact. Commercial damage to victim tenant.

Mitigation.

  1. MO origin authenticated against MSISDN (the STOP MUST come from the actual recipient, not a spoofed MO).
  2. Per-sender-MO rate-limit on STOP processing (if the same MSISDN sends STOP to many tenants in a minute, queue for review).
  3. Tenant can appeal via false-positive reporting.

Residual risk. Low.


CONS-RISK-13 — Fail-closed blocks national SMS during Postgres incident

Scenario. A single-AZ Postgres incident takes down consent-ledger. Fail-closed behaviour means every SMS submission is now blocked.

Impact. National SMS outage.

Mitigation.

  1. Postgres HA with synchronous replica within region; automatic fail-over ≤ 30 s.
  2. Redis hot-cache absorbs 97%+ of hot-path traffic, masking DB incidents up to cache TTL (300 s).
  3. Per ADR-0004 §5, region-level fail-over to Mazar is manual-gated but tested quarterly.
  4. Emergency override: P0 emergency lane (e.g., CBC broadcasts) is not gated by consent-ledger.

Residual risk. Medium — national-scale incident still possible in worst case.


CONS-RISK-14 — Regulator changes DND schema without notice

Scenario. ATRA changes their published DND file format mid-year.

Impact. Sync worker fails silently or parses wrong fields.

Mitigation.

  1. Schema version captured per sync; mismatch triggers alert + manual review.
  2. Adapter abstraction allows new adapter deployment without full-service redeploy.
  3. Regulator Liaison maintains change-management contact.

Residual risk. Medium.


Scenario. Tenant integration accidentally loops through customers calling RevokeConsent instead of RecordConsent.

Impact. Mass opt-out; tenant complaint; recovery requires investigation.

Mitigation.

  1. SDK has rate-limit defaults (≤ 100 req/s per API key) with big-red-button warning if exceeded.
  2. Server-side circuit breaker: if a single API key issues > 1 000 revocations in 1 h, throttle + alert tenant.
  3. Audit-log-driven recovery: if tenant requests a restore, the audit log can be replayed to a prior point-in-time with tenant sign-off.

Residual risk. Low.


3. Residual-Risk Summary

ResidualCountAcceptance
Low10Accepted for GA
Medium5Accepted with mitigation commitments and named owners
High0

GA requires zero High residual risks and explicit sign-off on every Medium risk from the named owner.


4. Risk Review Cadence

  • Weekly during development (Platform Architecture).
  • Monthly post-GA (Trust & Safety + SRE + Security).
  • Quarterly regulator-risk review (Regulator Liaison + Legal).