Skip to main content

compliance-engine — Service Risk Register

Status: populated | Last updated: 2026-04-18

Risk Scoring

Likelihood × Impact	Low Impact	Medium Impact	High Impact	Critical Impact
Low likelihood	LOW	LOW	MEDIUM	HIGH
Medium likelihood	LOW	MEDIUM	HIGH	HIGH
High likelihood	MEDIUM	HIGH	HIGH	CRITICAL

1. Operational Risks

R-OPS-01 — Compliance-engine becomes single point of failure in hot path

Attribute	Value
Likelihood	Low
Impact	Critical
Rating	HIGH
Description	Every outbound SMS depends on a compliance verdict. The layer is always fail-closed: if compliance-engine is unavailable, messages are held in the NATS queue via redelivery, and after retries exhausted, move to dead-letter. Tenants see delayed processing in the web portal but no unverified message is ever dispatched.
Mitigation	(1) Minimum 3 replicas with PDB = 2; (2) HPA auto-scales under load; (3) Fail-closed NATS redelivery (3 attempts × 30 s ack wait); (4) Dead-letter queue with tenant notification; (5) Phase 1 observation-mode rollout validates operation without affecting traffic
Owner	Platform SRE

R-OPS-02 — LLM API outage disables AI rules

Attribute	Value
Likelihood	Medium
Impact	Medium
Rating	MEDIUM
Description	AI_CLASSIFICATION rules rely on Claude/OpenAI availability. Outages at the provider reduce detection coverage.
Mitigation	(1) Per-rule `fallbackAction`; (2) Multi-provider support (Claude primary, OpenAI secondary); (3) 24 h AI result cache absorbs partial outages; (4) Circuit breaker skips AI evaluation when provider is degraded
Owner	Platform Engineering

R-OPS-03 — Hold queue overflow during mass-spam events

Attribute	Value
Likelihood	Medium
Impact	Medium
Rating	MEDIUM
Description	A coordinated spam attack or misconfigured tenant can flood the hold queue, overwhelming reviewers.
Mitigation	(1) 24 h auto-expiry of unreviewed holds; (2) Bulk-reject API endpoint for admins; (3) Tier escalation to RESTRICTED/SUSPENDED reduces inflow; (4) Alert at 500 pending, page at 2000
Owner	Trust & Safety

R-OPS-04 — PostgreSQL connection pool exhaustion

Attribute	Value
Likelihood	Low
Impact	High
Rating	MEDIUM
Description	Under sustained high load, DB connection pool may exhaust, causing evaluation timeouts.
Mitigation	(1) Redis caching absorbs 95%+ of read load; (2) PgBouncer in transaction pooling mode; (3) Pool size tuned via load tests; (4) Monitor `pg_pool_waiting_clients` metric
Owner	Platform DBA

2. Security Risks

R-SEC-01 — Message body leakage via LLM provider

Attribute	Value
Likelihood	Low
Impact	High
Rating	MEDIUM
Description	Sending customer message bodies to a third-party LLM creates a data leakage risk if the provider stores or misuses the content.
Mitigation	(1) Local LLM is the primary provider — message bodies stay within our trust boundary; (2) `ANONYMIZE_BODY_BEFORE_AI=true` redacts PII as defence-in-depth even for local LLM; (3) External LLM failover is opt-in per deployment, not default; (4) If external LLM is enabled, DPA with provider prohibits training on data and requires zero-retention mode; (5) Regulated deployments can disable external failover entirely
Owner	Security

R-SEC-02 — Rule misconfiguration blocks legitimate traffic

Attribute	Value
Likelihood	Medium
Impact	High
Rating	HIGH
Description	An overly broad regex or keyword rule can block legitimate tenant traffic, causing business disruption and compensation liability.
Mitigation	(1) All new rules go through shadow-mode testing before enforcement; (2) Rule versioning allows instant rollback; (3) Required peer review before activating BLOCK rules; (4) Per-rule match count tracked in Prometheus — unexpectedly high match rate triggers alert; (5) Rule changes logged to audit trail
Owner	Trust & Safety

R-SEC-03 — Compromised admin account modifies rules to bypass compliance

Attribute	Value
Likelihood	Low
Impact	Critical
Rating	HIGH
Description	An attacker with `platform.compliance.admin` credentials could disable rules, tier-override malicious tenants, or exfiltrate hold queue contents.
Mitigation	(1) MFA enforced for compliance admin accounts; (2) All admin actions logged to immutable audit trail; (3) Audit events replicated to SIEM; (4) Anomaly detection on rule change frequency; (5) Tier override requires secondary approval (future: two-person rule)
Owner	Security

R-SEC-04 — ReDoS attack via malicious regex rule

Attribute	Value
Likelihood	Low
Impact	High
Rating	MEDIUM
Description	A malicious or naive admin could create a REGEX rule with catastrophic backtracking, freezing the evaluator.
Mitigation	(1) Regex engine: `re2` (linear-time, no backtracking); (2) Regex validation at save time rejects overly complex patterns; (3) Hard 10 ms timeout per regex evaluation; (4) Budget enforcement caps total evaluation time
Owner	Platform Engineering

R-SEC-05 — Audit log tampering

Attribute	Value
Likelihood	Low
Impact	Critical
Rating	HIGH
Description	An attacker with DB access could attempt to modify the audit log to hide malicious actions.
Mitigation	(1) PostgreSQL rules reject UPDATE/DELETE on audit_log; (2) NATS event replication creates independent audit copy; (3) Off-site backup of audit log to immutable S3 bucket (object lock enabled); (4) Regular integrity verification cron
Owner	Security

3. Compliance / Regulatory Risks

R-REG-01 — Regulator requires stricter enforcement than configured

Attribute	Value
Likelihood	Medium
Impact	High
Rating	HIGH
Description	A national telecom authority mandates blocking of specific content categories that are not currently in our rule set, creating compliance exposure.
Mitigation	(1) Regulatory rule updates tracked in Trust & Safety backlog; (2) Rule sets versioned for audit; (3) Default rule set includes platform-level mandatory rules that cannot be disabled by tenant admins; (4) Quarterly regulatory review
Owner	Legal + Trust & Safety

R-REG-02 — Cross-border data flow restrictions

Attribute	Value
Likelihood	Medium
Impact	Medium
Rating	MEDIUM
Description	Sending message content to a US-hosted LLM may violate data residency requirements in certain jurisdictions.
Mitigation	(1) Regional deployment option — AI provider endpoint selection per region; (2) Per-tenant AI enablement flag; (3) Fallback to keyword/regex only for restricted tenants; (4) Future: evaluate on-premise LLM deployment for high-regulation markets
Owner	Legal

Attribute	Value
Likelihood	Low
Impact	Medium
Rating	LOW
Description	When a user exercises erasure rights, their message content in the hold queue must be redacted without compromising audit integrity.
Mitigation	(1) On `tenant_erased.v1` event, redact `payload.body`, `payload.to`, `payload.from_id` in hold_queue while preserving metadata; (2) Evaluation log already stores only hashes (GDPR-minimal); (3) Audit log entries redacted consistently
Owner	Legal + Platform Engineering

4. Product / Business Risks

R-BUS-01 — False-positive BLOCK damages tenant trust

Attribute	Value
Likelihood	Medium
Impact	High
Rating	HIGH
Description	A legitimate high-value tenant has their traffic blocked by a false-positive rule match, leading to SLA breach and reputation damage.
Mitigation	(1) ALLOW rules for verified trusted senders override BLOCK rules; (2) Template approval workflow for high-volume legitimate use cases (OTP, alerts); (3) Rapid rule rollback via version history; (4) Dedicated support channel for compliance-blocked messages
Owner	Trust & Safety + Product

R-BUS-02 — AI API cost overrun

Attribute	Value
Likelihood	Medium
Impact	Medium
Rating	MEDIUM
Description	LLM calls at scale can become expensive, especially for repeated unique messages that bypass the cache.
Mitigation	(1) 24 h AI result cache keyed by body hash (typical OTP/alert templates have >95% cache hit); (2) AI rules gated to specific categories, not all messages; (3) Use smallest capable model (Claude Haiku vs Sonnet); (4) Daily cost monitoring with budget alerts; (5) ANONYMIZE_BODY reduces token count
Owner	Platform Engineering

R-BUS-03 — Reviewer team capacity insufficient for hold queue

Attribute	Value
Likelihood	Medium
Impact	Medium
Rating	MEDIUM
Description	Growth in held messages outpaces reviewer team capacity, leading to long review delays and auto-expiry of legitimate messages.
Mitigation	(1) 24 h auto-expiry prevents indefinite delay; (2) Review priority algorithm surfaces highest-risk items first; (3) Bulk-action endpoints for common patterns; (4) Capacity planning model based on message volume × BLOCK/HOLD rate; (5) Tier-based auto-rejection for repeat offenders
Owner	Trust & Safety

R-BUS-04 — Tenant scoring algorithm unfair to new accounts

Attribute	Value
Likelihood	Medium
Impact	Low
Rating	LOW
Description	New tenants without message history have no score signal; default to CLEAR may be too lenient, or default to MONITOR may be too strict.
Mitigation	(1) Tenure bonus is small (10 pts) so new accounts start near average; (2) First 1,000 messages evaluated with heightened scrutiny (future: new-account rule set); (3) Human review of first BLOCK/HOLD patterns for new accounts
Owner	Trust & Safety

5. Risk Summary Matrix

Risk ID	Rating	Owner	Review Cadence
R-OPS-01	HIGH	Platform SRE	Quarterly
R-OPS-02	MEDIUM	Platform Engineering	Quarterly
R-OPS-03	MEDIUM	Trust & Safety	Monthly
R-OPS-04	MEDIUM	Platform DBA	Quarterly
R-SEC-01	MEDIUM	Security	Quarterly
R-SEC-02	HIGH	Trust & Safety	Monthly
R-SEC-03	HIGH	Security	Quarterly
R-SEC-04	MEDIUM	Platform Engineering	Quarterly
R-SEC-05	HIGH	Security	Quarterly
R-REG-01	HIGH	Legal	Quarterly
R-REG-02	MEDIUM	Legal	Bi-annual
R-REG-03	LOW	Legal	Annual
R-BUS-01	HIGH	Trust & Safety + Product	Monthly
R-BUS-02	MEDIUM	Platform Engineering	Monthly
R-BUS-03	MEDIUM	Trust & Safety	Monthly
R-BUS-04	LOW	Trust & Safety	Quarterly

Risk Scoring
1. Operational Risks
2. Security Risks
3. Compliance / Regulatory Risks
4. Product / Business Risks
5. Risk Summary Matrix