compliance-engine — Service Risk Register
Status: populated | Last updated: 2026-04-18
Risk Scoring
| Likelihood × Impact | Low Impact | Medium Impact | High Impact | Critical Impact |
|---|---|---|---|---|
| Low likelihood | LOW | LOW | MEDIUM | HIGH |
| Medium likelihood | LOW | MEDIUM | HIGH | HIGH |
| High likelihood | MEDIUM | HIGH | HIGH | CRITICAL |
1. Operational Risks
R-OPS-01 — Compliance-engine becomes single point of failure in hot path
| Attribute | Value |
|---|---|
| Likelihood | Low |
| Impact | Critical |
| Rating | HIGH |
| Description | Every outbound SMS depends on a compliance verdict. The layer is always fail-closed: if compliance-engine is unavailable, messages are held in the NATS queue via redelivery, and after retries exhausted, move to dead-letter. Tenants see delayed processing in the web portal but no unverified message is ever dispatched. |
| Mitigation | (1) Minimum 3 replicas with PDB = 2; (2) HPA auto-scales under load; (3) Fail-closed NATS redelivery (3 attempts × 30 s ack wait); (4) Dead-letter queue with tenant notification; (5) Phase 1 observation-mode rollout validates operation without affecting traffic |
| Owner | Platform SRE |
R-OPS-02 — LLM API outage disables AI rules
| Attribute | Value |
|---|---|
| Likelihood | Medium |
| Impact | Medium |
| Rating | MEDIUM |
| Description | AI_CLASSIFICATION rules rely on Claude/OpenAI availability. Outages at the provider reduce detection coverage. |
| Mitigation | (1) Per-rule fallbackAction; (2) Multi-provider support (Claude primary, OpenAI secondary); (3) 24 h AI result cache absorbs partial outages; (4) Circuit breaker skips AI evaluation when provider is degraded |
| Owner | Platform Engineering |
R-OPS-03 — Hold queue overflow during mass-spam events
| Attribute | Value |
|---|---|
| Likelihood | Medium |
| Impact | Medium |
| Rating | MEDIUM |
| Description | A coordinated spam attack or misconfigured tenant can flood the hold queue, overwhelming reviewers. |
| Mitigation | (1) 24 h auto-expiry of unreviewed holds; (2) Bulk-reject API endpoint for admins; (3) Tier escalation to RESTRICTED/SUSPENDED reduces inflow; (4) Alert at 500 pending, page at 2000 |
| Owner | Trust & Safety |
R-OPS-04 — PostgreSQL connection pool exhaustion
| Attribute | Value |
|---|---|
| Likelihood | Low |
| Impact | High |
| Rating | MEDIUM |
| Description | Under sustained high load, DB connection pool may exhaust, causing evaluation timeouts. |
| Mitigation | (1) Redis caching absorbs 95%+ of read load; (2) PgBouncer in transaction pooling mode; (3) Pool size tuned via load tests; (4) Monitor pg_pool_waiting_clients metric |
| Owner | Platform DBA |
2. Security Risks
R-SEC-01 — Message body leakage via LLM provider
| Attribute | Value |
|---|---|
| Likelihood | Low |
| Impact | High |
| Rating | MEDIUM |
| Description | Sending customer message bodies to a third-party LLM creates a data leakage risk if the provider stores or misuses the content. |
| Mitigation | (1) Local LLM is the primary provider — message bodies stay within our trust boundary; (2) ANONYMIZE_BODY_BEFORE_AI=true redacts PII as defence-in-depth even for local LLM; (3) External LLM failover is opt-in per deployment, not default; (4) If external LLM is enabled, DPA with provider prohibits training on data and requires zero-retention mode; (5) Regulated deployments can disable external failover entirely |
| Owner | Security |
R-SEC-02 — Rule misconfiguration blocks legitimate traffic
| Attribute | Value |
|---|---|
| Likelihood | Medium |
| Impact | High |
| Rating | HIGH |
| Description | An overly broad regex or keyword rule can block legitimate tenant traffic, causing business disruption and compensation liability. |
| Mitigation | (1) All new rules go through shadow-mode testing before enforcement; (2) Rule versioning allows instant rollback; (3) Required peer review before activating BLOCK rules; (4) Per-rule match count tracked in Prometheus — unexpectedly high match rate triggers alert; (5) Rule changes logged to audit trail |
| Owner | Trust & Safety |
R-SEC-03 — Compromised admin account modifies rules to bypass compliance
| Attribute | Value |
|---|---|
| Likelihood | Low |
| Impact | Critical |
| Rating | HIGH |
| Description | An attacker with platform.compliance.admin credentials could disable rules, tier-override malicious tenants, or exfiltrate hold queue contents. |
| Mitigation | (1) MFA enforced for compliance admin accounts; (2) All admin actions logged to immutable audit trail; (3) Audit events replicated to SIEM; (4) Anomaly detection on rule change frequency; (5) Tier override requires secondary approval (future: two-person rule) |
| Owner | Security |
R-SEC-04 — ReDoS attack via malicious regex rule
| Attribute | Value |
|---|---|
| Likelihood | Low |
| Impact | High |
| Rating | MEDIUM |
| Description | A malicious or naive admin could create a REGEX rule with catastrophic backtracking, freezing the evaluator. |
| Mitigation | (1) Regex engine: re2 (linear-time, no backtracking); (2) Regex validation at save time rejects overly complex patterns; (3) Hard 10 ms timeout per regex evaluation; (4) Budget enforcement caps total evaluation time |
| Owner | Platform Engineering |
R-SEC-05 — Audit log tampering
| Attribute | Value |
|---|---|
| Likelihood | Low |
| Impact | Critical |
| Rating | HIGH |
| Description | An attacker with DB access could attempt to modify the audit log to hide malicious actions. |
| Mitigation | (1) PostgreSQL rules reject UPDATE/DELETE on audit_log; (2) NATS event replication creates independent audit copy; (3) Off-site backup of audit log to immutable S3 bucket (object lock enabled); (4) Regular integrity verification cron |
| Owner | Security |
3. Compliance / Regulatory Risks
R-REG-01 — Regulator requires stricter enforcement than configured
| Attribute | Value |
|---|---|
| Likelihood | Medium |
| Impact | High |
| Rating | HIGH |
| Description | A national telecom authority mandates blocking of specific content categories that are not currently in our rule set, creating compliance exposure. |
| Mitigation | (1) Regulatory rule updates tracked in Trust & Safety backlog; (2) Rule sets versioned for audit; (3) Default rule set includes platform-level mandatory rules that cannot be disabled by tenant admins; (4) Quarterly regulatory review |
| Owner | Legal + Trust & Safety |
R-REG-02 — Cross-border data flow restrictions
| Attribute | Value |
|---|---|
| Likelihood | Medium |
| Impact | Medium |
| Rating | MEDIUM |
| Description | Sending message content to a US-hosted LLM may violate data residency requirements in certain jurisdictions. |
| Mitigation | (1) Regional deployment option — AI provider endpoint selection per region; (2) Per-tenant AI enablement flag; (3) Fallback to keyword/regex only for restricted tenants; (4) Future: evaluate on-premise LLM deployment for high-regulation markets |
| Owner | Legal |
R-REG-03 — GDPR right-to-erasure request includes hold queue entries
| Attribute | Value |
|---|---|
| Likelihood | Low |
| Impact | Medium |
| Rating | LOW |
| Description | When a user exercises erasure rights, their message content in the hold queue must be redacted without compromising audit integrity. |
| Mitigation | (1) On tenant_erased.v1 event, redact payload.body, payload.to, payload.from_id in hold_queue while preserving metadata; (2) Evaluation log already stores only hashes (GDPR-minimal); (3) Audit log entries redacted consistently |
| Owner | Legal + Platform Engineering |
4. Product / Business Risks
R-BUS-01 — False-positive BLOCK damages tenant trust
| Attribute | Value |
|---|---|
| Likelihood | Medium |
| Impact | High |
| Rating | HIGH |
| Description | A legitimate high-value tenant has their traffic blocked by a false-positive rule match, leading to SLA breach and reputation damage. |
| Mitigation | (1) ALLOW rules for verified trusted senders override BLOCK rules; (2) Template approval workflow for high-volume legitimate use cases (OTP, alerts); (3) Rapid rule rollback via version history; (4) Dedicated support channel for compliance-blocked messages |
| Owner | Trust & Safety + Product |
R-BUS-02 — AI API cost overrun
| Attribute | Value |
|---|---|
| Likelihood | Medium |
| Impact | Medium |
| Rating | MEDIUM |
| Description | LLM calls at scale can become expensive, especially for repeated unique messages that bypass the cache. |
| Mitigation | (1) 24 h AI result cache keyed by body hash (typical OTP/alert templates have >95% cache hit); (2) AI rules gated to specific categories, not all messages; (3) Use smallest capable model (Claude Haiku vs Sonnet); (4) Daily cost monitoring with budget alerts; (5) ANONYMIZE_BODY reduces token count |
| Owner | Platform Engineering |
R-BUS-03 — Reviewer team capacity insufficient for hold queue
| Attribute | Value |
|---|---|
| Likelihood | Medium |
| Impact | Medium |
| Rating | MEDIUM |
| Description | Growth in held messages outpaces reviewer team capacity, leading to long review delays and auto-expiry of legitimate messages. |
| Mitigation | (1) 24 h auto-expiry prevents indefinite delay; (2) Review priority algorithm surfaces highest-risk items first; (3) Bulk-action endpoints for common patterns; (4) Capacity planning model based on message volume × BLOCK/HOLD rate; (5) Tier-based auto-rejection for repeat offenders |
| Owner | Trust & Safety |
R-BUS-04 — Tenant scoring algorithm unfair to new accounts
| Attribute | Value |
|---|---|
| Likelihood | Medium |
| Impact | Low |
| Rating | LOW |
| Description | New tenants without message history have no score signal; default to CLEAR may be too lenient, or default to MONITOR may be too strict. |
| Mitigation | (1) Tenure bonus is small (10 pts) so new accounts start near average; (2) First 1,000 messages evaluated with heightened scrutiny (future: new-account rule set); (3) Human review of first BLOCK/HOLD patterns for new accounts |
| Owner | Trust & Safety |
5. Risk Summary Matrix
| Risk ID | Rating | Owner | Review Cadence |
|---|---|---|---|
| R-OPS-01 | HIGH | Platform SRE | Quarterly |
| R-OPS-02 | MEDIUM | Platform Engineering | Quarterly |
| R-OPS-03 | MEDIUM | Trust & Safety | Monthly |
| R-OPS-04 | MEDIUM | Platform DBA | Quarterly |
| R-SEC-01 | MEDIUM | Security | Quarterly |
| R-SEC-02 | HIGH | Trust & Safety | Monthly |
| R-SEC-03 | HIGH | Security | Quarterly |
| R-SEC-04 | MEDIUM | Platform Engineering | Quarterly |
| R-SEC-05 | HIGH | Security | Quarterly |
| R-REG-01 | HIGH | Legal | Quarterly |
| R-REG-02 | MEDIUM | Legal | Bi-annual |
| R-REG-03 | LOW | Legal | Annual |
| R-BUS-01 | HIGH | Trust & Safety + Product | Monthly |
| R-BUS-02 | MEDIUM | Platform Engineering | Monthly |
| R-BUS-03 | MEDIUM | Trust & Safety | Monthly |
| R-BUS-04 | LOW | Trust & Safety | Quarterly |