Skip to main content

compliance-engine — Service Risk Register

Status: populated | Last updated: 2026-04-18

Risk Scoring

Likelihood × ImpactLow ImpactMedium ImpactHigh ImpactCritical Impact
Low likelihoodLOWLOWMEDIUMHIGH
Medium likelihoodLOWMEDIUMHIGHHIGH
High likelihoodMEDIUMHIGHHIGHCRITICAL

1. Operational Risks

R-OPS-01 — Compliance-engine becomes single point of failure in hot path

AttributeValue
LikelihoodLow
ImpactCritical
RatingHIGH
DescriptionEvery outbound SMS depends on a compliance verdict. The layer is always fail-closed: if compliance-engine is unavailable, messages are held in the NATS queue via redelivery, and after retries exhausted, move to dead-letter. Tenants see delayed processing in the web portal but no unverified message is ever dispatched.
Mitigation(1) Minimum 3 replicas with PDB = 2; (2) HPA auto-scales under load; (3) Fail-closed NATS redelivery (3 attempts × 30 s ack wait); (4) Dead-letter queue with tenant notification; (5) Phase 1 observation-mode rollout validates operation without affecting traffic
OwnerPlatform SRE

R-OPS-02 — LLM API outage disables AI rules

AttributeValue
LikelihoodMedium
ImpactMedium
RatingMEDIUM
DescriptionAI_CLASSIFICATION rules rely on Claude/OpenAI availability. Outages at the provider reduce detection coverage.
Mitigation(1) Per-rule fallbackAction; (2) Multi-provider support (Claude primary, OpenAI secondary); (3) 24 h AI result cache absorbs partial outages; (4) Circuit breaker skips AI evaluation when provider is degraded
OwnerPlatform Engineering

R-OPS-03 — Hold queue overflow during mass-spam events

AttributeValue
LikelihoodMedium
ImpactMedium
RatingMEDIUM
DescriptionA coordinated spam attack or misconfigured tenant can flood the hold queue, overwhelming reviewers.
Mitigation(1) 24 h auto-expiry of unreviewed holds; (2) Bulk-reject API endpoint for admins; (3) Tier escalation to RESTRICTED/SUSPENDED reduces inflow; (4) Alert at 500 pending, page at 2000
OwnerTrust & Safety

R-OPS-04 — PostgreSQL connection pool exhaustion

AttributeValue
LikelihoodLow
ImpactHigh
RatingMEDIUM
DescriptionUnder sustained high load, DB connection pool may exhaust, causing evaluation timeouts.
Mitigation(1) Redis caching absorbs 95%+ of read load; (2) PgBouncer in transaction pooling mode; (3) Pool size tuned via load tests; (4) Monitor pg_pool_waiting_clients metric
OwnerPlatform DBA

2. Security Risks

R-SEC-01 — Message body leakage via LLM provider

AttributeValue
LikelihoodLow
ImpactHigh
RatingMEDIUM
DescriptionSending customer message bodies to a third-party LLM creates a data leakage risk if the provider stores or misuses the content.
Mitigation(1) Local LLM is the primary provider — message bodies stay within our trust boundary; (2) ANONYMIZE_BODY_BEFORE_AI=true redacts PII as defence-in-depth even for local LLM; (3) External LLM failover is opt-in per deployment, not default; (4) If external LLM is enabled, DPA with provider prohibits training on data and requires zero-retention mode; (5) Regulated deployments can disable external failover entirely
OwnerSecurity

R-SEC-02 — Rule misconfiguration blocks legitimate traffic

AttributeValue
LikelihoodMedium
ImpactHigh
RatingHIGH
DescriptionAn overly broad regex or keyword rule can block legitimate tenant traffic, causing business disruption and compensation liability.
Mitigation(1) All new rules go through shadow-mode testing before enforcement; (2) Rule versioning allows instant rollback; (3) Required peer review before activating BLOCK rules; (4) Per-rule match count tracked in Prometheus — unexpectedly high match rate triggers alert; (5) Rule changes logged to audit trail
OwnerTrust & Safety

R-SEC-03 — Compromised admin account modifies rules to bypass compliance

AttributeValue
LikelihoodLow
ImpactCritical
RatingHIGH
DescriptionAn attacker with platform.compliance.admin credentials could disable rules, tier-override malicious tenants, or exfiltrate hold queue contents.
Mitigation(1) MFA enforced for compliance admin accounts; (2) All admin actions logged to immutable audit trail; (3) Audit events replicated to SIEM; (4) Anomaly detection on rule change frequency; (5) Tier override requires secondary approval (future: two-person rule)
OwnerSecurity

R-SEC-04 — ReDoS attack via malicious regex rule

AttributeValue
LikelihoodLow
ImpactHigh
RatingMEDIUM
DescriptionA malicious or naive admin could create a REGEX rule with catastrophic backtracking, freezing the evaluator.
Mitigation(1) Regex engine: re2 (linear-time, no backtracking); (2) Regex validation at save time rejects overly complex patterns; (3) Hard 10 ms timeout per regex evaluation; (4) Budget enforcement caps total evaluation time
OwnerPlatform Engineering

R-SEC-05 — Audit log tampering

AttributeValue
LikelihoodLow
ImpactCritical
RatingHIGH
DescriptionAn attacker with DB access could attempt to modify the audit log to hide malicious actions.
Mitigation(1) PostgreSQL rules reject UPDATE/DELETE on audit_log; (2) NATS event replication creates independent audit copy; (3) Off-site backup of audit log to immutable S3 bucket (object lock enabled); (4) Regular integrity verification cron
OwnerSecurity

3. Compliance / Regulatory Risks

R-REG-01 — Regulator requires stricter enforcement than configured

AttributeValue
LikelihoodMedium
ImpactHigh
RatingHIGH
DescriptionA national telecom authority mandates blocking of specific content categories that are not currently in our rule set, creating compliance exposure.
Mitigation(1) Regulatory rule updates tracked in Trust & Safety backlog; (2) Rule sets versioned for audit; (3) Default rule set includes platform-level mandatory rules that cannot be disabled by tenant admins; (4) Quarterly regulatory review
OwnerLegal + Trust & Safety

R-REG-02 — Cross-border data flow restrictions

AttributeValue
LikelihoodMedium
ImpactMedium
RatingMEDIUM
DescriptionSending message content to a US-hosted LLM may violate data residency requirements in certain jurisdictions.
Mitigation(1) Regional deployment option — AI provider endpoint selection per region; (2) Per-tenant AI enablement flag; (3) Fallback to keyword/regex only for restricted tenants; (4) Future: evaluate on-premise LLM deployment for high-regulation markets
OwnerLegal

R-REG-03 — GDPR right-to-erasure request includes hold queue entries

AttributeValue
LikelihoodLow
ImpactMedium
RatingLOW
DescriptionWhen a user exercises erasure rights, their message content in the hold queue must be redacted without compromising audit integrity.
Mitigation(1) On tenant_erased.v1 event, redact payload.body, payload.to, payload.from_id in hold_queue while preserving metadata; (2) Evaluation log already stores only hashes (GDPR-minimal); (3) Audit log entries redacted consistently
OwnerLegal + Platform Engineering

4. Product / Business Risks

R-BUS-01 — False-positive BLOCK damages tenant trust

AttributeValue
LikelihoodMedium
ImpactHigh
RatingHIGH
DescriptionA legitimate high-value tenant has their traffic blocked by a false-positive rule match, leading to SLA breach and reputation damage.
Mitigation(1) ALLOW rules for verified trusted senders override BLOCK rules; (2) Template approval workflow for high-volume legitimate use cases (OTP, alerts); (3) Rapid rule rollback via version history; (4) Dedicated support channel for compliance-blocked messages
OwnerTrust & Safety + Product

R-BUS-02 — AI API cost overrun

AttributeValue
LikelihoodMedium
ImpactMedium
RatingMEDIUM
DescriptionLLM calls at scale can become expensive, especially for repeated unique messages that bypass the cache.
Mitigation(1) 24 h AI result cache keyed by body hash (typical OTP/alert templates have >95% cache hit); (2) AI rules gated to specific categories, not all messages; (3) Use smallest capable model (Claude Haiku vs Sonnet); (4) Daily cost monitoring with budget alerts; (5) ANONYMIZE_BODY reduces token count
OwnerPlatform Engineering

R-BUS-03 — Reviewer team capacity insufficient for hold queue

AttributeValue
LikelihoodMedium
ImpactMedium
RatingMEDIUM
DescriptionGrowth in held messages outpaces reviewer team capacity, leading to long review delays and auto-expiry of legitimate messages.
Mitigation(1) 24 h auto-expiry prevents indefinite delay; (2) Review priority algorithm surfaces highest-risk items first; (3) Bulk-action endpoints for common patterns; (4) Capacity planning model based on message volume × BLOCK/HOLD rate; (5) Tier-based auto-rejection for repeat offenders
OwnerTrust & Safety

R-BUS-04 — Tenant scoring algorithm unfair to new accounts

AttributeValue
LikelihoodMedium
ImpactLow
RatingLOW
DescriptionNew tenants without message history have no score signal; default to CLEAR may be too lenient, or default to MONITOR may be too strict.
Mitigation(1) Tenure bonus is small (10 pts) so new accounts start near average; (2) First 1,000 messages evaluated with heightened scrutiny (future: new-account rule set); (3) Human review of first BLOCK/HOLD patterns for new accounts
OwnerTrust & Safety

5. Risk Summary Matrix

Risk IDRatingOwnerReview Cadence
R-OPS-01HIGHPlatform SREQuarterly
R-OPS-02MEDIUMPlatform EngineeringQuarterly
R-OPS-03MEDIUMTrust & SafetyMonthly
R-OPS-04MEDIUMPlatform DBAQuarterly
R-SEC-01MEDIUMSecurityQuarterly
R-SEC-02HIGHTrust & SafetyMonthly
R-SEC-03HIGHSecurityQuarterly
R-SEC-04MEDIUMPlatform EngineeringQuarterly
R-SEC-05HIGHSecurityQuarterly
R-REG-01HIGHLegalQuarterly
R-REG-02MEDIUMLegalBi-annual
R-REG-03LOWLegalAnnual
R-BUS-01HIGHTrust & Safety + ProductMonthly
R-BUS-02MEDIUMPlatform EngineeringMonthly
R-BUS-03MEDIUMTrust & SafetyMonthly
R-BUS-04LOWTrust & SafetyQuarterly