Skip to main content

compliance-engine — Service Readiness

Status: populated | Last updated: 2026-04-18

This document tracks the readiness criteria for taking compliance-engine from development to production.


1. Code Readiness

CriterionStatusNotes
Core gRPC handler implemented
All 10 rule type evaluators implementedKEYWORD, REGEX, SENDER_ID, RECIPIENT, RATE_VOLUME, GEO_RESTRICTION, TEMPORAL, DLR_ABUSE, AI_CLASSIFICATION, COMPOSITE
Hold queue CRUD complete
Tenant scoring worker implemented
Auto-expiry job for hold queue
AI classification integration (Claude + OpenAI)
AI result caching (24 h by body hash)
Evaluation result dedup cache (5 min fingerprint)
REST API for rule/hold/tenant management
Audit log with immutability enforcement
NATS publishers for all 7 event types
DLR consumer for sms.dlr.inbound
Budget enforcement (70 ms evaluation cap)
Circuit breaker on AI client
Fail-closed redelivery behaviour verified (no ACK on compliance error)
Distributed lock for cron jobs (multi-replica safe)

2. Testing Readiness

CriterionTargetStatus
Unit test coverage≥ 85% line, ≥ 80% branch
Unit tests for each rule type10–15 per type
Property-based tests (fast-check)Evaluation determinism, ALLOW override invariants
Integration tests against real PG + Redis≥ 50 tests
Contract tests with sms-orchestratorPact or gRPC reflection
Load test: 500 RPS sustained, P95 < 80 msPassed
Load test: 1000 RPS burst, P99 < 200 msPassed
Chaos test: DB unavailable, Redis unavailable, AI unavailableGraceful degradation verified
Security test: ReDoS patterns rejected at savePassed
Security test: role escalation attempts blockedPassed
Security test: cross-tenant hold access blockedPassed
Audit log immutability testUPDATE/DELETE rejected

3. Observability Readiness

CriterionStatus
All Prometheus metrics emitting correctly
Grafana dashboard compliance-engine.json deployed
All alerts configured and tested
Structured JSON logs with PII masking verified
OpenTelemetry tracing propagation verified end-to-end
Log aggregation (Loki) parsing rules validated
Runbook written for each alert

Alerts Configured

  • ComplianceHoldQueueHigh (> 500 for 5 min)
  • ComplianceHoldQueueCritical (> 2000 for 2 min)
  • TenantSuspended
  • ComplianceEvalP95High (> 100 ms)
  • ComplianceEvalErrorHigh (> 0.1% error rate)
  • ComplianceUnavailableRetries
  • AIServiceUnavailable
  • HoldQueueAutoExpiring (> 50/hour)

4. Security Readiness

CriterionStatus
mTLS enforced on gRPC port in production
TLS certificates provisioned via cert-manager
NetworkPolicy restricting ingress to sms-orchestrator + admin-dashboard
JWT validation on all REST endpoints
RBAC roles enforced per endpoint
LLM API keys stored in Vault, injected at pod start
Database credentials rotated and in Vault
ANONYMIZE_BODY_BEFORE_AI=true in production
RLS policies on hold_queue and evaluation_log verified
Penetration test completed
Security review signed off by Security team

5. Operational Readiness

CriterionStatus
Kubernetes Deployment manifest reviewed
HPA configured with CPU + custom latency metric
PodDisruptionBudget ensures minAvailable = 2
Rolling update strategy tested (no dropped gRPC calls)
Resource requests/limits validated under load
Database connection pool sized appropriately
Redis connection pool sized appropriately
Graceful shutdown drains in-flight requests (SIGTERM handler)
On-call playbook written
Training delivered to compliance reviewer team
Escalation path defined for SUSPENDED tenants

6. Documentation Readiness

DocumentStatus
SERVICE_OVERVIEW.mdComplete
DOMAIN_MODEL.mdComplete
APPLICATION_LOGIC.mdComplete
API_CONTRACTS.mdComplete
DATA_MODEL.mdComplete
EVENT_SCHEMAS.mdComplete
SYNC_CONTRACT.mdComplete
SECURITY_MODEL.mdComplete
OBSERVABILITY.mdComplete
FAILURE_MODES.mdComplete
DEPLOYMENT_TOPOLOGY.mdComplete
TESTING_STRATEGY.mdComplete
LOCAL_DEV_SETUP.mdComplete
MIGRATION_PLAN.mdComplete
SERVICE_RISK_REGISTER.mdComplete
AI_INTEGRATION.mdComplete
Runbook for on-call
Compliance reviewer training guide
Rule authoring handbook

7. Compliance / Regulatory Readiness

CriterionStatus
Legal review of data processing with LLM provider
Data Processing Agreement (DPA) signed with Anthropic / OpenAI
Regulatory consultation with national telecom authority
Initial keyword lists reviewed by Trust & Safety lead
Initial rule sets reviewed by legal counsel
Audit log retention policy agreed (7 years minimum for regulated deployments)
Compliance report format approved by regulatory stakeholders
Tenant opt-out/opt-in flow documented for end users

8. Go/No-Go Criteria Summary

Production deployment is GO when all of the following are met:

  • All items in §1 Code Readiness complete
  • Test coverage ≥ 85% line, all integration tests passing
  • Load test at 1.5× expected peak RPS passes SLO
  • All alerts configured and tested via chaos drill
  • Security team sign-off obtained
  • 14-day shadow mode completed successfully
  • Compliance team trained and staffed for review queue
  • On-call playbook finalised
  • Rollback plan validated in staging

9. Post-Launch Review

Within 30 days of full enforcement (Phase 3):

  • False-positive rate audit (target: < 1% of BLOCK verdicts are false positives)
  • False-negative rate audit via red-team sampling
  • Review hold queue review turnaround times (SLA: 95% reviewed within 4 hours)
  • Tenant complaint review — adjust rules if patterns of legitimate messages are being blocked
  • Cost analysis — AI API spend per 1,000 evaluations
  • Performance review — any p99 latency regressions?
  • Adjust HPA thresholds based on observed load patterns