compliance-engine — Service Readiness
Status: populated | Last updated: 2026-04-18
This document tracks the readiness criteria for taking compliance-engine from development to production.
1. Code Readiness
| Criterion | Status | Notes |
|---|---|---|
| Core gRPC handler implemented | ☐ | |
| All 10 rule type evaluators implemented | ☐ | KEYWORD, REGEX, SENDER_ID, RECIPIENT, RATE_VOLUME, GEO_RESTRICTION, TEMPORAL, DLR_ABUSE, AI_CLASSIFICATION, COMPOSITE |
| Hold queue CRUD complete | ☐ | |
| Tenant scoring worker implemented | ☐ | |
| Auto-expiry job for hold queue | ☐ | |
| AI classification integration (Claude + OpenAI) | ☐ | |
| AI result caching (24 h by body hash) | ☐ | |
| Evaluation result dedup cache (5 min fingerprint) | ☐ | |
| REST API for rule/hold/tenant management | ☐ | |
| Audit log with immutability enforcement | ☐ | |
| NATS publishers for all 7 event types | ☐ | |
DLR consumer for sms.dlr.inbound | ☐ | |
| Budget enforcement (70 ms evaluation cap) | ☐ | |
| Circuit breaker on AI client | ☐ | |
| Fail-closed redelivery behaviour verified (no ACK on compliance error) | ☐ | |
| Distributed lock for cron jobs (multi-replica safe) | ☐ |
2. Testing Readiness
| Criterion | Target | Status |
|---|---|---|
| Unit test coverage | ≥ 85% line, ≥ 80% branch | ☐ |
| Unit tests for each rule type | 10–15 per type | ☐ |
| Property-based tests (fast-check) | Evaluation determinism, ALLOW override invariants | ☐ |
| Integration tests against real PG + Redis | ≥ 50 tests | ☐ |
| Contract tests with sms-orchestrator | Pact or gRPC reflection | ☐ |
| Load test: 500 RPS sustained, P95 < 80 ms | Passed | ☐ |
| Load test: 1000 RPS burst, P99 < 200 ms | Passed | ☐ |
| Chaos test: DB unavailable, Redis unavailable, AI unavailable | Graceful degradation verified | ☐ |
| Security test: ReDoS patterns rejected at save | Passed | ☐ |
| Security test: role escalation attempts blocked | Passed | ☐ |
| Security test: cross-tenant hold access blocked | Passed | ☐ |
| Audit log immutability test | UPDATE/DELETE rejected | ☐ |
3. Observability Readiness
| Criterion | Status |
|---|---|
| All Prometheus metrics emitting correctly | ☐ |
Grafana dashboard compliance-engine.json deployed | ☐ |
| All alerts configured and tested | ☐ |
| Structured JSON logs with PII masking verified | ☐ |
| OpenTelemetry tracing propagation verified end-to-end | ☐ |
| Log aggregation (Loki) parsing rules validated | ☐ |
| Runbook written for each alert | ☐ |
Alerts Configured
-
ComplianceHoldQueueHigh(> 500 for 5 min) -
ComplianceHoldQueueCritical(> 2000 for 2 min) -
TenantSuspended -
ComplianceEvalP95High(> 100 ms) -
ComplianceEvalErrorHigh(> 0.1% error rate) -
ComplianceUnavailableRetries -
AIServiceUnavailable -
HoldQueueAutoExpiring(> 50/hour)
4. Security Readiness
| Criterion | Status |
|---|---|
| mTLS enforced on gRPC port in production | ☐ |
| TLS certificates provisioned via cert-manager | ☐ |
| NetworkPolicy restricting ingress to sms-orchestrator + admin-dashboard | ☐ |
| JWT validation on all REST endpoints | ☐ |
| RBAC roles enforced per endpoint | ☐ |
| LLM API keys stored in Vault, injected at pod start | ☐ |
| Database credentials rotated and in Vault | ☐ |
ANONYMIZE_BODY_BEFORE_AI=true in production | ☐ |
RLS policies on hold_queue and evaluation_log verified | ☐ |
| Penetration test completed | ☐ |
| Security review signed off by Security team | ☐ |
5. Operational Readiness
| Criterion | Status |
|---|---|
| Kubernetes Deployment manifest reviewed | ☐ |
| HPA configured with CPU + custom latency metric | ☐ |
| PodDisruptionBudget ensures minAvailable = 2 | ☐ |
| Rolling update strategy tested (no dropped gRPC calls) | ☐ |
| Resource requests/limits validated under load | ☐ |
| Database connection pool sized appropriately | ☐ |
| Redis connection pool sized appropriately | ☐ |
| Graceful shutdown drains in-flight requests (SIGTERM handler) | ☐ |
| On-call playbook written | ☐ |
| Training delivered to compliance reviewer team | ☐ |
| Escalation path defined for SUSPENDED tenants | ☐ |
6. Documentation Readiness
| Document | Status |
|---|---|
| SERVICE_OVERVIEW.md | Complete |
| DOMAIN_MODEL.md | Complete |
| APPLICATION_LOGIC.md | Complete |
| API_CONTRACTS.md | Complete |
| DATA_MODEL.md | Complete |
| EVENT_SCHEMAS.md | Complete |
| SYNC_CONTRACT.md | Complete |
| SECURITY_MODEL.md | Complete |
| OBSERVABILITY.md | Complete |
| FAILURE_MODES.md | Complete |
| DEPLOYMENT_TOPOLOGY.md | Complete |
| TESTING_STRATEGY.md | Complete |
| LOCAL_DEV_SETUP.md | Complete |
| MIGRATION_PLAN.md | Complete |
| SERVICE_RISK_REGISTER.md | Complete |
| AI_INTEGRATION.md | Complete |
| Runbook for on-call | ☐ |
| Compliance reviewer training guide | ☐ |
| Rule authoring handbook | ☐ |
7. Compliance / Regulatory Readiness
| Criterion | Status |
|---|---|
| Legal review of data processing with LLM provider | ☐ |
| Data Processing Agreement (DPA) signed with Anthropic / OpenAI | ☐ |
| Regulatory consultation with national telecom authority | ☐ |
| Initial keyword lists reviewed by Trust & Safety lead | ☐ |
| Initial rule sets reviewed by legal counsel | ☐ |
| Audit log retention policy agreed (7 years minimum for regulated deployments) | ☐ |
| Compliance report format approved by regulatory stakeholders | ☐ |
| Tenant opt-out/opt-in flow documented for end users | ☐ |
8. Go/No-Go Criteria Summary
Production deployment is GO when all of the following are met:
- All items in §1 Code Readiness complete
- Test coverage ≥ 85% line, all integration tests passing
- Load test at 1.5× expected peak RPS passes SLO
- All alerts configured and tested via chaos drill
- Security team sign-off obtained
- 14-day shadow mode completed successfully
- Compliance team trained and staffed for review queue
- On-call playbook finalised
- Rollback plan validated in staging
9. Post-Launch Review
Within 30 days of full enforcement (Phase 3):
- False-positive rate audit (target: < 1% of BLOCK verdicts are false positives)
- False-negative rate audit via red-team sampling
- Review hold queue review turnaround times (SLA: 95% reviewed within 4 hours)
- Tenant complaint review — adjust rules if patterns of legitimate messages are being blocked
- Cost analysis — AI API spend per 1,000 evaluations
- Performance review — any p99 latency regressions?
- Adjust HPA thresholds based on observed load patterns