consent-ledger-service — Service Readiness

Version: 1.0 Status: Draft Owner: Trust and Safety Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, docs/architecture/ADR-0004-national-backbone-resilience.md

This document tracks the readiness criteria for taking consent-ledger-service from development to production. Given the service is the platform's authoritative consent ledger and is consulted synchronously on every outbound SMS, readiness bar is elevated: fail-closed behaviour, sub-5 ms P95 CheckConsent, hash-chain audit integrity, and 7-year regulator-defensible retention.

1. Code Readiness

Criterion	Status	Notes
gRPC `ConsentLedgerService.v1` — CheckConsent / RecordConsent / RevokeConsent / RecordConsentBatch	☐	Core hot-path.
REST `/v1/consent/*` — tenant records, double-opt-in, erasure, admin DND	☐
Citizen-portal REST `/v1/consent/records?msisdn=` with MSISDN-OTP verification	☐
STOP-keyword NATS consumer on `sms.mo.inbound`	☐	Durable, queue group `consent-ledger-stop`.
ATRA National DND sync worker (cron daily 03:00 Asia/Kabul)	☐	Graceful on ATRA unreachable.
Audit hash-chain implementation (`prev_hash`, `record_hash = sha256(payload		prev_hash)`)
Audit chain daily verifier job (last 24 h)	☐
Erasure processor (MSISDN → deterministic-hash tokenisation)	☐	GDPR 30-day SLA.
Monthly partition creator + cold-tier archive job (> 13 m → S3)	☐
Redis hot-cache fill + invalidation on state change	☐	TTL 300 s; invalidation on revoke.
Fail-closed on `CheckConsent` when Redis cache miss + Postgres unavailable	☐	Return `allowed=false, reason=CONSENT_UNKNOWN`.
Localised STOP-ack dispatcher (en/fa/ps/ar) via channel-router	☐	Lane=P2 transactional.
Bulk-import CSV processor (US-CONS-018)	☐
Consent SDK published (US-CONS-019)	☐	Node, Python, Java initial set.
Idempotency-Key support on REST writes	☐
mTLS gRPC client-cert verification	☐	Mesh SVID enforcement.

2. Testing Readiness

Criterion	Target	Status
Unit test coverage	≥ 90% line (domain) / ≥ 80% branch	☐
Unit tests for consent state machine transitions	≥ 20 tests per scope	☐
Unit tests for STOP-keyword matcher per language	≥ 50 per language (en/fa/ps/ar)	☐
Unit tests for MSISDN normalisation and hash-tokenisation	≥ 30	☐
Unit tests for hash-chain integrity (happy path, tamper, break)	≥ 15	☐
Property-based tests (fast-check) — chain monotonicity, scope isolation	≥ 10 properties	☐
Integration tests: gRPC `CheckConsent` P95 ≤ 5 ms @ 5000 RPS	Passed	☐
Integration test: STOP MO → consent.revoked.v1 end-to-end < 1 s	Passed	☐
Integration test: ATRA DND sync with mock endpoint	Passed	☐
Integration test: multi-region replication of control-plane data	Passed	☐
Contract test with compliance-engine (CONSENT rule integration)	Passed	☐
Contract test with routing-engine (last-mile veto)	Passed	☐
Contract test with channel-router (MO STOP detection)	Passed	☐
Chaos test: Postgres unavailable → fail-closed verified	Passed	☐
Chaos test: Redis unavailable → PG fallback, P95 degrades gracefully	Passed	☐
Chaos test: NATS lag → STOP-keyword processing queues, no message loss	Passed	☐
Security test: RLS cross-tenant read/write blocked	Passed	☐
Security test: audit log UPDATE/DELETE rejected at Postgres trigger	Passed	☐
Security test: hash-chain tamper detected by verifier within 24 h	Passed	☐
Security test: MSISDN erasure actually purges from records + audit (tokenised)	Passed	☐
Load test: 10 000 RPS sustained for 1 h, P99 ≤ 20 ms	Passed	☐

3. Observability Readiness

Criterion	Status
All Prometheus metrics emitting (see OBSERVABILITY.md §1)	☐
Grafana dashboard `consent-ledger-service.json` deployed	☐
All alerts configured in Alertmanager with runbooks	☐
Structured JSON logs (Pino) with MSISDN hash-masking	☐
OpenTelemetry trace propagation from Kong → compliance-engine → consent-ledger verified	☐
Loki parsing rules for service logs validated	☐
SIEM forwarding of `consent.*` events via `regulator-portal-service` verified	☐

Alerts Configured

ConsentCheckLatencyHigh (gRPC CheckConsent P95 > 15 ms for 5 min)
ConsentCheckErrorRateHigh (> 0.1% 5xx)
ConsentDndSyncStale (ATRA DND last sync > 24 h)
ConsentStopKeywordLag (NATS consumer lag > 60 s)
ConsentAuditChainBroken (verifier detected break — Critical)
ConsentErasureSlaBreach (erasure request > 25 d old)
ConsentCachePostgresFallback (hot-cache fail-over rate > 5% of traffic for 10 min)
ConsentBulkImportFailureRate (bulk-import reject rate > 5%)

4. Security Readiness

Criterion	Status
mTLS enforced on gRPC port (SPIRE SVID, per ADR-0004 §12)	☐
NetworkPolicy restricting ingress to compliance-engine, routing-engine, sms-firewall, channel-router	☐
Kong JWT validation on all REST endpoints	☐
Citizen-portal MSISDN-OTP verification flow hardened (rate-limit, anti-enumeration)	☐
RBAC: tenant scope, citizen self-only, admin	☐
MSISDN encryption at rest (per-tenant DEK wrapped by HSM KEK per ADR-0004 §11)	☐
Erasure tokenisation uses HSM-bound deterministic key (FF1)	☐
Audit log trigger rejects UPDATE/DELETE (Postgres rule)	☐
RLS policies verified on `consent.records` and `consent.audit`	☐
Penetration test against citizen-portal + gRPC completed	☐
Security team sign-off	☐

5. Operational Readiness

Criterion	Status
K8s Deployment manifest (3–15 replicas, HPA on gRPC RPS) reviewed	☐
PodDisruptionBudget `minAvailable: 2` (per region)	☐
Rolling update tested: zero dropped gRPC calls under steady 2 000 RPS	☐
Graceful shutdown: 15 s drain with SIGTERM handler	☐
Resource requests/limits validated under 5 000 RPS load	☐
Postgres connection pool sized (`pgbouncer` in transaction mode recommended)	☐
Redis connection pool sized (min 50, max 200 per pod)	☐
Multi-region replication verified (kbl ↔ mzr logical repl for control-plane)	☐
ATRA DND-sync runbook drafted	☐
Erasure-request handling runbook drafted (Legal + Trust & Safety joint)	☐
Hash-chain-break incident runbook drafted	☐
On-call rotation assigned (Trust & Safety + SRE shared primary)	☐

6. Documentation Readiness

Document	Status
SERVICE_OVERVIEW.md	Complete
DOMAIN_MODEL.md	Complete
APPLICATION_LOGIC.md	Complete
API_CONTRACTS.md	Complete
EVENT_SCHEMAS.md	Complete
DATA_MODEL.md	Complete
SYNC_CONTRACT.md	Complete
SECURITY_MODEL.md	Complete
OBSERVABILITY.md	Complete
TESTING_STRATEGY.md	Complete
DEPLOYMENT_TOPOLOGY.md	Complete
FAILURE_MODES.md	Complete
LOCAL_DEV_SETUP.md	Complete
AI_INTEGRATION.md	Complete
MIGRATION_PLAN.md	Complete
SERVICE_RISK_REGISTER.md	Complete
Runbook: DND sync staleness	☐
Runbook: hash-chain verifier break	☐
Runbook: STOP-keyword false positive triage	☐
Runbook: citizen erasure end-to-end	☐
Legal briefing: 7-year retention + GDPR erasure interaction	☐
Operator handbook for Trust & Safety reviewers	☐

7. Compliance / Regulatory Readiness

Criterion	Status
DPIA authored for MSISDN processing and hashing	☐
Legal review of 7-year audit retention vs. GDPR right-to-erasure interaction	☐
ATRA Memorandum of Understanding on National DND registry integration	☐
Citizen-portal Terms of Use and privacy notice approved	☐
STOP-keyword default catalog reviewed by Legal and Trust & Safety lead	☐
Initial scope catalog (TRANSACTIONAL/MARKETING/OTP/EMERGENCY) approved	☐
Cross-tenant STOP-propagation policy signed off (per-tenant default)	☐
Tenant-portal consent-inspection flow reviewed	☐
7-year retention policy with S3 immutable bucket configured	☐
SIEM-forwarding of consent events to regulator-portal approved	☐

8. Go/No-Go Criteria Summary

Production deployment is GO when all of the following are met:

9. Post-Launch Review

Within 30 days of full enforcement:

10. Phased Rollout

The service follows a 3-phase rollout to de-risk national-backbone deployment:

Phase	Duration	Behaviour	Exit criteria
P1 — Shadow	14 d	`CheckConsent` returns `allowed=true` for all requests; record hypothetical verdicts in `audit` table. No enforcement.	Metrics parity; false-positive rate projection < 1%.
P2 — Enforcement (single scope)	7 d	`MARKETING` scope enforcement on only; other scopes still shadow.	No tenant escalation beyond expected; SLA met.
P3 — Full Enforcement	Ongoing	All scopes enforced; citizen-portal live; STOP-keyword processing live.	N/A (steady state).

Rollback at any phase: feature-flag CONSENT_ENFORCEMENT_ENABLED=false, consumers fall back to previous behaviour (which, per the critique, was implicit).

1. Code Readiness​

2. Testing Readiness​

3. Observability Readiness​

Alerts Configured​

4. Security Readiness​

5. Operational Readiness​

6. Documentation Readiness​

7. Compliance / Regulatory Readiness​

8. Go/No-Go Criteria Summary​

9. Post-Launch Review​

10. Phased Rollout​