Skip to main content

consent-ledger-service — Service Readiness

Version: 1.0 Status: Draft Owner: Trust and Safety Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, _report.md, docs/architecture/ADR-0004-national-backbone-resilience.md

This document tracks the readiness criteria for taking consent-ledger-service from development to production. Given the service is the platform's authoritative consent ledger and is consulted synchronously on every outbound SMS, readiness bar is elevated: fail-closed behaviour, sub-5 ms P95 CheckConsent, hash-chain audit integrity, and 7-year regulator-defensible retention.


1. Code Readiness

CriterionStatusNotes
gRPC ConsentLedgerService.v1 — CheckConsent / RecordConsent / RevokeConsent / RecordConsentBatchCore hot-path.
REST /v1/consent/* — tenant records, double-opt-in, erasure, admin DND
Citizen-portal REST /v1/consent/records?msisdn= with MSISDN-OTP verification
STOP-keyword NATS consumer on sms.mo.inboundDurable, queue group consent-ledger-stop.
ATRA National DND sync worker (cron daily 03:00 Asia/Kabul)Graceful on ATRA unreachable.
Audit hash-chain implementation (prev_hash, `record_hash = sha256(payloadprev_hash)`)
Audit chain daily verifier job (last 24 h)
Erasure processor (MSISDN → deterministic-hash tokenisation)GDPR 30-day SLA.
Monthly partition creator + cold-tier archive job (> 13 m → S3)
Redis hot-cache fill + invalidation on state changeTTL 300 s; invalidation on revoke.
Fail-closed on CheckConsent when Redis cache miss + Postgres unavailableReturn allowed=false, reason=CONSENT_UNKNOWN.
Localised STOP-ack dispatcher (en/fa/ps/ar) via channel-routerLane=P2 transactional.
Bulk-import CSV processor (US-CONS-018)
Consent SDK published (US-CONS-019)Node, Python, Java initial set.
Idempotency-Key support on REST writes
mTLS gRPC client-cert verificationMesh SVID enforcement.

2. Testing Readiness

CriterionTargetStatus
Unit test coverage≥ 90% line (domain) / ≥ 80% branch
Unit tests for consent state machine transitions≥ 20 tests per scope
Unit tests for STOP-keyword matcher per language≥ 50 per language (en/fa/ps/ar)
Unit tests for MSISDN normalisation and hash-tokenisation≥ 30
Unit tests for hash-chain integrity (happy path, tamper, break)≥ 15
Property-based tests (fast-check) — chain monotonicity, scope isolation≥ 10 properties
Integration tests: gRPC CheckConsent P95 ≤ 5 ms @ 5000 RPSPassed
Integration test: STOP MO → consent.revoked.v1 end-to-end < 1 sPassed
Integration test: ATRA DND sync with mock endpointPassed
Integration test: multi-region replication of control-plane dataPassed
Contract test with compliance-engine (CONSENT rule integration)Passed
Contract test with routing-engine (last-mile veto)Passed
Contract test with channel-router (MO STOP detection)Passed
Chaos test: Postgres unavailable → fail-closed verifiedPassed
Chaos test: Redis unavailable → PG fallback, P95 degrades gracefullyPassed
Chaos test: NATS lag → STOP-keyword processing queues, no message lossPassed
Security test: RLS cross-tenant read/write blockedPassed
Security test: audit log UPDATE/DELETE rejected at Postgres triggerPassed
Security test: hash-chain tamper detected by verifier within 24 hPassed
Security test: MSISDN erasure actually purges from records + audit (tokenised)Passed
Load test: 10 000 RPS sustained for 1 h, P99 ≤ 20 msPassed

3. Observability Readiness

CriterionStatus
All Prometheus metrics emitting (see OBSERVABILITY.md §1)
Grafana dashboard consent-ledger-service.json deployed
All alerts configured in Alertmanager with runbooks
Structured JSON logs (Pino) with MSISDN hash-masking
OpenTelemetry trace propagation from Kong → compliance-engine → consent-ledger verified
Loki parsing rules for service logs validated
SIEM forwarding of consent.* events via regulator-portal-service verified

Alerts Configured

  • ConsentCheckLatencyHigh (gRPC CheckConsent P95 > 15 ms for 5 min)
  • ConsentCheckErrorRateHigh (> 0.1% 5xx)
  • ConsentDndSyncStale (ATRA DND last sync > 24 h)
  • ConsentStopKeywordLag (NATS consumer lag > 60 s)
  • ConsentAuditChainBroken (verifier detected break — Critical)
  • ConsentErasureSlaBreach (erasure request > 25 d old)
  • ConsentCachePostgresFallback (hot-cache fail-over rate > 5% of traffic for 10 min)
  • ConsentBulkImportFailureRate (bulk-import reject rate > 5%)

4. Security Readiness

CriterionStatus
mTLS enforced on gRPC port (SPIRE SVID, per ADR-0004 §12)
NetworkPolicy restricting ingress to compliance-engine, routing-engine, sms-firewall, channel-router
Kong JWT validation on all REST endpoints
Citizen-portal MSISDN-OTP verification flow hardened (rate-limit, anti-enumeration)
RBAC: tenant scope, citizen self-only, admin
MSISDN encryption at rest (per-tenant DEK wrapped by HSM KEK per ADR-0004 §11)
Erasure tokenisation uses HSM-bound deterministic key (FF1)
Audit log trigger rejects UPDATE/DELETE (Postgres rule)
RLS policies verified on consent.records and consent.audit
Penetration test against citizen-portal + gRPC completed
Security team sign-off

5. Operational Readiness

CriterionStatus
K8s Deployment manifest (3–15 replicas, HPA on gRPC RPS) reviewed
PodDisruptionBudget minAvailable: 2 (per region)
Rolling update tested: zero dropped gRPC calls under steady 2 000 RPS
Graceful shutdown: 15 s drain with SIGTERM handler
Resource requests/limits validated under 5 000 RPS load
Postgres connection pool sized (pgbouncer in transaction mode recommended)
Redis connection pool sized (min 50, max 200 per pod)
Multi-region replication verified (kbl ↔ mzr logical repl for control-plane)
ATRA DND-sync runbook drafted
Erasure-request handling runbook drafted (Legal + Trust & Safety joint)
Hash-chain-break incident runbook drafted
On-call rotation assigned (Trust & Safety + SRE shared primary)

6. Documentation Readiness

DocumentStatus
SERVICE_OVERVIEW.mdComplete
DOMAIN_MODEL.mdComplete
APPLICATION_LOGIC.mdComplete
API_CONTRACTS.mdComplete
EVENT_SCHEMAS.mdComplete
DATA_MODEL.mdComplete
SYNC_CONTRACT.mdComplete
SECURITY_MODEL.mdComplete
OBSERVABILITY.mdComplete
TESTING_STRATEGY.mdComplete
DEPLOYMENT_TOPOLOGY.mdComplete
FAILURE_MODES.mdComplete
LOCAL_DEV_SETUP.mdComplete
AI_INTEGRATION.mdComplete
MIGRATION_PLAN.mdComplete
SERVICE_RISK_REGISTER.mdComplete
Runbook: DND sync staleness
Runbook: hash-chain verifier break
Runbook: STOP-keyword false positive triage
Runbook: citizen erasure end-to-end
Legal briefing: 7-year retention + GDPR erasure interaction
Operator handbook for Trust & Safety reviewers

7. Compliance / Regulatory Readiness

CriterionStatus
DPIA authored for MSISDN processing and hashing
Legal review of 7-year audit retention vs. GDPR right-to-erasure interaction
ATRA Memorandum of Understanding on National DND registry integration
Citizen-portal Terms of Use and privacy notice approved
STOP-keyword default catalog reviewed by Legal and Trust & Safety lead
Initial scope catalog (TRANSACTIONAL/MARKETING/OTP/EMERGENCY) approved
Cross-tenant STOP-propagation policy signed off (per-tenant default)
Tenant-portal consent-inspection flow reviewed
7-year retention policy with S3 immutable bucket configured
SIEM-forwarding of consent events to regulator-portal approved

8. Go/No-Go Criteria Summary

Production deployment is GO when all of the following are met:

  • All items in §1 Code Readiness complete.
  • Unit coverage ≥ 90% (domain), ≥ 80% (branch); all integration tests green.
  • Load test at 1.5× expected peak RPS (target 7 500 RPS) sustains P99 ≤ 25 ms.
  • Hash-chain verifier has run 14 consecutive days in staging with no breaks.
  • Chaos drill: 4 separate failure injections (Postgres, Redis, NATS, ATRA-DND) all degrade as designed.
  • Legal + Security + Regulator Liaison sign-offs obtained.
  • 14-day shadow mode completed with metrics parity.
  • Tenant migration path (bulk-import) validated with at least 3 design partner tenants.
  • Rollback plan validated in staging.
  • SIEM forwarding of consent events verified in staging with downstream regulator-portal.

9. Post-Launch Review

Within 30 days of full enforcement:

  • False-positive rate audit (target: < 0.5% of STOP-keyword detections are legitimate non-STOP).
  • Tenant opt-in recording volume vs. forecast; adjust quotas if needed.
  • National-DND sync reliability audit (target: > 99% daily syncs successful).
  • Erasure SLA audit (target: 100% completed within 30 d; 95% within 14 d).
  • Hash-chain integrity audit (0 breaks).
  • CheckConsent cache hit ratio (target: ≥ 97%).
  • Citizen-portal MSISDN-verification abuse detection review.
  • Cost analysis: per-1000 CheckConsent calls; NATS bandwidth; S3 archive growth.
  • Scope-taxonomy usage analysis: are all four scopes being used? If one is dead, reconsider.
  • Post-launch tune of STOP-keyword catalog based on false-positive reports.

10. Phased Rollout

The service follows a 3-phase rollout to de-risk national-backbone deployment:

PhaseDurationBehaviourExit criteria
P1 — Shadow14 dCheckConsent returns allowed=true for all requests; record hypothetical verdicts in audit table. No enforcement.Metrics parity; false-positive rate projection < 1%.
P2 — Enforcement (single scope)7 dMARKETING scope enforcement on only; other scopes still shadow.No tenant escalation beyond expected; SLA met.
P3 — Full EnforcementOngoingAll scopes enforced; citizen-portal live; STOP-keyword processing live.N/A (steady state).

Rollback at any phase: feature-flag CONSENT_ENFORCEMENT_ENABLED=false, consumers fall back to previous behaviour (which, per the critique, was implicit).