Consent Ledger Service — Observability
Version: 1.0 Status: Draft Owner: Trust & Safety / Platform SRE Last Updated: 2026-04-21 Companion: FAILURE_MODES · SERVICE_READINESS · APPLICATION_LOGIC
1. SLIs / SLOs
Bound to the platform NFR catalog (EP-PLAT-NB-09).
| SLI | SLO | Window | Burn-rate alert |
|---|---|---|---|
CheckConsent cache-hit P95 latency | ≤ 5 ms | 28 d | 14 × 1 h, 6 × 6 h |
CheckConsent end-to-end P99 latency (any path) | ≤ 20 ms | 28 d | 14 × 1 h, 6 × 6 h |
CheckConsent availability (non-error rate) | ≥ 99.95% | 28 d | 14 × 1 h, 6 × 6 h |
CheckConsent cache hit ratio | ≥ 90% sustained | 7 d | < 70% for 30 min → MEDIUM |
RecordConsent write P95 latency | ≤ 80 ms | 28 d | 14 × 1 h |
RecordConsent availability | ≥ 99.9% | 28 d | 14 × 1 h |
| MO-STOP end-to-end (MO consumed → ack-back enqueued) P95 | ≤ 2 s | 28 d | 14 × 1 h |
| ATRA DND sync freshness | last successful run < 24 h | continuous | 24 h stale → HIGH |
| Audit hash-chain integrity | Daily verifier returns OK | continuous | break → CRITICAL (page) |
| Outbox publish lag P95 | ≤ 5 s | 28 d | > 60 s for 5 min → HIGH |
| Citizen erasure SLA | 100% completed within 30 d | rolling | any breach → HIGH |
2. Prometheus metrics
All metrics exposed at GET /metrics on port 3071. Service label service="consent-ledger-service".
2.1 CheckConsent
| Metric | Type | Labels | Description |
|---|---|---|---|
consent_check_total | Counter | verdict (allowed/blocked), reason, tenant_id, scope | Total CheckConsent calls by outcome |
consent_check_duration_seconds | Histogram | path (cache_hit/db/failclosed) | End-to-end CheckConsent latency |
consent_check_cache_hits_total | Counter | tenant_id | Redis hot-cache hits |
consent_check_cache_misses_total | Counter | tenant_id | Redis hot-cache misses |
consent_check_dnd_block_total | Counter | scope | National-DND short-circuit blocks |
consent_check_failclosed_total | Counter | cause (db_unavailable, redis_and_db_unavailable, unknown) | Fail-closed activations |
consent_check_concurrency_in_flight | Gauge | — | In-flight CheckConsent calls |
Histogram buckets for consent_check_duration_seconds: [0.001, 0.002, 0.003, 0.005, 0.008, 0.012, 0.020, 0.050, 0.100, 0.500].
2.2 Records / writes
| Metric | Type | Labels | Description |
|---|---|---|---|
consent_records_written_total | Counter | tenant_id, scope, verification_method, outcome (created/replaced/noop) | Record writes |
consent_records_revoked_total | Counter | tenant_id, scope, revoked_reason | Revocations |
consent_record_write_duration_seconds | Histogram | op (record/revoke/bulk_row) | Write latency |
consent_idempotency_hits_total | Counter | endpoint | Cached idempotent responses returned |
2.3 STOP MO consumer
| Metric | Type | Labels | Description |
|---|---|---|---|
consent_stop_mo_received_total | Counter | language, match_outcome (matched/no_match/orphan_sender) | MO events processed |
consent_stop_mo_duration_seconds | Histogram | match_outcome | End-to-end MO → revoke + ack-back enqueue latency |
consent_stop_mo_consumer_lag_seconds | Gauge | — | NATS num_pending translated to seconds via observed throughput |
consent_ack_back_sent_total | Counter | language | Ack-back SMS enqueued |
consent_ack_back_dedup_skipped_total | Counter | — | One-shot dedup hits (subscriber STOP-flooded the same MSISDN) |
2.4 DND sync
| Metric | Type | Labels | Description |
|---|---|---|---|
consent_dnd_sync_runs_total | Counter | outcome (success/failed) | DND sync attempts |
consent_dnd_sync_duration_seconds | Histogram | — | Per-run duration |
consent_dnd_sync_added_total | Counter | — | Rows added in last run |
consent_dnd_sync_removed_total | Counter | — | Rows removed in last run |
consent_dnd_sync_last_success_timestamp_seconds | Gauge | — | UNIX epoch of last successful sync (for (time() - this) > 86400 alert) |
consent_dnd_total_entries | Gauge | — | Current size of consent.dnd_registry |
2.5 Audit chain
| Metric | Type | Labels | Description |
|---|---|---|---|
consent_audit_rows_written_total | Counter | event_type | Audit row writes |
consent_audit_chain_verifier_runs_total | Counter | outcome (ok/broken) | Verifier executions |
consent_audit_chain_verifier_duration_seconds | Histogram | — | Verifier run duration |
consent_audit_chain_verifier_last_ok_timestamp_seconds | Gauge | — | Last successful verification UNIX epoch |
consent_audit_chain_breaks_detected_total | Counter | partition | CRITICAL — must equal 0 |
consent_audit_partition_count | Gauge | — | Active partitions in consent.audit |
consent_audit_oldest_unarchived_partition_age_days | Gauge | — | Detects partition-archive lag |
2.6 Erasure
| Metric | Type | Labels | Description |
|---|---|---|---|
consent_erasure_requests_total | Counter | requested_via, status | Erasure requests |
consent_erasure_processing_duration_seconds | Histogram | — | Per-request processing time |
consent_erasure_records_redacted_total | Counter | — | Total record rows redacted |
consent_erasure_audit_rows_redacted_total | Counter | — | Total audit rows redacted (for chain-preservation accounting) |
consent_erasure_sla_breach_total | Counter | — | Erasures past 30 d SLA — must equal 0 |
2.7 Outbox
| Metric | Type | Labels | Description |
|---|---|---|---|
consent_outbox_unpublished_count | Gauge | — | Unpublished rows currently waiting |
consent_outbox_oldest_unpublished_age_seconds | Gauge | — | Age of the oldest waiting row |
consent_outbox_publish_total | Counter | subject, outcome | Publish attempts |
consent_outbox_publish_duration_seconds | Histogram | subject | NATS publish latency |
2.8 Citizen flows
| Metric | Type | Labels | Description |
|---|---|---|---|
consent_citizen_otp_requested_total | Counter | — | OTP requests |
consent_citizen_otp_verified_total | Counter | outcome | OTP verifications |
consent_citizen_revoke_total | Counter | scope_or_all | Citizen-portal revokes |
consent_citizen_inspection_total | Counter | — | Self-service inspections (audit-relevant) |
2.9 Dependencies / health
| Metric | Type | Labels | Description |
|---|---|---|---|
consent_pg_pool_in_use | Gauge | pool (primary/replica) | PgBouncer pool usage |
consent_redis_op_duration_seconds | Histogram | op (get/set/del) | Redis op latency |
consent_vault_unwrap_duration_seconds | Histogram | — | KEK unwrap latency |
consent_vault_unwrap_errors_total | Counter | — | KEK unwrap failures |
3. Structured log events
All logs are valid JSON (Pino), redacted per SECURITY_MODEL §3.3. LOG_LEVEL env var controls verbosity.
{ "level": "info", "time": "2026-04-21T11:00:00.000Z",
"event": "consent.check", "tenantId": "1111…", "msisdnHash": "8d4f…", "scope": "MARKETING",
"verdict": "allowed", "reason": "ALLOWED_TENANT_RECORD", "path": "cache_hit", "latencyMs": 2,
"traceId": "00-abc-def-01" }
{ "level": "info", "event": "consent.recorded", "tenantId": "1111…", "msisdnMasked": "+93701***",
"scope": "MARKETING", "verificationMethod": "DOUBLE_OPT_IN", "recordId": "cn_…",
"previousRecordId": null, "auditId": "cna_…" }
{ "level": "warn", "event": "consent.stop_mo.matched", "moId": "mo_…", "msisdnMasked": "+93701***",
"matchedKeyword": "stop", "matchedLanguage": "EN", "policyApplied": "PER_TENANT",
"tenantsRevoked": ["1111…"], "ackBackEnqueued": true, "latencyMs": 850 }
{ "level": "error", "event": "consent.audit.chain_broken", "verifierRunId": "cvr_…",
"partition": "consent_audit_2026_04", "firstBadSeq": 4837521, "auditId": "cna_…",
"expectedPrevHash": "ab12…", "actualPrevHash": "cd34…" }
{ "level": "error", "event": "consent.dnd.sync.failed", "errorClass": "atra_unreachable",
"lastSuccessAt": "2026-04-20T03:00:00Z", "ageHours": 24.5 }
4. OpenTelemetry tracing
Parent span: consent-ledger-service.<rpc-or-handler>.
| Span | Op | Attributes |
|---|---|---|
consent.check | gRPC CheckConsent | tenantId, scope, path, verdict, reason, latencyMs |
consent.cache.lookup | Redis MGET | cache.hit, keyCount |
consent.db.lookup | PG SELECT | db.system=postgresql, db.statement (parameterised) |
consent.record.write | UC-RecordConsent | recordId, verificationMethod, replacedPrevious |
consent.audit.write | UC-WriteAudit | auditId, partition, seq |
consent.outbox.publish | NATS publish | subject, eventId |
consent.stop_mo.handle | NATS consumer | moId, matchedKeyword, tenantsRevoked |
consent.dnd.sync.run | Worker | runId, addedCount, removedCount |
consent.audit.verifier.run | Worker | partition, outcome |
Trace context propagates via gRPC traceparent (W3C Trace Context) and NATS traceparent header.
5. Alert rules
groups:
- name: consent-ledger-service
rules:
# ---- CRITICAL ----
- alert: ConsentAuditChainBroken
expr: increase(consent_audit_chain_breaks_detected_total[10m]) > 0
labels: { severity: critical, page: trust_and_safety_oncall }
annotations:
summary: "Consent audit hash-chain break detected"
runbook: https://runbooks.ghasi.gov.af/consent-ledger/audit-chain-broken
- alert: ConsentCheckFailclosedSurge
expr: rate(consent_check_failclosed_total[5m]) > 1
labels: { severity: critical, page: trust_and_safety_oncall }
annotations:
summary: "CheckConsent fail-closed >1/s — DB+Redis dual outage suspected"
runbook: https://runbooks.ghasi.gov.af/consent-ledger/checkconsent-failclosed
- alert: ConsentServiceDown
expr: up{job="consent-ledger-service"} == 0
for: 2m
labels: { severity: critical, page: platform_sre_oncall }
# ---- HIGH ----
- alert: ConsentDndStale
expr: (time() - consent_dnd_sync_last_success_timestamp_seconds) > 86400
labels: { severity: high }
annotations:
summary: "ATRA DND sync stale > 24 h — running on last-known data"
runbook: https://runbooks.ghasi.gov.af/consent-ledger/dnd-stale
- alert: ConsentCheckP95High
expr: histogram_quantile(0.95, rate(consent_check_duration_seconds_bucket{path="cache_hit"}[5m])) > 0.005
for: 10m
labels: { severity: high }
- alert: ConsentCheckCacheHitLow
expr: |
(rate(consent_check_cache_hits_total[10m]) /
(rate(consent_check_cache_hits_total[10m]) + rate(consent_check_cache_misses_total[10m]))) < 0.7
for: 30m
labels: { severity: high }
- alert: ConsentStopConsumerLag
expr: consent_stop_mo_consumer_lag_seconds > 60
for: 5m
labels: { severity: high }
annotations:
summary: "STOP MO consumer >60s lag — subscriber STOPs not honoured promptly (regulator risk)"
- alert: ConsentOutboxBacklog
expr: consent_outbox_unpublished_count > 1000 or consent_outbox_oldest_unpublished_age_seconds > 60
for: 5m
labels: { severity: high }
- alert: ConsentVaultUnwrapErrors
expr: rate(consent_vault_unwrap_errors_total[5m]) > 0.1
for: 5m
labels: { severity: high }
# ---- MEDIUM ----
- alert: ConsentRecordWriteSlow
expr: histogram_quantile(0.95, rate(consent_record_write_duration_seconds_bucket[5m])) > 0.080
for: 15m
labels: { severity: medium }
- alert: ConsentErasureSLABreach
expr: increase(consent_erasure_sla_breach_total[1h]) > 0
labels: { severity: medium }
- alert: ConsentAuditPartitionMissing
expr: consent_audit_oldest_unarchived_partition_age_days > 14
labels: { severity: medium }
- alert: ConsentCitizenOtpAbuse
expr: rate(consent_citizen_otp_requested_total[1h]) > 100
for: 30m
labels: { severity: medium }
Each alert links to a runbook under https://runbooks.ghasi.gov.af/consent-ledger/<alert-slug>.
6. Grafana dashboard panels
Dashboard: dashboards/consent-ledger-service.json.
| Panel | Query | Visualization |
|---|---|---|
| CheckConsent rate | sum by(verdict)(rate(consent_check_total[5m])) | Stacked area |
| CheckConsent P50/P95/P99 | histogram_quantile(.50/.95/.99, rate(consent_check_duration_seconds_bucket[5m])) | Time series |
| CheckConsent cache hit ratio | computed | Gauge + time series |
| Reasons distribution (24 h) | sum by(reason)(increase(consent_check_total[24h])) | Pie |
| RecordConsent rate | sum by(verification_method, outcome)(rate(consent_records_written_total[5m])) | Stacked area |
| Top tenants by writes (24 h) | topk(10, sum by(tenant_id)(increase(consent_records_written_total[24h]))) | Bar |
| STOP MO end-to-end P95 | histogram_quantile(.95, rate(consent_stop_mo_duration_seconds_bucket[5m])) | Time series |
| STOP MO consumer lag | consent_stop_mo_consumer_lag_seconds | Time series |
| STOP keyword hit rate by language | sum by(language)(rate(consent_stop_mo_received_total{match_outcome="matched"}[5m])) | Stacked area |
| ATRA DND last sync | time() - consent_dnd_sync_last_success_timestamp_seconds | Stat (red if > 24 h) |
| DND total entries | consent_dnd_total_entries | Stat |
| Audit chain status | consent_audit_chain_verifier_runs_total (and breaks counter) | Stat (CRITICAL if breaks > 0) |
| Outbox backlog | consent_outbox_unpublished_count and consent_outbox_oldest_unpublished_age_seconds | Time series |
| PG pool usage | consent_pg_pool_in_use | Time series |
| Erasure pipeline | requested vs completed counters; SLA breaches | Mixed |
| Citizen flows | OTP, verify, revoke, inspection counters | Stacked area |
Dashboard variables: tenant_id, scope, language, region (Kabul/Herat/Mazar), time_range.
7. Runbooks (linked from alerts)
Each runbook contains:
- What this alert means in plain words
- Likely causes
- Diagnostic queries (Prom + log search + SQL)
- Mitigation steps (with safety notes)
- Escalation path
- Post-incident review template
Runbook inventory:
audit-chain-broken.md— CRITICAL response: freeze writes, snapshot affected partition, investigate, NEVER auto-resolvecheckconsent-failclosed.md— investigate Postgres + Redis health concurrentlydnd-stale.md— manual ATRA fetch, escalate to ATRA NOC if remotestop-consumer-lag.md— scale consumer; investigate downstream NATS healthoutbox-backlog.md— investigate NATS health; check for poison messagesvault-unwrap-errors.md— fall back to cached DEKs; investigate Vault Transiterasure-sla-breach.md— manual processor run; explain to legalpartition-missing.md— manual partition creation; investigate cron worker