Skip to main content

Consent Ledger Service — Observability

Version: 1.0 Status: Draft Owner: Trust & Safety / Platform SRE Last Updated: 2026-04-21 Companion: FAILURE_MODES · SERVICE_READINESS · APPLICATION_LOGIC

1. SLIs / SLOs

Bound to the platform NFR catalog (EP-PLAT-NB-09).

SLISLOWindowBurn-rate alert
CheckConsent cache-hit P95 latency≤ 5 ms28 d14 × 1 h, 6 × 6 h
CheckConsent end-to-end P99 latency (any path)≤ 20 ms28 d14 × 1 h, 6 × 6 h
CheckConsent availability (non-error rate)≥ 99.95%28 d14 × 1 h, 6 × 6 h
CheckConsent cache hit ratio≥ 90% sustained7 d< 70% for 30 min → MEDIUM
RecordConsent write P95 latency≤ 80 ms28 d14 × 1 h
RecordConsent availability≥ 99.9%28 d14 × 1 h
MO-STOP end-to-end (MO consumed → ack-back enqueued) P95≤ 2 s28 d14 × 1 h
ATRA DND sync freshnesslast successful run < 24 hcontinuous24 h stale → HIGH
Audit hash-chain integrityDaily verifier returns OKcontinuousbreak → CRITICAL (page)
Outbox publish lag P95≤ 5 s28 d> 60 s for 5 min → HIGH
Citizen erasure SLA100% completed within 30 drollingany breach → HIGH

2. Prometheus metrics

All metrics exposed at GET /metrics on port 3071. Service label service="consent-ledger-service".

2.1 CheckConsent

MetricTypeLabelsDescription
consent_check_totalCounterverdict (allowed/blocked), reason, tenant_id, scopeTotal CheckConsent calls by outcome
consent_check_duration_secondsHistogrampath (cache_hit/db/failclosed)End-to-end CheckConsent latency
consent_check_cache_hits_totalCountertenant_idRedis hot-cache hits
consent_check_cache_misses_totalCountertenant_idRedis hot-cache misses
consent_check_dnd_block_totalCounterscopeNational-DND short-circuit blocks
consent_check_failclosed_totalCountercause (db_unavailable, redis_and_db_unavailable, unknown)Fail-closed activations
consent_check_concurrency_in_flightGaugeIn-flight CheckConsent calls

Histogram buckets for consent_check_duration_seconds: [0.001, 0.002, 0.003, 0.005, 0.008, 0.012, 0.020, 0.050, 0.100, 0.500].

2.2 Records / writes

MetricTypeLabelsDescription
consent_records_written_totalCountertenant_id, scope, verification_method, outcome (created/replaced/noop)Record writes
consent_records_revoked_totalCountertenant_id, scope, revoked_reasonRevocations
consent_record_write_duration_secondsHistogramop (record/revoke/bulk_row)Write latency
consent_idempotency_hits_totalCounterendpointCached idempotent responses returned

2.3 STOP MO consumer

MetricTypeLabelsDescription
consent_stop_mo_received_totalCounterlanguage, match_outcome (matched/no_match/orphan_sender)MO events processed
consent_stop_mo_duration_secondsHistogrammatch_outcomeEnd-to-end MO → revoke + ack-back enqueue latency
consent_stop_mo_consumer_lag_secondsGaugeNATS num_pending translated to seconds via observed throughput
consent_ack_back_sent_totalCounterlanguageAck-back SMS enqueued
consent_ack_back_dedup_skipped_totalCounterOne-shot dedup hits (subscriber STOP-flooded the same MSISDN)

2.4 DND sync

MetricTypeLabelsDescription
consent_dnd_sync_runs_totalCounteroutcome (success/failed)DND sync attempts
consent_dnd_sync_duration_secondsHistogramPer-run duration
consent_dnd_sync_added_totalCounterRows added in last run
consent_dnd_sync_removed_totalCounterRows removed in last run
consent_dnd_sync_last_success_timestamp_secondsGaugeUNIX epoch of last successful sync (for (time() - this) > 86400 alert)
consent_dnd_total_entriesGaugeCurrent size of consent.dnd_registry

2.5 Audit chain

MetricTypeLabelsDescription
consent_audit_rows_written_totalCounterevent_typeAudit row writes
consent_audit_chain_verifier_runs_totalCounteroutcome (ok/broken)Verifier executions
consent_audit_chain_verifier_duration_secondsHistogramVerifier run duration
consent_audit_chain_verifier_last_ok_timestamp_secondsGaugeLast successful verification UNIX epoch
consent_audit_chain_breaks_detected_totalCounterpartitionCRITICAL — must equal 0
consent_audit_partition_countGaugeActive partitions in consent.audit
consent_audit_oldest_unarchived_partition_age_daysGaugeDetects partition-archive lag

2.6 Erasure

MetricTypeLabelsDescription
consent_erasure_requests_totalCounterrequested_via, statusErasure requests
consent_erasure_processing_duration_secondsHistogramPer-request processing time
consent_erasure_records_redacted_totalCounterTotal record rows redacted
consent_erasure_audit_rows_redacted_totalCounterTotal audit rows redacted (for chain-preservation accounting)
consent_erasure_sla_breach_totalCounterErasures past 30 d SLA — must equal 0

2.7 Outbox

MetricTypeLabelsDescription
consent_outbox_unpublished_countGaugeUnpublished rows currently waiting
consent_outbox_oldest_unpublished_age_secondsGaugeAge of the oldest waiting row
consent_outbox_publish_totalCountersubject, outcomePublish attempts
consent_outbox_publish_duration_secondsHistogramsubjectNATS publish latency

2.8 Citizen flows

MetricTypeLabelsDescription
consent_citizen_otp_requested_totalCounterOTP requests
consent_citizen_otp_verified_totalCounteroutcomeOTP verifications
consent_citizen_revoke_totalCounterscope_or_allCitizen-portal revokes
consent_citizen_inspection_totalCounterSelf-service inspections (audit-relevant)

2.9 Dependencies / health

MetricTypeLabelsDescription
consent_pg_pool_in_useGaugepool (primary/replica)PgBouncer pool usage
consent_redis_op_duration_secondsHistogramop (get/set/del)Redis op latency
consent_vault_unwrap_duration_secondsHistogramKEK unwrap latency
consent_vault_unwrap_errors_totalCounterKEK unwrap failures

3. Structured log events

All logs are valid JSON (Pino), redacted per SECURITY_MODEL §3.3. LOG_LEVEL env var controls verbosity.

{ "level": "info", "time": "2026-04-21T11:00:00.000Z",
"event": "consent.check", "tenantId": "1111…", "msisdnHash": "8d4f…", "scope": "MARKETING",
"verdict": "allowed", "reason": "ALLOWED_TENANT_RECORD", "path": "cache_hit", "latencyMs": 2,
"traceId": "00-abc-def-01" }
{ "level": "info", "event": "consent.recorded", "tenantId": "1111…", "msisdnMasked": "+93701***",
"scope": "MARKETING", "verificationMethod": "DOUBLE_OPT_IN", "recordId": "cn_…",
"previousRecordId": null, "auditId": "cna_…" }
{ "level": "warn", "event": "consent.stop_mo.matched", "moId": "mo_…", "msisdnMasked": "+93701***",
"matchedKeyword": "stop", "matchedLanguage": "EN", "policyApplied": "PER_TENANT",
"tenantsRevoked": ["1111…"], "ackBackEnqueued": true, "latencyMs": 850 }
{ "level": "error", "event": "consent.audit.chain_broken", "verifierRunId": "cvr_…",
"partition": "consent_audit_2026_04", "firstBadSeq": 4837521, "auditId": "cna_…",
"expectedPrevHash": "ab12…", "actualPrevHash": "cd34…" }
{ "level": "error", "event": "consent.dnd.sync.failed", "errorClass": "atra_unreachable",
"lastSuccessAt": "2026-04-20T03:00:00Z", "ageHours": 24.5 }

4. OpenTelemetry tracing

Parent span: consent-ledger-service.<rpc-or-handler>.

SpanOpAttributes
consent.checkgRPC CheckConsenttenantId, scope, path, verdict, reason, latencyMs
consent.cache.lookupRedis MGETcache.hit, keyCount
consent.db.lookupPG SELECTdb.system=postgresql, db.statement (parameterised)
consent.record.writeUC-RecordConsentrecordId, verificationMethod, replacedPrevious
consent.audit.writeUC-WriteAuditauditId, partition, seq
consent.outbox.publishNATS publishsubject, eventId
consent.stop_mo.handleNATS consumermoId, matchedKeyword, tenantsRevoked
consent.dnd.sync.runWorkerrunId, addedCount, removedCount
consent.audit.verifier.runWorkerpartition, outcome

Trace context propagates via gRPC traceparent (W3C Trace Context) and NATS traceparent header.

5. Alert rules

groups:
- name: consent-ledger-service
rules:

# ---- CRITICAL ----

- alert: ConsentAuditChainBroken
expr: increase(consent_audit_chain_breaks_detected_total[10m]) > 0
labels: { severity: critical, page: trust_and_safety_oncall }
annotations:
summary: "Consent audit hash-chain break detected"
runbook: https://runbooks.ghasi.gov.af/consent-ledger/audit-chain-broken

- alert: ConsentCheckFailclosedSurge
expr: rate(consent_check_failclosed_total[5m]) > 1
labels: { severity: critical, page: trust_and_safety_oncall }
annotations:
summary: "CheckConsent fail-closed >1/s — DB+Redis dual outage suspected"
runbook: https://runbooks.ghasi.gov.af/consent-ledger/checkconsent-failclosed

- alert: ConsentServiceDown
expr: up{job="consent-ledger-service"} == 0
for: 2m
labels: { severity: critical, page: platform_sre_oncall }

# ---- HIGH ----

- alert: ConsentDndStale
expr: (time() - consent_dnd_sync_last_success_timestamp_seconds) > 86400
labels: { severity: high }
annotations:
summary: "ATRA DND sync stale > 24 h — running on last-known data"
runbook: https://runbooks.ghasi.gov.af/consent-ledger/dnd-stale

- alert: ConsentCheckP95High
expr: histogram_quantile(0.95, rate(consent_check_duration_seconds_bucket{path="cache_hit"}[5m])) > 0.005
for: 10m
labels: { severity: high }

- alert: ConsentCheckCacheHitLow
expr: |
(rate(consent_check_cache_hits_total[10m]) /
(rate(consent_check_cache_hits_total[10m]) + rate(consent_check_cache_misses_total[10m]))) < 0.7
for: 30m
labels: { severity: high }

- alert: ConsentStopConsumerLag
expr: consent_stop_mo_consumer_lag_seconds > 60
for: 5m
labels: { severity: high }
annotations:
summary: "STOP MO consumer >60s lag — subscriber STOPs not honoured promptly (regulator risk)"

- alert: ConsentOutboxBacklog
expr: consent_outbox_unpublished_count > 1000 or consent_outbox_oldest_unpublished_age_seconds > 60
for: 5m
labels: { severity: high }

- alert: ConsentVaultUnwrapErrors
expr: rate(consent_vault_unwrap_errors_total[5m]) > 0.1
for: 5m
labels: { severity: high }

# ---- MEDIUM ----

- alert: ConsentRecordWriteSlow
expr: histogram_quantile(0.95, rate(consent_record_write_duration_seconds_bucket[5m])) > 0.080
for: 15m
labels: { severity: medium }

- alert: ConsentErasureSLABreach
expr: increase(consent_erasure_sla_breach_total[1h]) > 0
labels: { severity: medium }

- alert: ConsentAuditPartitionMissing
expr: consent_audit_oldest_unarchived_partition_age_days > 14
labels: { severity: medium }

- alert: ConsentCitizenOtpAbuse
expr: rate(consent_citizen_otp_requested_total[1h]) > 100
for: 30m
labels: { severity: medium }

Each alert links to a runbook under https://runbooks.ghasi.gov.af/consent-ledger/<alert-slug>.

6. Grafana dashboard panels

Dashboard: dashboards/consent-ledger-service.json.

PanelQueryVisualization
CheckConsent ratesum by(verdict)(rate(consent_check_total[5m]))Stacked area
CheckConsent P50/P95/P99histogram_quantile(.50/.95/.99, rate(consent_check_duration_seconds_bucket[5m]))Time series
CheckConsent cache hit ratiocomputedGauge + time series
Reasons distribution (24 h)sum by(reason)(increase(consent_check_total[24h]))Pie
RecordConsent ratesum by(verification_method, outcome)(rate(consent_records_written_total[5m]))Stacked area
Top tenants by writes (24 h)topk(10, sum by(tenant_id)(increase(consent_records_written_total[24h])))Bar
STOP MO end-to-end P95histogram_quantile(.95, rate(consent_stop_mo_duration_seconds_bucket[5m]))Time series
STOP MO consumer lagconsent_stop_mo_consumer_lag_secondsTime series
STOP keyword hit rate by languagesum by(language)(rate(consent_stop_mo_received_total{match_outcome="matched"}[5m]))Stacked area
ATRA DND last synctime() - consent_dnd_sync_last_success_timestamp_secondsStat (red if > 24 h)
DND total entriesconsent_dnd_total_entriesStat
Audit chain statusconsent_audit_chain_verifier_runs_total (and breaks counter)Stat (CRITICAL if breaks > 0)
Outbox backlogconsent_outbox_unpublished_count and consent_outbox_oldest_unpublished_age_secondsTime series
PG pool usageconsent_pg_pool_in_useTime series
Erasure pipelinerequested vs completed counters; SLA breachesMixed
Citizen flowsOTP, verify, revoke, inspection countersStacked area

Dashboard variables: tenant_id, scope, language, region (Kabul/Herat/Mazar), time_range.

7. Runbooks (linked from alerts)

Each runbook contains:

  1. What this alert means in plain words
  2. Likely causes
  3. Diagnostic queries (Prom + log search + SQL)
  4. Mitigation steps (with safety notes)
  5. Escalation path
  6. Post-incident review template

Runbook inventory:

  • audit-chain-broken.md — CRITICAL response: freeze writes, snapshot affected partition, investigate, NEVER auto-resolve
  • checkconsent-failclosed.md — investigate Postgres + Redis health concurrently
  • dnd-stale.md — manual ATRA fetch, escalate to ATRA NOC if remote
  • stop-consumer-lag.md — scale consumer; investigate downstream NATS health
  • outbox-backlog.md — investigate NATS health; check for poison messages
  • vault-unwrap-errors.md — fall back to cached DEKs; investigate Vault Transit
  • erasure-sla-breach.md — manual processor run; explain to legal
  • partition-missing.md — manual partition creation; investigate cron worker