regulator-portal-service — Observability
Version: 1.0 Status: Draft Owner: Regulator-facing + Legal + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, FAILURE_MODES.md, services/compliance-engine/OBSERVABILITY.md
Observability for regulator-portal-service. The service sits at the regulator-platform interface; its observability serves three audiences: (1) SRE for availability, (2) Legal / Regulator Liaison for LI-SLA tracking, (3) Security for mTLS / cert behaviour.
1. Prometheus Metrics
Standard labels service="regulator-portal-service", region, namespace, pod.
1.1 Login + cert metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
regulator_login_total | Counter | role (regulator/auditor/admin), result | Logins |
regulator_login_seconds | Histogram | role | Login duration (including CRL/OCSP check); buckets [0.05, 0.1, 0.25, 0.5, 1, 2, 5] |
regulator_crl_check_total | Counter | result (HIT/MISS/EXPIRED/REVOKED) | CRL/OCSP outcomes |
regulator_cert_revocation_check_seconds | Histogram | — | CRL+OCSP latency |
regulator_cert_expiring_days | Gauge | role, cert_subject | Days until user cert expires |
regulator_session_active_count | Gauge | role | Active sessions |
regulator_auditor_access_grants_total | Counter | status (GRANTED/EXPIRED/REVOKED) | Time-boxed auditor grants |
1.2 LI workflow
| Metric | Type | Labels | Description |
|---|---|---|---|
regulator_li_requests_total | Counter | scope (IRI/CC/FULL), state | LI workflow states |
regulator_li_state_duration_seconds | Histogram | from_state, to_state | Time spent in each state |
regulator_li_sla_breach_total | Counter | state | SLA-breach events |
regulator_li_pending_oldest_hours | Gauge | state | Oldest pending request age |
regulator_li_dual_control_approvals_total | Counter | result (GRANTED/EXPIRED_WINDOW/REJECTED) | Dual-control outcomes |
regulator_li_delivery_total | Counter | destination, result (DELIVERED/RETRY/FAILED) | Package delivery |
1.3 Complaint workflow
| Metric | Type | Labels | Description |
|---|---|---|---|
regulator_complaints_ingested_total | Counter | source, category | Complaints received |
regulator_complaint_triage_duration_seconds | Histogram | — | Time from ingest to first triage |
regulator_complaint_resolution_duration_hours | Histogram | outcome | Time from ingest to resolution |
regulator_complaint_sla_breach_total | Counter | — | Breach count |
1.4 Reports
| Metric | Type | Labels | Description |
|---|---|---|---|
regulator_report_jobs_total | Counter | type, status | Report-generation jobs |
regulator_report_duration_seconds | Histogram | type, window (hot/cold) | Report generation time; buckets [1, 10, 30, 60, 300, 600, 1800] |
regulator_report_size_bytes | Histogram | type | Generated report size |
regulator_report_hsm_sign_seconds | Histogram | — | HSM signing time for PDF |
regulator_report_download_total | Counter | role, report_type | Downloads (audit) |
1.5 SIEM forwarding
| Metric | Type | Labels | Description |
|---|---|---|---|
regulator_siem_events_total | Counter | destination, format (CEF/LEEF/JSON), status (SENT/RETRIED/DISKED/FAILED) | Events processed |
regulator_siem_lag_seconds | Histogram | destination | Consumer-to-destination lag; buckets [1, 5, 10, 30, 60, 300] |
regulator_siem_wal_depth_bytes | Gauge | destination | Disk-WAL size per destination |
regulator_siem_destination_available | Gauge | destination | 0 or 1 |
regulator_siem_last_ack_seconds | Gauge | destination | Age since last ACK from destination |
1.6 Attestations
| Metric | Type | Labels | Description |
|---|---|---|---|
regulator_attestation_evidence_total | Counter | framework, status (CURRENT/STALE/MISSING) | Evidence catalog status |
regulator_attestation_bundle_generations_total | Counter | framework, year, status | Bundle generation runs |
2. Structured Log Events
Pino JSON; standard fields + service-specific:
regulator.login.success/regulator.login.failure:{ role, certSubject, certFingerprint, sourceIp, reason? }regulator.li.state_change:{ liId, fromState, toState, initiator, approver?, reason? }regulator.li.delivery:{ liId, destination, latencyMs, result }regulator.complaint.ingested:{ complaintId, source, category, regulatorRef }regulator.complaint.resolved:{ complaintId, outcome, resolutionHours }regulator.report.generated:{ reportId, type, rows, durationSec, signed: bool }regulator.siem.delivered:{ destination, eventCount, latencyMs }regulator.siem.fallback_to_disk:{ destination, walBytes }regulator.auditor.access_granted:{ auditorId, scope, expiresAt }regulator.auditor.access_expired:{ auditorId, lastActiveAt }
PII policy
Citizen complaint contents are logged with MSISDN hashed. Regulator-user identity is logged (regulator accountability, not PII). Auditor identity logged with audit privilege.
3. OpenTelemetry Tracing
Manual spans:
regulator.login.mtls— cert chain + CRL/OCSP checkregulator.li.submit— LI request intakeregulator.li.deliver— SFTP / HTTPS deliveryregulator.report.build— upstream read-through + assemblyregulator.report.sign— HSMregulator.siem.dispatch— per-destination publishregulator.siem.wal_drain— WAL recovery after outage
W3C TraceContext propagation from all upstream reads; ATRA endpoint does not participate in tracing.
Sampling: 100% for all errors + all LI flows + all attestation-bundle generations; 10% for routine reads; 1% for admin background crons.
4. Alerting Rules
groups:
- name: regulator-portal.rules
rules:
- alert: RegulatorLoginLatencyHigh
expr: histogram_quantile(0.95, sum(rate(regulator_login_seconds_bucket[5m])) by (le, role)) > 2
for: 10m
labels: { severity: medium, team: sre }
- alert: RegulatorCertRevocationCheckFailing
expr: rate(regulator_crl_check_total{result=~"EXPIRED|ERROR"}[5m]) > 0.1
for: 5m
labels: { severity: high, team: security }
- alert: RegulatorLiSlaBreach
expr: regulator_li_pending_oldest_hours > 18
for: 0m
labels: { severity: critical, team: legal, page: ciso }
- alert: RegulatorComplaintSlaBreach
expr: increase(regulator_complaint_sla_breach_total[1h]) > 0
for: 0m
labels: { severity: high, team: legal }
- alert: RegulatorSiemLagHigh
expr: histogram_quantile(0.95, sum(rate(regulator_siem_lag_seconds_bucket[5m])) by (le, destination)) > 60
for: 10m
labels: { severity: high, team: security }
- alert: RegulatorSiemDestinationUnreachable
expr: regulator_siem_destination_available == 0
for: 5m
labels: { severity: high, team: sre }
- alert: RegulatorSiemWalDiskGrowing
expr: regulator_siem_wal_depth_bytes > 5e9 # > 5 GB
for: 30m
labels: { severity: high, team: sre }
- alert: RegulatorEvidenceStale
expr: sum(regulator_attestation_evidence_total{status="STALE"}) > 10
for: 1h
labels: { severity: medium, team: compliance }
- alert: RegulatorAttestationBundleFail
expr: increase(regulator_attestation_bundle_generations_total{status="FAILED"}[24h]) > 0
for: 0m
labels: { severity: high, team: compliance }
- alert: RegulatorAuditorAccessExpiredButActive
expr: sum(regulator_auditor_access_grants_total{status="EXPIRED"}) - sum(regulator_auditor_access_grants_total{status="REVOKED"}) > 0
for: 10m
labels: { severity: high, team: security }
- alert: RegulatorHsmUnavailable
expr: rate(regulator_report_hsm_sign_seconds_count[5m]) == 0 and increase(regulator_report_jobs_total{status="AWAITING_SIGN"}[10m]) > 0
for: 5m
labels: { severity: critical, team: sre }
5. Grafana Dashboard Panels
regulator-portal-service.json:
5.1 Regulator / Legal row
- LI workflow: state breakdown + SLA countdown per active request
- Complaint queue depth + SLA status
- Report generation: today's jobs + ACK status
- Evidence freshness heatmap per framework × control
- Auditor active sessions
5.2 Security row
- Login success/fail rate per role
- Cert revocation-check latency
- Cert-expiry heatmap (upcoming renewals)
- SIEM events per destination
- SIEM lag per destination + WAL depth
5.3 SRE row
- Login latency P50/P95/P99
- Postgres/Redis conn pool utilisation
- HSM sign latency + availability
- S3 report-bucket operations
6. Runbook Index
| Alert | Runbook |
|---|---|
| RegulatorLoginLatencyHigh | runbooks/regulator/login-latency.md |
| RegulatorCertRevocationCheckFailing | runbooks/regulator/crl-ocsp-issue.md |
| RegulatorLiSlaBreach | runbooks/regulator/li-sla.md |
| RegulatorComplaintSlaBreach | runbooks/regulator/complaint-sla.md |
| RegulatorSiemLagHigh | runbooks/regulator/siem-lag.md |
| RegulatorSiemDestinationUnreachable | runbooks/regulator/siem-dest-out.md |
| RegulatorSiemWalDiskGrowing | runbooks/regulator/siem-wal-fill.md |
| RegulatorEvidenceStale | runbooks/regulator/evidence-stale.md |
| RegulatorAttestationBundleFail | runbooks/regulator/attestation-bundle-fail.md |
| RegulatorAuditorAccessExpiredButActive | runbooks/regulator/auditor-access-anomaly.md |
| RegulatorHsmUnavailable | runbooks/regulator/hsm-out.md |
7. SLIs / SLOs
| SLI | SLO target | Window |
|---|---|---|
| Regulator login success rate | ≥ 99.9% for valid certs | 30 d |
| Login latency (P99) | ≤ 3 s (includes CRL/OCSP) | 30 d |
| LI workflow SLA (RECEIVED → ACK) | 100% within 1 h | 30 d |
| LI workflow SLA (RECEIVED → DELIVERED) | 100% within 24 h | 30 d |
| Complaint response SLA | 95% within 5 business days | 30 d |
| Scheduled report delivery | 100% within SLA | 30 d |
| SIEM forwarding lag (P95) | ≤ 60 s | 30 d |
| SIEM delivery success (any destination up) | ≥ 99.95% | 30 d |
| Attestation evidence freshness | ≥ 95% CURRENT | Continuous |
8. Log Retention
| Stream | Hot (Loki) | Cold (S3) |
|---|---|---|
| Regulator activity + LI + complaint | 14 d | 7 y |
| Report download audit | 14 d | 7 y |
| SIEM forwarding events | 7 d | 90 d |
| Auditor session audit | 14 d | 3 y |