Skip to main content

regulator-portal-service — Observability

Version: 1.0 Status: Draft Owner: Regulator-facing + Legal + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, FAILURE_MODES.md, services/compliance-engine/OBSERVABILITY.md

Observability for regulator-portal-service. The service sits at the regulator-platform interface; its observability serves three audiences: (1) SRE for availability, (2) Legal / Regulator Liaison for LI-SLA tracking, (3) Security for mTLS / cert behaviour.


1. Prometheus Metrics

Standard labels service="regulator-portal-service", region, namespace, pod.

1.1 Login + cert metrics

MetricTypeLabelsDescription
regulator_login_totalCounterrole (regulator/auditor/admin), resultLogins
regulator_login_secondsHistogramroleLogin duration (including CRL/OCSP check); buckets [0.05, 0.1, 0.25, 0.5, 1, 2, 5]
regulator_crl_check_totalCounterresult (HIT/MISS/EXPIRED/REVOKED)CRL/OCSP outcomes
regulator_cert_revocation_check_secondsHistogramCRL+OCSP latency
regulator_cert_expiring_daysGaugerole, cert_subjectDays until user cert expires
regulator_session_active_countGaugeroleActive sessions
regulator_auditor_access_grants_totalCounterstatus (GRANTED/EXPIRED/REVOKED)Time-boxed auditor grants

1.2 LI workflow

MetricTypeLabelsDescription
regulator_li_requests_totalCounterscope (IRI/CC/FULL), stateLI workflow states
regulator_li_state_duration_secondsHistogramfrom_state, to_stateTime spent in each state
regulator_li_sla_breach_totalCounterstateSLA-breach events
regulator_li_pending_oldest_hoursGaugestateOldest pending request age
regulator_li_dual_control_approvals_totalCounterresult (GRANTED/EXPIRED_WINDOW/REJECTED)Dual-control outcomes
regulator_li_delivery_totalCounterdestination, result (DELIVERED/RETRY/FAILED)Package delivery

1.3 Complaint workflow

MetricTypeLabelsDescription
regulator_complaints_ingested_totalCountersource, categoryComplaints received
regulator_complaint_triage_duration_secondsHistogramTime from ingest to first triage
regulator_complaint_resolution_duration_hoursHistogramoutcomeTime from ingest to resolution
regulator_complaint_sla_breach_totalCounterBreach count

1.4 Reports

MetricTypeLabelsDescription
regulator_report_jobs_totalCountertype, statusReport-generation jobs
regulator_report_duration_secondsHistogramtype, window (hot/cold)Report generation time; buckets [1, 10, 30, 60, 300, 600, 1800]
regulator_report_size_bytesHistogramtypeGenerated report size
regulator_report_hsm_sign_secondsHistogramHSM signing time for PDF
regulator_report_download_totalCounterrole, report_typeDownloads (audit)

1.5 SIEM forwarding

MetricTypeLabelsDescription
regulator_siem_events_totalCounterdestination, format (CEF/LEEF/JSON), status (SENT/RETRIED/DISKED/FAILED)Events processed
regulator_siem_lag_secondsHistogramdestinationConsumer-to-destination lag; buckets [1, 5, 10, 30, 60, 300]
regulator_siem_wal_depth_bytesGaugedestinationDisk-WAL size per destination
regulator_siem_destination_availableGaugedestination0 or 1
regulator_siem_last_ack_secondsGaugedestinationAge since last ACK from destination

1.6 Attestations

MetricTypeLabelsDescription
regulator_attestation_evidence_totalCounterframework, status (CURRENT/STALE/MISSING)Evidence catalog status
regulator_attestation_bundle_generations_totalCounterframework, year, statusBundle generation runs

2. Structured Log Events

Pino JSON; standard fields + service-specific:

  • regulator.login.success / regulator.login.failure: { role, certSubject, certFingerprint, sourceIp, reason? }
  • regulator.li.state_change: { liId, fromState, toState, initiator, approver?, reason? }
  • regulator.li.delivery: { liId, destination, latencyMs, result }
  • regulator.complaint.ingested: { complaintId, source, category, regulatorRef }
  • regulator.complaint.resolved: { complaintId, outcome, resolutionHours }
  • regulator.report.generated: { reportId, type, rows, durationSec, signed: bool }
  • regulator.siem.delivered: { destination, eventCount, latencyMs }
  • regulator.siem.fallback_to_disk: { destination, walBytes }
  • regulator.auditor.access_granted: { auditorId, scope, expiresAt }
  • regulator.auditor.access_expired: { auditorId, lastActiveAt }

PII policy

Citizen complaint contents are logged with MSISDN hashed. Regulator-user identity is logged (regulator accountability, not PII). Auditor identity logged with audit privilege.


3. OpenTelemetry Tracing

Manual spans:

  • regulator.login.mtls — cert chain + CRL/OCSP check
  • regulator.li.submit — LI request intake
  • regulator.li.deliver — SFTP / HTTPS delivery
  • regulator.report.build — upstream read-through + assembly
  • regulator.report.sign — HSM
  • regulator.siem.dispatch — per-destination publish
  • regulator.siem.wal_drain — WAL recovery after outage

W3C TraceContext propagation from all upstream reads; ATRA endpoint does not participate in tracing.

Sampling: 100% for all errors + all LI flows + all attestation-bundle generations; 10% for routine reads; 1% for admin background crons.


4. Alerting Rules

groups:
- name: regulator-portal.rules
rules:
- alert: RegulatorLoginLatencyHigh
expr: histogram_quantile(0.95, sum(rate(regulator_login_seconds_bucket[5m])) by (le, role)) > 2
for: 10m
labels: { severity: medium, team: sre }

- alert: RegulatorCertRevocationCheckFailing
expr: rate(regulator_crl_check_total{result=~"EXPIRED|ERROR"}[5m]) > 0.1
for: 5m
labels: { severity: high, team: security }

- alert: RegulatorLiSlaBreach
expr: regulator_li_pending_oldest_hours > 18
for: 0m
labels: { severity: critical, team: legal, page: ciso }

- alert: RegulatorComplaintSlaBreach
expr: increase(regulator_complaint_sla_breach_total[1h]) > 0
for: 0m
labels: { severity: high, team: legal }

- alert: RegulatorSiemLagHigh
expr: histogram_quantile(0.95, sum(rate(regulator_siem_lag_seconds_bucket[5m])) by (le, destination)) > 60
for: 10m
labels: { severity: high, team: security }

- alert: RegulatorSiemDestinationUnreachable
expr: regulator_siem_destination_available == 0
for: 5m
labels: { severity: high, team: sre }

- alert: RegulatorSiemWalDiskGrowing
expr: regulator_siem_wal_depth_bytes > 5e9 # > 5 GB
for: 30m
labels: { severity: high, team: sre }

- alert: RegulatorEvidenceStale
expr: sum(regulator_attestation_evidence_total{status="STALE"}) > 10
for: 1h
labels: { severity: medium, team: compliance }

- alert: RegulatorAttestationBundleFail
expr: increase(regulator_attestation_bundle_generations_total{status="FAILED"}[24h]) > 0
for: 0m
labels: { severity: high, team: compliance }

- alert: RegulatorAuditorAccessExpiredButActive
expr: sum(regulator_auditor_access_grants_total{status="EXPIRED"}) - sum(regulator_auditor_access_grants_total{status="REVOKED"}) > 0
for: 10m
labels: { severity: high, team: security }

- alert: RegulatorHsmUnavailable
expr: rate(regulator_report_hsm_sign_seconds_count[5m]) == 0 and increase(regulator_report_jobs_total{status="AWAITING_SIGN"}[10m]) > 0
for: 5m
labels: { severity: critical, team: sre }

5. Grafana Dashboard Panels

regulator-portal-service.json:

  • LI workflow: state breakdown + SLA countdown per active request
  • Complaint queue depth + SLA status
  • Report generation: today's jobs + ACK status
  • Evidence freshness heatmap per framework × control
  • Auditor active sessions

5.2 Security row

  • Login success/fail rate per role
  • Cert revocation-check latency
  • Cert-expiry heatmap (upcoming renewals)
  • SIEM events per destination
  • SIEM lag per destination + WAL depth

5.3 SRE row

  • Login latency P50/P95/P99
  • Postgres/Redis conn pool utilisation
  • HSM sign latency + availability
  • S3 report-bucket operations

6. Runbook Index

AlertRunbook
RegulatorLoginLatencyHighrunbooks/regulator/login-latency.md
RegulatorCertRevocationCheckFailingrunbooks/regulator/crl-ocsp-issue.md
RegulatorLiSlaBreachrunbooks/regulator/li-sla.md
RegulatorComplaintSlaBreachrunbooks/regulator/complaint-sla.md
RegulatorSiemLagHighrunbooks/regulator/siem-lag.md
RegulatorSiemDestinationUnreachablerunbooks/regulator/siem-dest-out.md
RegulatorSiemWalDiskGrowingrunbooks/regulator/siem-wal-fill.md
RegulatorEvidenceStalerunbooks/regulator/evidence-stale.md
RegulatorAttestationBundleFailrunbooks/regulator/attestation-bundle-fail.md
RegulatorAuditorAccessExpiredButActiverunbooks/regulator/auditor-access-anomaly.md
RegulatorHsmUnavailablerunbooks/regulator/hsm-out.md

7. SLIs / SLOs

SLISLO targetWindow
Regulator login success rate≥ 99.9% for valid certs30 d
Login latency (P99)≤ 3 s (includes CRL/OCSP)30 d
LI workflow SLA (RECEIVED → ACK)100% within 1 h30 d
LI workflow SLA (RECEIVED → DELIVERED)100% within 24 h30 d
Complaint response SLA95% within 5 business days30 d
Scheduled report delivery100% within SLA30 d
SIEM forwarding lag (P95)≤ 60 s30 d
SIEM delivery success (any destination up)≥ 99.95%30 d
Attestation evidence freshness≥ 95% CURRENTContinuous

8. Log Retention

StreamHot (Loki)Cold (S3)
Regulator activity + LI + complaint14 d7 y
Report download audit14 d7 y
SIEM forwarding events7 d90 d
Auditor session audit14 d3 y