Skip to main content

Audit Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry

1. SLIs and SLOs

SLIMeasurementSLO targetWindow
Ingestion availability1 - (failed ingestion rate)99.9 %30-day rolling
Ingestion latency (P95)audit_ingestion_duration_ms< 200 ms5-min window
Query availability1 - (5xx rate on GET /api/v1/audit/*)99.5 %30-day rolling
Export completion (P95)audit_export_duration_ms< 10 minPer-export
Chain integrityVerification job pass rate100 %Daily

2. Key metrics (Prometheus)

MetricTypeLabelsDescription
audit_events_ingested_totalCounterevent_type, source_service, tenant_idEvents successfully ingested
audit_ingestion_duration_msHistogramevent_typeTime from NATS message receipt to Postgres INSERT
audit_events_deduplicated_totalCountersource_serviceEvents skipped due to duplicate source_event_id
audit_dlq_pending_messagesGaugeMessages in dead-letter queue
audit_chain_integrity_failures_totalCounterChain-hash mismatches detected
audit_export_duration_msHistogramExport job duration
audit_export_errors_totalCountererror_typeFailed export jobs
audit_query_duration_msHistogramoperationCompliance query latency
audit_db_connection_errors_totalCounterDB connection failures
audit_entries_totalGaugetenant_idTotal rows per tenant (sampled, not live count)

3. Distributed tracing (OpenTelemetry)

Key spans per ingestion:

NATS message received
audit.dedup_check (SELECT source_event_id)
audit.normalise_event
audit.compute_chain_hash
audit.insert_entry (INSERT audit_entries)
NATS ACK

Span attributes: tenant_id, event_type, source_service, source_event_id — no PHI in span attributes.

OTEL exporter: OTLP → Grafana Tempo.

4. Structured logging

FieldDescription
levelinfo / warn / error
serviceaudit-service
traceIdOTEL trace id
tenantIdTenant from event envelope
sourceEventIdSource event ID
operationHandler name
msgHuman-readable message

PHI-safe: raw metadata fields from event payload are not logged; only IDs and event type are logged.

5. Dashboards

DashboardKey panels
Audit Service — IngestionEvents/s by type, ingestion P95 latency, DLQ depth, dedup rate
Audit Service — Chain IntegrityDaily job status, chain-hash failure count, last verified timestamp
Audit Service — Query & ExportQuery latency, export job queue depth, export errors
Audit Service — SLO BurnMulti-window burn rate for ingestion and query SLOs

6. Alerts

AlertConditionSeverityRunbook
AuditIngestionStoppedNo events ingested for > 5 minCritical/runbooks/audit/ingestion-stopped
AuditDLQGrowingaudit_dlq_pending_messages > 0 for > 2 minWarning/runbooks/audit/dlq-handler
AuditChainIntegrityFailedaudit_chain_integrity_failures_total > 0Critical/runbooks/audit/chain-integrity
AuditDBConnectionErroraudit_db_connection_errors_total > 0 for > 1 minCritical/runbooks/audit/db-failure
AuditExportFailedaudit_export_errors_total > 2 in 10 minWarning/runbooks/audit/export-failure
AuditHighIngestionLatencyP95 ingestion latency > 500 ms for > 5 minWarning/runbooks/audit/high-latency

7. Chain integrity verification job

The chain-hash verification job runs on a configurable cron (default: 0 2 * * * — 2 AM daily). It:

  1. Reads audit_entries in recorded_at order per tenant.
  2. Recomputes SHA-256(prev_id:sourceEventId:tenantId:occurredAt:resourceId) for each row.
  3. Compares against stored chain_hash.
  4. On any mismatch: increments audit_chain_integrity_failures_total; fires AuditChainIntegrityFailed alert; logs full context.

Job runtime < 5 min for 1 M rows (partition-pruned to last 7 days by default; full scan configurable).