Audit Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry
1. SLIs and SLOs
| SLI | Measurement | SLO target | Window |
|---|---|---|---|
| Ingestion availability | 1 - (failed ingestion rate) | 99.9 % | 30-day rolling |
| Ingestion latency (P95) | audit_ingestion_duration_ms | < 200 ms | 5-min window |
| Query availability | 1 - (5xx rate on GET /api/v1/audit/*) | 99.5 % | 30-day rolling |
| Export completion (P95) | audit_export_duration_ms | < 10 min | Per-export |
| Chain integrity | Verification job pass rate | 100 % | Daily |
2. Key metrics (Prometheus)
| Metric | Type | Labels | Description |
|---|---|---|---|
audit_events_ingested_total | Counter | event_type, source_service, tenant_id | Events successfully ingested |
audit_ingestion_duration_ms | Histogram | event_type | Time from NATS message receipt to Postgres INSERT |
audit_events_deduplicated_total | Counter | source_service | Events skipped due to duplicate source_event_id |
audit_dlq_pending_messages | Gauge | — | Messages in dead-letter queue |
audit_chain_integrity_failures_total | Counter | — | Chain-hash mismatches detected |
audit_export_duration_ms | Histogram | — | Export job duration |
audit_export_errors_total | Counter | error_type | Failed export jobs |
audit_query_duration_ms | Histogram | operation | Compliance query latency |
audit_db_connection_errors_total | Counter | — | DB connection failures |
audit_entries_total | Gauge | tenant_id | Total rows per tenant (sampled, not live count) |
3. Distributed tracing (OpenTelemetry)
Key spans per ingestion:
NATS message received
audit.dedup_check (SELECT source_event_id)
audit.normalise_event
audit.compute_chain_hash
audit.insert_entry (INSERT audit_entries)
NATS ACK
Span attributes: tenant_id, event_type, source_service, source_event_id — no PHI in span attributes.
OTEL exporter: OTLP → Grafana Tempo.
4. Structured logging
| Field | Description |
|---|---|
level | info / warn / error |
service | audit-service |
traceId | OTEL trace id |
tenantId | Tenant from event envelope |
sourceEventId | Source event ID |
operation | Handler name |
msg | Human-readable message |
PHI-safe: raw metadata fields from event payload are not logged; only IDs and event type are logged.
5. Dashboards
| Dashboard | Key panels |
|---|---|
| Audit Service — Ingestion | Events/s by type, ingestion P95 latency, DLQ depth, dedup rate |
| Audit Service — Chain Integrity | Daily job status, chain-hash failure count, last verified timestamp |
| Audit Service — Query & Export | Query latency, export job queue depth, export errors |
| Audit Service — SLO Burn | Multi-window burn rate for ingestion and query SLOs |
6. Alerts
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
AuditIngestionStopped | No events ingested for > 5 min | Critical | /runbooks/audit/ingestion-stopped |
AuditDLQGrowing | audit_dlq_pending_messages > 0 for > 2 min | Warning | /runbooks/audit/dlq-handler |
AuditChainIntegrityFailed | audit_chain_integrity_failures_total > 0 | Critical | /runbooks/audit/chain-integrity |
AuditDBConnectionError | audit_db_connection_errors_total > 0 for > 1 min | Critical | /runbooks/audit/db-failure |
AuditExportFailed | audit_export_errors_total > 2 in 10 min | Warning | /runbooks/audit/export-failure |
AuditHighIngestionLatency | P95 ingestion latency > 500 ms for > 5 min | Warning | /runbooks/audit/high-latency |
7. Chain integrity verification job
The chain-hash verification job runs on a configurable cron (default: 0 2 * * * — 2 AM daily). It:
- Reads
audit_entriesinrecorded_atorder per tenant. - Recomputes
SHA-256(prev_id:sourceEventId:tenantId:occurredAt:resourceId)for each row. - Compares against stored
chain_hash. - On any mismatch: increments
audit_chain_integrity_failures_total; firesAuditChainIntegrityFailedalert; logs full context.
Job runtime < 5 min for 1 M rows (partition-pruned to last 7 days by default; full scan configurable).