Skip to main content

Audit Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · OBSERVABILITY

1. Failure catalog

IDFailureUser impactDetectionMitigation
FM-AUDIT-01PostgreSQL primary unavailableAudit ingestion stalls; NATS messages not ACK'd; compliance queries failHealth probe fails; audit_db_connection_errors_total metric firesNATS consumer pauses; DB reconnect retry with backoff; Kubernetes readiness probe fails; SRE alerts; NATS messages remain in stream (at-least-once) — delivered once DB recovers
FM-AUDIT-02NATS JetStream unavailableNo new events consumed; no new audit entries created; existing entries intactNATS health check fails; consumer lag metric absentService retries NATS connection; existing Postgres data unaffected; no data loss on recovery
FM-AUDIT-03Object storage (S3) unavailableExport jobs fail; signed download URLs not generatedAuditExport.status=failed; audit_export_errors_total metricExport job retried up to 3x; export marked failed after exhaustion; existing audit_entries intact; alert fires
FM-AUDIT-04Chain-hash integrity check failureTamper-detection alarm; possible data corruptionScheduled verification job exits non-zero; audit_chain_integrity_failures_total metricAlert fires immediately (CRITICAL); SRE investigates; no auto-remediation (tampering requires human review); all exports halted pending investigation
FM-AUDIT-05NATS dead-letter queue growthSome source events never ingested; audit trail has gapsaudit_dlq_pending_messages metric > 0; audit.dlq.alert.v1 emittedDLQ handler retries 3x with delay; on exhaustion emits audit.dlq.alert.v1 to platform-admin-service; SRE investigates source event format
FM-AUDIT-06Dedup collision (duplicate source_event_id)Duplicate event silently skipped (expected behavior)None — expectedUNIQUE index on source_event_id; INSERT ON CONFLICT DO NOTHING; correct behavior
FM-AUDIT-07Query dateTo - dateFrom > 90 daysLive query blockedHTTP 400 AUD_DATE_RANGE_TOO_WIDE returnedUser directed to async export; no performance impact
FM-AUDIT-08Pod OOM / restart under high ingestion loadBrief gap in event processing during restartKubernetes OOM event; pod restart metricNATS JetStream retains unACK'd messages; re-delivered after pod restarts; no data loss; HPA scales additional replica
FM-AUDIT-09Compliance query returns stale data (no cache)Query reflects state at query timeNot applicable (correct behavior — no cache)No Redis cache by design; reads always from primary Postgres
FM-AUDIT-10Export signed URL expired before downloadUser cannot download completed export fileHTTP 403 from object storageUser must re-request export; fileUrl TTL is 1 hour; re-issue by re-querying export status (new URL not supported — re-run export)
FM-AUDIT-11Malformed event payload from source serviceEvent normalisation fails; message sent to DLQDLQ message count increases; parsing error loggedDLQ handler attempts schema coercion; on failure, raw payload stored with normalisation_error=true flag; source service alerted via audit.dlq.alert.v1

2. Dependency failure matrix

DependencyFailureAudit impactBehavior
PostgreSQLUnavailableEvent ingestion stallsPod unhealthy; NATS holds messages; recover on reconnect
NATSUnavailableNo new eventsService reconnects; no data loss
Object storageUnavailableExport files not writtenExport job fails; retry; no query impact
identity-serviceUnavailableQuery auth failsQueries return 401/503; ingestion unaffected
KongUnavailableCompliance UI and exports blockedIngestion continues; only query/export APIs affected

3. Compliance implications of failures

FailureCompliance riskMitigation
FM-AUDIT-01 (DB down)Audit trail gap during outageMoPH acknowledges force-majeure outage; NATS holds messages; entries created on recovery; gap duration logged
FM-AUDIT-04 (chain-hash fail)Tamper evidence compromisedCRITICAL alert; exports halted; human investigation required; regulator notification procedure in runbook
FM-AUDIT-05 (DLQ growth)Some events never auditedEach DLQ message investigated; root cause from source service fixed; backfill attempted if event content recoverable

4. SLO impact

SLOTargetBreach
Ingestion availability99.9 %DB unavailability > 43 min/month
Query availability99.5 %Less strict (read-only; infrequent)
Export completion95 % within 10 minP95 export job > 10 min