| FM-AUDIT-01 | PostgreSQL primary unavailable | Audit ingestion stalls; NATS messages not ACK'd; compliance queries fail | Health probe fails; audit_db_connection_errors_total metric fires | NATS consumer pauses; DB reconnect retry with backoff; Kubernetes readiness probe fails; SRE alerts; NATS messages remain in stream (at-least-once) — delivered once DB recovers |
| FM-AUDIT-02 | NATS JetStream unavailable | No new events consumed; no new audit entries created; existing entries intact | NATS health check fails; consumer lag metric absent | Service retries NATS connection; existing Postgres data unaffected; no data loss on recovery |
| FM-AUDIT-03 | Object storage (S3) unavailable | Export jobs fail; signed download URLs not generated | AuditExport.status=failed; audit_export_errors_total metric | Export job retried up to 3x; export marked failed after exhaustion; existing audit_entries intact; alert fires |
| FM-AUDIT-04 | Chain-hash integrity check failure | Tamper-detection alarm; possible data corruption | Scheduled verification job exits non-zero; audit_chain_integrity_failures_total metric | Alert fires immediately (CRITICAL); SRE investigates; no auto-remediation (tampering requires human review); all exports halted pending investigation |
| FM-AUDIT-05 | NATS dead-letter queue growth | Some source events never ingested; audit trail has gaps | audit_dlq_pending_messages metric > 0; audit.dlq.alert.v1 emitted | DLQ handler retries 3x with delay; on exhaustion emits audit.dlq.alert.v1 to platform-admin-service; SRE investigates source event format |
| FM-AUDIT-06 | Dedup collision (duplicate source_event_id) | Duplicate event silently skipped (expected behavior) | None — expected | UNIQUE index on source_event_id; INSERT ON CONFLICT DO NOTHING; correct behavior |
| FM-AUDIT-07 | Query dateTo - dateFrom > 90 days | Live query blocked | HTTP 400 AUD_DATE_RANGE_TOO_WIDE returned | User directed to async export; no performance impact |
| FM-AUDIT-08 | Pod OOM / restart under high ingestion load | Brief gap in event processing during restart | Kubernetes OOM event; pod restart metric | NATS JetStream retains unACK'd messages; re-delivered after pod restarts; no data loss; HPA scales additional replica |
| FM-AUDIT-09 | Compliance query returns stale data (no cache) | Query reflects state at query time | Not applicable (correct behavior — no cache) | No Redis cache by design; reads always from primary Postgres |
| FM-AUDIT-10 | Export signed URL expired before download | User cannot download completed export file | HTTP 403 from object storage | User must re-request export; fileUrl TTL is 1 hour; re-issue by re-querying export status (new URL not supported — re-run export) |
| FM-AUDIT-11 | Malformed event payload from source service | Event normalisation fails; message sent to DLQ | DLQ message count increases; parsing error logged | DLQ handler attempts schema coercion; on failure, raw payload stored with normalisation_error=true flag; source service alerted via audit.dlq.alert.v1 |