regulator-portal-service — Failure Modes
Version: 1.0 Status: Draft Owner: Regulator-facing + Legal + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md
Catalog of how regulator-portal-service fails. The service is regulator-facing: downtime or data delivery issues have legal-notification implications.
1. Operating Principle
- Regulator login: fail-closed on cert-revocation check failures.
- LI workflow: SLA breaches trigger immediate escalation to Legal + CISO.
- SIEM forwarding: at-least-once; disk-WAL prevents event loss during outages.
- Report generation: partial-data tolerated; downstream-unavailability surfaced in report metadata.
- Upstream reads: service is read-only against upstream — outages degrade reports but don't corrupt state.
2. Failure Mode Summary
| # | Name | Class | Detection | User-visible impact | Runbook |
|---|---|---|---|---|---|
| FM-01 | mTLS handshake storm (adversarial or cert chaos) | Security | < 5 min | Legitimate login latency up; brief outage for probe sources | runbooks/regulator/mtls-storm.md |
| FM-02 | SIEM destination unreachable | Dependency | < 2 min | Events buffer to disk-WAL; alert fires | runbooks/regulator/siem-dest-out.md |
| FM-03 | Postgres unavailable | Infra | < 30 s | Regulator portal returns 503 | runbooks/regulator/pg-out.md |
| FM-04 | HSM unavailable | Dependency | < 30 s | Reports + bundles can't be signed | runbooks/regulator/hsm-out.md |
| FM-05 | Upstream service unreachable | Integration | Per-request | Partial report + warning banner | runbooks/regulator/upstream-out.md |
| FM-06 | LI delivery SFTP fails | Dependency | Per-delivery | Retry + alert if > 3 attempts | runbooks/regulator/li-deliver-fail.md |
| FM-07 | Complaint ingest flood (DDoS or real-world event) | Infra | 5 min | Rate-limit + tarpit engage | runbooks/regulator/complaint-flood.md |
| FM-08 | Evidence collection cron fails | Ops | 24 h | Stale evidence status before regulator visibility | runbooks/regulator/evidence-cron.md |
| FM-09 | Attestation bundle generation fails | Ops | Per-generation | Manual intervention | runbooks/regulator/bundle-fail.md |
| FM-10 | Auditor cert revoked mid-session | Security | Next API call | Session ends; re-attestation needed | runbooks/regulator/auditor-mid-session.md |
| FM-11 | Disk-WAL fills (catastrophic SIEM backlog) | Infra | 30 min | Event loss risk if unaddressed | runbooks/regulator/wal-fill.md |
| FM-12 | Regulator cert revoked during active LI workflow | Security | Next action | Pending LI paused; new approver required | runbooks/regulator/li-mid-flow-revocation.md |
| FM-13 | Dual-control approver unavailable for urgent LI | Process | Per-workflow | Emergency-approver runbook; audit heavy | runbooks/regulator/li-no-approver.md |
3. Detailed Failure Modes
FM-01 — mTLS handshake storm
Probing source attempts rapid mTLS handshakes with revoked / unknown certs.
Mitigation. Edge Kong tarpits after 5 failures per IP in 60 s; CRL cache prevents repeated upstream CRL fetches. Cloudflare Layer 7 rate-limit on regulator domain.
Recovery. Tarpit times out; legitimate traffic unaffected.
FM-02 — SIEM destination unreachable
One destination (Splunk, QRadar, or Logstash) unreachable.
Mitigation. Disk-WAL engages for that destination; other destinations continue normally. Alert fires.
Recovery. Destination recovers → forwarder replays WAL in order.
FM-03 — Postgres unavailable
Primary PG outage.
Mitigation. PG HA auto-failover (≤ 30 s). Regulator Liaison notified if outage > 15 min.
Recovery. PG recovery → service resumes.
FM-04 — HSM unavailable
HSM outage blocks PDF + bundle signing.
Mitigation. HSM HA. Pending reports queue with status AWAITING_SIGN. Backup manual-signing via security-held HSM backup key (dual-control).
Recovery. HSM recovery → queue drains.
FM-05 — Upstream service unreachable
A read-through target (e.g., compliance-engine) is unreachable.
Mitigation. Partial-report mode: report generated with placeholder for affected section + prominent "data unavailable from {service}" warning. Regulator Liaison notified.
Recovery. Upstream recovers → full report available on next generation.
FM-06 — LI delivery SFTP fails
Package delivery to ATRA SFTP endpoint fails.
Mitigation. 3 retry attempts with exponential backoff. Failure after retries → alert + LI state stays DELIVERED_PENDING_ACK. Manual SFTP delivery runbook.
Recovery. Retry succeeds or manual intervention.
FM-07 — Complaint flood
Unusual volume of complaint POSTs (DDoS or real-world national event).
Mitigation. Per-source rate-limit at Kong; excess 429'd. Sustained flood → tarpit. Real-world-event case (e.g., nationwide outage): complaints queue in NATS for post-event processing.
FM-08 — Evidence collection cron fails
Scheduled job fails.
Mitigation. backoffLimit: 3; alert on failure; manual re-trigger. Evidence status stays as last-good.
FM-09 — Bundle generation fails
Annual attestation bundle generation fails mid-way.
Mitigation. Idempotent; re-run generates from scratch. Intermediate files retained for recovery investigation.
FM-10 — Auditor cert mid-session revoked
Auditor cert revoked while they're actively using the portal.
Mitigation. Next API call detects revocation via OCSP staple + CRL cache → session terminated. Audit entry with revocation cause. Auditor notified via email.
FM-11 — Disk-WAL fills
All SIEM destinations unreachable for > 24 h → WAL grows to 5 GB limit.
Mitigation. Monitoring alerts at 50%, 75%, 90%. Runbook to escalate to Security team. Worst case: expand WAL volume. Further worst case: drop oldest events with Legal approval + regulator notification.
FM-12 — Regulator cert mid-LI revoked
LI initiator's cert revoked mid-workflow.
Mitigation. LI paused; new approver required to continue. Audit records original initiator's revocation. ATRA notified via Regulator Liaison.
FM-13 — Urgent LI with no approver
LI requires dual-control but no approver available (holiday / off-hours).
Mitigation. Emergency-approver runbook (CISO + CTO both required). Action heavily audit-logged; LI proceeds with reduced control; post-hoc review mandatory.
4. Graceful Degradation Summary
| Failure domain | Behaviour |
|---|---|
| Login / cert | Fail-closed |
| Postgres | Fail-closed (503) |
| HSM | Reports queue; no silent fallback |
| Upstream reads | Partial report with warning |
| SIEM destination | Disk-WAL buffer; other destinations unaffected |
| LI SFTP | Retry + manual fallback |
| Complaint flood | Rate-limit + tarpit |
5. Failure ↔ Experience Matrix
| FM | Regulator | Auditor | Legal / Internal | NOC |
|---|---|---|---|---|
| FM-01 | Brief login issue for some IPs | — | — | Alert |
| FM-02 | — | — | SIEM data delayed | Alert |
| FM-03 | 503 portal | 503 portal | Service down | Critical |
| FM-04 | Reports unavailable | Bundles unavailable | — | Critical |
| FM-05 | Partial report | Partial evidence | — | Warning |
| FM-06 | LI delivery delayed | — | Aware | Alert |
| FM-07 | Complaints delayed | — | Aware | Alert |
| FM-08 | Stale evidence visible to regulator if prolonged | Stale evidence | Aware | Alert |
| FM-09 | — | Bundle unavailable until manual recovery | Aware | Alert |
| FM-10 | — | Mid-session termination | — | Audit entry |
| FM-11 | — | — | Event loss risk | Critical |
| FM-12 | LI paused | — | Requires new approver | Alert |
| FM-13 | Urgent LI requires emergency approver | — | Heavy audit | Alert |
6. Open Points
| ID | Question | Owner |
|---|---|---|
| FM-OPEN-01 | Emergency-approver escalation bridge (phone + Slack) for FM-13 | Legal + Leadership |
| FM-OPEN-02 | Disk-WAL sizing policy (5 GB vs. 20 GB) | SRE |
| FM-OPEN-03 | Partial-report wording approved by ATRA | Regulator Liaison + Legal |