Skip to main content

regulator-portal-service — Failure Modes

Version: 1.0 Status: Draft Owner: Regulator-facing + Legal + SRE Last Updated: 2026-04-21 References: SERVICE_OVERVIEW.md, OBSERVABILITY.md

Catalog of how regulator-portal-service fails. The service is regulator-facing: downtime or data delivery issues have legal-notification implications.


1. Operating Principle

  • Regulator login: fail-closed on cert-revocation check failures.
  • LI workflow: SLA breaches trigger immediate escalation to Legal + CISO.
  • SIEM forwarding: at-least-once; disk-WAL prevents event loss during outages.
  • Report generation: partial-data tolerated; downstream-unavailability surfaced in report metadata.
  • Upstream reads: service is read-only against upstream — outages degrade reports but don't corrupt state.

2. Failure Mode Summary

#NameClassDetectionUser-visible impactRunbook
FM-01mTLS handshake storm (adversarial or cert chaos)Security< 5 minLegitimate login latency up; brief outage for probe sourcesrunbooks/regulator/mtls-storm.md
FM-02SIEM destination unreachableDependency< 2 minEvents buffer to disk-WAL; alert firesrunbooks/regulator/siem-dest-out.md
FM-03Postgres unavailableInfra< 30 sRegulator portal returns 503runbooks/regulator/pg-out.md
FM-04HSM unavailableDependency< 30 sReports + bundles can't be signedrunbooks/regulator/hsm-out.md
FM-05Upstream service unreachableIntegrationPer-requestPartial report + warning bannerrunbooks/regulator/upstream-out.md
FM-06LI delivery SFTP failsDependencyPer-deliveryRetry + alert if > 3 attemptsrunbooks/regulator/li-deliver-fail.md
FM-07Complaint ingest flood (DDoS or real-world event)Infra5 minRate-limit + tarpit engagerunbooks/regulator/complaint-flood.md
FM-08Evidence collection cron failsOps24 hStale evidence status before regulator visibilityrunbooks/regulator/evidence-cron.md
FM-09Attestation bundle generation failsOpsPer-generationManual interventionrunbooks/regulator/bundle-fail.md
FM-10Auditor cert revoked mid-sessionSecurityNext API callSession ends; re-attestation neededrunbooks/regulator/auditor-mid-session.md
FM-11Disk-WAL fills (catastrophic SIEM backlog)Infra30 minEvent loss risk if unaddressedrunbooks/regulator/wal-fill.md
FM-12Regulator cert revoked during active LI workflowSecurityNext actionPending LI paused; new approver requiredrunbooks/regulator/li-mid-flow-revocation.md
FM-13Dual-control approver unavailable for urgent LIProcessPer-workflowEmergency-approver runbook; audit heavyrunbooks/regulator/li-no-approver.md

3. Detailed Failure Modes

FM-01 — mTLS handshake storm

Probing source attempts rapid mTLS handshakes with revoked / unknown certs.

Mitigation. Edge Kong tarpits after 5 failures per IP in 60 s; CRL cache prevents repeated upstream CRL fetches. Cloudflare Layer 7 rate-limit on regulator domain.

Recovery. Tarpit times out; legitimate traffic unaffected.


FM-02 — SIEM destination unreachable

One destination (Splunk, QRadar, or Logstash) unreachable.

Mitigation. Disk-WAL engages for that destination; other destinations continue normally. Alert fires.

Recovery. Destination recovers → forwarder replays WAL in order.


FM-03 — Postgres unavailable

Primary PG outage.

Mitigation. PG HA auto-failover (≤ 30 s). Regulator Liaison notified if outage > 15 min.

Recovery. PG recovery → service resumes.


FM-04 — HSM unavailable

HSM outage blocks PDF + bundle signing.

Mitigation. HSM HA. Pending reports queue with status AWAITING_SIGN. Backup manual-signing via security-held HSM backup key (dual-control).

Recovery. HSM recovery → queue drains.


FM-05 — Upstream service unreachable

A read-through target (e.g., compliance-engine) is unreachable.

Mitigation. Partial-report mode: report generated with placeholder for affected section + prominent "data unavailable from {service}" warning. Regulator Liaison notified.

Recovery. Upstream recovers → full report available on next generation.


FM-06 — LI delivery SFTP fails

Package delivery to ATRA SFTP endpoint fails.

Mitigation. 3 retry attempts with exponential backoff. Failure after retries → alert + LI state stays DELIVERED_PENDING_ACK. Manual SFTP delivery runbook.

Recovery. Retry succeeds or manual intervention.


FM-07 — Complaint flood

Unusual volume of complaint POSTs (DDoS or real-world national event).

Mitigation. Per-source rate-limit at Kong; excess 429'd. Sustained flood → tarpit. Real-world-event case (e.g., nationwide outage): complaints queue in NATS for post-event processing.


FM-08 — Evidence collection cron fails

Scheduled job fails.

Mitigation. backoffLimit: 3; alert on failure; manual re-trigger. Evidence status stays as last-good.


FM-09 — Bundle generation fails

Annual attestation bundle generation fails mid-way.

Mitigation. Idempotent; re-run generates from scratch. Intermediate files retained for recovery investigation.


FM-10 — Auditor cert mid-session revoked

Auditor cert revoked while they're actively using the portal.

Mitigation. Next API call detects revocation via OCSP staple + CRL cache → session terminated. Audit entry with revocation cause. Auditor notified via email.


FM-11 — Disk-WAL fills

All SIEM destinations unreachable for > 24 h → WAL grows to 5 GB limit.

Mitigation. Monitoring alerts at 50%, 75%, 90%. Runbook to escalate to Security team. Worst case: expand WAL volume. Further worst case: drop oldest events with Legal approval + regulator notification.


FM-12 — Regulator cert mid-LI revoked

LI initiator's cert revoked mid-workflow.

Mitigation. LI paused; new approver required to continue. Audit records original initiator's revocation. ATRA notified via Regulator Liaison.


FM-13 — Urgent LI with no approver

LI requires dual-control but no approver available (holiday / off-hours).

Mitigation. Emergency-approver runbook (CISO + CTO both required). Action heavily audit-logged; LI proceeds with reduced control; post-hoc review mandatory.


4. Graceful Degradation Summary

Failure domainBehaviour
Login / certFail-closed
PostgresFail-closed (503)
HSMReports queue; no silent fallback
Upstream readsPartial report with warning
SIEM destinationDisk-WAL buffer; other destinations unaffected
LI SFTPRetry + manual fallback
Complaint floodRate-limit + tarpit

5. Failure ↔ Experience Matrix

FMRegulatorAuditorLegal / InternalNOC
FM-01Brief login issue for some IPsAlert
FM-02SIEM data delayedAlert
FM-03503 portal503 portalService downCritical
FM-04Reports unavailableBundles unavailableCritical
FM-05Partial reportPartial evidenceWarning
FM-06LI delivery delayedAwareAlert
FM-07Complaints delayedAwareAlert
FM-08Stale evidence visible to regulator if prolongedStale evidenceAwareAlert
FM-09Bundle unavailable until manual recoveryAwareAlert
FM-10Mid-session terminationAudit entry
FM-11Event loss riskCritical
FM-12LI pausedRequires new approverAlert
FM-13Urgent LI requires emergency approverHeavy auditAlert

6. Open Points

IDQuestionOwner
FM-OPEN-01Emergency-approver escalation bridge (phone + Slack) for FM-13Legal + Leadership
FM-OPEN-02Disk-WAL sizing policy (5 GB vs. 20 GB)SRE
FM-OPEN-03Partial-report wording approved by ATRARegulator Liaison + Legal