file-storage-service — FAILURE_MODES
Companion: OBSERVABILITY · APPLICATION_LOGIC §8 · SERVICE_RISK_REGISTER · Error Codes
This catalog enumerates what can break, who notices, how we detect, and how we recover. Each row maps to a runbook URL under runbooks.melmastoon.ghasi.io/file/.... Severities follow platform convention: P1 = customer-impacting, page within 5 min; P2 = degradation, alert within 15 min; P3 = minor / informational, ticket.
User impact lanes:
- C = consumer-facing booking site (guests viewing photos / receiving PDFs).
- T = tenant-booking BFF and tenant booking site.
- B = backoffice / staff (uploading photos, ID scans, viewing invoices).
- P = platform-internal (other services that depend on file-storage).
1. Upload flow failures
| # | Failure | User impact | Detection | Mitigation | Severity | Runbook |
|---|---|---|---|---|---|---|
| U1 | GCS rejects signed PUT (5xx) during direct upload | C, T, B | client retries; metric file_storage_uploads_failed_total{reason='gcs_5xx'} | Client retry with jitter; if persistent across tenants → check GCS status, fall back to BFF-proxy upload mode | P2 | file/gcs-upload-5xx |
| U2 | Upload session expires before confirm | C, T, B | sweeper marks session expired; emits file.upload.failed.v1 reason=session_expired | Client re-initiates; reserved quota released by sweeper | P3 | file/session-expired |
| U3 | Hash mismatch on confirm | C, T, B | file_storage_uploads_failed_total{reason='hash_mismatch'} | Session aborted, partial GCS object soft-deleted; client must re-upload; if rate spikes investigate client SDK bug | P2 | file/hash-mismatch |
| U4 | Magic-byte mismatch (declared MIME ≠ actual) | C, T, B | quarantine; file_storage_quarantine_total{reason='magic_byte'} | Notify uploader; usually a client misconfiguration (renamed .exe to .jpg) — return clear message; persistent abuse → tenant-level rate limit | P3 | file/magic-byte |
| U5 | Polyglot file detected | C, T, B | quarantine | Same as U4; treat as security event if from authenticated user with no prior history | P2 | file/polyglot |
| U6 | Quota exceeded mid-upload (race) | T, B | MELMASTOON.FILE.QUOTA_EXCEEDED returned at confirm | Client surfaces quota banner; tenant admin upgrades plan or deletes content | P3 | file/quota-runaway |
| U7 | Resumable upload abandoned | C, T, B | session open past expires_at; sweeper aborts | Sweeper calls abortResumable, clears partial GCS, releases quota | P3 | file/session-expired |
| U8 | Idempotency-key reuse with different body | T, B | 409 MELMASTOON.SYNC.IDEMPOTENCY_KEY_REUSED | Caller bug; surfaces in client logs | P3 | general/idempotency |
| U9 | Initiate succeeds but DB outbox write fails (rare) | T, B | uncaught exception → 500 → telemetry alert | Tx rolls back the FileObject and outbox both; client retries with same idempotency key | P2 | file/db-tx-failure |
| U10 | Pub/Sub publish failure on upload.completed.v1 | P | outbox lag rises | Outbox relay retries; consumers eventually receive event; no data loss | P2 | file/outbox-lag |
| U11 | Resumable session URI leak in client logs | C, T, B | post-incident audit | URI is short-lived (10 min); revoke session via abort; rotate signing identity if abuse confirmed | P3 | file/session-leak |
2. Scan / quarantine failures
| # | Failure | User impact | Detection | Mitigation | Severity | Runbook |
|---|---|---|---|---|---|---|
| S1 | ClamAV cluster unreachable | P | scan callbacks stop; file_storage_scan_results_total flatlines; alert FileStorage_ScanLatencyHigh | Scale ClamAV pods, check Pub/Sub backlog; reads remain blocked while files in scanning; consider feature flag to relax to "DLP-only" path for low-risk scopes | P2 | file/quarantine-storm |
| S2 | Cloud DLP API quota exceeded | P | DLP scan failures; scope pii_id_scan reads blocked | Increase DLP quota; temporarily promote files to inconclusive (manual review) for forensic mode | P2 | file/dlp-quota |
| S3 | Scan callback delivered for unknown file | P | inbox dedup logs unknown file; metric file_storage_inbox_handler_failures_total | Likely race or replay; safe to ignore; alert if rate > 1/s | P3 | file/scan-callback-orphan |
| S4 | Scan returns inconclusive 3 attempts in a row | P | metric file_storage_scan_results_total{verdict='inconclusive'} | Promote to quarantine; security reviewer manually triages | P2 | file/scan-inconclusive |
| S5 | False positive quarantine (legitimate file flagged) | T, B | tenant complaint | OverrideQuarantineUseCase releases to archive (not ready); investigate scanner version | P3 | file/quarantine-override |
| S6 | Scan SLO breach (p95 > 60 s) | C, T, B | alert FileStorage_ScanLatencyHigh | Scale ClamAV; investigate poison object loop | P2 | file/scan-latency |
| S7 | Scan worker delivers callback to expired endpoint after deploy | P | mTLS handshake fails | Cloud Run revisions accept old SAN for 30 min during canary; investigate if longer | P3 | file/canary-handover |
3. Optimization pipeline failures
| # | Failure | User impact | Detection | Mitigation | Severity | Runbook |
|---|---|---|---|---|---|---|
| O1 | Optimizer worker crash on poison image (image bomb / ZIP bomb / corrupted JPEG) | C, T, B | per-object retries 5× then DLQ; file_storage_optimization_failed_total | DLQ alert; manual triage; mark variant failed; original remains usable; sandbox catches resource exhaustion | P3 | file/optimizer-dlq |
| O2 | Optimizer worker OOM | P | container restart + Pub/Sub re-delivery | RSS cap 512 MB enforced; raise memory if persistent for legitimate large images | P2 | file/optimizer-oom |
| O3 | Variant upload to GCS fails 5× | P | DLQ + alert | Retry from optimizer; eventually variant.status=failed; original still usable; CDN serves original | P3 | file/optimizer-dlq |
| O4 | Optimization SLO breach (p95 > 60 s for ≤ 5 MB) | C, T | alert FileStorage_OptimizerSLOBreach | Scale workers; check sharp / ffmpeg perf regression after deploy | P3 | file/optimization-latency |
| O5 | Variant rendered with wrong dimensions | T | manual report; optional QA sample audit | Re-enqueue optimization with bug fix; sweeper picks up new variant | P3 | file/variant-dimension |
| O6 | Optimization callback never arrives | P | sweeper marks variant failed after 60 min | Retry on sweeper run; investigate Eventarc subscription health | P2 | file/callback-missing |
4. Download / signed URL failures
| # | Failure | User impact | Detection | Mitigation | Severity | Runbook |
|---|---|---|---|---|---|---|
| D1 | Signed URL expired before client uses it | C, T, B | client gets 403 from GCS; metric file_storage_downloads_failed_total{reason='expired'} | Client requests a new URL; tighten TTL doc; consider longer default | P3 | file/url-expired |
| D2 | Cross-tenant URL leaked (someone shared a URL externally) | I (security) | SIEM detects access from unexpected IP / UA | Revoke access_grant; URL becomes inert at TTL or via ZSET blacklist (private bucket); rotate signing identity if URL was issued to a compromised actor | P1 | file/cross-tenant-leak-suspected |
| D3 | CDN serves stale public asset after delete | T | tenant report; cdn_invalidate_backlog metric | CDN invalidation worker drains backlog; if persistent, manually invalidate via gcloud | P2 | file/cdn-invalidation-backlog |
| D4 | Signed URL refused at issuance because file is scanning | C, T, B | 409 MELMASTOON.FILE.SCAN_PENDING | Expected; client polls or waits for scan.passed.v1 event | P3 | file/scan-pending |
| D5 | Quarantined file requested for download | C, T, B | 409 MELMASTOON.FILE.QUARANTINED | Expected; backoffice can use override flow; consumer / tenant cannot | P3 | n/a |
| D6 | Private CDN sidecar down | C, T, B | LB health check fails | Sidecar autoscales to zero traffic; LB removes from rotation; private downloads fail until restored | P2 | file/private-cdn-sidecar |
| D7 | Redis flush (signed URL blacklist lost) | I (security) | metric signed_url_blacklist_size drops to 0 | Re-seed from Postgres backstop on startup; alert if mismatch persists > 60 s | P2 | file/blacklist-flush |
| D8 | Per-tenant download rate limit exceeded | C, T | 429 MELMASTOON.GENERAL.RATE_LIMITED | Caller backs off; consider raising bucket if tenant on enterprise plan | P3 | general/rate-limit |
5. Retention & erasure failures
| # | Failure | User impact | Detection | Mitigation | Severity | Runbook |
|---|---|---|---|---|---|---|
| R1 | Retention sweeper falls behind (lag > 30 min) | I (compliance) | alert FileStorage_RetentionSweepLag | Investigate query plan; ensure index file_objects_hard_delete_after_idx is hot; scale sweeper job concurrency | P2 | file/retention-sweep-lag |
| R2 | GCS delete fails during sweep | I (compliance) | retry → DLQ → alert | Manual delete via gcloud + DB row update; investigate IAM | P2 | file/gcs-delete-fail |
| R3 | CDN invalidation fails on erasure | C, T (privacy) | cdn_invalidated=false on certificate; alert | Retry queue; if persistent, fall back to manual invalidation; certificate flagged | P1 | file/cdn-invalidation-erasure |
| R4 | Erasure batch partial (some rows deferred legitimately) | I (compliance) | erasure_requests.status='partial' | Expected; certificate lists deferred IDs and releasedAt; sweeper runs deferred at horizon | P3 | file/erasure-partial |
| R5 | Erasure batch fails (e.g., DB connection drops) | I (compliance) | status='failed' | Operator re-runs erasure with same idempotency key; resumes from last purged_at per item | P1 | file/erasure-cert-failure |
| R6 | Erasure certificate signing fails (KMS) | I (compliance) | metric file_storage_erasure_runs_total{outcome='cert_fail'} | Retry signing; check KMS quota / key version | P2 | file/erasure-cert-fail |
| R7 | OCR redaction worker fails repeatedly | I (privacy) | metric file_storage_ai_calls_total{purpose='ocr_redact', outcome='fail'} | Backoff; original retained until success; alert on 24 h backlog | P2 | file/ocr-redact-fail |
| R8 | Legal hold not honoured (sweeper deleted held file) | I (legal) | post-incident audit; would be P1 if real | Triple-check via test retention-sweep.spec.ts; trigger replay of legal-hold metadata if any drift detected | P1 | file/legal-hold-violation |
| R9 | Quarantine purge regret (need bytes for forensics) | I | post-incident; bytes are gone after 30 d | Within window, copy from quarantine bucket; outside window, restore from GCS versioning if still in 30 d versioning window | P2 | file/quarantine-purge |
6. Storage layer failures
| # | Failure | User impact | Detection | Mitigation | Severity | Runbook |
|---|---|---|---|---|---|---|
| ST1 | GCS regional outage (europe-west4) | C, T, B, P | LB error rate up; file_storage_gcs_op_failures_total spike | Dual-region buckets self-heal for media/archive; private bucket unavailable until region restored — feature flag for graceful degradation | P1 | file/gcs-region-out |
| ST2 | Cloud SQL primary failover | minimal | < 60 s blip | Auto-failover; connections re-pool; outbox catches up | P2 | file/cloudsql-failover |
| ST3 | Cloud SQL replica drift | I (analytics) | replica lag metric | Alert; reads from replica suspended until lag < 10 s | P3 | file/replica-lag |
| ST4 | Memorystore Redis evicts hot signed-URL cache | C, T, B | latency up; signed_url_cache_misses_total spike | Increase Memorystore tier; check key TTLs | P2 | file/redis-eviction |
| ST5 | KMS key version disabled accidentally | C, T, B (private only) | decrypt errors on private bucket | Re-enable previous version; rollback IAM change | P1 | file/kms-unavailable |
| ST6 | Bucket Lock prevents intended deletion (archive) | I (admin) | gcloud delete returns 403 | Expected for tax_compliance Bucket-Locked objects; wait until retention horizon | P3 | file/bucket-lock |
| ST7 | Datastream pipeline stalled (BigQuery sync) | I (analytics) | BQ row count diverges | Restart Datastream; backfill from outbox archive | P3 | file/datastream-stall |
7. Event-driven failures
| # | Failure | User impact | Detection | Mitigation | Severity | Runbook |
|---|---|---|---|---|---|---|
| E1 | Outbox relay falls behind (lag > 30 s) | P | alert FileStorage_OutboxLagHigh | Scale relay; check Pub/Sub publish errors | P2 | file/outbox-lag |
| E2 | Outbox relay crash | P | health check fails; restart | At-least-once guarantees no event loss; consumers dedupe | P3 | file/relay-crash |
| E3 | Inbox handler crash | P | metric inbox_handler_failures_total | Pub/Sub redelivers; idempotent handler succeeds eventually; DLQ at 5 attempts | P2 | file/dlq-growth |
| E4 | Pub/Sub topic deletion (operator error) | P | publish failures + alert | Recreate topic from terraform; replay outbox unpublished window | P1 | file/topic-deletion |
| E5 | Schema-incompatible event published | P | consumer schema validation fails | Block at CI via compatibility-check; if escapes, ship .v2 and adapter shim | P2 | file/schema-incompat |
| E6 | Consumed event from another service has unknown new field | P | tolerated (forward-compatible parsing) | None; just monitor for excess unknown_field_count | P3 | n/a |
| E7 | tenant.guest.erasure_requested.v1 consumed but tenant_id mismatch | I | consumer asserts; metric inbox_dedupe_skips_total{reason='tenant_mismatch'} | Drop event; security alert if rate > 1/min | P2 | file/cross-tenant-event |
8. AI subsystem failures
| # | Failure | User impact | Detection | Mitigation | Severity | Runbook |
|---|---|---|---|---|---|---|
| A1 | Orchestrator unreachable | T (no alt text) | ai_calls_failed_total | File still goes to ready; alt text empty; backfill job re-runs nightly | P3 | file/ai-orchestrator-down |
| A2 | Per-tenant AI budget exhausted | T (no alt text) | MELMASTOON.AI.REFUSED_BUDGET | Tenant upgrade or wait next period; degraded UX banner | P3 | file/ai-budget |
| A3 | HITL backlog > 24 h | T (borderline images held) | hitl_oldest_age_seconds | Auto-quarantine on 24 h; staffing alert | P2 | file/hitl-backlog |
| A4 | Model rolls and quality regresses | T | spot QA + drift report | Pin previous modelRef in orchestrator policy; re-run | P3 | file/ai-model-drift |
| A5 | OCR redact returns wrong boxes (PII bleeds through) | I (privacy) | spot audit; consumer reports | Quarantine the redacted file; manual redaction via backoffice; investigate model | P1 | file/ocr-leak |
9. Desktop / sync failures
| # | Failure | User impact | Detection | Mitigation | Severity | Runbook |
|---|---|---|---|---|---|---|
| DK1 | Desktop offline outbox grows unbounded | B | Electron telemetry; local disk usage alert | UI prompts user; sync engine prioritizes oldest; large files chunked to avoid blocking | P3 | desktop/offline-outbox |
| DK2 | Resumable upload from Electron stalls on flaky link | B | client retries; integration with low-bandwidth e2e test | Resume protocol picks up; user can pause/cancel | P3 | desktop/resumable-stall |
| DK3 | Renderer caches expired signed URL | B | DOM image fails to load | Cache TTL = grant.expires - 30s; renderer auto-refresh on 403 | P3 | desktop/url-cache |
| DK4 | Desktop captures wrong tenant context (multi-tenant user) | B | inbox handler tenant_mismatch | Renderer sets X-Tenant-Id from active tenant in app bar; guard on switch | P2 | desktop/tenant-switch |
10. Observability failures
| # | Failure | User impact | Detection | Mitigation | Severity | Runbook |
|---|---|---|---|---|---|---|
| OB1 | OTLP exporter unreachable | I (visibility) | telemetry collector up but service spans missing | Drop spans, never block hot path; alert on collector health | P3 | obs/otlp-down |
| OB2 | Logs sink throttled by Cloud Logging | I | ingestion 429 | Reduce log level on selected loggers; investigate noise (scan callback success at info?) | P3 | obs/log-throttle |
| OB3 | Datastream → BQ broken | I (slo) | SLO panels stale | Restart Datastream; backfill from outbox archive | P3 | obs/datastream |
| OB4 | Alert noisy / duplicate | I (on-call) | PagerDuty incident review | Tune thresholds; consolidate via routing rules | P3 | obs/alert-tuning |
11. Catastrophic / business-continuity
| # | Failure | User impact | Detection | Mitigation | Severity | Runbook |
|---|---|---|---|---|---|---|
| BC1 | Total europe-west4 region outage | All | LB | Promote europe-west1 replica; rebuild Redis cold; run in degraded read-mostly mode while writes drain (RTO 30 min, RPO ≤ 5 min) | P1 | bc/region-failover |
| BC2 | Catastrophic Postgres data loss (primary + replica corrupt) | All | DB query failures | Restore from automated backup; replay outbox for last 24 h | P1 | bc/db-restore |
| BC3 | Suspected mass exfiltration of signed URLs | I (security) | SIEM burst | Freeze tenant; rotate signing identity; revoke all active grants; security incident review | P1 | file/cross-tenant-leak-suspected |
| BC4 | Compromised service account | I (security) | IAM audit / unusual access pattern | Revoke key version; rotate WIF binding; rebuild service from clean image | P1 | bc/sa-compromise |
| BC5 | Persistent CMEK rotation failure | All (private only) | KMS errors persistent | Read-only mode for private bucket; restore previous key version; security review | P1 | file/kms-rotation |
| BC6 | Regulator-issued takedown order | I (legal) | manual | OverrideQuarantineUseCase + legal-hold table; certificate of compliance | P1 | bc/regulator-takedown |
12. Incident response (general)
- PagerDuty pages on-call rotation (Tier 2 file-storage rotation).
- On-call acknowledges within 5 min, opens incident channel
#inc-file-…. - Apply mitigation per runbook above.
- Communicate status via status page if customer-impacting > 10 min.
- Postmortem within 5 business days; action items tracked in
SERVICE_RISK_REGISTER.
All P1 incidents trigger a security-reviewer review even if the cause is non-security; cross-tenant leakage or PII exposure escalates immediately to the security on-call.