Skip to main content

file-storage-service — FAILURE_MODES

Companion: OBSERVABILITY · APPLICATION_LOGIC §8 · SERVICE_RISK_REGISTER · Error Codes

This catalog enumerates what can break, who notices, how we detect, and how we recover. Each row maps to a runbook URL under runbooks.melmastoon.ghasi.io/file/.... Severities follow platform convention: P1 = customer-impacting, page within 5 min; P2 = degradation, alert within 15 min; P3 = minor / informational, ticket.

User impact lanes:

  • C = consumer-facing booking site (guests viewing photos / receiving PDFs).
  • T = tenant-booking BFF and tenant booking site.
  • B = backoffice / staff (uploading photos, ID scans, viewing invoices).
  • P = platform-internal (other services that depend on file-storage).

1. Upload flow failures

#FailureUser impactDetectionMitigationSeverityRunbook
U1GCS rejects signed PUT (5xx) during direct uploadC, T, Bclient retries; metric file_storage_uploads_failed_total{reason='gcs_5xx'}Client retry with jitter; if persistent across tenants → check GCS status, fall back to BFF-proxy upload modeP2file/gcs-upload-5xx
U2Upload session expires before confirmC, T, Bsweeper marks session expired; emits file.upload.failed.v1 reason=session_expiredClient re-initiates; reserved quota released by sweeperP3file/session-expired
U3Hash mismatch on confirmC, T, Bfile_storage_uploads_failed_total{reason='hash_mismatch'}Session aborted, partial GCS object soft-deleted; client must re-upload; if rate spikes investigate client SDK bugP2file/hash-mismatch
U4Magic-byte mismatch (declared MIME ≠ actual)C, T, Bquarantine; file_storage_quarantine_total{reason='magic_byte'}Notify uploader; usually a client misconfiguration (renamed .exe to .jpg) — return clear message; persistent abuse → tenant-level rate limitP3file/magic-byte
U5Polyglot file detectedC, T, BquarantineSame as U4; treat as security event if from authenticated user with no prior historyP2file/polyglot
U6Quota exceeded mid-upload (race)T, BMELMASTOON.FILE.QUOTA_EXCEEDED returned at confirmClient surfaces quota banner; tenant admin upgrades plan or deletes contentP3file/quota-runaway
U7Resumable upload abandonedC, T, Bsession open past expires_at; sweeper abortsSweeper calls abortResumable, clears partial GCS, releases quotaP3file/session-expired
U8Idempotency-key reuse with different bodyT, B409 MELMASTOON.SYNC.IDEMPOTENCY_KEY_REUSEDCaller bug; surfaces in client logsP3general/idempotency
U9Initiate succeeds but DB outbox write fails (rare)T, Buncaught exception → 500 → telemetry alertTx rolls back the FileObject and outbox both; client retries with same idempotency keyP2file/db-tx-failure
U10Pub/Sub publish failure on upload.completed.v1Poutbox lag risesOutbox relay retries; consumers eventually receive event; no data lossP2file/outbox-lag
U11Resumable session URI leak in client logsC, T, Bpost-incident auditURI is short-lived (10 min); revoke session via abort; rotate signing identity if abuse confirmedP3file/session-leak

2. Scan / quarantine failures

#FailureUser impactDetectionMitigationSeverityRunbook
S1ClamAV cluster unreachablePscan callbacks stop; file_storage_scan_results_total flatlines; alert FileStorage_ScanLatencyHighScale ClamAV pods, check Pub/Sub backlog; reads remain blocked while files in scanning; consider feature flag to relax to "DLP-only" path for low-risk scopesP2file/quarantine-storm
S2Cloud DLP API quota exceededPDLP scan failures; scope pii_id_scan reads blockedIncrease DLP quota; temporarily promote files to inconclusive (manual review) for forensic modeP2file/dlp-quota
S3Scan callback delivered for unknown filePinbox dedup logs unknown file; metric file_storage_inbox_handler_failures_totalLikely race or replay; safe to ignore; alert if rate > 1/sP3file/scan-callback-orphan
S4Scan returns inconclusive 3 attempts in a rowPmetric file_storage_scan_results_total{verdict='inconclusive'}Promote to quarantine; security reviewer manually triagesP2file/scan-inconclusive
S5False positive quarantine (legitimate file flagged)T, Btenant complaintOverrideQuarantineUseCase releases to archive (not ready); investigate scanner versionP3file/quarantine-override
S6Scan SLO breach (p95 > 60 s)C, T, Balert FileStorage_ScanLatencyHighScale ClamAV; investigate poison object loopP2file/scan-latency
S7Scan worker delivers callback to expired endpoint after deployPmTLS handshake failsCloud Run revisions accept old SAN for 30 min during canary; investigate if longerP3file/canary-handover

3. Optimization pipeline failures

#FailureUser impactDetectionMitigationSeverityRunbook
O1Optimizer worker crash on poison image (image bomb / ZIP bomb / corrupted JPEG)C, T, Bper-object retries 5× then DLQ; file_storage_optimization_failed_totalDLQ alert; manual triage; mark variant failed; original remains usable; sandbox catches resource exhaustionP3file/optimizer-dlq
O2Optimizer worker OOMPcontainer restart + Pub/Sub re-deliveryRSS cap 512 MB enforced; raise memory if persistent for legitimate large imagesP2file/optimizer-oom
O3Variant upload to GCS fails 5×PDLQ + alertRetry from optimizer; eventually variant.status=failed; original still usable; CDN serves originalP3file/optimizer-dlq
O4Optimization SLO breach (p95 > 60 s for ≤ 5 MB)C, Talert FileStorage_OptimizerSLOBreachScale workers; check sharp / ffmpeg perf regression after deployP3file/optimization-latency
O5Variant rendered with wrong dimensionsTmanual report; optional QA sample auditRe-enqueue optimization with bug fix; sweeper picks up new variantP3file/variant-dimension
O6Optimization callback never arrivesPsweeper marks variant failed after 60 minRetry on sweeper run; investigate Eventarc subscription healthP2file/callback-missing

4. Download / signed URL failures

#FailureUser impactDetectionMitigationSeverityRunbook
D1Signed URL expired before client uses itC, T, Bclient gets 403 from GCS; metric file_storage_downloads_failed_total{reason='expired'}Client requests a new URL; tighten TTL doc; consider longer defaultP3file/url-expired
D2Cross-tenant URL leaked (someone shared a URL externally)I (security)SIEM detects access from unexpected IP / UARevoke access_grant; URL becomes inert at TTL or via ZSET blacklist (private bucket); rotate signing identity if URL was issued to a compromised actorP1file/cross-tenant-leak-suspected
D3CDN serves stale public asset after deleteTtenant report; cdn_invalidate_backlog metricCDN invalidation worker drains backlog; if persistent, manually invalidate via gcloudP2file/cdn-invalidation-backlog
D4Signed URL refused at issuance because file is scanningC, T, B409 MELMASTOON.FILE.SCAN_PENDINGExpected; client polls or waits for scan.passed.v1 eventP3file/scan-pending
D5Quarantined file requested for downloadC, T, B409 MELMASTOON.FILE.QUARANTINEDExpected; backoffice can use override flow; consumer / tenant cannotP3n/a
D6Private CDN sidecar downC, T, BLB health check failsSidecar autoscales to zero traffic; LB removes from rotation; private downloads fail until restoredP2file/private-cdn-sidecar
D7Redis flush (signed URL blacklist lost)I (security)metric signed_url_blacklist_size drops to 0Re-seed from Postgres backstop on startup; alert if mismatch persists > 60 sP2file/blacklist-flush
D8Per-tenant download rate limit exceededC, T429 MELMASTOON.GENERAL.RATE_LIMITEDCaller backs off; consider raising bucket if tenant on enterprise planP3general/rate-limit

5. Retention & erasure failures

#FailureUser impactDetectionMitigationSeverityRunbook
R1Retention sweeper falls behind (lag > 30 min)I (compliance)alert FileStorage_RetentionSweepLagInvestigate query plan; ensure index file_objects_hard_delete_after_idx is hot; scale sweeper job concurrencyP2file/retention-sweep-lag
R2GCS delete fails during sweepI (compliance)retry → DLQ → alertManual delete via gcloud + DB row update; investigate IAMP2file/gcs-delete-fail
R3CDN invalidation fails on erasureC, T (privacy)cdn_invalidated=false on certificate; alertRetry queue; if persistent, fall back to manual invalidation; certificate flaggedP1file/cdn-invalidation-erasure
R4Erasure batch partial (some rows deferred legitimately)I (compliance)erasure_requests.status='partial'Expected; certificate lists deferred IDs and releasedAt; sweeper runs deferred at horizonP3file/erasure-partial
R5Erasure batch fails (e.g., DB connection drops)I (compliance)status='failed'Operator re-runs erasure with same idempotency key; resumes from last purged_at per itemP1file/erasure-cert-failure
R6Erasure certificate signing fails (KMS)I (compliance)metric file_storage_erasure_runs_total{outcome='cert_fail'}Retry signing; check KMS quota / key versionP2file/erasure-cert-fail
R7OCR redaction worker fails repeatedlyI (privacy)metric file_storage_ai_calls_total{purpose='ocr_redact', outcome='fail'}Backoff; original retained until success; alert on 24 h backlogP2file/ocr-redact-fail
R8Legal hold not honoured (sweeper deleted held file)I (legal)post-incident audit; would be P1 if realTriple-check via test retention-sweep.spec.ts; trigger replay of legal-hold metadata if any drift detectedP1file/legal-hold-violation
R9Quarantine purge regret (need bytes for forensics)Ipost-incident; bytes are gone after 30 dWithin window, copy from quarantine bucket; outside window, restore from GCS versioning if still in 30 d versioning windowP2file/quarantine-purge

6. Storage layer failures

#FailureUser impactDetectionMitigationSeverityRunbook
ST1GCS regional outage (europe-west4)C, T, B, PLB error rate up; file_storage_gcs_op_failures_total spikeDual-region buckets self-heal for media/archive; private bucket unavailable until region restored — feature flag for graceful degradationP1file/gcs-region-out
ST2Cloud SQL primary failoverminimal< 60 s blipAuto-failover; connections re-pool; outbox catches upP2file/cloudsql-failover
ST3Cloud SQL replica driftI (analytics)replica lag metricAlert; reads from replica suspended until lag < 10 sP3file/replica-lag
ST4Memorystore Redis evicts hot signed-URL cacheC, T, Blatency up; signed_url_cache_misses_total spikeIncrease Memorystore tier; check key TTLsP2file/redis-eviction
ST5KMS key version disabled accidentallyC, T, B (private only)decrypt errors on private bucketRe-enable previous version; rollback IAM changeP1file/kms-unavailable
ST6Bucket Lock prevents intended deletion (archive)I (admin)gcloud delete returns 403Expected for tax_compliance Bucket-Locked objects; wait until retention horizonP3file/bucket-lock
ST7Datastream pipeline stalled (BigQuery sync)I (analytics)BQ row count divergesRestart Datastream; backfill from outbox archiveP3file/datastream-stall

7. Event-driven failures

#FailureUser impactDetectionMitigationSeverityRunbook
E1Outbox relay falls behind (lag > 30 s)Palert FileStorage_OutboxLagHighScale relay; check Pub/Sub publish errorsP2file/outbox-lag
E2Outbox relay crashPhealth check fails; restartAt-least-once guarantees no event loss; consumers dedupeP3file/relay-crash
E3Inbox handler crashPmetric inbox_handler_failures_totalPub/Sub redelivers; idempotent handler succeeds eventually; DLQ at 5 attemptsP2file/dlq-growth
E4Pub/Sub topic deletion (operator error)Ppublish failures + alertRecreate topic from terraform; replay outbox unpublished windowP1file/topic-deletion
E5Schema-incompatible event publishedPconsumer schema validation failsBlock at CI via compatibility-check; if escapes, ship .v2 and adapter shimP2file/schema-incompat
E6Consumed event from another service has unknown new fieldPtolerated (forward-compatible parsing)None; just monitor for excess unknown_field_countP3n/a
E7tenant.guest.erasure_requested.v1 consumed but tenant_id mismatchIconsumer asserts; metric inbox_dedupe_skips_total{reason='tenant_mismatch'}Drop event; security alert if rate > 1/minP2file/cross-tenant-event

8. AI subsystem failures

#FailureUser impactDetectionMitigationSeverityRunbook
A1Orchestrator unreachableT (no alt text)ai_calls_failed_totalFile still goes to ready; alt text empty; backfill job re-runs nightlyP3file/ai-orchestrator-down
A2Per-tenant AI budget exhaustedT (no alt text)MELMASTOON.AI.REFUSED_BUDGETTenant upgrade or wait next period; degraded UX bannerP3file/ai-budget
A3HITL backlog > 24 hT (borderline images held)hitl_oldest_age_secondsAuto-quarantine on 24 h; staffing alertP2file/hitl-backlog
A4Model rolls and quality regressesTspot QA + drift reportPin previous modelRef in orchestrator policy; re-runP3file/ai-model-drift
A5OCR redact returns wrong boxes (PII bleeds through)I (privacy)spot audit; consumer reportsQuarantine the redacted file; manual redaction via backoffice; investigate modelP1file/ocr-leak

9. Desktop / sync failures

#FailureUser impactDetectionMitigationSeverityRunbook
DK1Desktop offline outbox grows unboundedBElectron telemetry; local disk usage alertUI prompts user; sync engine prioritizes oldest; large files chunked to avoid blockingP3desktop/offline-outbox
DK2Resumable upload from Electron stalls on flaky linkBclient retries; integration with low-bandwidth e2e testResume protocol picks up; user can pause/cancelP3desktop/resumable-stall
DK3Renderer caches expired signed URLBDOM image fails to loadCache TTL = grant.expires - 30s; renderer auto-refresh on 403P3desktop/url-cache
DK4Desktop captures wrong tenant context (multi-tenant user)Binbox handler tenant_mismatchRenderer sets X-Tenant-Id from active tenant in app bar; guard on switchP2desktop/tenant-switch

10. Observability failures

#FailureUser impactDetectionMitigationSeverityRunbook
OB1OTLP exporter unreachableI (visibility)telemetry collector up but service spans missingDrop spans, never block hot path; alert on collector healthP3obs/otlp-down
OB2Logs sink throttled by Cloud LoggingIingestion 429Reduce log level on selected loggers; investigate noise (scan callback success at info?)P3obs/log-throttle
OB3Datastream → BQ brokenI (slo)SLO panels staleRestart Datastream; backfill from outbox archiveP3obs/datastream
OB4Alert noisy / duplicateI (on-call)PagerDuty incident reviewTune thresholds; consolidate via routing rulesP3obs/alert-tuning

11. Catastrophic / business-continuity

#FailureUser impactDetectionMitigationSeverityRunbook
BC1Total europe-west4 region outageAllLBPromote europe-west1 replica; rebuild Redis cold; run in degraded read-mostly mode while writes drain (RTO 30 min, RPO ≤ 5 min)P1bc/region-failover
BC2Catastrophic Postgres data loss (primary + replica corrupt)AllDB query failuresRestore from automated backup; replay outbox for last 24 hP1bc/db-restore
BC3Suspected mass exfiltration of signed URLsI (security)SIEM burstFreeze tenant; rotate signing identity; revoke all active grants; security incident reviewP1file/cross-tenant-leak-suspected
BC4Compromised service accountI (security)IAM audit / unusual access patternRevoke key version; rotate WIF binding; rebuild service from clean imageP1bc/sa-compromise
BC5Persistent CMEK rotation failureAll (private only)KMS errors persistentRead-only mode for private bucket; restore previous key version; security reviewP1file/kms-rotation
BC6Regulator-issued takedown orderI (legal)manualOverrideQuarantineUseCase + legal-hold table; certificate of complianceP1bc/regulator-takedown

12. Incident response (general)

  1. PagerDuty pages on-call rotation (Tier 2 file-storage rotation).
  2. On-call acknowledges within 5 min, opens incident channel #inc-file-….
  3. Apply mitigation per runbook above.
  4. Communicate status via status page if customer-impacting > 10 min.
  5. Postmortem within 5 business days; action items tracked in SERVICE_RISK_REGISTER.

All P1 incidents trigger a security-reviewer review even if the cause is non-security; cross-tenant leakage or PII exposure escalates immediately to the security on-call.