Skip to main content

file-storage-service — SERVICE_RISK_REGISTER

Companion: SERVICE_OVERVIEW · FAILURE_MODES · SECURITY_MODEL · SERVICE_READINESS

This is the live register of risks the service knowingly carries. Each entry is scored, owned, and either mitigated, monitored, or accepted with an expiry date. Reviewed monthly by the service tech lead and quarterly by the platform architecture council. Risks resolved (or fully realised and closed) move to the History section so we never lose institutional memory.

1. Scoring rubric

Dimension12345
Likelihood (annual)< 1%1–10%10–30%30–60%> 60%
Impactminor degradation, < 5% tenantslocalized outage, < 25% tenantsplatform-wide degradationdata loss / SLA breachsafety / compliance / brand catastrophic

Risk = Likelihood × Impact. Treatment thresholds:

ScoreTreatment
1–4accept; monitor
5–9mitigate over 1–2 quarters
10–14mitigate this quarter
15–25block GA; mitigate immediately

2. Active risks

R-FILE-001 — Cross-tenant signed-URL theft

FieldValue
CategorySecurity · Multi-tenancy
Likelihood2
Impact5
Score10
OwnerSecurity reviewer
DetectionSIEM access pattern, signed_url_blacklist hits, tenant audit
DescriptionA signed URL legitimately issued for tenant A is replayed by tenant B (or attacker). Mitigations are layered: object key carries t/<tenant>/; private bucket signing identity is per-tenant; URL TTLs are short; revocation list is global. Residual risk is theft within TTL window before revocation.
MitigationsShort default TTL (5 min), audit on every signed-URL issuance, revocation list (Redis ZSET + Postgres backstop), per-tenant signing identity, optional IP scope on issuance
Monitoringmetric signed_url_blacklist_size, alert on spike of 403s from CDN/private path
Acceptance expiryn/a — permanent monitor
LinkedSECURITY_MODEL §6, FAILURE_MODES D2

R-FILE-002 — ClamAV definitions stale

FieldValue
CategorySecurity · Operational
Likelihood3
Impact3
Score9
OwnerSRE on-call
DescriptionClamAV freshclam updates can fail silently (network, mirror outage). New malware variants slip past scan.
Mitigationsfreshclam runs hourly with metric exposed; alert if last update > 6 h; daily Cloud DLP fallback for image scopes; quarantine bucket retention covers later catch
Monitoringclamav_definitions_age_seconds, alert > 6 h
Acceptance expiryn/a
LinkedFAILURE_MODES S1

R-FILE-003 — Optimizer image bomb

FieldValue
CategorySecurity · DoS
Likelihood3
Impact3
Score9
OwnerService tech lead
DescriptionMaliciously crafted image expands to gigabytes when decoded (decompression bomb). Could OOM the optimizer pool.
Mitigationssharp configured with limitInputPixels: 80M; container memory cap 512 MB; ulimit on virtual memory; per-object retries cap 5; DLQ
MonitoringOOM kills; optimization_failed_total{reason='oom'}
Acceptance expiryn/a
LinkedFAILURE_MODES O1, O2

R-FILE-004 — Optimizer poison loop

FieldValue
CategoryReliability
Likelihood3
Impact2
Score6
OwnerService tech lead
DescriptionA single corrupted image is retried repeatedly, exhausting worker capacity.
Mitigationsretry budget = 5 then DLQ; per-object backoff with jitter; circuit breaker per-bucket
MonitoringDLQ depth alert
Acceptance expiryn/a
LinkedFAILURE_MODES O1

R-FILE-005 — CDN cache invalidation lag

FieldValue
CategoryPrivacy · Performance
Likelihood4
Impact3
Score12
OwnerSRE on-call
DescriptionAfter a public asset is deleted (or rotated), Cloud CDN may serve stale bytes for the cache horizon (≤ 1 h with our config). For non-erasure cases this is acceptable; for erasure it is not.
Mitigationsinvalidation API call on every delete; for GDPR erasure we wait for invalidation success before marking certificate purged; for non-erasure we accept eventual; private bucket bypasses CDN entirely
Monitoringmetric cdn_invalidate_backlog, alert > 100; certificate row enforces cdn_invalidated=true
Acceptance expirymitigated in Phase 1
LinkedSECURITY_MODEL §7, FAILURE_MODES D3 / R3

R-FILE-006 — KMS key version disabled

FieldValue
CategorySecurity · Operational
Likelihood1
Impact5
Score5
OwnerPlatform security
DescriptionOperator disables a CMEK version that is still referenced by objects, breaking decrypt path.
MitigationsCMEK changes are PR-gated; rotation policy keeps versions enabled ≥ 1 y; audit alarm on KMS disable events
MonitoringKMS audit log alarm
Acceptance expiryn/a
LinkedFAILURE_MODES ST5, BC5

R-FILE-007 — Quota race at high concurrency

FieldValue
CategoryReliability · Tenant fairness
Likelihood3
Impact2
Score6
OwnerService tech lead
DescriptionTwo parallel uploads from the same tenant pass the quota check then both reserve, exceeding cap.
Mitigationsreservation done in same Tx as quotas row update with FOR UPDATE; rejected ones see MELMASTOON.FILE.QUOTA_EXCEEDED; periodic reconcile sweeper reclaims orphaned reservations
Monitoringquota_reservation_overshoot_total
Acceptance expiryn/a
LinkedDATA_MODEL §3.10, FAILURE_MODES U6

R-FILE-008 — Resumable uploads abandoned

FieldValue
CategoryReliability · Cost
Likelihood5
Impact1
Score5
OwnerService tech lead
DescriptionMobile / desktop client crashes mid-upload, leaves orphan partial blobs in GCS.
Mitigationssweeper aborts sessions past expires_at; releases quota; logs abort reason
Monitoringupload_session_aborts_total{reason='expired'} baseline
Acceptance expiryn/a
LinkedFAILURE_MODES U2, U7

R-FILE-009 — DSR cascade incomplete

FieldValue
CategoryCompliance · GDPR
Likelihood2
Impact5
Score10
OwnerCompliance + Service tech lead
DescriptionErasure event consumed but some files missed (e.g., classification mistake), leaving residual PII.
Mitigationserasure pulls by ownership.guestId AND tenantId; certificate enumerates files; second-pass sweeper re-runs after 24 h; legal hold respected; spot-audit script in QA
Monitoringerasure_runs_total{outcome='partial'} rate
Acceptance expiryn/a
LinkedSECURITY_MODEL §8, APPLICATION_LOGIC §3.6

R-FILE-010 — Bandwidth-constrained tenant cannot upload

FieldValue
CategoryUX · Hotel context
Likelihood4
Impact2
Score8
OwnerProduct + Service tech lead
DescriptionPilot regions have unreliable connectivity; large image upload from front desk fails repeatedly.
Mitigationsresumable uploads end-to-end; client compresses on capture; image optimization happens server-side so client doesn't need to upload heavyweight derivatives; offline outbox in Electron
Monitoringper-tenant upload success rate; e2e e2e/low-bandwidth.spec.ts
Acceptance expiryn/a
LinkedSYNC_CONTRACT §3-§4

R-FILE-011 — Schema-evolution breaks event consumers

FieldValue
CategoryPlatform · Coupling
Likelihood2
Impact4
Score8
OwnerService tech lead
DescriptionProducer ships breaking change in melmastoon.file.upload.completed.v1 without bumping to v2; consumers explode.
Mitigationsevents:compatibility-check CI gate; PRs touching event schema require docs update; consumer Pact contracts in CI
Monitoringconsumer DLQ growth
Acceptance expiryn/a
LinkedMIGRATION_PLAN §7

R-FILE-012 — AI orchestrator unavailable degrades UX

FieldValue
CategoryUX · External dependency
Likelihood3
Impact1
Score3
OwnerService tech lead
DescriptionImage safety / alt-text / OCR redact unavailable; uploads still complete; degraded UX (no alt text, manual review of ID scans).
MitigationsAI calls async, never block confirm; per-tenant budget caps; HITL fallback; nightly backfill job
Monitoringai_calls_failed_total{purpose}
Acceptance expiryn/a
LinkedAI_INTEGRATION §6, FAILURE_MODES A1

R-FILE-013 — Pub/Sub at-least-once causes duplicate workers

FieldValue
CategoryReliability · Correctness
Likelihood4
Impact1
Score4
OwnerService tech lead
DescriptionSame optimization.requested.v1 delivered twice; two workers may compute variants in parallel.
Mitigationsupsert on (file_id, variant) unique index; idempotent inbox handler; cheap recompute
Monitoringinbox_dedupe_skips_total{reason='duplicate_event'}
Acceptance expiryn/a
LinkedAPPLICATION_LOGIC §6, DATA_MODEL §3.4

R-FILE-014 — Tenant prefix invariant bypassed by future code

FieldValue
CategorySecurity · Multi-tenancy
Likelihood2
Impact5
Score10
OwnerSecurity reviewer
DescriptionA new code path could construct an ObjectKey without the t/<tenant>/ prefix and pass DB CHECK only to leak across tenants.
Mitigationsthree layers (domain VO factory, DB CHECK, GCS Conditions); arch-fitness test forbids bypass of factory; security review on every PR touching gcs/ adapter
Monitoringunit tests across the three layers; e2e cross-tenant test
Acceptance expiryn/a
LinkedDOMAIN_MODEL §6, SECURITY_MODEL §3

R-FILE-015 — Cloud SQL exhausts connection pool under burst

FieldValue
CategoryReliability
Likelihood3
Impact2
Score6
OwnerSRE on-call
DescriptionSudden upload burst (e.g., property onboarding) saturates connection pool.
MitigationsPgBouncer transaction pooling; per-instance pool 20; max instances tuned; admission control returns 429 instead of timeout
Monitoringpgpool_active_connections near max
Acceptance expiryn/a
LinkedDEPLOYMENT_TOPOLOGY §7

R-FILE-016 — BFF proxy upload swamps service bandwidth

FieldValue
CategoryCost · Reliability
Likelihood2
Impact3
Score6
OwnerService tech lead
DescriptionWhen tenants opt for proxy upload (constrained networks), large bytes flow through Cloud Run, not direct to GCS.
Mitigationsstreaming proxy that does not buffer; tenant-level cap on proxy mode; Cloud Run egress monitored; default remains direct upload
Monitoringproxy_upload_bytes_total per tenant
Acceptance expiryn/a
LinkedSYNC_CONTRACT §3.2

R-FILE-017 — Retention sweeper races erasure

FieldValue
CategoryCompliance · Race condition
Likelihood1
Impact2
Score2
OwnerService tech lead
DescriptionSweeper hard-deletes a file at the same moment an erasure batch tries to enumerate it.
MitigationsSELECT … FOR UPDATE SKIP LOCKED; idempotent purge; certificate accepts both purged and already_purged
Monitoringerasure_runs_total{outcome='partial'}
Acceptance expiryn/a
LinkedAPPLICATION_LOGIC §3.6

R-FILE-018 — Fake-gcs ≠ real GCS in dev

FieldValue
CategoryOperational · Test fidelity
Likelihood4
Impact1
Score4
OwnerService tech lead
DescriptionResumable upload semantics, CORS, and conditions differ between fake-gcs and real GCS. Dev passes; staging fails.
Mitigationsnightly e2e against staging GCS bucket; smoke tests in PR pipeline against ephemeral GCS bucket
Monitoringnightly e2e dashboard
Acceptance expiryn/a
LinkedLOCAL_DEV_SETUP §11

R-FILE-019 — Stale OpenAPI vs. implementation drift

FieldValue
CategoryDocumentation · Consumer trust
Likelihood3
Impact2
Score6
OwnerService tech lead
DescriptionEndpoint changes ship without OpenAPI regeneration. Pact contracts then fail or worse silently pass on stale assumptions.
MitigationsOpenAPI generated from controller decorators (NestJS Swagger); CI fails if openapi/v1.yaml drift detected
MonitoringCI gate
Acceptance expiryn/a
LinkedAPI_CONTRACTS, MIGRATION_PLAN §8

R-FILE-020 — Long-tail variant zoo (Phase 3 video)

FieldValue
CategoryScope · Cost
Likelihood3
Impact2
Score6
OwnerProduct
DescriptionAdding video variants explodes storage and compute cost.
MitigationsPhase 3 only; per-tenant opt-in; profile-driven (mobile-portrait only by default); cost guardrails in product plan
Monitoringgcs_object_bytes_total by mediaKind
Acceptance expiryscoped to Phase 3 design
LinkedMIGRATION_PLAN §12

3. Risk heatmap (current)

Likelihood ↓ / Impact →12345
1R-017R-006
2R-008R-001, R-009, R-014
3R-004, R-007, R-013, R-015, R-016, R-019, R-020R-002, R-003, R-012
4R-013, R-018R-010R-005
5

4. Risk owners

  • Service tech lead: end-to-end accountable
  • Security reviewer: signs off on R-001, R-014, R-006
  • SRE on-call: owns operational risks (R-002, R-005, R-015)
  • Compliance / DPO: owns R-009
  • Product: owns scope risks (R-010, R-020)

5. Review cadence

  • Monthly: tech lead + SRE walk-through, update scores, surface new risks
  • Quarterly: platform architecture council review; write-off accepted risks; promote unmitigated to executive review
  • Annually: re-baseline of likelihood ratings against actual incident data from observability and postmortems

6. History (closed risks)

IDTitleClosed dateReason
(none yet — initial register)