file-storage-service — SERVICE_RISK_REGISTER
Companion: SERVICE_OVERVIEW · FAILURE_MODES · SECURITY_MODEL · SERVICE_READINESS
This is the live register of risks the service knowingly carries. Each entry is scored, owned, and either mitigated, monitored, or accepted with an expiry date. Reviewed monthly by the service tech lead and quarterly by the platform architecture council. Risks resolved (or fully realised and closed) move to the History section so we never lose institutional memory.
1. Scoring rubric
| Dimension | 1 | 2 | 3 | 4 | 5 |
|---|
| Likelihood (annual) | < 1% | 1–10% | 10–30% | 30–60% | > 60% |
| Impact | minor degradation, < 5% tenants | localized outage, < 25% tenants | platform-wide degradation | data loss / SLA breach | safety / compliance / brand catastrophic |
Risk = Likelihood × Impact. Treatment thresholds:
| Score | Treatment |
|---|
| 1–4 | accept; monitor |
| 5–9 | mitigate over 1–2 quarters |
| 10–14 | mitigate this quarter |
| 15–25 | block GA; mitigate immediately |
2. Active risks
R-FILE-001 — Cross-tenant signed-URL theft
| Field | Value |
|---|
| Category | Security · Multi-tenancy |
| Likelihood | 2 |
| Impact | 5 |
| Score | 10 |
| Owner | Security reviewer |
| Detection | SIEM access pattern, signed_url_blacklist hits, tenant audit |
| Description | A signed URL legitimately issued for tenant A is replayed by tenant B (or attacker). Mitigations are layered: object key carries t/<tenant>/; private bucket signing identity is per-tenant; URL TTLs are short; revocation list is global. Residual risk is theft within TTL window before revocation. |
| Mitigations | Short default TTL (5 min), audit on every signed-URL issuance, revocation list (Redis ZSET + Postgres backstop), per-tenant signing identity, optional IP scope on issuance |
| Monitoring | metric signed_url_blacklist_size, alert on spike of 403s from CDN/private path |
| Acceptance expiry | n/a — permanent monitor |
| Linked | SECURITY_MODEL §6, FAILURE_MODES D2 |
R-FILE-002 — ClamAV definitions stale
| Field | Value |
|---|
| Category | Security · Operational |
| Likelihood | 3 |
| Impact | 3 |
| Score | 9 |
| Owner | SRE on-call |
| Description | ClamAV freshclam updates can fail silently (network, mirror outage). New malware variants slip past scan. |
| Mitigations | freshclam runs hourly with metric exposed; alert if last update > 6 h; daily Cloud DLP fallback for image scopes; quarantine bucket retention covers later catch |
| Monitoring | clamav_definitions_age_seconds, alert > 6 h |
| Acceptance expiry | n/a |
| Linked | FAILURE_MODES S1 |
R-FILE-003 — Optimizer image bomb
| Field | Value |
|---|
| Category | Security · DoS |
| Likelihood | 3 |
| Impact | 3 |
| Score | 9 |
| Owner | Service tech lead |
| Description | Maliciously crafted image expands to gigabytes when decoded (decompression bomb). Could OOM the optimizer pool. |
| Mitigations | sharp configured with limitInputPixels: 80M; container memory cap 512 MB; ulimit on virtual memory; per-object retries cap 5; DLQ |
| Monitoring | OOM kills; optimization_failed_total{reason='oom'} |
| Acceptance expiry | n/a |
| Linked | FAILURE_MODES O1, O2 |
R-FILE-004 — Optimizer poison loop
| Field | Value |
|---|
| Category | Reliability |
| Likelihood | 3 |
| Impact | 2 |
| Score | 6 |
| Owner | Service tech lead |
| Description | A single corrupted image is retried repeatedly, exhausting worker capacity. |
| Mitigations | retry budget = 5 then DLQ; per-object backoff with jitter; circuit breaker per-bucket |
| Monitoring | DLQ depth alert |
| Acceptance expiry | n/a |
| Linked | FAILURE_MODES O1 |
R-FILE-005 — CDN cache invalidation lag
| Field | Value |
|---|
| Category | Privacy · Performance |
| Likelihood | 4 |
| Impact | 3 |
| Score | 12 |
| Owner | SRE on-call |
| Description | After a public asset is deleted (or rotated), Cloud CDN may serve stale bytes for the cache horizon (≤ 1 h with our config). For non-erasure cases this is acceptable; for erasure it is not. |
| Mitigations | invalidation API call on every delete; for GDPR erasure we wait for invalidation success before marking certificate purged; for non-erasure we accept eventual; private bucket bypasses CDN entirely |
| Monitoring | metric cdn_invalidate_backlog, alert > 100; certificate row enforces cdn_invalidated=true |
| Acceptance expiry | mitigated in Phase 1 |
| Linked | SECURITY_MODEL §7, FAILURE_MODES D3 / R3 |
R-FILE-006 — KMS key version disabled
| Field | Value |
|---|
| Category | Security · Operational |
| Likelihood | 1 |
| Impact | 5 |
| Score | 5 |
| Owner | Platform security |
| Description | Operator disables a CMEK version that is still referenced by objects, breaking decrypt path. |
| Mitigations | CMEK changes are PR-gated; rotation policy keeps versions enabled ≥ 1 y; audit alarm on KMS disable events |
| Monitoring | KMS audit log alarm |
| Acceptance expiry | n/a |
| Linked | FAILURE_MODES ST5, BC5 |
R-FILE-007 — Quota race at high concurrency
| Field | Value |
|---|
| Category | Reliability · Tenant fairness |
| Likelihood | 3 |
| Impact | 2 |
| Score | 6 |
| Owner | Service tech lead |
| Description | Two parallel uploads from the same tenant pass the quota check then both reserve, exceeding cap. |
| Mitigations | reservation done in same Tx as quotas row update with FOR UPDATE; rejected ones see MELMASTOON.FILE.QUOTA_EXCEEDED; periodic reconcile sweeper reclaims orphaned reservations |
| Monitoring | quota_reservation_overshoot_total |
| Acceptance expiry | n/a |
| Linked | DATA_MODEL §3.10, FAILURE_MODES U6 |
R-FILE-008 — Resumable uploads abandoned
| Field | Value |
|---|
| Category | Reliability · Cost |
| Likelihood | 5 |
| Impact | 1 |
| Score | 5 |
| Owner | Service tech lead |
| Description | Mobile / desktop client crashes mid-upload, leaves orphan partial blobs in GCS. |
| Mitigations | sweeper aborts sessions past expires_at; releases quota; logs abort reason |
| Monitoring | upload_session_aborts_total{reason='expired'} baseline |
| Acceptance expiry | n/a |
| Linked | FAILURE_MODES U2, U7 |
R-FILE-009 — DSR cascade incomplete
| Field | Value |
|---|
| Category | Compliance · GDPR |
| Likelihood | 2 |
| Impact | 5 |
| Score | 10 |
| Owner | Compliance + Service tech lead |
| Description | Erasure event consumed but some files missed (e.g., classification mistake), leaving residual PII. |
| Mitigations | erasure pulls by ownership.guestId AND tenantId; certificate enumerates files; second-pass sweeper re-runs after 24 h; legal hold respected; spot-audit script in QA |
| Monitoring | erasure_runs_total{outcome='partial'} rate |
| Acceptance expiry | n/a |
| Linked | SECURITY_MODEL §8, APPLICATION_LOGIC §3.6 |
R-FILE-010 — Bandwidth-constrained tenant cannot upload
| Field | Value |
|---|
| Category | UX · Hotel context |
| Likelihood | 4 |
| Impact | 2 |
| Score | 8 |
| Owner | Product + Service tech lead |
| Description | Pilot regions have unreliable connectivity; large image upload from front desk fails repeatedly. |
| Mitigations | resumable uploads end-to-end; client compresses on capture; image optimization happens server-side so client doesn't need to upload heavyweight derivatives; offline outbox in Electron |
| Monitoring | per-tenant upload success rate; e2e e2e/low-bandwidth.spec.ts |
| Acceptance expiry | n/a |
| Linked | SYNC_CONTRACT §3-§4 |
R-FILE-011 — Schema-evolution breaks event consumers
| Field | Value |
|---|
| Category | Platform · Coupling |
| Likelihood | 2 |
| Impact | 4 |
| Score | 8 |
| Owner | Service tech lead |
| Description | Producer ships breaking change in melmastoon.file.upload.completed.v1 without bumping to v2; consumers explode. |
| Mitigations | events:compatibility-check CI gate; PRs touching event schema require docs update; consumer Pact contracts in CI |
| Monitoring | consumer DLQ growth |
| Acceptance expiry | n/a |
| Linked | MIGRATION_PLAN §7 |
R-FILE-012 — AI orchestrator unavailable degrades UX
| Field | Value |
|---|
| Category | UX · External dependency |
| Likelihood | 3 |
| Impact | 1 |
| Score | 3 |
| Owner | Service tech lead |
| Description | Image safety / alt-text / OCR redact unavailable; uploads still complete; degraded UX (no alt text, manual review of ID scans). |
| Mitigations | AI calls async, never block confirm; per-tenant budget caps; HITL fallback; nightly backfill job |
| Monitoring | ai_calls_failed_total{purpose} |
| Acceptance expiry | n/a |
| Linked | AI_INTEGRATION §6, FAILURE_MODES A1 |
R-FILE-013 — Pub/Sub at-least-once causes duplicate workers
| Field | Value |
|---|
| Category | Reliability · Correctness |
| Likelihood | 4 |
| Impact | 1 |
| Score | 4 |
| Owner | Service tech lead |
| Description | Same optimization.requested.v1 delivered twice; two workers may compute variants in parallel. |
| Mitigations | upsert on (file_id, variant) unique index; idempotent inbox handler; cheap recompute |
| Monitoring | inbox_dedupe_skips_total{reason='duplicate_event'} |
| Acceptance expiry | n/a |
| Linked | APPLICATION_LOGIC §6, DATA_MODEL §3.4 |
R-FILE-014 — Tenant prefix invariant bypassed by future code
| Field | Value |
|---|
| Category | Security · Multi-tenancy |
| Likelihood | 2 |
| Impact | 5 |
| Score | 10 |
| Owner | Security reviewer |
| Description | A new code path could construct an ObjectKey without the t/<tenant>/ prefix and pass DB CHECK only to leak across tenants. |
| Mitigations | three layers (domain VO factory, DB CHECK, GCS Conditions); arch-fitness test forbids bypass of factory; security review on every PR touching gcs/ adapter |
| Monitoring | unit tests across the three layers; e2e cross-tenant test |
| Acceptance expiry | n/a |
| Linked | DOMAIN_MODEL §6, SECURITY_MODEL §3 |
R-FILE-015 — Cloud SQL exhausts connection pool under burst
| Field | Value |
|---|
| Category | Reliability |
| Likelihood | 3 |
| Impact | 2 |
| Score | 6 |
| Owner | SRE on-call |
| Description | Sudden upload burst (e.g., property onboarding) saturates connection pool. |
| Mitigations | PgBouncer transaction pooling; per-instance pool 20; max instances tuned; admission control returns 429 instead of timeout |
| Monitoring | pgpool_active_connections near max |
| Acceptance expiry | n/a |
| Linked | DEPLOYMENT_TOPOLOGY §7 |
R-FILE-016 — BFF proxy upload swamps service bandwidth
| Field | Value |
|---|
| Category | Cost · Reliability |
| Likelihood | 2 |
| Impact | 3 |
| Score | 6 |
| Owner | Service tech lead |
| Description | When tenants opt for proxy upload (constrained networks), large bytes flow through Cloud Run, not direct to GCS. |
| Mitigations | streaming proxy that does not buffer; tenant-level cap on proxy mode; Cloud Run egress monitored; default remains direct upload |
| Monitoring | proxy_upload_bytes_total per tenant |
| Acceptance expiry | n/a |
| Linked | SYNC_CONTRACT §3.2 |
R-FILE-017 — Retention sweeper races erasure
| Field | Value |
|---|
| Category | Compliance · Race condition |
| Likelihood | 1 |
| Impact | 2 |
| Score | 2 |
| Owner | Service tech lead |
| Description | Sweeper hard-deletes a file at the same moment an erasure batch tries to enumerate it. |
| Mitigations | SELECT … FOR UPDATE SKIP LOCKED; idempotent purge; certificate accepts both purged and already_purged |
| Monitoring | erasure_runs_total{outcome='partial'} |
| Acceptance expiry | n/a |
| Linked | APPLICATION_LOGIC §3.6 |
R-FILE-018 — Fake-gcs ≠ real GCS in dev
| Field | Value |
|---|
| Category | Operational · Test fidelity |
| Likelihood | 4 |
| Impact | 1 |
| Score | 4 |
| Owner | Service tech lead |
| Description | Resumable upload semantics, CORS, and conditions differ between fake-gcs and real GCS. Dev passes; staging fails. |
| Mitigations | nightly e2e against staging GCS bucket; smoke tests in PR pipeline against ephemeral GCS bucket |
| Monitoring | nightly e2e dashboard |
| Acceptance expiry | n/a |
| Linked | LOCAL_DEV_SETUP §11 |
R-FILE-019 — Stale OpenAPI vs. implementation drift
| Field | Value |
|---|
| Category | Documentation · Consumer trust |
| Likelihood | 3 |
| Impact | 2 |
| Score | 6 |
| Owner | Service tech lead |
| Description | Endpoint changes ship without OpenAPI regeneration. Pact contracts then fail or worse silently pass on stale assumptions. |
| Mitigations | OpenAPI generated from controller decorators (NestJS Swagger); CI fails if openapi/v1.yaml drift detected |
| Monitoring | CI gate |
| Acceptance expiry | n/a |
| Linked | API_CONTRACTS, MIGRATION_PLAN §8 |
R-FILE-020 — Long-tail variant zoo (Phase 3 video)
| Field | Value |
|---|
| Category | Scope · Cost |
| Likelihood | 3 |
| Impact | 2 |
| Score | 6 |
| Owner | Product |
| Description | Adding video variants explodes storage and compute cost. |
| Mitigations | Phase 3 only; per-tenant opt-in; profile-driven (mobile-portrait only by default); cost guardrails in product plan |
| Monitoring | gcs_object_bytes_total by mediaKind |
| Acceptance expiry | scoped to Phase 3 design |
| Linked | MIGRATION_PLAN §12 |
3. Risk heatmap (current)
| Likelihood ↓ / Impact → | 1 | 2 | 3 | 4 | 5 |
|---|
| 1 | | R-017 | | | R-006 |
| 2 | | R-008 | | | R-001, R-009, R-014 |
| 3 | | R-004, R-007, R-013, R-015, R-016, R-019, R-020 | R-002, R-003, R-012 | | |
| 4 | R-013, R-018 | R-010 | R-005 | | |
| 5 | | | | | |
4. Risk owners
- Service tech lead: end-to-end accountable
- Security reviewer: signs off on R-001, R-014, R-006
- SRE on-call: owns operational risks (R-002, R-005, R-015)
- Compliance / DPO: owns R-009
- Product: owns scope risks (R-010, R-020)
5. Review cadence
- Monthly: tech lead + SRE walk-through, update scores, surface new risks
- Quarterly: platform architecture council review; write-off accepted risks; promote unmitigated to executive review
- Annually: re-baseline of likelihood ratings against actual incident data from observability and postmortems
6. History (closed risks)
| ID | Title | Closed date | Reason |
|---|
| (none yet — initial register) | | | |