FAILURE_MODES — theme-config-service
Sibling: APPLICATION_LOGIC · OBSERVABILITY · SERVICE_RISK_REGISTER
This document enumerates how theme-config-service is designed to fail safely. Each failure mode lists detection, blast radius, automatic mitigation, manual mitigation, and the operator runbook anchor.
1. Categorisation
| Category | Examples |
|---|---|
| Authoring path failure | DB unavailable, OCC conflict, validation rejection |
| Publish path failure | GCS upload error, transaction rollback mid-flip, CDN invalidation timeout |
| Read path degradation | Memorystore down, GCS slow, CDN edge miss storm |
| Eventing failure | Outbox backlog, inbox stuck, DLQ growth |
| Cross-tenant safety failure | RLS bypass, asset URL injection, preview-link leak |
| AI surface failure | Orchestrator down, output unsafe, budget exhausted |
| Data integrity failure | Bundle SHA mismatch, broken asset reference, corrupted JSONB |
| Capacity failure | Cloud Run scale ceiling, DB connection storm, Pub/Sub publish throttle |
2. Failure-mode catalogue
2.1 Cloud SQL primary failure
- Detection: Cloud SQL HA monitor; app
db_poolsaturation; readyz fails. - Blast radius: all write paths blocked; reads served stale from Memorystore until TTL.
- Automatic mitigation: Cloud SQL regional HA failover (≤ 60 s); Cloud Run drops failing instances; clients see 503 for ~ 1 min.
- Manual mitigation: none in standard case;
runbooks/db-failover.mdfor confirmation steps. - Customer experience: booking-flow continues to render via CDN-cached bundles; backoffice editing unavailable for ~1 min.
2.2 OCC conflict (MELMASTOON.PLATFORM.PRECONDITION_FAILED)
- Detection: repository
savereturns version mismatch. - Blast radius: the single requesting actor.
- Automatic mitigation: none required; client should refetch and retry.
- Customer experience: UI shows "Someone else updated this — please reload"; no data lost.
2.3 Publish-time validation rejection
- Detection:
PublishThemeVersionUseCaseraises one or more validation errors before TX open. - Blast radius: the publish actor; no state change.
- Automatic mitigation: publish-rejected event emitted; UI surfaces field-level violations; the version stays in
preview_ready. - Manual mitigation: author fixes the issue and republishes.
2.4 GCS upload failure during publish
- Detection: GCS client error or timeout (> 10 s).
- Blast radius: the publishing tenant.
- Automatic mitigation: publish use case aborts pre-TX; no DB write occurs; idempotency-key allows safe retry.
- Manual mitigation: if the upload eventually succeeded but the response was lost, retry uploads to the same object generation; lifecycle rule cleans orphans.
2.5 Publication-flip transaction failure
- Detection: Postgres unique-violation on
theme_publications_active_uqor constraint violation. - Blast radius: the publishing tenant; concurrent publish race.
- Automatic mitigation: TX rolls back; orphan GCS object reaped; the loser sees
409 PUBLISH_CONFLICT. - Manual mitigation:
runbooks/publish-conflict.md.
2.6 CDN invalidation timeout
- Detection:
invalidateByTagexceeds 10 s or returns retryable error. - Blast radius: publish completes; new bundle is on origin but CDN edges may still serve stale for up to TTL (5 min by
max-age, 24 h bys-maxage, mitigated by SWR). - Automatic mitigation:
cdn-invalidation-retrierworker retries with backoff up to 6 attempts over 5 min; alerts after 3 failed attempts. - Manual mitigation: force a manual invalidation via gcloud; runbook
runbooks/cdn-invalidation.md. - Customer experience: existing visitors keep seeing the old brand for up to 5 min; new sessions revalidate. Acceptable for a non-critical path.
2.7 Memorystore unreachable
- Detection: Redis client errors > 1 % over 5 min.
- Blast radius: elevated origin-read load; bundle reads still served from origin GCS.
- Automatic mitigation: circuit breaker in
MemorystoreBundleCacheAdapteropens; reads bypass cache; writes are logged but not retried. - Manual mitigation: Memorystore HA usually recovers automatically; runbook
runbooks/memorystore-degradation.md. - Customer experience: read latency may double (still ≤ 250 ms p95); no errors.
2.8 Outbox backlog
- Detection:
theme_outbox_unpublished_rows> 1 000 sustained 10 min;theme_outbox_lag_secondsp95 > 30 s. - Blast radius: delayed downstream effects (BFF cache invalidation, audit, analytics); CDN propagation may lag publish events.
- Automatic mitigation: outbox-publisher scales up; Pub/Sub backpressure handled by retry-with-backoff; alert paged.
- Manual mitigation:
runbooks/outbox-backlog.md— diagnoses Pub/Sub publish errors, consumer-side throttles, or DB IOPS.
2.9 Inbox stuck (consumed event handler failing)
- Detection: Pub/Sub subscription unacked count grows; DLQ inflow detected.
- Blast radius: delayed reactions to
tenant.created.v1(new tenants don't get default theme),tenant.config_updated.v1(formatting drift),media.deleted.v1(broken-asset detection delayed). - Automatic mitigation: Pub/Sub retries with exponential backoff; max 5 → DLQ.
- Manual mitigation: inspect DLQ via Cloud Console; replay after fix; runbook
runbooks/inbox-dlq.md.
2.10 RLS bypass / cross-tenant read
- Detection: integration test
cross_tenant_isolation.spec.tsfailing; runtime metrictheme_rls_violations_total > 0; sentry-style anomaly detection. - Blast radius: potentially severe (data leak across tenants).
- Automatic mitigation: none — design must prevent this; dynamic enforcement defends in depth.
- Manual mitigation: sev1 incident; quarantine the offending revision; runbook
runbooks/sev1-cross-tenant.md.
2.11 Stored XSS via ContentBlock.body
- Detection: allow-list violation at write rejected by validator; runtime detection by BFF CSP report-uri.
- Blast radius: rendered HTML on the booking flow could execute attacker-controlled script.
- Automatic mitigation: dompurify allow-list at write and at render in BFFs; CSP
script-src 'self'blocks in-browser execution. - Manual mitigation: sev1 incident; revoke the offending content block; rotate any leaked tokens; runbook
runbooks/sev1-stored-xss.md.
2.12 Bundle SHA mismatch detected at edge
- Detection: BFF first-read SHA verification fails;
theme.bundle.integrity.violationemitted. - Blast radius: the affected theme; the BFF refuses to serve until verified.
- Automatic mitigation: purge cache; re-fetch from origin; if still mismatch, fall back to the previous published bundle (last-known-good cached at the BFF).
- Manual mitigation: sev1 incident; investigate GCS object generation history; runbook
runbooks/sev2-bundle-integrity.md.
2.13 Preview-link brute force
- Detection: sustained 401/404 spikes per
tokenHashor per IP. - Blast radius: none if rate-limited correctly; otherwise potential exposure of a specific draft.
- Automatic mitigation: per-
tokenHashrate limit 60 rpm + per-tenant 600 rpm; auto IP block after 100 failures in 1 min via Cloud Armor rule. - Manual mitigation: revoke all preview tokens for the tenant; runbook
runbooks/sev3-preview-leak.md.
2.14 Broken asset URL after file-storage-service deletion
- Detection: consumed
melmastoon.media.deleted.v1referenced by an active version; daily scanner detects 404 from origin. - Blast radius: booking flow may render a missing image; never breaks page load (alt text + skeleton).
- Automatic mitigation:
theme.broken_asset_detected.v1emitted; backoffice surfaces banner "Broken asset detected" with the offending block. - Manual mitigation: author replaces the asset and republishes (rollback also restores).
2.15 AI orchestrator unavailable / unsafe output
- Detection: orchestrator returns 5xx, safety-blocked, or schema-invalid.
- Blast radius: AI surfaces unavailable; authoring continues without AI.
- Automatic mitigation: API returns
503 MELMASTOON.AI.UNAVAILABLEwithRetry-After; UI gracefully degrades to manual edit. - Manual mitigation: runbook
runbooks/ai-degradation.md.
2.16 AI prompt injection through tenant input
- Detection: orchestrator output deviates from schema (anomaly), or output contains banned phrases.
- Blast radius: drafted content potentially adversarial — but never auto-applied.
- Automatic mitigation: orchestrator-side instruction-hierarchy; output schema enforced; HITL gate.
- Manual mitigation: review safety logs; adjust input sanitisation if a new injection pattern emerges; runbook
runbooks/sev3-ai-prompt-injection.md.
2.17 Concurrent draft edits with stale UI
- Detection: OCC
412returned to client. - Blast radius: the editing user only.
- Automatic mitigation: UI prompts to reload; no data loss because no merge occurred.
- Manual mitigation: none.
2.18 Locale removal that's still referenced
- Detection:
RemoveLocaleUseCasevalidation finds copy in the locale referenced by an active CTA. - Blast radius: the requesting actor.
- Automatic mitigation: rejected with
409 LOCALE_IN_USElisting the references. - Manual mitigation: author migrates content to fallback locale, then retries.
2.19 Layout preset deactivated by Frontend Platform
- Detection: publish validation finds
layoutSelections.<surface>.presetKeyno longeris_active. - Blast radius: the publishing tenant.
- Automatic mitigation: publish blocked with
MELMASTOON.THEME.LAYOUT_PRESET_INACTIVE; UI prompts a substitution. - Manual mitigation: author selects a different preset.
2.20 Cloud Run instance scale ceiling
- Detection: request 503s at the LB;
instance_count_at_max= true. - Blast radius: authoring path; CDN-cached reads unaffected.
- Automatic mitigation: Cloud Run scales to max; capacity-bound 503s.
- Manual mitigation: raise max instance count via runbook
runbooks/scale-ceiling.md; review whether a publish storm is the cause.
2.21 Tenant onboarding theme provisioning failure
- Detection:
ProvisionDefaultThemeUseCaseraises and the consumed event lands in DLQ. - Blast radius: the new tenant has no theme; tenant-onboarding saga marks brand step
degraded. - Automatic mitigation: retried by Pub/Sub up to 5 times.
- Manual mitigation: tenant admin can complete branding manually via wizard; runbook
runbooks/onboarding-degraded.md.
2.22 Pub/Sub publisher quota exceeded
- Detection: publish errors with
RESOURCE_EXHAUSTED. - Blast radius: outbox backlog; eventual consistency lag; no data loss.
- Automatic mitigation: outbox retries with backoff; alert paged.
- Manual mitigation: request quota increase.
2.23 Bundle larger than 40 KB budget
- Detection: publish validation logs warning; alert if persistent (> 5 publishes/day breach).
- Blast radius: worse first-paint on the booking flow.
- Automatic mitigation: warning only — publish proceeds (the budget is a soft target, not a hard constraint).
- Manual mitigation: author trims content blocks or compresses copy; bigger fix is to lazy-load some content blocks (BFF responsibility).
2.24 GCS bucket misconfiguration (e.g. public-write)
- Detection: SCC finding; configuration scanner.
- Blast radius: could enable bundle tampering (sev1).
- Automatic mitigation: Terraform drift detection runs nightly; reverts policy.
- Manual mitigation: runbook
runbooks/sev1-gcs-misconfig.md; SHA verification protects readers in the meantime.
3. Cross-cutting principles
- Fail closed on writes, fail open on reads. Authoring failures must never produce inconsistent state; read failures should degrade gracefully (cached → origin → last-known-good) before erroring.
- No silent failures. Every failure path emits a structured log + a metric increment + (where applicable) an event. Operators can answer "what just happened?" from observability alone.
- Idempotency everywhere. Retries must be safe; we use
Idempotency-Key, OCC, outbox/inbox, and pure-function bundle build. - Tenant blast radius isolation. A failure for tenant A must not affect tenants B/C/D. We monitor
theme_usecase_errors_totalpertenantTierto detect noisy-neighbour patterns early. - Preserve last-known-good. Rollback is always available; the BFF caches the last-known-good bundle for 24 h to survive an integrity violation without an immediate code deploy.
4. Runbook index
services/theme-config-service/runbooks/:
db-failover.mdpublish-conflict.mdcdn-invalidation.mdmemorystore-degradation.mdoutbox-backlog.mdinbox-dlq.mdsev1-cross-tenant.mdsev1-stored-xss.mdsev2-bundle-integrity.mdsev3-preview-leak.mdai-degradation.mdsev3-ai-prompt-injection.mdscale-ceiling.mdonboarding-degraded.mdsev1-gcs-misconfig.md
Each runbook follows the platform format: detection signals → first 5 minutes → diagnosis → mitigation → prevention.
5. References
- Observability + alerting:
OBSERVABILITY - Security threat model:
SECURITY_MODEL §1 - Risk register entries:
SERVICE_RISK_REGISTER