Skip to main content

FAILURE_MODES — theme-config-service

Sibling: APPLICATION_LOGIC · OBSERVABILITY · SERVICE_RISK_REGISTER

This document enumerates how theme-config-service is designed to fail safely. Each failure mode lists detection, blast radius, automatic mitigation, manual mitigation, and the operator runbook anchor.


1. Categorisation

CategoryExamples
Authoring path failureDB unavailable, OCC conflict, validation rejection
Publish path failureGCS upload error, transaction rollback mid-flip, CDN invalidation timeout
Read path degradationMemorystore down, GCS slow, CDN edge miss storm
Eventing failureOutbox backlog, inbox stuck, DLQ growth
Cross-tenant safety failureRLS bypass, asset URL injection, preview-link leak
AI surface failureOrchestrator down, output unsafe, budget exhausted
Data integrity failureBundle SHA mismatch, broken asset reference, corrupted JSONB
Capacity failureCloud Run scale ceiling, DB connection storm, Pub/Sub publish throttle

2. Failure-mode catalogue

2.1 Cloud SQL primary failure

  • Detection: Cloud SQL HA monitor; app db_pool saturation; readyz fails.
  • Blast radius: all write paths blocked; reads served stale from Memorystore until TTL.
  • Automatic mitigation: Cloud SQL regional HA failover (≤ 60 s); Cloud Run drops failing instances; clients see 503 for ~ 1 min.
  • Manual mitigation: none in standard case; runbooks/db-failover.md for confirmation steps.
  • Customer experience: booking-flow continues to render via CDN-cached bundles; backoffice editing unavailable for ~1 min.

2.2 OCC conflict (MELMASTOON.PLATFORM.PRECONDITION_FAILED)

  • Detection: repository save returns version mismatch.
  • Blast radius: the single requesting actor.
  • Automatic mitigation: none required; client should refetch and retry.
  • Customer experience: UI shows "Someone else updated this — please reload"; no data lost.

2.3 Publish-time validation rejection

  • Detection: PublishThemeVersionUseCase raises one or more validation errors before TX open.
  • Blast radius: the publish actor; no state change.
  • Automatic mitigation: publish-rejected event emitted; UI surfaces field-level violations; the version stays in preview_ready.
  • Manual mitigation: author fixes the issue and republishes.

2.4 GCS upload failure during publish

  • Detection: GCS client error or timeout (> 10 s).
  • Blast radius: the publishing tenant.
  • Automatic mitigation: publish use case aborts pre-TX; no DB write occurs; idempotency-key allows safe retry.
  • Manual mitigation: if the upload eventually succeeded but the response was lost, retry uploads to the same object generation; lifecycle rule cleans orphans.

2.5 Publication-flip transaction failure

  • Detection: Postgres unique-violation on theme_publications_active_uq or constraint violation.
  • Blast radius: the publishing tenant; concurrent publish race.
  • Automatic mitigation: TX rolls back; orphan GCS object reaped; the loser sees 409 PUBLISH_CONFLICT.
  • Manual mitigation: runbooks/publish-conflict.md.

2.6 CDN invalidation timeout

  • Detection: invalidateByTag exceeds 10 s or returns retryable error.
  • Blast radius: publish completes; new bundle is on origin but CDN edges may still serve stale for up to TTL (5 min by max-age, 24 h by s-maxage, mitigated by SWR).
  • Automatic mitigation: cdn-invalidation-retrier worker retries with backoff up to 6 attempts over 5 min; alerts after 3 failed attempts.
  • Manual mitigation: force a manual invalidation via gcloud; runbook runbooks/cdn-invalidation.md.
  • Customer experience: existing visitors keep seeing the old brand for up to 5 min; new sessions revalidate. Acceptable for a non-critical path.

2.7 Memorystore unreachable

  • Detection: Redis client errors > 1 % over 5 min.
  • Blast radius: elevated origin-read load; bundle reads still served from origin GCS.
  • Automatic mitigation: circuit breaker in MemorystoreBundleCacheAdapter opens; reads bypass cache; writes are logged but not retried.
  • Manual mitigation: Memorystore HA usually recovers automatically; runbook runbooks/memorystore-degradation.md.
  • Customer experience: read latency may double (still ≤ 250 ms p95); no errors.

2.8 Outbox backlog

  • Detection: theme_outbox_unpublished_rows > 1 000 sustained 10 min; theme_outbox_lag_seconds p95 > 30 s.
  • Blast radius: delayed downstream effects (BFF cache invalidation, audit, analytics); CDN propagation may lag publish events.
  • Automatic mitigation: outbox-publisher scales up; Pub/Sub backpressure handled by retry-with-backoff; alert paged.
  • Manual mitigation: runbooks/outbox-backlog.md — diagnoses Pub/Sub publish errors, consumer-side throttles, or DB IOPS.

2.9 Inbox stuck (consumed event handler failing)

  • Detection: Pub/Sub subscription unacked count grows; DLQ inflow detected.
  • Blast radius: delayed reactions to tenant.created.v1 (new tenants don't get default theme), tenant.config_updated.v1 (formatting drift), media.deleted.v1 (broken-asset detection delayed).
  • Automatic mitigation: Pub/Sub retries with exponential backoff; max 5 → DLQ.
  • Manual mitigation: inspect DLQ via Cloud Console; replay after fix; runbook runbooks/inbox-dlq.md.

2.10 RLS bypass / cross-tenant read

  • Detection: integration test cross_tenant_isolation.spec.ts failing; runtime metric theme_rls_violations_total > 0; sentry-style anomaly detection.
  • Blast radius: potentially severe (data leak across tenants).
  • Automatic mitigation: none — design must prevent this; dynamic enforcement defends in depth.
  • Manual mitigation: sev1 incident; quarantine the offending revision; runbook runbooks/sev1-cross-tenant.md.

2.11 Stored XSS via ContentBlock.body

  • Detection: allow-list violation at write rejected by validator; runtime detection by BFF CSP report-uri.
  • Blast radius: rendered HTML on the booking flow could execute attacker-controlled script.
  • Automatic mitigation: dompurify allow-list at write and at render in BFFs; CSP script-src 'self' blocks in-browser execution.
  • Manual mitigation: sev1 incident; revoke the offending content block; rotate any leaked tokens; runbook runbooks/sev1-stored-xss.md.

2.12 Bundle SHA mismatch detected at edge

  • Detection: BFF first-read SHA verification fails; theme.bundle.integrity.violation emitted.
  • Blast radius: the affected theme; the BFF refuses to serve until verified.
  • Automatic mitigation: purge cache; re-fetch from origin; if still mismatch, fall back to the previous published bundle (last-known-good cached at the BFF).
  • Manual mitigation: sev1 incident; investigate GCS object generation history; runbook runbooks/sev2-bundle-integrity.md.
  • Detection: sustained 401/404 spikes per tokenHash or per IP.
  • Blast radius: none if rate-limited correctly; otherwise potential exposure of a specific draft.
  • Automatic mitigation: per-tokenHash rate limit 60 rpm + per-tenant 600 rpm; auto IP block after 100 failures in 1 min via Cloud Armor rule.
  • Manual mitigation: revoke all preview tokens for the tenant; runbook runbooks/sev3-preview-leak.md.

2.14 Broken asset URL after file-storage-service deletion

  • Detection: consumed melmastoon.media.deleted.v1 referenced by an active version; daily scanner detects 404 from origin.
  • Blast radius: booking flow may render a missing image; never breaks page load (alt text + skeleton).
  • Automatic mitigation: theme.broken_asset_detected.v1 emitted; backoffice surfaces banner "Broken asset detected" with the offending block.
  • Manual mitigation: author replaces the asset and republishes (rollback also restores).

2.15 AI orchestrator unavailable / unsafe output

  • Detection: orchestrator returns 5xx, safety-blocked, or schema-invalid.
  • Blast radius: AI surfaces unavailable; authoring continues without AI.
  • Automatic mitigation: API returns 503 MELMASTOON.AI.UNAVAILABLE with Retry-After; UI gracefully degrades to manual edit.
  • Manual mitigation: runbook runbooks/ai-degradation.md.

2.16 AI prompt injection through tenant input

  • Detection: orchestrator output deviates from schema (anomaly), or output contains banned phrases.
  • Blast radius: drafted content potentially adversarial — but never auto-applied.
  • Automatic mitigation: orchestrator-side instruction-hierarchy; output schema enforced; HITL gate.
  • Manual mitigation: review safety logs; adjust input sanitisation if a new injection pattern emerges; runbook runbooks/sev3-ai-prompt-injection.md.

2.17 Concurrent draft edits with stale UI

  • Detection: OCC 412 returned to client.
  • Blast radius: the editing user only.
  • Automatic mitigation: UI prompts to reload; no data loss because no merge occurred.
  • Manual mitigation: none.

2.18 Locale removal that's still referenced

  • Detection: RemoveLocaleUseCase validation finds copy in the locale referenced by an active CTA.
  • Blast radius: the requesting actor.
  • Automatic mitigation: rejected with 409 LOCALE_IN_USE listing the references.
  • Manual mitigation: author migrates content to fallback locale, then retries.

2.19 Layout preset deactivated by Frontend Platform

  • Detection: publish validation finds layoutSelections.<surface>.presetKey no longer is_active.
  • Blast radius: the publishing tenant.
  • Automatic mitigation: publish blocked with MELMASTOON.THEME.LAYOUT_PRESET_INACTIVE; UI prompts a substitution.
  • Manual mitigation: author selects a different preset.

2.20 Cloud Run instance scale ceiling

  • Detection: request 503s at the LB; instance_count_at_max = true.
  • Blast radius: authoring path; CDN-cached reads unaffected.
  • Automatic mitigation: Cloud Run scales to max; capacity-bound 503s.
  • Manual mitigation: raise max instance count via runbook runbooks/scale-ceiling.md; review whether a publish storm is the cause.

2.21 Tenant onboarding theme provisioning failure

  • Detection: ProvisionDefaultThemeUseCase raises and the consumed event lands in DLQ.
  • Blast radius: the new tenant has no theme; tenant-onboarding saga marks brand step degraded.
  • Automatic mitigation: retried by Pub/Sub up to 5 times.
  • Manual mitigation: tenant admin can complete branding manually via wizard; runbook runbooks/onboarding-degraded.md.

2.22 Pub/Sub publisher quota exceeded

  • Detection: publish errors with RESOURCE_EXHAUSTED.
  • Blast radius: outbox backlog; eventual consistency lag; no data loss.
  • Automatic mitigation: outbox retries with backoff; alert paged.
  • Manual mitigation: request quota increase.

2.23 Bundle larger than 40 KB budget

  • Detection: publish validation logs warning; alert if persistent (> 5 publishes/day breach).
  • Blast radius: worse first-paint on the booking flow.
  • Automatic mitigation: warning only — publish proceeds (the budget is a soft target, not a hard constraint).
  • Manual mitigation: author trims content blocks or compresses copy; bigger fix is to lazy-load some content blocks (BFF responsibility).

2.24 GCS bucket misconfiguration (e.g. public-write)

  • Detection: SCC finding; configuration scanner.
  • Blast radius: could enable bundle tampering (sev1).
  • Automatic mitigation: Terraform drift detection runs nightly; reverts policy.
  • Manual mitigation: runbook runbooks/sev1-gcs-misconfig.md; SHA verification protects readers in the meantime.

3. Cross-cutting principles

  1. Fail closed on writes, fail open on reads. Authoring failures must never produce inconsistent state; read failures should degrade gracefully (cached → origin → last-known-good) before erroring.
  2. No silent failures. Every failure path emits a structured log + a metric increment + (where applicable) an event. Operators can answer "what just happened?" from observability alone.
  3. Idempotency everywhere. Retries must be safe; we use Idempotency-Key, OCC, outbox/inbox, and pure-function bundle build.
  4. Tenant blast radius isolation. A failure for tenant A must not affect tenants B/C/D. We monitor theme_usecase_errors_total per tenantTier to detect noisy-neighbour patterns early.
  5. Preserve last-known-good. Rollback is always available; the BFF caches the last-known-good bundle for 24 h to survive an integrity violation without an immediate code deploy.

4. Runbook index

services/theme-config-service/runbooks/:

  • db-failover.md
  • publish-conflict.md
  • cdn-invalidation.md
  • memorystore-degradation.md
  • outbox-backlog.md
  • inbox-dlq.md
  • sev1-cross-tenant.md
  • sev1-stored-xss.md
  • sev2-bundle-integrity.md
  • sev3-preview-leak.md
  • ai-degradation.md
  • sev3-ai-prompt-injection.md
  • scale-ceiling.md
  • onboarding-degraded.md
  • sev1-gcs-misconfig.md

Each runbook follows the platform format: detection signals → first 5 minutes → diagnosis → mitigation → prevention.


5. References