Skip to main content

FAILURE_MODES — pricing-service

Sibling: APPLICATION_LOGIC · DATA_MODEL · OBSERVABILITY · SERVICE_RISK_REGISTER

The pricing-service sits on the booking critical path. Every failure mode below has a defined detection signal, blast radius, mitigation, and recovery action. Where applicable, an on-call runbook lives in services/pricing-service/runbooks/<id>.md.


1. Failure modes (registry)

F-01 — FX provider outage

PropertyValue
TriggerUpstream FX provider returns 5xx or times out
Detectionpricing_fx_refresh_failures_total increments; alert fx_provider_down
Blast radiusMulti-currency display drifts as cached snapshot ages; non-display-currency quotes unaffected
MitigationRefreshFxSnapshotUseCase does NOT throw on failure; cache continues to be served; quotes carry fxSnapshot.stale=true after staleAfter
Hard limitAt hardExpireAt (72 h), /quotes requesting cross-currency display return MELMASTOON.PRICING.FX_SNAPSHOT_STALE (409); customers can still book in the plan currency
RecoveryProvider returns → next cron tick refreshes → events fire → consumers cache invalidates
Runbookrunbooks/F-01-fx-provider-outage.md

F-02 — Cloud SQL primary failover

PropertyValue
TriggerGCP-initiated or manual failover of primary instance
DetectionConnection errors spike; readiness probe fails; alert quote_5xx_spike (P1)
Blast radiusQuote API returns 5xx for 30–90 s
MitigationConnection pooling (PgBouncer-style via pg-pool) configured with connect_timeout=2s, retries with jittered backoff; outbox publisher pauses
RecoveryAutomatic; Cloud Run revisions reattach within 90 s; outbox publisher resumes catch-up
Runbookrunbooks/F-02-cloud-sql-failover.md

F-03 — Pub/Sub backlog (publisher side)

PropertyValue
TriggerPub/Sub publish latency spikes or topic ACL misconfig
Detectionpricing_outbox_unpublished rises; event_publish_lag_p95 SLO burns
Blast radiusDownstream services (reservation, search-indexer, analytics) lag
MitigationOutbox table absorbs the backlog; publisher retries with jitter; on > 10 min sustained, switch publisher to a secondary topic with mirrored subscriptions
RecoveryWhen publish recovers, replay is automatic; ordering preserved per ordering key
Runbookrunbooks/F-03-pubsub-publish-backlog.md

F-04 — Pub/Sub consumer poison message

PropertyValue
TriggerInbound event payload that fails schema validation or handler logic
DetectionDLQ depth alert; handler error counter; structured log error
Blast radiusAffected ordering key's downstream effect delayed (e.g. tax-rule update not picked up)
MitigationAfter 5 redeliveries, message routes to <topic>.dlq; handler emits melmastoon.audit.event_dlq.v1
RecoveryOn-call inspects DLQ, hot-fixes handler or republishes after schema repair via the platform DLQ replay tool
Runbookrunbooks/F-04-pubsub-dlq.md

F-05 — Promo over-redemption race

PropertyValue
TriggerConcurrent quote calculations attempting to redeem a near-cap promo
Detectionpricing_promo_overcap_total increments; redemptions reject with MELMASTOON.PRICING.PROMO_OVEROBLIGATION
Blast radiusUp to N concurrent guests see "promo not available"; never an over-redemption
MitigationAtomic SQL (see DATA_MODEL §6); per-quote unique constraint on (promotion_id, quote_id)
RecoveryNone required; storm alert fires only if > 50/min sustained, prompting Revenue Ops to consider raising the cap
Runbookrunbooks/F-05-promo-storm.md

F-06 — Sharia guard rejection in production

PropertyValue
TriggerA fee with shariaTag='riba_forbidden' reaches a sharia-compliant plan
Detectionpricing_sharia_guard_failures_total increments; quote returns MELMASTOON.PRICING.SHARIA_GUARD_FAILED
Blast radiusAffected guests cannot quote on the offending plan
MitigationDomain guard prevents persisted bad state; rejected at calc time
RecoveryRevenue Ops fixes the offending fee rule (mark disclosed or unlink from the sharia plan) via admin API; quotes flow again
Runbookrunbooks/F-06-sharia-guard.md

F-07 — Negative or zero grand total (math bug)

PropertyValue
TriggerA bug in derivation produces grandTotalMicro <= 0 (would normally only be possible via stacked discount > 100% bypassing the floor)
DetectionMELMASTOON.PRICING.NEGATIVE_TOTAL error; alert at any occurrence (P1)
Blast radiusQuote refused; no booking proceeds at the bad price
MitigationDefensive guard at pinQuote step refuses to persist; emits audit event; rolls UoW back
RecoveryHot-fix release; integration test added to property suite
Runbookrunbooks/F-07-negative-total.md

F-08 — Rate-plan archive with active future bookings

PropertyValue
TriggerOperator archives a plan that backs N future reservations
Detectionarchived.v1 event carries futureBookingsAtArchive
Blast radiusNew quotes against the plan rejected (RATE_PLAN_INACTIVE); existing locked quotes remain honoured until check-out
MitigationArchive use case requires step-up authentication; UI surfaces the futureBookingsAtArchive count and demands explicit confirmation
RecoveryIf archived in error, restore via admin API (un-archive → re-publish); no data loss
Runbookrunbooks/F-08-rate-plan-archive.md

F-09 — Concurrent rate-rule edits (OCC mismatch)

PropertyValue
TriggerTwo operators submit conflicting edits on the same rule
DetectionMELMASTOON.PRICING.STALE_VERSION returned to second writer
Blast radiusOne operator's edit rejected with a clear message; first edit wins
MitigationOCC via If-Match: <version> header on PATCH endpoints
RecoveryLoser refreshes UI, re-applies edit
Runbookn/a — handled by client UX

F-10 — Tax rule mid-stay change

PropertyValue
TriggerGovernment changes tax rate during a stay window
Detectionmelmastoon.pricing.tax_rule.updated.v1 event
Blast radiusQuotes generated before the new rule's validFrom keep the OLD rate (we apply rate at booking time per legal advice); new quotes use the NEW rate
MitigationTax composition pins the snapshot's rate_value in the quote derivation; rule updates do NOT retroactively change locked quotes
RecoveryNone required; auditors can recompute using derivation.steps[step="ComposeTaxes"]
Runbookrunbooks/F-10-tax-mid-stay.md

F-11 — Memorystore Redis cache outage

PropertyValue
TriggerMemorystore unavailable
DetectionCache wrapper logs error; quote_latency_p99 rises (DB-only path)
Blast radiusLatency degradation 2–4×; no incorrect data
MitigationCache wrapper degrades to direct DB read on error; circuit breaker prevents thrash
RecoveryMemorystore returns; cache repopulates on demand
Runbookrunbooks/F-11-redis-outage.md

F-12 — Inventory allocation failure invalidates quotes

PropertyValue
Triggermelmastoon.inventory.allocation.failed.v1 arrives for a property/date
DetectionMarkQuotesStaleHandler runs; pricing_quotes_expired_total{reason="inventory_failed"} increments
Blast radiusAffected open quotes immediately marked expired; guests must re-quote
MitigationHandler is idempotent on messageId; only matching quotes are touched
RecoveryNone required; expected behaviour
Runbookn/a

F-13 — AI orchestrator unavailable

PropertyValue
Triggerai-orchestrator-service returns 5xx or times out
Detectiondynamic_suggestion_latency_breach (P3)
Blast radiusOperators cannot generate AI suggestions; nightly batch defers; live pricing UNAFFECTED
MitigationGenerateDynamicPricingSuggestionUseCase returns MELMASTOON.AI.UNAVAILABLE; nothing persisted
RecoveryOrchestrator returns; manual retry from UI; nightly batch reruns next day
Runbookrunbooks/F-13-ai-orchestrator-down.md

F-14 — Cross-tenant query regression (RLS bypass)

PropertyValue
TriggerA code change inadvertently bypasses the SET LOCAL app.tenant_id step
DetectionContinuous integration test asserts cross-tenant queries return zero rows; runtime MELMASTOON.SECURITY.TENANT_VIOLATION log alert (P1)
Blast radiusPotentially severe; pricing data is sensitive
MitigationMulti-layer defence: RLS + ALS + cache key prefix + audit; CI test must pass
RecoveryImmediate rollback; postmortem; affected tenants notified per the platform's data incident policy
Runbookrunbooks/F-14-cross-tenant.md

F-15 — Outbox table runaway growth

PropertyValue
TriggerPublisher unable to drain (sustained F-03)
Detectionpricing_outbox_unpublished > 10 000
Blast radiusDB storage pressure; eventually inserts slow
MitigationPublisher leader-election; dual-write to a secondary Pub/Sub topic; manual replay tool
RecoveryDrain via secondary or upgrade publisher concurrency; partition outbox if recurrent
Runbookrunbooks/F-15-outbox-runaway.md

F-16 — Desktop offline quote rejected on push

PropertyValue
TriggerDesktop pushed an offline-derived quote that the server recomputes differently (e.g. server-side promotion ran out, plan archived)
DetectionDesktop receives 409 from /internal/v1/sync/price-quotes:push; server logs MELMASTOON.PRICING.PROMO_OVEROBLIGATION or RATE_PLAN_INACTIVE
Blast radiusFront-desk operator must re-quote; guest may pay a different price
MitigationServer is authoritative; desktop UI surfaces the rejection inline; reservation fallback path is documented
RecoveryRe-quote and re-print folio confirmation
Runbookrunbooks/F-16-desktop-quote-rejected.md

2. Failure-mode interaction matrix

IfAndThen
F-02 (DB failover)F-03 (Pub/Sub backlog)Outbox grows; publisher resumes after both clear; no data loss
F-01 (FX outage)F-13 (AI down)Live pricing OK; advisory features disabled; revenue manager UI flags both
F-11 (Redis down)quote burstp99 still ≤ 600 ms (degraded); pages on quote_latency_slo_burn_fast
F-14 (RLS regression)anyfull incident response; deploy freeze

3. Resilience principles

  1. Fail closed on ambiguity. A quote is never produced if any input is uncertain (FX hard-expire, sharia mismatch, derivation error).
  2. Fail open on freshness when safe. Cached FX snapshots remain usable past the soft staleAfter with a clear stale: true flag.
  3. Idempotency everywhere. Every public mutation accepts an Idempotency-Key; every event handler dedupes on messageId.
  4. Authoritative server. Desktop quotes are validated and may be rejected on push; server price always wins.
  5. Defence in depth on tenancy. Multiple independent layers must all fail to leak cross-tenant data.