FAILURE_MODES — pricing-service
Sibling: APPLICATION_LOGIC · DATA_MODEL · OBSERVABILITY · SERVICE_RISK_REGISTER
The pricing-service sits on the booking critical path. Every failure mode below has a defined detection signal, blast radius, mitigation, and recovery action. Where applicable, an on-call runbook lives in services/pricing-service/runbooks/<id>.md.
1. Failure modes (registry)
F-01 — FX provider outage
| Property | Value |
|---|
| Trigger | Upstream FX provider returns 5xx or times out |
| Detection | pricing_fx_refresh_failures_total increments; alert fx_provider_down |
| Blast radius | Multi-currency display drifts as cached snapshot ages; non-display-currency quotes unaffected |
| Mitigation | RefreshFxSnapshotUseCase does NOT throw on failure; cache continues to be served; quotes carry fxSnapshot.stale=true after staleAfter |
| Hard limit | At hardExpireAt (72 h), /quotes requesting cross-currency display return MELMASTOON.PRICING.FX_SNAPSHOT_STALE (409); customers can still book in the plan currency |
| Recovery | Provider returns → next cron tick refreshes → events fire → consumers cache invalidates |
| Runbook | runbooks/F-01-fx-provider-outage.md |
F-02 — Cloud SQL primary failover
| Property | Value |
|---|
| Trigger | GCP-initiated or manual failover of primary instance |
| Detection | Connection errors spike; readiness probe fails; alert quote_5xx_spike (P1) |
| Blast radius | Quote API returns 5xx for 30–90 s |
| Mitigation | Connection pooling (PgBouncer-style via pg-pool) configured with connect_timeout=2s, retries with jittered backoff; outbox publisher pauses |
| Recovery | Automatic; Cloud Run revisions reattach within 90 s; outbox publisher resumes catch-up |
| Runbook | runbooks/F-02-cloud-sql-failover.md |
F-03 — Pub/Sub backlog (publisher side)
| Property | Value |
|---|
| Trigger | Pub/Sub publish latency spikes or topic ACL misconfig |
| Detection | pricing_outbox_unpublished rises; event_publish_lag_p95 SLO burns |
| Blast radius | Downstream services (reservation, search-indexer, analytics) lag |
| Mitigation | Outbox table absorbs the backlog; publisher retries with jitter; on > 10 min sustained, switch publisher to a secondary topic with mirrored subscriptions |
| Recovery | When publish recovers, replay is automatic; ordering preserved per ordering key |
| Runbook | runbooks/F-03-pubsub-publish-backlog.md |
F-04 — Pub/Sub consumer poison message
| Property | Value |
|---|
| Trigger | Inbound event payload that fails schema validation or handler logic |
| Detection | DLQ depth alert; handler error counter; structured log error |
| Blast radius | Affected ordering key's downstream effect delayed (e.g. tax-rule update not picked up) |
| Mitigation | After 5 redeliveries, message routes to <topic>.dlq; handler emits melmastoon.audit.event_dlq.v1 |
| Recovery | On-call inspects DLQ, hot-fixes handler or republishes after schema repair via the platform DLQ replay tool |
| Runbook | runbooks/F-04-pubsub-dlq.md |
| Property | Value |
|---|
| Trigger | Concurrent quote calculations attempting to redeem a near-cap promo |
| Detection | pricing_promo_overcap_total increments; redemptions reject with MELMASTOON.PRICING.PROMO_OVEROBLIGATION |
| Blast radius | Up to N concurrent guests see "promo not available"; never an over-redemption |
| Mitigation | Atomic SQL (see DATA_MODEL §6); per-quote unique constraint on (promotion_id, quote_id) |
| Recovery | None required; storm alert fires only if > 50/min sustained, prompting Revenue Ops to consider raising the cap |
| Runbook | runbooks/F-05-promo-storm.md |
F-06 — Sharia guard rejection in production
| Property | Value |
|---|
| Trigger | A fee with shariaTag='riba_forbidden' reaches a sharia-compliant plan |
| Detection | pricing_sharia_guard_failures_total increments; quote returns MELMASTOON.PRICING.SHARIA_GUARD_FAILED |
| Blast radius | Affected guests cannot quote on the offending plan |
| Mitigation | Domain guard prevents persisted bad state; rejected at calc time |
| Recovery | Revenue Ops fixes the offending fee rule (mark disclosed or unlink from the sharia plan) via admin API; quotes flow again |
| Runbook | runbooks/F-06-sharia-guard.md |
F-07 — Negative or zero grand total (math bug)
| Property | Value |
|---|
| Trigger | A bug in derivation produces grandTotalMicro <= 0 (would normally only be possible via stacked discount > 100% bypassing the floor) |
| Detection | MELMASTOON.PRICING.NEGATIVE_TOTAL error; alert at any occurrence (P1) |
| Blast radius | Quote refused; no booking proceeds at the bad price |
| Mitigation | Defensive guard at pinQuote step refuses to persist; emits audit event; rolls UoW back |
| Recovery | Hot-fix release; integration test added to property suite |
| Runbook | runbooks/F-07-negative-total.md |
F-08 — Rate-plan archive with active future bookings
| Property | Value |
|---|
| Trigger | Operator archives a plan that backs N future reservations |
| Detection | archived.v1 event carries futureBookingsAtArchive |
| Blast radius | New quotes against the plan rejected (RATE_PLAN_INACTIVE); existing locked quotes remain honoured until check-out |
| Mitigation | Archive use case requires step-up authentication; UI surfaces the futureBookingsAtArchive count and demands explicit confirmation |
| Recovery | If archived in error, restore via admin API (un-archive → re-publish); no data loss |
| Runbook | runbooks/F-08-rate-plan-archive.md |
F-09 — Concurrent rate-rule edits (OCC mismatch)
| Property | Value |
|---|
| Trigger | Two operators submit conflicting edits on the same rule |
| Detection | MELMASTOON.PRICING.STALE_VERSION returned to second writer |
| Blast radius | One operator's edit rejected with a clear message; first edit wins |
| Mitigation | OCC via If-Match: <version> header on PATCH endpoints |
| Recovery | Loser refreshes UI, re-applies edit |
| Runbook | n/a — handled by client UX |
F-10 — Tax rule mid-stay change
| Property | Value |
|---|
| Trigger | Government changes tax rate during a stay window |
| Detection | melmastoon.pricing.tax_rule.updated.v1 event |
| Blast radius | Quotes generated before the new rule's validFrom keep the OLD rate (we apply rate at booking time per legal advice); new quotes use the NEW rate |
| Mitigation | Tax composition pins the snapshot's rate_value in the quote derivation; rule updates do NOT retroactively change locked quotes |
| Recovery | None required; auditors can recompute using derivation.steps[step="ComposeTaxes"] |
| Runbook | runbooks/F-10-tax-mid-stay.md |
F-11 — Memorystore Redis cache outage
| Property | Value |
|---|
| Trigger | Memorystore unavailable |
| Detection | Cache wrapper logs error; quote_latency_p99 rises (DB-only path) |
| Blast radius | Latency degradation 2–4×; no incorrect data |
| Mitigation | Cache wrapper degrades to direct DB read on error; circuit breaker prevents thrash |
| Recovery | Memorystore returns; cache repopulates on demand |
| Runbook | runbooks/F-11-redis-outage.md |
F-12 — Inventory allocation failure invalidates quotes
| Property | Value |
|---|
| Trigger | melmastoon.inventory.allocation.failed.v1 arrives for a property/date |
| Detection | MarkQuotesStaleHandler runs; pricing_quotes_expired_total{reason="inventory_failed"} increments |
| Blast radius | Affected open quotes immediately marked expired; guests must re-quote |
| Mitigation | Handler is idempotent on messageId; only matching quotes are touched |
| Recovery | None required; expected behaviour |
| Runbook | n/a |
F-13 — AI orchestrator unavailable
| Property | Value |
|---|
| Trigger | ai-orchestrator-service returns 5xx or times out |
| Detection | dynamic_suggestion_latency_breach (P3) |
| Blast radius | Operators cannot generate AI suggestions; nightly batch defers; live pricing UNAFFECTED |
| Mitigation | GenerateDynamicPricingSuggestionUseCase returns MELMASTOON.AI.UNAVAILABLE; nothing persisted |
| Recovery | Orchestrator returns; manual retry from UI; nightly batch reruns next day |
| Runbook | runbooks/F-13-ai-orchestrator-down.md |
F-14 — Cross-tenant query regression (RLS bypass)
| Property | Value |
|---|
| Trigger | A code change inadvertently bypasses the SET LOCAL app.tenant_id step |
| Detection | Continuous integration test asserts cross-tenant queries return zero rows; runtime MELMASTOON.SECURITY.TENANT_VIOLATION log alert (P1) |
| Blast radius | Potentially severe; pricing data is sensitive |
| Mitigation | Multi-layer defence: RLS + ALS + cache key prefix + audit; CI test must pass |
| Recovery | Immediate rollback; postmortem; affected tenants notified per the platform's data incident policy |
| Runbook | runbooks/F-14-cross-tenant.md |
F-15 — Outbox table runaway growth
| Property | Value |
|---|
| Trigger | Publisher unable to drain (sustained F-03) |
| Detection | pricing_outbox_unpublished > 10 000 |
| Blast radius | DB storage pressure; eventually inserts slow |
| Mitigation | Publisher leader-election; dual-write to a secondary Pub/Sub topic; manual replay tool |
| Recovery | Drain via secondary or upgrade publisher concurrency; partition outbox if recurrent |
| Runbook | runbooks/F-15-outbox-runaway.md |
F-16 — Desktop offline quote rejected on push
| Property | Value |
|---|
| Trigger | Desktop pushed an offline-derived quote that the server recomputes differently (e.g. server-side promotion ran out, plan archived) |
| Detection | Desktop receives 409 from /internal/v1/sync/price-quotes:push; server logs MELMASTOON.PRICING.PROMO_OVEROBLIGATION or RATE_PLAN_INACTIVE |
| Blast radius | Front-desk operator must re-quote; guest may pay a different price |
| Mitigation | Server is authoritative; desktop UI surfaces the rejection inline; reservation fallback path is documented |
| Recovery | Re-quote and re-print folio confirmation |
| Runbook | runbooks/F-16-desktop-quote-rejected.md |
2. Failure-mode interaction matrix
| If | And | Then |
|---|
| F-02 (DB failover) | F-03 (Pub/Sub backlog) | Outbox grows; publisher resumes after both clear; no data loss |
| F-01 (FX outage) | F-13 (AI down) | Live pricing OK; advisory features disabled; revenue manager UI flags both |
| F-11 (Redis down) | quote burst | p99 still ≤ 600 ms (degraded); pages on quote_latency_slo_burn_fast |
| F-14 (RLS regression) | any | full incident response; deploy freeze |
3. Resilience principles
- Fail closed on ambiguity. A quote is never produced if any input is uncertain (FX hard-expire, sharia mismatch, derivation error).
- Fail open on freshness when safe. Cached FX snapshots remain usable past the soft
staleAfter with a clear stale: true flag.
- Idempotency everywhere. Every public mutation accepts an
Idempotency-Key; every event handler dedupes on messageId.
- Authoritative server. Desktop quotes are validated and may be rejected on push; server price always wins.
- Defence in depth on tenancy. Multiple independent layers must all fail to leak cross-tenant data.