maintenance-service · FAILURE_MODES
Catalog of how this service can fail, what we detect it with, the containment / compensation, the user impact, and the runbook + tests that prove the recovery.
1. Choreography failures (room block / relocation)
1.1 work_order.room_blocked.v1 not acknowledged by property-service
- Detection: alert
mnt.room_block_no_response(noproperty.room.taken_out_of_order.v1for ourcorrelationIdwithin 60 s). - Compensation: worker re-publishes the request once after 60 s; if still no ack after 5 min, the WO is flagged
room_block_unconfirmed=true(UI badge), GM is notified, and the room is not automatically re-OOO from our side. - User impact: room may show as in-service while a high-severity issue is open; GM can manually OOO via property-service.
- Runbook:
runbook://maintenance/room-block— verify property-service health, replay event, manual OOO fallback. - Test:
integration/room_block_no_ack.spec.ts.
1.2 property-service rejects the block (room.block_rejected.v1)
- Detection: inbox handler.
- Compensation: set
causedRoomBlock=false; emitWorkOrderEscalatedwith reasonroom_block_rejected; notify GM. WO continues normally. - Test:
integration/room_block_rejected.spec.ts.
1.3 Relocation required but no replacement room available
- Detection:
reservation-servicepublishesreservation.modification.failed.v1withkind=room_change, reason=no_inventory. - Compensation: auto-emit
WorkOrderEscalatedto GM; UI surfaces "guest waiting on relocation"; we do not auto-cancel the WO. - Test:
integration/relocation_no_inventory.spec.ts.
2. Vendor & assignment failures
2.1 Vendor no-show (no acknowledgement after N reminders)
- Detection:
VendorReminderWorkerafter 3 cycles pastvendorReminderMinutes. - Compensation: auto-escalate (
escalated.v1withreason=vendor_no_show); revert WO toassignedwithassigneecleared after 4th cycle so a new vendor can be picked. - User impact: GM is paged; staff sees a "vendor unresponsive" badge.
- Runbook:
runbook://maintenance/vendor-no-show. - Test:
integration/vendor_no_show.spec.ts.
2.2 Vendor assigned with channelPreference = call_only and staff didn't manually ack
- Detection: the assign use case rejects with
VENDOR_CHANNEL_MISMATCH. - Compensation: UI shows the message: "this vendor must be reached by phone; record the manual acknowledgement." Staff records
RecordVendorAcknowledgementUseCase. - Test:
application/__tests__/assign_call_only.spec.ts.
2.3 Notification dispatch fails
- Detection: notification-service returns 5xx; we log, do not block.
- Compensation: notification-service has its own retry policy and DLQ; vendor.assigned.v1 event still fires, so analytics and audit are intact.
- User impact: none for the WO itself; vendor may not be informed digitally — UI shows a "notification not delivered" badge after notification-service publishes its own DLQ event we consume.
- Test:
integration/notification_5xx.spec.ts.
3. Parts & cost failures
3.1 Part out of stock at resolve time
- Detection:
PartRepository.decrementOnHandraisesPART_OUT_OF_STOCK. - Compensation:
ResolveWorkOrderUseCaseaborts, returns 409 to caller. UI prompts to update parts (purchase or override quantity). Optionally the WO is auto-blocked withreason=awaiting_part. - User impact: technician must reconcile parts before resolution.
- Test:
integration/parts_out_of_stock.spec.ts.
3.2 Cost-currency mismatch
- Detection: domain validation at resolve.
- Compensation: 422; UI uses the tenant base currency dropdown.
- Test:
domain/__tests__/cost_currency.spec.ts.
4. Concurrency & ordering failures
4.1 OCC stale version
- Detection:
UPDATE … WHERE version = ?returns 0 rows; repository raisesOccConflict. - Compensation: controller returns
MELMASTOON.SYS.OCC_CONFLICTwith current version + status; BFF retries up to 2× with re-fetch. - Test: every state transition has an OCC conflict spec.
4.2 Out-of-order Pub/Sub delivery
- Detection: none (we don't try to detect; we make every consumer idempotent and order-tolerant).
- Compensation: consumer handlers are idempotent (
messageIdin inbox) and pure functions of payload + current state; reordering of upstream events does not change the final state. - Test:
integration/out_of_order_inbox.spec.tsshuffles 10 events for the same WO and asserts identical final state to the in-order run.
4.3 Two staff resolve the same WO simultaneously
- Detection: OCC; first wins.
- Compensation: second gets 409; UI shows the resolved state.
- Test:
integration/concurrent_resolve.spec.ts.
4.4 Preventive scheduler tick overlap (two pods process same row)
- Detection: unique constraint on
preventive_fires(scheduleId, due_at_bucket_hour). - Compensation: second insert raises unique violation → handler logs
PREVENTIVE_DUPLICATE_FIREdebug and returns 200. - Test:
integration/preventive_duplicate_fire.spec.ts.
5. Storage failures
5.1 Cloud SQL unavailability (planned failover)
- Detection: connection errors; readiness probe fails; pods drained.
- Compensation: Cloud Run waits with retries + circuit breaker; new request returns
MELMASTOON.SYS.UPSTREAM_UNAVAILABLE(503). Client retries with exponential backoff. - User impact: ~30 s blip during regional failover.
- Test: chaos:
pnpm chaos:db-failover.
5.2 Outbox relay stalled
- Detection:
mnt.outbox.lag_seconds > 60for 5 min. - Compensation: alert P2; runbook restarts relay or scales it; events catch up.
- User impact: downstream services see delayed events but no data loss.
- Test:
integration/outbox_relay_stalled.spec.ts.
5.3 Inbox dedupe corruption
- Detection: assertion in handler: same
messageIdprocessed but inbox row missing. - Compensation: handler still treats as already-processed for current request (the natural dedupe via state checks holds), and emits an alert. Operator runs
tools/rebuild-inbox.tsfrom event log to repair. - Test:
integration/inbox_corruption.spec.ts.
5.4 Disk full / Cloud SQL storage exhausted
- Detection: alert
db_storage_highat 80%,db_storage_criticalat 90%. - Compensation: auto-grow enabled to 200 GB; archiver runs to evict closed WOs > 24 mo.
- Test: simulated in staging with synthetic load.
6. Worker failures
6.1 Preventive scheduler stuck
- Detection:
mnt.preventive.due_pending_countrising without scheduler progress; workerlast_tick_atlag > 5 min. - Compensation: restart Cloud Run revision; backlog drains in N ticks since dedupe is robust.
- Test: chaos: kill the worker pod mid-tick and verify recovery.
6.2 SLA scanner double-counts breaches
- Detection:
breach_countjumping > 1 per minute. - Cause: missing dedupe within minute; not possible per current code, but tested.
- Test:
integration/sla_double_count.spec.ts.
6.3 Asset health forecaster pollutes scores
- Detection: alert if
mnt.assets.health_indexdrops > 30 points across a property in one tick. - Compensation: auto-revert via
RecordAssetHealthUpdateUseCasewith override; suspend forecaster for the tenant; investigate model output. - Test: golden tests on forecaster bound it.
7. Auto-create path failures (inbound choreography)
7.1 Housekeeping flag arrives but no propertyId resolvable
- Detection: handler validation; emit metric
mnt.inbox.invalid_payload_total. - Compensation: push to DLQ for human review; do not retry forever.
- Test:
integration/inbox_invalid_payload.spec.ts.
7.2 Lock health alert references unknown deviceId
- Detection: asset upsert proceeds (we register a new asset), but
displayNameis generic. - Compensation: UI surfaces a "needs naming" badge on the new asset; technician renames.
- Test:
integration/lock_unknown_device.spec.ts.
7.3 Auto-created WO collides with existing open one (invariant #4)
- Detection: repository find-open returns a row.
- Compensation: comment append on existing WO with the new source/note; original WO id returned in
originRef. - Test:
integration/auto_create_collision.spec.ts.
8. Sync failures
8.1 Push command with stale OCC
- Detection: OCC conflict.
- Compensation: server returns
OCC_CONFLICT; client surfaces conflict view; technician re-fetches. - Test:
integration/sync_push_occ.spec.ts.
8.2 Push commands batch partial failure
- Detection: per-command result.
- Compensation: other WOs proceed; failed WO's subsequent commands in the batch are skipped (server breaks at first failure per-WO).
- Test:
integration/sync_push_partial.spec.ts.
8.3 Device clock far skewed
- Detection: server compares
deviceClockvsserverNow. - Compensation: if skew > 15 min, push is rejected with
MELMASTOON.SYS.CLOCK_SKEW_EXCESSIVE; technician syncs OS clock. - Test:
integration/sync_clock_skew.spec.ts.
9. AI failures
9.1 Orchestrator timeout / 5xx
- Compensation: fail-soft; WO created without AI assist; provenance logs failure.
- Test:
integration/ai_orchestrator_timeout.spec.ts.
9.2 Model returns invalid enum
- Compensation: discard and log; do not persist; fall back to staff input.
- Test:
integration/ai_invalid_enum.spec.ts.
9.3 Tenant AI budget exhausted
- Compensation: fail-soft per capability matrix in
AI_INTEGRATION.md§5; WO flow continues. - Test:
integration/ai_budget_exhausted.spec.ts.
10. Edge-case operational failures
10.1 High-severity WO opened on a room with an active reservation, no replacement available, guest already at desk
- Detection:
reservation.modification.failed.v1withkind=room_change. - Compensation: WO escalated to GM; UI displays "guest waiting" banner; check-in saga (reservation-service) handles guest reaccommodation manually.
- User impact: guest waits; GM intervenes.
- Test: scripted E2E in staging.
10.2 Vendor invoice file upload fails
- Detection: signed-URL upload returns 4xx; BFF surfaces; we never persist the WO change because the file ref was never confirmed.
- Compensation: retry; or skip file upload, just record amount.
- Test: BFF E2E.
10.3 Multiple lock health alerts for the same device in 1 second
- Detection: inbox dedupe +
(deviceId, alertCode, dayBucket)natural dedupe. - Compensation: at most one WO per dedupe bucket; subsequent alerts append a comment.
- Test:
integration/lock_alert_dedupe.spec.ts.
10.4 Generator run-hours regress (sensor or human input goes backward)
- Detection:
Asset.runHoursismax-ofper sync contract; but cloud-side direct REST update to a lower value is rejected withMELMASTOON.MAINTENANCE.ASSET_RUN_HOURS_REGRESSION. - Compensation: UI prompts: "current value is X, you entered Y. Confirm reset?" — if confirmed,
RecordAssetHealthUpdateUseCasewithforce=true(audited). - Test:
integration/asset_run_hours_regression.spec.ts.
10.5 Schedule nextDueAt regression after manual override
- Detection: invariant #9.
- Compensation: request rejected; UI explains that next-due cannot move backward without manually completing the in-flight schedule.
- Test:
domain/__tests__/schedule_next_due_regression.spec.ts.
11. Generic failure principles
- Fail-soft on advisory dependencies (AI, notification dispatch, asset health forecaster) — primary flow continues, badge surfaced.
- Fail-fast on authoritative dependencies (DB, auth, RLS) — return 5xx, let client retry.
- Audit everything — every failure produces a structured log + a metric increment. Silent failures are bugs.
- No partial state — every command writes WO + outbox in one transaction; either both happen or neither.