FAILURE_MODES — lock-integration-service
Bundle: SERVICE_OVERVIEW · APPLICATION_LOGIC · SECURITY_MODEL · OBSERVABILITY · SYNC_CONTRACT · SERVICE_RISK_REGISTER
Cross-cutting: docs/standards/ERROR_CODES, docs/02 §13 Resilience.
This service is never allowed to fail silently: a missing key at the door is a real-world incident. Every failure mode below has a deterministic detection signal, a deterministic recovery, and a documented runbook hook.
1. Failure-mode catalog
| # | Mode | Likelihood | Impact | Detection | Mitigation | Recovery |
|---|---|---|---|---|---|---|
| F1 | Vendor cloud unreachable (TTLock/Salto/Vostio API down) | Medium | High — new credentials cannot be issued for affected vendor | Adapter call latency p95 > 5×SLA, then circuit opens; melmastoon_lock_vendor_circuit_state == 1 | Per-vendor circuit breaker; emit lock.vendor_adapter.health_changed.disabled; saga retries with exp backoff (max 6); for issuance, attempt fallback adapter from KeyKindPolicy.fallbackOrder (if any cloud-resilient kind available); for properties with paired Electron desktop and Generic Wiegand, issue locally as provisional | Auto-recover when health prober succeeds 3× consecutively → emit health_changed.healthy, drain saga retry queue |
| F2 | Vendor cloud rate-limited / quota exhausted | Medium | Medium | Adapter returns documented quota code; vendor_call_total{outcome="rate_limited"} spikes | Token-bucket per vendor (cached headers); on 429, requeue with Retry-After; warn at 80% quota use via vendor_health_score degradation | Quota window roll; emergency tier upgrade if sustained |
| F3 | Vendor webhook delayed / missing | Medium | Medium — credential stuck in pending | Reconciler job: any pending credential older than vendorMaxAckSeconds flagged | Reconciliation Cloud Run Job every 60s scans pending rows, calls vendor getCredentialStatus() to pull state directly | Either materialize via pull or mark failed after 30 min and trigger compensation |
| F4 | Vendor webhook spoofed (signature pass would let an attacker mark a credential revoked) | Low | Critical | Signature mismatch counter spikes; Cloud Armor logs anomalous IPs | Per-vendor signature verification (HMAC/RSA/mTLS) — see SECURITY_MODEL §5; rate-limit; auto-denylist | Forensic via inbox + audit; rotate signing secret with vendor |
| F5 | Duplicate webhook delivery | High | Low (idempotent) | webhook_inbox unique key collision metric | Inbox dedup by (vendor, vendorEventId) | Drop duplicate, return 200 |
| F6 | Out-of-order webhook (e.g., revoked arrives before active) | Medium | Medium | Saga step receives unexpected transition; emits STATE_INVARIANT_VIOLATED warning | Buffer event by vendorRef + vendorEventTs; if newer state already applied, drop older | Compensation rare; typically benign because the later state is correct |
| F7 | Concurrent issuance for same room | Medium | Medium — risk of two active credentials for same room | Advisory-lock contention metric | Postgres advisory lock lock:{propertyId}:{roomId} acquired before issuance; one waiter, max 5 retries with 250 ms jitter | Loser idempotently observes the winner's credential and either defers or returns it |
| F8 | PIN code collision (vendor returns same PIN for two active credentials) | Low | Medium — door opens for wrong guest | Adapter post-issuance check: query active PINs for the device; if collision, regenerate up to 3× | If 3 regenerations all collide → fail with MELMASTOON.LOCK.PIN_COLLISION, fall back to rfid_card or mobile_app per policy | Notify GM; HITL Decision created |
| F9 | Mobile-app token push failure (notification-service down) | Medium | Medium | notification.push.delivered.v1 not received within 60s | Re-emit lock.credential.delivery_failed.v1; auto-fallback to PIN delivery via SMS if KeyKindPolicy.fallbackOrder includes pin_code | Front desk dashboard surfaces undelivered keys; manual re-send |
| F10 | Time skew on lock device > tolerance (≥ 60s) | Medium | High — credential rejected at door despite valid window | Vendor health checks include time-sync; device.health_alert.v1 with kind=clock_drift | Adapter command issued to re-sync (where supported); device added to maintenance ticket | Field engineer dispatched; meanwhile widen credential window by 5 min via update saga |
| F11 | USB encoder disconnected mid-session (Electron desktop) | Medium | High — offline issuance halted | Main-process serial event disconnect; EncoderSession.lastActivityAt stale | Close session row, emit lock.encoder_session.closed.local.v1 (reason='disconnected'); UI prompts operator to reconnect | Operator reconnects; new session opens; queued issuances retry |
| F12 | Cloud SQL Postgres failover / brief unavailability | Low | High — sagas stall | sql_admin.uptime drop; db_pool.connection_errors_total spikes | HA failover (regional); saga step retries on transient; outbox guarantees at-least-once delivery | Resume on DB return; backlog drains within minutes |
| F13 | Outbox publisher backlog | Low | Medium | melmastoon_lock_outbox_pending > 10 000 for 10 min | Auto-scale saga-runner; manual unblock if Pub/Sub topic-level outage | Drain confirmed by gauge return to baseline |
| F14 | Inbox handler crash loop (poison message) | Low | Medium | inbox_processed_total{outcome="error"} rate spike on a single event | Max 5 attempts → DLQ subscription; alert; manual replay tool | Engineer inspects DLQ, fixes handler bug or marks event dropped with reason |
| F15 | Stale KeyCredentialAggregate.version (optimistic concurrency conflict) | Medium | Low | UseCase returns MELMASTOON.LOCK.STATE_VERSION_MISMATCH | Use case re-loads aggregate and retries up to 3× | Final retry failure surfaces as 409 to caller; CLI replay tool for stuck rows |
| F16 | Provisional credential references reservation that was cancelled | Medium | Medium | Reconciler decision matrix branch | Sync reconciler revokes at vendor (best-effort) and persists revoked with metadata.wasProvisional=true | Desktop receives outcome:'revoked'; locally invalidates the card |
| F17 | Offline issuance certificate expired before reconcile | Low | Medium | Reconciler signature check fails; CRL match | Reject push with MELMASTOON.LOCK.OFFLINE_CERT_EXPIRED; desktop must re-mint cert and re-issue | Operator re-mints cert via cloud REST when online |
| F18 | Stolen Electron desktop | Low | Critical | iam.device.unbound.v1 event | On unbind, add device's offline_issuance cert serial to CRL; flag all credentials issued in last 14 days for review (HITL) | GM reviews flagged credentials; revokes any suspicious ones |
| F19 | Vendor credential leak | Low | Critical | Detected by routine secret-scanner CI or external bug bounty | Revoke at vendor + rotate; emit vendor_adapter.health_changed.disabled | Re-issue all pending credentials post-rotation |
| F20 | Pub/Sub subscriber misconfiguration (subscription deleted) | Very low | Medium | inbox_lag_seconds climbs without bound | Terraform drift alarm; recreate subscription via IaC | Replay topic backlog into recreated subscription |
| F21 | Salto on-prem connector unreachable (VPN tunnel down) | Low | Medium — affected property only | Synthetic check fails; vendor_circuit_state opens for that adapter only | Falls back to vendor cloud-only operations; on-prem-only operations queue with retry | Network team restores tunnel |
| F22 | BLE pairing fail (TTLock mobile-app token cannot bind) | Medium | Medium — guest can't open door | Adapter returns ttlock.ble_pair_failed; lock.credential.failed.v1 emitted | Auto-retry once; on persistent fail, fallback to PIN per policy | Front desk hands out PIN at counter |
| F23 | Idempotency key reuse with different payload | Low | Low | MELMASTOON.LOCK.IDEMPOTENCY_KEY_REUSED in logs | Caller bug; surface 409 with prior decision | Caller fixes; replays with fresh key |
| F24 | Audit Merkle anchor mismatch | Very low | Critical | Daily verification job emits mismatch=true | P1 incident; halt writes to affected partition; security forensic | Possible silent tamper or replication corruption — escalate per SECURITY_MODEL §11 |
| F25 | Master key off-shift use | Medium | High | Anomaly score ≥ 0.85 → HITL Decision | Optionally auto-suspend at score ≥ 0.95 (HITL pre-approved policy per tenant) | Tenant admin reviews; suspends or unsuspends |
2. Failure → error code mapping
Mapping to canonical ERROR_CODES:
| Failure | Error code | HTTP |
|---|---|---|
| F1 | MELMASTOON.LOCK.VENDOR_UNAVAILABLE | 503 |
| F2 | MELMASTOON.LOCK.VENDOR_RATE_LIMITED | 429 |
| F3 | MELMASTOON.LOCK.VENDOR_ACK_TIMEOUT | 504 (admin) / surfaces as pending to caller |
| F4 | MELMASTOON.LOCK.WEBHOOK_SIGNATURE_INVALID | 401 |
| F5 | MELMASTOON.LOCK.WEBHOOK_DUPLICATE | 200 (silent) |
| F6 | logged WARNING; no error to caller | |
| F7 | MELMASTOON.LOCK.ROOM_LOCK_CONTENTION (after retries) | 409 |
| F8 | MELMASTOON.LOCK.PIN_COLLISION | 409 |
| F9 | MELMASTOON.LOCK.DELIVERY_FAILED | 502 (admin) / handled by retry |
| F10 | MELMASTOON.LOCK.DEVICE_TIME_SKEW (operational warning) | n/a |
| F11 | MELMASTOON.LOCK.ENCODER_DISCONNECTED (desktop) | n/a |
| F15 | MELMASTOON.LOCK.STATE_VERSION_MISMATCH | 409 |
| F17 | MELMASTOON.LOCK.OFFLINE_CERT_EXPIRED | 403 |
| F18 | MELMASTOON.LOCK.OFFLINE_CERT_REVOKED | 403 |
| F22 | MELMASTOON.LOCK.BLE_PAIRING_FAILED | 502 |
| F23 | MELMASTOON.SYNC.IDEMPOTENCY_KEY_REUSED | 409 |
3. Compensation matrix
| Saga | Failure point | Compensation |
|---|---|---|
IssueSaga | Vendor returns success but webhook never confirms (F3) | Reconciler pulls vendor state; if absent, mark failed, emit lock.credential.failed.v1; downstream notification-service notifies front desk |
IssueSaga | Vendor success, then DB save fails (F12) | Outbox guarantees event publish on retry; on retry, idempotency key dedupes vendor side |
IssueSaga | Vendor permanent failure (F1 sustained) | Emit failed; if KeyKindPolicy.fallbackOrder permits, retry with next-preferred kind via separate saga step |
RevokeSaga | Vendor returns not_found | Treat as success (idempotent); update local state |
RevokeSaga | Vendor success, partial (e.g., revoked at cloud but device still has the credential cached) | Local state revoked; emit lock.credential.revoked.v1; device sync request enqueued; followup device.health_alert.v1 if device fails to sync within 5 min |
MasterKeyShiftSaga | Issue at shift start fails | No partial state; ticket created for engineer; manager notified |
WebhookSaga | Inbox handler crashes (F14) | DLQ + replay |
OfflineReconcileSaga | Cert revoked / expired | Reject; desktop receives outcome and revokes locally |
OfflineReconcileSaga | Reservation no longer matches | Per SYNC_CONTRACT §5.1 decision matrix |
4. Degraded modes
| Mode | Trigger | Behavior |
|---|---|---|
| Vendor-down degraded (per vendor) | F1 sustained for 30s | New issuances for affected vendor return 503 with Retry-After; existing credentials continue to work; sagas defer; UI shows banner |
| All-vendors-down catastrophic | Multiple F1 simultaneously | Site-level circuit breaker; new issuance forced to provisional via Electron desktop where possible; cloud REST returns 503 with explicit message; on-call paged |
| Webhook revision down | Cloud Run revision unhealthy | Vendor cloud queues webhooks (Salto: 24h, Vostio: 7d, TTLock: 1h); reconciler covers gaps via pull |
| Saga-runner backlog | F13 | Auto-scale; per-tenant priority queue ensures critical events (reservation.confirmed.v1) processed before bulk events |
| Audit anchor unavailable | F24 partial | Halt new audit writes is not acceptable; queue Merkle root submissions and alert; rows themselves still written |
5. Test coverage
Each failure mode has at least one matching test in TESTING_STRATEGY:
- F1, F2, F22: vendor simulator failure injection.
- F3: integration test with delayed-ack simulator.
- F4, F5: webhook ingestion suite.
- F7: integration test with two concurrent issuance attempts.
- F8: PIN-collision simulator scenario.
- F11: HIL rig manual test, plus simulator unit tests.
- F12: integration test with Postgres killed mid-saga.
- F14: poison-message inbox test.
- F15: optimistic-locking concurrency test.
- F16, F17: offline reconcile decision matrix coverage.
- F23: API contract test for idempotency.
- F24: dedicated Merkle-anchor verification test.