Skip to main content

FAILURE_MODES — lock-integration-service

Bundle: SERVICE_OVERVIEW · APPLICATION_LOGIC · SECURITY_MODEL · OBSERVABILITY · SYNC_CONTRACT · SERVICE_RISK_REGISTER

Cross-cutting: docs/standards/ERROR_CODES, docs/02 §13 Resilience.

This service is never allowed to fail silently: a missing key at the door is a real-world incident. Every failure mode below has a deterministic detection signal, a deterministic recovery, and a documented runbook hook.

1. Failure-mode catalog

#ModeLikelihoodImpactDetectionMitigationRecovery
F1Vendor cloud unreachable (TTLock/Salto/Vostio API down)MediumHigh — new credentials cannot be issued for affected vendorAdapter call latency p95 > 5×SLA, then circuit opens; melmastoon_lock_vendor_circuit_state == 1Per-vendor circuit breaker; emit lock.vendor_adapter.health_changed.disabled; saga retries with exp backoff (max 6); for issuance, attempt fallback adapter from KeyKindPolicy.fallbackOrder (if any cloud-resilient kind available); for properties with paired Electron desktop and Generic Wiegand, issue locally as provisionalAuto-recover when health prober succeeds 3× consecutively → emit health_changed.healthy, drain saga retry queue
F2Vendor cloud rate-limited / quota exhaustedMediumMediumAdapter returns documented quota code; vendor_call_total{outcome="rate_limited"} spikesToken-bucket per vendor (cached headers); on 429, requeue with Retry-After; warn at 80% quota use via vendor_health_score degradationQuota window roll; emergency tier upgrade if sustained
F3Vendor webhook delayed / missingMediumMedium — credential stuck in pendingReconciler job: any pending credential older than vendorMaxAckSeconds flaggedReconciliation Cloud Run Job every 60s scans pending rows, calls vendor getCredentialStatus() to pull state directlyEither materialize via pull or mark failed after 30 min and trigger compensation
F4Vendor webhook spoofed (signature pass would let an attacker mark a credential revoked)LowCriticalSignature mismatch counter spikes; Cloud Armor logs anomalous IPsPer-vendor signature verification (HMAC/RSA/mTLS) — see SECURITY_MODEL §5; rate-limit; auto-denylistForensic via inbox + audit; rotate signing secret with vendor
F5Duplicate webhook deliveryHighLow (idempotent)webhook_inbox unique key collision metricInbox dedup by (vendor, vendorEventId)Drop duplicate, return 200
F6Out-of-order webhook (e.g., revoked arrives before active)MediumMediumSaga step receives unexpected transition; emits STATE_INVARIANT_VIOLATED warningBuffer event by vendorRef + vendorEventTs; if newer state already applied, drop olderCompensation rare; typically benign because the later state is correct
F7Concurrent issuance for same roomMediumMedium — risk of two active credentials for same roomAdvisory-lock contention metricPostgres advisory lock lock:{propertyId}:{roomId} acquired before issuance; one waiter, max 5 retries with 250 ms jitterLoser idempotently observes the winner's credential and either defers or returns it
F8PIN code collision (vendor returns same PIN for two active credentials)LowMedium — door opens for wrong guestAdapter post-issuance check: query active PINs for the device; if collision, regenerate up to 3×If 3 regenerations all collide → fail with MELMASTOON.LOCK.PIN_COLLISION, fall back to rfid_card or mobile_app per policyNotify GM; HITL Decision created
F9Mobile-app token push failure (notification-service down)MediumMediumnotification.push.delivered.v1 not received within 60sRe-emit lock.credential.delivery_failed.v1; auto-fallback to PIN delivery via SMS if KeyKindPolicy.fallbackOrder includes pin_codeFront desk dashboard surfaces undelivered keys; manual re-send
F10Time skew on lock device > tolerance (≥ 60s)MediumHigh — credential rejected at door despite valid windowVendor health checks include time-sync; device.health_alert.v1 with kind=clock_driftAdapter command issued to re-sync (where supported); device added to maintenance ticketField engineer dispatched; meanwhile widen credential window by 5 min via update saga
F11USB encoder disconnected mid-session (Electron desktop)MediumHigh — offline issuance haltedMain-process serial event disconnect; EncoderSession.lastActivityAt staleClose session row, emit lock.encoder_session.closed.local.v1 (reason='disconnected'); UI prompts operator to reconnectOperator reconnects; new session opens; queued issuances retry
F12Cloud SQL Postgres failover / brief unavailabilityLowHigh — sagas stallsql_admin.uptime drop; db_pool.connection_errors_total spikesHA failover (regional); saga step retries on transient; outbox guarantees at-least-once deliveryResume on DB return; backlog drains within minutes
F13Outbox publisher backlogLowMediummelmastoon_lock_outbox_pending > 10 000 for 10 minAuto-scale saga-runner; manual unblock if Pub/Sub topic-level outageDrain confirmed by gauge return to baseline
F14Inbox handler crash loop (poison message)LowMediuminbox_processed_total{outcome="error"} rate spike on a single eventMax 5 attempts → DLQ subscription; alert; manual replay toolEngineer inspects DLQ, fixes handler bug or marks event dropped with reason
F15Stale KeyCredentialAggregate.version (optimistic concurrency conflict)MediumLowUseCase returns MELMASTOON.LOCK.STATE_VERSION_MISMATCHUse case re-loads aggregate and retries up to 3×Final retry failure surfaces as 409 to caller; CLI replay tool for stuck rows
F16Provisional credential references reservation that was cancelledMediumMediumReconciler decision matrix branchSync reconciler revokes at vendor (best-effort) and persists revoked with metadata.wasProvisional=trueDesktop receives outcome:'revoked'; locally invalidates the card
F17Offline issuance certificate expired before reconcileLowMediumReconciler signature check fails; CRL matchReject push with MELMASTOON.LOCK.OFFLINE_CERT_EXPIRED; desktop must re-mint cert and re-issueOperator re-mints cert via cloud REST when online
F18Stolen Electron desktopLowCriticaliam.device.unbound.v1 eventOn unbind, add device's offline_issuance cert serial to CRL; flag all credentials issued in last 14 days for review (HITL)GM reviews flagged credentials; revokes any suspicious ones
F19Vendor credential leakLowCriticalDetected by routine secret-scanner CI or external bug bountyRevoke at vendor + rotate; emit vendor_adapter.health_changed.disabledRe-issue all pending credentials post-rotation
F20Pub/Sub subscriber misconfiguration (subscription deleted)Very lowMediuminbox_lag_seconds climbs without boundTerraform drift alarm; recreate subscription via IaCReplay topic backlog into recreated subscription
F21Salto on-prem connector unreachable (VPN tunnel down)LowMedium — affected property onlySynthetic check fails; vendor_circuit_state opens for that adapter onlyFalls back to vendor cloud-only operations; on-prem-only operations queue with retryNetwork team restores tunnel
F22BLE pairing fail (TTLock mobile-app token cannot bind)MediumMedium — guest can't open doorAdapter returns ttlock.ble_pair_failed; lock.credential.failed.v1 emittedAuto-retry once; on persistent fail, fallback to PIN per policyFront desk hands out PIN at counter
F23Idempotency key reuse with different payloadLowLowMELMASTOON.LOCK.IDEMPOTENCY_KEY_REUSED in logsCaller bug; surface 409 with prior decisionCaller fixes; replays with fresh key
F24Audit Merkle anchor mismatchVery lowCriticalDaily verification job emits mismatch=trueP1 incident; halt writes to affected partition; security forensicPossible silent tamper or replication corruption — escalate per SECURITY_MODEL §11
F25Master key off-shift useMediumHighAnomaly score ≥ 0.85 → HITL DecisionOptionally auto-suspend at score ≥ 0.95 (HITL pre-approved policy per tenant)Tenant admin reviews; suspends or unsuspends

2. Failure → error code mapping

Mapping to canonical ERROR_CODES:

FailureError codeHTTP
F1MELMASTOON.LOCK.VENDOR_UNAVAILABLE503
F2MELMASTOON.LOCK.VENDOR_RATE_LIMITED429
F3MELMASTOON.LOCK.VENDOR_ACK_TIMEOUT504 (admin) / surfaces as pending to caller
F4MELMASTOON.LOCK.WEBHOOK_SIGNATURE_INVALID401
F5MELMASTOON.LOCK.WEBHOOK_DUPLICATE200 (silent)
F6logged WARNING; no error to caller
F7MELMASTOON.LOCK.ROOM_LOCK_CONTENTION (after retries)409
F8MELMASTOON.LOCK.PIN_COLLISION409
F9MELMASTOON.LOCK.DELIVERY_FAILED502 (admin) / handled by retry
F10MELMASTOON.LOCK.DEVICE_TIME_SKEW (operational warning)n/a
F11MELMASTOON.LOCK.ENCODER_DISCONNECTED (desktop)n/a
F15MELMASTOON.LOCK.STATE_VERSION_MISMATCH409
F17MELMASTOON.LOCK.OFFLINE_CERT_EXPIRED403
F18MELMASTOON.LOCK.OFFLINE_CERT_REVOKED403
F22MELMASTOON.LOCK.BLE_PAIRING_FAILED502
F23MELMASTOON.SYNC.IDEMPOTENCY_KEY_REUSED409

3. Compensation matrix

SagaFailure pointCompensation
IssueSagaVendor returns success but webhook never confirms (F3)Reconciler pulls vendor state; if absent, mark failed, emit lock.credential.failed.v1; downstream notification-service notifies front desk
IssueSagaVendor success, then DB save fails (F12)Outbox guarantees event publish on retry; on retry, idempotency key dedupes vendor side
IssueSagaVendor permanent failure (F1 sustained)Emit failed; if KeyKindPolicy.fallbackOrder permits, retry with next-preferred kind via separate saga step
RevokeSagaVendor returns not_foundTreat as success (idempotent); update local state
RevokeSagaVendor success, partial (e.g., revoked at cloud but device still has the credential cached)Local state revoked; emit lock.credential.revoked.v1; device sync request enqueued; followup device.health_alert.v1 if device fails to sync within 5 min
MasterKeyShiftSagaIssue at shift start failsNo partial state; ticket created for engineer; manager notified
WebhookSagaInbox handler crashes (F14)DLQ + replay
OfflineReconcileSagaCert revoked / expiredReject; desktop receives outcome and revokes locally
OfflineReconcileSagaReservation no longer matchesPer SYNC_CONTRACT §5.1 decision matrix

4. Degraded modes

ModeTriggerBehavior
Vendor-down degraded (per vendor)F1 sustained for 30sNew issuances for affected vendor return 503 with Retry-After; existing credentials continue to work; sagas defer; UI shows banner
All-vendors-down catastrophicMultiple F1 simultaneouslySite-level circuit breaker; new issuance forced to provisional via Electron desktop where possible; cloud REST returns 503 with explicit message; on-call paged
Webhook revision downCloud Run revision unhealthyVendor cloud queues webhooks (Salto: 24h, Vostio: 7d, TTLock: 1h); reconciler covers gaps via pull
Saga-runner backlogF13Auto-scale; per-tenant priority queue ensures critical events (reservation.confirmed.v1) processed before bulk events
Audit anchor unavailableF24 partialHalt new audit writes is not acceptable; queue Merkle root submissions and alert; rows themselves still written

5. Test coverage

Each failure mode has at least one matching test in TESTING_STRATEGY:

  • F1, F2, F22: vendor simulator failure injection.
  • F3: integration test with delayed-ack simulator.
  • F4, F5: webhook ingestion suite.
  • F7: integration test with two concurrent issuance attempts.
  • F8: PIN-collision simulator scenario.
  • F11: HIL rig manual test, plus simulator unit tests.
  • F12: integration test with Postgres killed mid-saga.
  • F14: poison-message inbox test.
  • F15: optimistic-locking concurrency test.
  • F16, F17: offline reconcile decision matrix coverage.
  • F23: API contract test for idempotency.
  • F24: dedicated Merkle-anchor verification test.

6. Cross-references