Skip to main content

OBSERVABILITY — lock-integration-service

Bundle: SERVICE_OVERVIEW · APPLICATION_LOGIC · SECURITY_MODEL · FAILURE_MODES

Cross-cutting: docs/02 §11 Observability, Cloud Operations Suite (Cloud Logging, Cloud Monitoring, Cloud Trace), Prometheus-compatible metrics scraped by Managed Service for Prometheus.

Lock failures are physically observable: a guest stands in front of a door that won't open. The observability target is therefore aggressive: detect issues before the guest, not after the complaint email.

1. Telemetry surfaces

SurfaceBackendRetention
Structured logsCloud Logging → BigQuery sink30d hot, 400d cold
MetricsManaged Service for Prometheus13mo
TracesCloud Trace (OpenTelemetry SDK)30d, 100% sampled for saga traces, 5% for read APIs
Events for analyticsPub/Sub → BigQuery via analytics-serviceper retention class (operational 90d, regulated 7y)
Audit Merkle anchorsaudit-service → BigQuery + external timestamping7y

2. Logs

2.1 Structured fields (mandatory)

Every log line emits JSON with:

timestamp, severity, message,
service: "lock-integration",
revision: "<cloud-run-revision>",
traceId, spanId,
tenantId, propertyId?,
keyCredentialId?, lockDeviceId?, masterKeyId?,
vendor?, vendorAdapterId?,
sagaName?, sagaStep?, eventId?, idempotencyKey?,
operation?, outcome: "ok" | "error" | "skipped" | "deferred",
errorCode?, errorMessage?,
durationMs?

Logger middleware enforces presence of tenantId for all request-scoped logs (or explicit tenantId: "system" for infra-only paths).

2.2 Severity policy

LevelWhen
DEBUGSaga step entry/exit; vendor adapter request/response shapes (no payload)
INFOSuccessful state transitions; vendor reachability changes; sync push results
WARNINGRetried saga step; circuit-breaker half-open; webhook signature mismatch first occurrence
ERRORSaga compensation triggered; vendor adapter failure after retries; idempotency key reuse with mutated payload
CRITICALAudit Merkle mismatch; secret leak heuristic match; vendor cred rotation failure

2.3 Redaction

Per SECURITY_MODEL §7. The redaction filter is a Pub/Sub-publisher-side concern as well: outbox-event payloads are validated against schema before publish; payloads containing forbidden field names are dropped and an outbox.publish_dropped_total{reason="redaction"} metric is incremented.

3. Metrics

All metrics are prefixed melmastoon_lock_ and labeled with tenant_id (low cardinality — at most thousands per cloud); property_id is added only on metrics where per-property breakdown matters (the catalog below marks them).

3.1 Saga / use case

MetricTypeLabelsSLO
melmastoon_lock_saga_runs_totalcountersaga, outcome
melmastoon_lock_saga_duration_secondshistogramsagap95 < 5s, p99 < 15s
melmastoon_lock_saga_retries_totalcountersaga, step, error_code
melmastoon_lock_saga_compensations_totalcountersaga, compensation
melmastoon_lock_use_case_runs_totalcounteruse_case, outcome
melmastoon_lock_use_case_duration_secondshistogramuse_casep95 < 1.5s for non-vendor paths

3.2 Vendor adapter

MetricTypeLabelsSLO
melmastoon_lock_vendor_call_totalcountervendor, operation, outcome
melmastoon_lock_vendor_call_duration_secondshistogramvendor, operationp95 ≤ vendor SLA + 200ms
melmastoon_lock_vendor_circuit_stategauge (0=closed, 0.5=half_open, 1=open)vendor, vendor_adapter_id
melmastoon_lock_vendor_health_scoregauge (0–1)vendor, vendor_adapter_id≥ 0.99 sustained
melmastoon_lock_vendor_credential_age_daysgaugevendor, vendor_adapter_id, role< per-vendor rotation policy

3.3 Webhooks

MetricTypeLabels
melmastoon_lock_webhook_received_totalcountervendor, kind
melmastoon_lock_webhook_signature_failed_totalcountervendor
melmastoon_lock_webhook_processed_totalcountervendor, kind, outcome
melmastoon_lock_webhook_processing_lag_secondshistogramvendor, kind

3.4 Inbox / outbox / sync

MetricType
melmastoon_lock_inbox_processed_total{topic, outcome}counter
melmastoon_lock_inbox_lag_seconds{topic} (since publish_at)gauge
melmastoon_lock_outbox_pending (rows where published_at IS NULL)gauge
melmastoon_lock_outbox_publish_lag_secondshistogram
melmastoon_lock_sync_push_events_total{topic, outcome}counter
melmastoon_lock_sync_pull_horizon_size{aggregate}histogram

3.5 Domain (per-property where flagged)

MetricTypeLabels
melmastoon_lock_key_credentials_activegaugetenant_id, property_id, vendor, kind
melmastoon_lock_key_credentials_provisionalgaugetenant_id, property_id
melmastoon_lock_key_credential_failed_totalcountertenant_id, vendor, failure_reason
melmastoon_lock_master_keys_activegaugetenant_id, property_id
melmastoon_lock_attempts_totalcountervendor, result, kind
melmastoon_lock_attempts_denied_recentgauge (5-min window)tenant_id, property_id
melmastoon_lock_devices_offlinegaugetenant_id, property_id
melmastoon_lock_devices_battery_lowgauge (battery_pct < 15)tenant_id, property_id
melmastoon_lock_offline_issuance_active_certsgaugetenant_id, property_id
melmastoon_lock_offline_issuance_pending_reconcilegaugetenant_id, property_id

4. Traces

OpenTelemetry SDK with auto-instrumentation for HTTP server, Pub/Sub publisher, Postgres (pg), and outbound HTTP (vendor calls). Custom spans:

  • IssueSaga.handle (root for saga-triggered work)
  • LockPort.{operation} (per operation)
  • Adapter.{vendor}.{operation} (child of LockPort span)
  • WebhookHandler.{vendor}.process
  • Outbox.publish, Sync.reconcilePush

Trace context propagated through Pub/Sub message attributes (traceparent, tracestate); inbox handler resumes the trace from the upstream service.

100% sampling for sagas; 5% for read APIs.

5. SLOs

SLOTargetWindow
Saga IssueSaga end-to-end (event received → lock.credential.issued.v1 published)p95 < 5s, p99 < 15s30d rolling
Saga RevokeSaga end-to-endp95 < 4s, p99 < 12s30d
LockPort.issueCredential per-vendor success ratio≥ 99.0%7d
Vendor health score sustained≥ 0.997d
Webhook processing lag (received → processed)p95 < 10s, p99 < 60s7d
API availability (5xx rate excluding 503 from explicit circuit-break)≥ 99.95%30d
Audit Merkle anchor freshness≤ 26h30d

Each SLO is computed by a Prometheus recording rule and exposed in the Lock SRE dashboard.

6. Alerts

6.1 Page-the-on-call

AlertTrigger
LockSagaErrorBudgetBurnIssueSaga error budget burn rate > 14× over 1h (per 02 §11 burn-rate alert pattern)
LockVendorAdapterDownmelmastoon_lock_vendor_circuit_state == 1 for > 5min for any tenant
LockWebhookFloodAnomalywebhook_signature_failed_total > 50/min on any vendor for 5 min
LockAuditMerkleMismatchDaily Merkle re-verification job emits mismatch=true
LockOutboxBacklogmelmastoon_lock_outbox_pending > 10 000 for 10 min
LockMasterKeyAnomalyHighHITL Decision created with score >= 0.95 and not acknowledged in 15 min
LockOfflineCertNoReconcileLongoffline_issuance_pending_reconcile > 50 per device for > 24h (suggests a desktop is offline forever)

6.2 Ticket-only

AlertTrigger
LockBatteryLowTrenddevices_battery_low increased > 20% week-over-week per property
LockVendorCredentialAgingvendor_credential_age_days > 0.8 × rotation policy
LockProvisionalReconcileLagA provisional credential not reconciled within 1h after a known-online desktop

7. Dashboards

Three Cloud Monitoring dashboards (also exported to Grafana for SREs):

  1. Lock — Service Health. Saga rates and latencies, vendor adapter health by vendor, outbox/inbox depth, error rates.
  2. Lock — Per Tenant. Active credentials, provisional count, offline desktops, recent denied attempts, anomaly Decision queue depth. Top 20 tenants tile.
  3. Lock — Per Property. Drill-down: per-room credential status, device health (battery, offline), recent attempts timeline, encoder session activity.

The GM-facing operational view (not an SRE dashboard) lives in the backoffice frontend and reads via analytics-service materialized views, not directly from Prometheus.

8. Runbook hooks

Each alert links to a runbook in runbooks/lock-integration/:

  • LOCK-VENDOR-DOWN.md — when an adapter circuit opens
  • LOCK-OUTBOX-BACKLOG.md — when publisher lags
  • LOCK-AUDIT-MERKLE-MISMATCH.md — P1
  • LOCK-WEBHOOK-FLOOD.md — Cloud Armor + tenant rotation
  • LOCK-MASTER-KEY-INCIDENT.md — invocation flow with security
  • LOCK-OFFLINE-RECONCILE-STUCK.md — desktop forensic + manual reconcile

9. Synthetic checks

  • Vendor sandbox health: per-vendor synthetic monitor mints a sandbox credential, verifies it is active, revokes it. Runs every 5 min from each region. Failure ⇒ adapter health gauge degrades + LockVendorAdapterDown if sustained.
  • Webhook receiver: a self-call signed with the test secret hits the webhook endpoint every 60s; verifies signature acceptance + inbox insertion.
  • Sync round-trip: a fake desktop client pushes a no-op lock.encoder_session.opened.local.v1 and verifies materialization.

10. Cross-references