OBSERVABILITY — lock-integration-service
Bundle: SERVICE_OVERVIEW · APPLICATION_LOGIC · SECURITY_MODEL · FAILURE_MODES
Cross-cutting: docs/02 §11 Observability, Cloud Operations Suite (Cloud Logging, Cloud Monitoring, Cloud Trace), Prometheus-compatible metrics scraped by Managed Service for Prometheus.
Lock failures are physically observable: a guest stands in front of a door that won't open. The observability target is therefore aggressive: detect issues before the guest, not after the complaint email.
1. Telemetry surfaces
| Surface | Backend | Retention |
|---|---|---|
| Structured logs | Cloud Logging → BigQuery sink | 30d hot, 400d cold |
| Metrics | Managed Service for Prometheus | 13mo |
| Traces | Cloud Trace (OpenTelemetry SDK) | 30d, 100% sampled for saga traces, 5% for read APIs |
| Events for analytics | Pub/Sub → BigQuery via analytics-service | per retention class (operational 90d, regulated 7y) |
| Audit Merkle anchors | audit-service → BigQuery + external timestamping | 7y |
2. Logs
2.1 Structured fields (mandatory)
Every log line emits JSON with:
timestamp, severity, message,
service: "lock-integration",
revision: "<cloud-run-revision>",
traceId, spanId,
tenantId, propertyId?,
keyCredentialId?, lockDeviceId?, masterKeyId?,
vendor?, vendorAdapterId?,
sagaName?, sagaStep?, eventId?, idempotencyKey?,
operation?, outcome: "ok" | "error" | "skipped" | "deferred",
errorCode?, errorMessage?,
durationMs?
Logger middleware enforces presence of tenantId for all request-scoped logs (or explicit tenantId: "system" for infra-only paths).
2.2 Severity policy
| Level | When |
|---|---|
| DEBUG | Saga step entry/exit; vendor adapter request/response shapes (no payload) |
| INFO | Successful state transitions; vendor reachability changes; sync push results |
| WARNING | Retried saga step; circuit-breaker half-open; webhook signature mismatch first occurrence |
| ERROR | Saga compensation triggered; vendor adapter failure after retries; idempotency key reuse with mutated payload |
| CRITICAL | Audit Merkle mismatch; secret leak heuristic match; vendor cred rotation failure |
2.3 Redaction
Per SECURITY_MODEL §7. The redaction filter is a Pub/Sub-publisher-side concern as well: outbox-event payloads are validated against schema before publish; payloads containing forbidden field names are dropped and an outbox.publish_dropped_total{reason="redaction"} metric is incremented.
3. Metrics
All metrics are prefixed melmastoon_lock_ and labeled with tenant_id (low cardinality — at most thousands per cloud); property_id is added only on metrics where per-property breakdown matters (the catalog below marks them).
3.1 Saga / use case
| Metric | Type | Labels | SLO |
|---|---|---|---|
melmastoon_lock_saga_runs_total | counter | saga, outcome | — |
melmastoon_lock_saga_duration_seconds | histogram | saga | p95 < 5s, p99 < 15s |
melmastoon_lock_saga_retries_total | counter | saga, step, error_code | — |
melmastoon_lock_saga_compensations_total | counter | saga, compensation | — |
melmastoon_lock_use_case_runs_total | counter | use_case, outcome | — |
melmastoon_lock_use_case_duration_seconds | histogram | use_case | p95 < 1.5s for non-vendor paths |
3.2 Vendor adapter
| Metric | Type | Labels | SLO |
|---|---|---|---|
melmastoon_lock_vendor_call_total | counter | vendor, operation, outcome | — |
melmastoon_lock_vendor_call_duration_seconds | histogram | vendor, operation | p95 ≤ vendor SLA + 200ms |
melmastoon_lock_vendor_circuit_state | gauge (0=closed, 0.5=half_open, 1=open) | vendor, vendor_adapter_id | — |
melmastoon_lock_vendor_health_score | gauge (0–1) | vendor, vendor_adapter_id | ≥ 0.99 sustained |
melmastoon_lock_vendor_credential_age_days | gauge | vendor, vendor_adapter_id, role | < per-vendor rotation policy |
3.3 Webhooks
| Metric | Type | Labels |
|---|---|---|
melmastoon_lock_webhook_received_total | counter | vendor, kind |
melmastoon_lock_webhook_signature_failed_total | counter | vendor |
melmastoon_lock_webhook_processed_total | counter | vendor, kind, outcome |
melmastoon_lock_webhook_processing_lag_seconds | histogram | vendor, kind |
3.4 Inbox / outbox / sync
| Metric | Type |
|---|---|
melmastoon_lock_inbox_processed_total{topic, outcome} | counter |
melmastoon_lock_inbox_lag_seconds{topic} (since publish_at) | gauge |
melmastoon_lock_outbox_pending (rows where published_at IS NULL) | gauge |
melmastoon_lock_outbox_publish_lag_seconds | histogram |
melmastoon_lock_sync_push_events_total{topic, outcome} | counter |
melmastoon_lock_sync_pull_horizon_size{aggregate} | histogram |
3.5 Domain (per-property where flagged)
| Metric | Type | Labels |
|---|---|---|
melmastoon_lock_key_credentials_active | gauge | tenant_id, property_id, vendor, kind |
melmastoon_lock_key_credentials_provisional | gauge | tenant_id, property_id |
melmastoon_lock_key_credential_failed_total | counter | tenant_id, vendor, failure_reason |
melmastoon_lock_master_keys_active | gauge | tenant_id, property_id |
melmastoon_lock_attempts_total | counter | vendor, result, kind |
melmastoon_lock_attempts_denied_recent | gauge (5-min window) | tenant_id, property_id |
melmastoon_lock_devices_offline | gauge | tenant_id, property_id |
melmastoon_lock_devices_battery_low | gauge (battery_pct < 15) | tenant_id, property_id |
melmastoon_lock_offline_issuance_active_certs | gauge | tenant_id, property_id |
melmastoon_lock_offline_issuance_pending_reconcile | gauge | tenant_id, property_id |
4. Traces
OpenTelemetry SDK with auto-instrumentation for HTTP server, Pub/Sub publisher, Postgres (pg), and outbound HTTP (vendor calls). Custom spans:
IssueSaga.handle(root for saga-triggered work)LockPort.{operation}(per operation)Adapter.{vendor}.{operation}(child of LockPort span)WebhookHandler.{vendor}.processOutbox.publish,Sync.reconcilePush
Trace context propagated through Pub/Sub message attributes (traceparent, tracestate); inbox handler resumes the trace from the upstream service.
100% sampling for sagas; 5% for read APIs.
5. SLOs
| SLO | Target | Window |
|---|---|---|
Saga IssueSaga end-to-end (event received → lock.credential.issued.v1 published) | p95 < 5s, p99 < 15s | 30d rolling |
Saga RevokeSaga end-to-end | p95 < 4s, p99 < 12s | 30d |
LockPort.issueCredential per-vendor success ratio | ≥ 99.0% | 7d |
| Vendor health score sustained | ≥ 0.99 | 7d |
| Webhook processing lag (received → processed) | p95 < 10s, p99 < 60s | 7d |
API availability (5xx rate excluding 503 from explicit circuit-break) | ≥ 99.95% | 30d |
| Audit Merkle anchor freshness | ≤ 26h | 30d |
Each SLO is computed by a Prometheus recording rule and exposed in the Lock SRE dashboard.
6. Alerts
6.1 Page-the-on-call
| Alert | Trigger |
|---|---|
LockSagaErrorBudgetBurn | IssueSaga error budget burn rate > 14× over 1h (per 02 §11 burn-rate alert pattern) |
LockVendorAdapterDown | melmastoon_lock_vendor_circuit_state == 1 for > 5min for any tenant |
LockWebhookFloodAnomaly | webhook_signature_failed_total > 50/min on any vendor for 5 min |
LockAuditMerkleMismatch | Daily Merkle re-verification job emits mismatch=true |
LockOutboxBacklog | melmastoon_lock_outbox_pending > 10 000 for 10 min |
LockMasterKeyAnomalyHigh | HITL Decision created with score >= 0.95 and not acknowledged in 15 min |
LockOfflineCertNoReconcileLong | offline_issuance_pending_reconcile > 50 per device for > 24h (suggests a desktop is offline forever) |
6.2 Ticket-only
| Alert | Trigger |
|---|---|
LockBatteryLowTrend | devices_battery_low increased > 20% week-over-week per property |
LockVendorCredentialAging | vendor_credential_age_days > 0.8 × rotation policy |
LockProvisionalReconcileLag | A provisional credential not reconciled within 1h after a known-online desktop |
7. Dashboards
Three Cloud Monitoring dashboards (also exported to Grafana for SREs):
- Lock — Service Health. Saga rates and latencies, vendor adapter health by vendor, outbox/inbox depth, error rates.
- Lock — Per Tenant. Active credentials, provisional count, offline desktops, recent denied attempts, anomaly Decision queue depth. Top 20 tenants tile.
- Lock — Per Property. Drill-down: per-room credential status, device health (battery, offline), recent attempts timeline, encoder session activity.
The GM-facing operational view (not an SRE dashboard) lives in the backoffice frontend and reads via analytics-service materialized views, not directly from Prometheus.
8. Runbook hooks
Each alert links to a runbook in runbooks/lock-integration/:
LOCK-VENDOR-DOWN.md— when an adapter circuit opensLOCK-OUTBOX-BACKLOG.md— when publisher lagsLOCK-AUDIT-MERKLE-MISMATCH.md— P1LOCK-WEBHOOK-FLOOD.md— Cloud Armor + tenant rotationLOCK-MASTER-KEY-INCIDENT.md— invocation flow with securityLOCK-OFFLINE-RECONCILE-STUCK.md— desktop forensic + manual reconcile
9. Synthetic checks
- Vendor sandbox health: per-vendor synthetic monitor mints a sandbox credential, verifies it is
active, revokes it. Runs every 5 min from each region. Failure ⇒ adapter health gauge degrades +LockVendorAdapterDownif sustained. - Webhook receiver: a self-call signed with the test secret hits the webhook endpoint every 60s; verifies signature acceptance + inbox insertion.
- Sync round-trip: a fake desktop client pushes a no-op
lock.encoder_session.opened.local.v1and verifies materialization.
10. Cross-references
- 02 §11 Observability
- FAILURE_MODES — error → metric mapping
- SECURITY_MODEL §6 Audit immutability