housekeeping-service — OBSERVABILITY
Logs (Cloud Logging), metrics (Cloud Monitoring + Datadog), traces (Cloud Trace + OpenTelemetry), errors (Sentry). SLOs declared in
slo.yaml; alert routes inalerts.yaml. Aligned with the platform standard.
1. SLOs
| SLO | Target | Window | Indicator |
|---|---|---|---|
| Turnover task auto-create latency | p95 < 2 s | 7-day rolling | event_received_at → outbox_appended_at for task.created.v1 from reservation.checked_out.v1 |
| Board read latency | p99 < 250 ms | 7-day rolling | GET /board server time |
| Board write latency | p99 < 400 ms | 7-day rolling | POST /tasks/*/assign server time including outbox commit |
| API availability | 99.9% | 30-day rolling | (1 - 5xx_rate) on /api/v1/housekeeping/* |
| Event consumer lag | p99 < 30 s | 1-day rolling | Pub/Sub oldest unacked age per consumed subscription |
| Sync push success | ≥ 99% | 7-day rolling | applied + conflict_recoverable / total |
Error budgets and burn-rate alerts (1h/6h/3d) are defined in slo.yaml.
2. Logs
Structured JSON, 1 line per request/event. Mandatory fields:
{
"ts":"2026-04-22T12:00:00.123Z",
"level":"INFO",
"service":"housekeeping-service",
"version":"1.4.2",
"tenantId":"tnt_…", // never the raw value in lower envs; hashed in production "tnt_h_<sha256-prefix>"
"requestId":"req_…",
"traceId":"…",
"spanId":"…",
"actor":{"type":"user","id":"stf_…"},
"route":"POST /tasks/:id/complete",
"useCase":"CompleteTaskUseCase",
"taskId":"hkt_…",
"roomId":"rom_…",
"outcome":"ok",
"latencyMs":143,
"msg":"task completed"
}
Levels: DEBUG (off in prod), INFO (default), WARN (recoverable), ERROR (use-case failure), FATAL (process-killing). Sensitive fields redacted by redactor.middleware.ts per SECURITY_MODEL.md §8. Retention: 30 days hot, 365 days archived in GCS.
3. Metrics
Naming: melmastoon.housekeeping.<area>.<measure> with labels (tenant_hash, property_id, outcome, route|use_case|topic). Histograms preferred over averages.
3.1 Hot-path
melmastoon.housekeeping.api.requests(counter; labels: route, status_code)melmastoon.housekeeping.api.latency_ms(histogram; labels: route)melmastoon.housekeeping.use_case.duration_ms(histogram; labels: use_case, outcome)melmastoon.housekeeping.outbox.append_to_publish_lag_ms(histogram)melmastoon.housekeeping.outbox.unpublished_rows(gauge; alert > 1k for 5m)
3.2 Domain
melmastoon.housekeeping.tasks.created(counter; labels: kind, source)melmastoon.housekeeping.tasks.completed(counter; labels: kind, with_inspection)melmastoon.housekeeping.tasks.duration_minutes(histogram; labels: kind)melmastoon.housekeeping.room.time_to_ready_minutes(histogram; labels: property_hash)melmastoon.housekeeping.tasks.escalated(counter; labels: reason)melmastoon.housekeeping.tasks.failed(counter; labels: reason)melmastoon.housekeeping.linen.on_hand(gauge; labels: property_hash, line)melmastoon.housekeeping.linen.low_stock_alerts_emitted(counter)melmastoon.housekeeping.shift.staffing_gap_detected(counter)melmastoon.housekeeping.inspection.outcome(counter; labels: outcome)
3.3 Sync
melmastoon.housekeeping.sync.pull.row_count(histogram; labels: aggregate)melmastoon.housekeeping.sync.push.ops(counter; labels: outcome)melmastoon.housekeeping.sync.conflicts(counter; labels: aggregate, field)
3.4 AI port
melmastoon.housekeeping.ai.routing.requests(counter; labels: outcome, hitl_mode)melmastoon.housekeeping.ai.routing.latency_ms(histogram)melmastoon.housekeeping.ai.routing.applied_rows(histogram)melmastoon.housekeeping.ai.routing.fallback(counter)
3.5 System
- Cloud Run native metrics (CPU, mem, instance count, request concurrency).
- Cloud SQL native metrics (connections, slow queries, replication lag, CPU).
- Pub/Sub:
subscription/oldest_unacked_message_age,delivery_attempts.
4. Traces
OpenTelemetry. Every request and event handler is a root span. Standard span names:
http.request(attributes: route, status, request_id)housekeeping.use_case.<name>(attributes: tenant_hash, aggregate_id)housekeeping.repo.<table>.<op>(attributes: rows, version)housekeeping.outbox.appendhousekeeping.pubsub.consume.<subject>housekeeping.ai.routing.suggest
traceparent propagated via the event envelope; consumers continue the trace.
Sample rate: 100% for ERROR, 10% for INFO under load, 100% on canary.
5. Errors (Sentry)
- All
ERROR-level logs auto-shipped to Sentry with the trace context. - Event handlers tag errors with
topic,message_id,delivery_attempt. - Domain errors are not Sentry events — they are expected business outcomes (4xx). Only
5xx, panics, andConcurrency conflict storms(10+ in 60 s on the same aggregate) page on-call.
6. Dashboards
Grafana folder housekeeping:
- Overview — request rate, latency p50/p95/p99, 4xx/5xx, error budget burn.
- Turnover Saga — tasks created/started/completed/min, time-to-ready histogram, current open count by status.
- Board —
GET /boardp99, board cache hit ratio, sync push outcomes. - Linen — on-hand by line per property, low-stock alerts, runway minutes.
- Shifts & Routing — staffing-gap events, routing requests/applied/fallback, hitl mode by tenant.
- Database — Cloud SQL CPU, slow query top 10, partition sizes, RLS denies.
- Pub/Sub — consumer lag per subscription, DLQ size, retry counts.
7. Alerts
| Alert | Condition | Route |
|---|---|---|
OutboxBacklog | unpublished rows > 1000 for 5 min | on-call (housekeeping) |
BoardLatencyHigh | board p99 > 400 ms for 10 min | on-call |
TurnoverCreateLagHigh | p95 > 5 s for 10 min | on-call |
ConsumerLagHigh | oldest unacked > 60 s for 5 min on any subscription | on-call |
DLQGrowth | DLQ size > 10 in 15 min | on-call |
ErrorBudgetBurnFast | 1h burn rate > 14× | on-call (page) |
ErrorBudgetBurnSlow | 6h burn rate > 6× | on-call (ticket) |
RLSDenyStorm | RLS denies > 0 for 5 min | security on-call (this should never happen) |
RoutingFallbackPersistent | fallback counter > 0 for 30 min | engineering Slack (warn) |
LinenLowStockSpike | low-stock alerts > 5/property in 1 h | analytics Slack (warn) |
Pages go to PagerDuty housekeeping; warnings to Slack #hk-ops.
8. Runbooks
Each alert has a runbook in runbooks/:
outbox-backlog.mdboard-latency.mdturnover-create-lag.mdconsumer-lag.mddlq-growth.mdrls-deny.mdrouting-fallback.md
Runbooks include: how to confirm, immediate mitigation, escalation, post-mortem template.
9. Audit visibility
audit_events is queryable via the internal admin tool (/internal/admin/audit?aggregate_id=…) — restricted to tenant_admin and platform support roles. Mirrored to audit-service over Pub/Sub.
10. Cross-link
- Platform observability defaults:
docs/observability.md(project-wide). - Failure modes & runbooks:
FAILURE_MODES.md. - Capacity planning inputs:
DATA_MODEL.md§6.