Skip to main content

maintenance-service · OBSERVABILITY

Stack: OpenTelemetry SDK (Node) → OTLP/gRPC → Cloud Operations + SigNoz. Logs structured JSON to stdout → Cloud Logging. Metrics scraped via OTLP. Traces sampled at 10% baseline, 100% on errors and on propertyId debug allow-list.

1. Required span attributes

Every server span (HTTP, Pub/Sub handler, worker tick) must carry:

AttributeSourceNotes
service.nameresourcemaintenance-service
service.namespaceresourcemelmastoon
service.versionresourcefrom build env BUILD_VERSION
tenant.idrequest middlewarefrom JWT / event envelope
property.idrequest middlewarewhen applicable
correlation.idheader / envelopepropagated end-to-end
user.idJWThashed for analytics export
device.idJWTdesktop sync only
work_order.iduse casewhen applicable
asset.iduse case
vendor.iduse case
wo.severity, wo.category, wo.source, wo.status_from, wo.status_totransitions
saga.stepreservedchoreography hop label, e.g., mnt.room_block_request
outbox.event.type, outbox.event.idoutbox publish span
inbox.event.type, inbox.message_idinbox handler span
ai.capability, ai.model, ai.score, ai.human_acceptedAI calls

2. Structured log fields

All logs must be JSON with at minimum:

{
"timestamp": "2026-04-22T14:03:21.812Z",
"level": "INFO",
"msg": "work order resolved",
"service": "maintenance-service",
"tenantId": "tnt_...",
"propertyId": "prop_...",
"correlationId": "01HXY...",
"spanId": "abc123",
"traceId": "def456",
"workOrderId": "mnt_...",
"fields": { "severity": "high", "category": "hvac", "costAmountMicro": "150000000", "currency": "AFN" }
}

PII never appears in logs. Vendor phone numbers / emails are referenced by vendorId only.

3. SLIs / SLOs

SLISLO targetMeasurement window
API availability (5xx / total)≥ 99.9%30-day rolling
API p95 latency for GET /work-orders (filtered, ≤ 5k rows)≤ 400 ms30-day rolling
API p95 latency for POST /work-orders (no AI)≤ 600 ms30-day rolling
API p99 latency for state transitions≤ 1 s30-day rolling
Outbox publish lag (enqueue → published) p99≤ 5 s7-day rolling
Preventive scheduler lag (due → draft WO created) p99≤ 60 s7-day rolling
SLA breach detection lag p99≤ 60 s7-day rolling
Inbox handler success rate≥ 99.5%7-day rolling
Pub/Sub DLQ rate≤ 0.05%7-day rolling

Error budget burn is alerted at 50% and 80% within window.

4. RED + USE metrics

RED (per route + per Pub/Sub subscription + per worker)

  • mnt.requests_total{route, code}
  • mnt.request_duration_seconds_bucket{route} (histogram)
  • mnt.errors_total{route, code}
  • mnt.inbox_messages_total{subscription, result} where result ∈ {processed, dedup, failed}
  • mnt.worker_ticks_total{worker, result}

USE (resource)

  • mnt.db.connections{pool=primary} gauge
  • mnt.db.query_duration_seconds{op} histogram
  • mnt.cache.hit_total{key} / mnt.cache.miss_total{key}
  • mnt.outbox.lag_seconds gauge (now − oldest unpublished enqueued_at)
  • mnt.outbox.unpublished_count gauge
  • mnt.preventive.due_pending_count gauge

Domain-specific

  • mnt.work_orders.created_total{source, severity, category} counter
  • mnt.work_orders.resolved_total{category, severity} counter
  • mnt.work_orders.sla_breached_total{category, severity} counter
  • mnt.work_orders.room_blocked_total{} counter
  • mnt.work_orders.relocations_required_total{} counter
  • mnt.work_orders.escalations_total{hop} counter
  • mnt.preventive.fired_total{cadence_kind} counter
  • mnt.parts.out_of_stock_total{partId} counter
  • mnt.assets.health_index gauge labelled (propertyId, class) (sampled, not per-asset)
  • mnt.ai.calls_total{capability, accepted} counter
  • mnt.ai.cost_micro_usd_total{capability} counter
  • mnt.ai.latency_seconds_bucket{capability} histogram

5. Dashboards

5.1 Service health (default)

  • Request rate, error rate, p50/p95/p99 latency per route.
  • Outbox lag and unpublished count.
  • DB connection saturation, query p95 by op.
  • Cache hit ratios for mnt:open:* and mnt:vendors:*.
  • Pod CPU/mem, request count per pod.

5.2 Operations

  • Open WOs by status × severity × category (heatmap).
  • Median time-in-status per category.
  • SLA breach count + open breach count.
  • Top 10 assets by WO count last 30 days.
  • Vendor responsiveness: avg time assigned → acknowledged per vendor.
  • Preventive due / overdue counts.
  • Cost per resolved WO (rolling 30 day) by category.

5.3 AI

  • Calls per capability, accept rate, avg score, avg latency, cost per call, daily total cost vs budget.
  • HITL rejection rate (suggestion vs final value diff).

5.4 Sync

  • Pull / push rates, push failure rate by error code, conflicts detected by aggregate, average device clock skew.

6. Alerts (selected; full catalog in runbook repo)

AlertConditionSeverityRunbook
mnt.api_5xx_burn_fasterror budget burn rate × 14.4 in last 1hP1runbook://maintenance/api-5xx
mnt.outbox_lag_highmnt.outbox.lag_seconds > 60 for 5 minP2runbook://maintenance/outbox-lag
mnt.outbox_unpublished_highmnt.outbox.unpublished_count > 1000 for 10 minP2same
mnt.preventive_lag_highmnt.preventive.due_pending_count > 50 for 5 minP2runbook://maintenance/preventive-lag
mnt.sla_breach_stormmnt.work_orders.sla_breached_total rate × 3 vs 7-day baselineP3runbook://maintenance/sla-storm
mnt.dlq_rate_highDLQ rate > 0.5% over 30 minP2runbook://maintenance/dlq
mnt.ai_budget_exhaustedtenant budget used 100%P3 (per tenant)runbook://maintenance/ai-budget
mnt.sync_push_failure_highpush failure rate > 5% over 5 minP2runbook://maintenance/sync-push
mnt.db_connection_saturationconnection pool ≥ 90% used for 5 minP2runbook://maintenance/db-connections
mnt.room_block_no_responseroom_blocked request not acknowledged by property-service within 60 sP3runbook://maintenance/room-block

7. Tracing patterns

7.1 Auto-create WO from housekeeping flag

[housekeeping-service] hk.flag.publish ──► causationId
[Pub/Sub] transport
[maintenance-service] inbox.handler ──► correlationId (= housekeeping causationId)
└── mnt.create_work_order
├── ai.severity-suggestion (optional)
├── ai.category-classify (optional)
├── repo.work_order.save (tx)
├── outbox.append (work_order.created.v1)
├── outbox.append (work_order.room_blocked.v1) // if applicable
└── outbox.append (work_order.relocation_required.v1) // if applicable
[outbox-relay] publishes → property-service inbox → ...

Same correlationId flows through every hop. Span links join causationId from the upstream event for full causal graph.

7.2 Saga: room block → relocation

Reservation-service's room_change.saga.step1.evaluate span includes a span link back to our work_order.relocation_required.v1 outbox publish span. The full causal graph is reconstructable in SigNoz.

8. Replay & backfill observability

When the outbox-relay replays events from a checkpoint, every replayed publish span carries replay=true. Dashboards filter these out by default to avoid noise.

When migrating data (see MIGRATION_PLAN.md), the importer worker emits its own metric series prefixed mnt.migrate.* with labels (tenantId, source) so we can watch progress and pause if errors spike.

9. Synthetic checks

Cloud Monitoring uptime checks (every 60 s, from 3 regions):

  • GET /healthz (liveness)
  • GET /readyz (readiness; checks DB + Pub/Sub auth + outbox-relay heartbeat)
  • GET /api/v1/maintenance/work-orders?limit=1 with synthetic test tenant

Per-tenant synthetic ("smoke") flow runs every 15 minutes in staging:

  1. Create WO → assert created
  2. Assign to test staff → assert assigned
  3. Start → resolve → verify → assert verified and OOO released

Failures page on-call.

10. Data export to BigQuery

  • All events archived 1× daily to melmastoon_events_v1.maintenance_* partitioned by producedAt date.
  • Operational metrics exported via Cloud Operations sink → BigQuery for long-term trend analysis.
  • Sync conflict logs exported nightly for cross-tenant pattern detection.