Skip to main content

property-service — OBSERVABILITY

Companion: SERVICE_OVERVIEW · API_CONTRACTS · EVENT_SCHEMAS · FAILURE_MODES · DEPLOYMENT_TOPOLOGY

This document defines the SLIs, SLOs, dashboards, alerts, runbook hooks, structured-log shape, and trace/metric attributes that property-service is required to emit. It is a binding contract between the service team and platform SRE.

Stack: OpenTelemetry SDK for traces + metrics, exported via OTLP to Cloud Trace and Cloud Monitoring; structured JSON logs to Cloud Logging via stdout. No vendor-specific instrumentation.


1. SLIs and SLOs

1.1 User-facing SLIs

SLIDefinitionSLO (rolling 28d)Burn alert
Read availability1 - (5xx on GET /properties* / total GET)≥ 99.95 %14.4× burn over 1 h or 6× over 6 h
Read latency p99p99 of GET /properties/:id end-to-end on the service span≤ 250 msbreach 30 min
Write availability1 - (5xx on POST/PATCH/DELETE / total mutations)≥ 99.9 %14.4× burn over 1 h
Write latency p99p99 of POST /properties and PATCH /properties/:id≤ 800 msbreach 30 min
Publish success rateaccepted_publish / requested_publish (excludes 4xx by domain validation)≥ 99.5 %breach 1 h
Photo readiness latency p95time from POST /properties/:id/photos to photo.added.v1 with status=ready≤ 30 sbreach 1 h
Geo-search latency p99p99 of GET /properties/geo/nearby≤ 400 msbreach 30 min

1.2 Internal SLIs

SLIDefinitionSLO
Outbox publish lag p95published_at - created_at for outbox rows≤ 5 s
Outbox backlogrows where published_at IS NULL≤ 100 sustained
Inbox processing lag p95processed_at - received_at for inbox rows≤ 10 s
Sync apply latency p95/internal/sync/apply per operation≤ 200 ms
AI orchestrator round-trip p95per capability≤ 8 s for prop.describe.draft, ≤ 12 s for prop.photo.tag

2. Required Span Attributes

Every span (HTTP, DB, Pub/Sub, Redis, internal RPC) must carry:

melmastoon.tenant_id
melmastoon.request_id
melmastoon.actor_user_id (when authenticated)
melmastoon.actor_kind (user|system|integration|ai)
melmastoon.route (e.g., GET /properties/:id)
melmastoon.aggregate (property|room|room_type|photo|policy|amenity)
melmastoon.aggregate_id
melmastoon.idempotency_key (when present)
melmastoon.cache.hit (true|false on Redis-touching spans)
melmastoon.error.code (MELMASTOON.* on errors only)

Sync spans add: melmastoon.device_id, melmastoon.operation_id, melmastoon.conflict_resolution.

AI spans add: melmastoon.ai.run_id, melmastoon.ai.capability, melmastoon.ai.model_route, melmastoon.ai.tokens_in, melmastoon.ai.tokens_out, melmastoon.ai.moderation_flagged.

Trace propagation: W3C traceparent only; gateway injects, every internal hop preserves.


3. Metrics Catalog

All counters/histograms are tagged with tenant_id (low cardinality after aggregation by tenant tier) and route or aggregate as appropriate. Histograms use OTel exponential buckets.

NameTypeTagsDescription
property_http_requests_totalcounterroute, method, status, tenant_idAll HTTP requests
property_http_request_duration_secondshistogramroute, method, statusEnd-to-end latency
property_db_query_duration_secondshistogramoperation, tableRepository spans
property_outbox_lag_secondsgaugetopicOldest unpublished row age
property_outbox_publish_totalcountertopic, resultsuccess / retry / dlq
property_inbox_processed_totalcountertopic, resultsuccess / retry / dlq
property_sync_pull_rows_totalcounteraggregatePull throughput
property_sync_apply_totalcounteraggregate, resultapplied / rejected / conflict_resolved
property_sync_conflicts_totalcounteraggregate, resolutionConflict counter
property_publish_totalcounterresult, reasonaccepted / rejected_invariant / rejected_authz
property_room_status_changes_totalcounterfrom_status, to_status, sourcesource ∈ {api, sync, event(housekeeping), event(maintenance), system}
property_photo_pipeline_secondshistogramstageupload, scan_wait, ready
property_ai_runs_totalcountercapability, resultsuccess / blocked / quota / error
property_ai_run_secondshistogramcapability, model_route
property_authz_denied_totalcounterroute, roleRepeated denials → security alert
property_geo_search_secondshistogramshapebbox / nearby
property_cache_totalcounterkey_pattern, resulthit / miss / fill / invalidate
property_tenant_isolation_audit_failures_totalcountertableMUST be 0

4. Logs

4.1 Format

JSON to stdout, one event per line. Fields:

{
"ts": "2026-04-22T10:00:00.123Z",
"level": "INFO",
"service": "property-service",
"version": "git-sha-abc1234",
"msg": "property.published",
"tenant_id": "tnt_…",
"request_id": "req_…",
"trace_id": "00-…-…-01",
"span_id": "…",
"actor_user_id": "usr_…",
"actor_kind": "user",
"route": "POST /properties/:id/publish",
"aggregate": "property",
"aggregate_id": "ppt_…",
"duration_ms": 312,
"outcome": "ok",
"extras": { "version": 7 }
}

4.2 Required log events

  • Every state-changing API call: route.completed with outcome and aggregate version.
  • Every emitted domain event: outbox.appended with topic + outboxId.
  • Every consumed event: inbox.processed with topic + result.
  • Every authorization denial: authz.denied with role, subject_user_id, resource_type, resource_id, policy_decision_id.
  • Every AI run: ai.run.completed with the orchestrator response metadata.
  • Every conflict resolution: sync.conflict.resolved with the field-level merge summary.

4.3 Redactions

The LoggingInterceptor redacts contact.phone, contact.email, OAuth tokens, signed URLs, and any field tagged @Sensitive in the DTO layer.


5. Dashboards

Authoritative Grafana / Cloud Monitoring dashboards (URLs maintained in infra/observability/dashboards/property-service/):

  1. Property — Service Health. RED metrics per route, p50/p95/p99 latency, error rate, outbox lag, inbox lag.
  2. Property — Domain Activity. Properties created/published/unpublished per hour, room create/archive, OOO transitions, photo pipeline funnel.
  3. Property — AI Activity. Runs per capability, acceptance rate, moderation block rate, quota usage per tenant.
  4. Property — Sync. Pull rows/min, push apply success, conflict rate, per-device error rates.
  5. Property — Tenant Drill-down. Same panels filtered by tenant_id (templated variable).
  6. Property — Search Projection Health. Time from event to projection visibility (computed via search-aggregation-service echo metric).

6. Alerts (with runbook anchors)

AlertConditionSeverityRunbook
PropertyReadAvailabilityBurnerror budget burn 14.4× over 1 hpage (P1)runbooks/property/read-availability.md
PropertyWriteAvailabilityBurnsame on writespage (P1)runbooks/property/write-availability.md
PropertyOutboxBacklogproperty_outbox_lag_seconds > 30 for 5 minpage (P2)runbooks/property/outbox-backlog.md
PropertyInboxStuckproperty_inbox_processed_total{result="error"} rate > 1/s for 10 minpage (P2)runbooks/property/inbox-stuck.md
PropertyTenantIsolationFailureproperty_tenant_isolation_audit_failures_total > 0page (P0, security)runbooks/security/tenant-isolation-breach.md
PropertyPublishRejectSpikeproperty_publish_total{result="rejected_invariant"} rate > baseline ×5ticketrunbooks/property/publish-rejects.md
PropertyPhotoPipelineSlowproperty_photo_pipeline_seconds{stage="ready"} p95 > 60s for 30 minticketrunbooks/property/photo-pipeline.md
PropertyAIQuotaExhaustionSpikeproperty_ai_runs_total{result="quota"} rate > 1/sticketrunbooks/ai/quota-spikes.md
PropertySyncConflictStormproperty_sync_conflicts_total rate per device > 5/min for 30 minticketrunbooks/property/sync-conflict-storm.md
PropertyAuthzDenialSpikeproperty_authz_denied_total rate per actor > 30/minticket (security)runbooks/security/authz-denial.md
PropertyGeoSearchSlowproperty_geo_search_seconds p99 > 1 s for 30 minticketrunbooks/property/geo-search.md

All page-grade alerts include the affected tenant_id (where unambiguous) and a templated link to the relevant dashboard with the time range pre-selected.


7. Synthetic Probes

  • Read probe. Every 60 s from each region: GET /properties/{seedPropertyId} for the platform synthetic tenant.
  • Publish probe. Hourly: full create→publish→unpublish→archive cycle on a synthetic property.
  • Sync probe. Every 5 min: simulated pull + push of a status flip on a synthetic room.
  • Photo probe. Every 30 min: signed-URL upload + scan completion check.

Probe results contribute to a separate "external SLI" panel that explicitly excludes synthetic load from billing analytics.


8. Trace Sampling

  • 100 % sampling on errors and on route ∈ {publish, archive}.
  • 5 % head-based sampling on reads.
  • 100 % sampling on AI capability spans.
  • 1 % sampling on synthetic probe traffic.
  • Sampling decisions made at the gateway and propagated.

9. Capacity & Cost

Capacity SLIs reported weekly:

  • Active properties per tenant (p50, p95, max).
  • Rooms per property (p50, p95).
  • Photos per property (p50, p95).
  • Outbox backlog mean / max.
  • Cloud SQL CPU + IOPS utilization.

Cost panel ties (a) Cloud Run billable seconds, (b) Cloud SQL units, (c) Pub/Sub throughput, (d) AI orchestrator-attributed runs, into a per-tenant cost projection used by billing-service.


10. On-call & Hand-off

  • Primary: property-service-oncall rotation in PagerDuty (platform-domain schedule).
  • Escalation: platform SRE → engineering lead.
  • Hand-off doc updated weekly in runbooks/property/oncall-handoff.md with the previous week's incidents, top false-positive alerts, and mitigation status.

Cross-references: failure catalog and runbook bodies in FAILURE_MODES; deployment topology and resource limits in DEPLOYMENT_TOPOLOGY; event topics in EVENT_SCHEMAS.