property-service — OBSERVABILITY

Companion: SERVICE_OVERVIEW · API_CONTRACTS · EVENT_SCHEMAS · FAILURE_MODES · DEPLOYMENT_TOPOLOGY

This document defines the SLIs, SLOs, dashboards, alerts, runbook hooks, structured-log shape, and trace/metric attributes that property-service is required to emit. It is a binding contract between the service team and platform SRE.

Stack: OpenTelemetry SDK for traces + metrics, exported via OTLP to Cloud Trace and Cloud Monitoring; structured JSON logs to Cloud Logging via stdout. No vendor-specific instrumentation.

1. SLIs and SLOs

1.1 User-facing SLIs

SLI	Definition	SLO (rolling 28d)	Burn alert
Read availability	`1 - (5xx on GET /properties* / total GET)`	≥ 99.95 %	14.4× burn over 1 h or 6× over 6 h
Read latency p99	p99 of `GET /properties/:id` end-to-end on the service span	≤ 250 ms	breach 30 min
Write availability	`1 - (5xx on POST/PATCH/DELETE / total mutations)`	≥ 99.9 %	14.4× burn over 1 h
Write latency p99	p99 of `POST /properties` and `PATCH /properties/:id`	≤ 800 ms	breach 30 min
Publish success rate	`accepted_publish / requested_publish` (excludes 4xx by domain validation)	≥ 99.5 %	breach 1 h
Photo readiness latency p95	time from `POST /properties/:id/photos` to `photo.added.v1` with `status=ready`	≤ 30 s	breach 1 h
Geo-search latency p99	p99 of `GET /properties/geo/nearby`	≤ 400 ms	breach 30 min

1.2 Internal SLIs

SLI	Definition	SLO
Outbox publish lag p95	`published_at - created_at` for outbox rows	≤ 5 s
Outbox backlog	rows where `published_at IS NULL`	≤ 100 sustained
Inbox processing lag p95	`processed_at - received_at` for inbox rows	≤ 10 s
Sync apply latency p95	`/internal/sync/apply` per operation	≤ 200 ms
AI orchestrator round-trip p95	per capability	≤ 8 s for `prop.describe.draft`, ≤ 12 s for `prop.photo.tag`

2. Required Span Attributes

Every span (HTTP, DB, Pub/Sub, Redis, internal RPC) must carry:

melmastoon.tenant_id
melmastoon.request_id
melmastoon.actor_user_id           (when authenticated)
melmastoon.actor_kind              (user|system|integration|ai)
melmastoon.route                   (e.g., GET /properties/:id)
melmastoon.aggregate               (property|room|room_type|photo|policy|amenity)
melmastoon.aggregate_id
melmastoon.idempotency_key         (when present)
melmastoon.cache.hit               (true|false on Redis-touching spans)
melmastoon.error.code              (MELMASTOON.* on errors only)

Sync spans add: melmastoon.device_id, melmastoon.operation_id, melmastoon.conflict_resolution.

AI spans add: melmastoon.ai.run_id, melmastoon.ai.capability, melmastoon.ai.model_route, melmastoon.ai.tokens_in, melmastoon.ai.tokens_out, melmastoon.ai.moderation_flagged.

Trace propagation: W3C traceparent only; gateway injects, every internal hop preserves.

3. Metrics Catalog

All counters/histograms are tagged with tenant_id (low cardinality after aggregation by tenant tier) and route or aggregate as appropriate. Histograms use OTel exponential buckets.

Name	Type	Tags	Description
`property_http_requests_total`	counter	route, method, status, tenant_id	All HTTP requests
`property_http_request_duration_seconds`	histogram	route, method, status	End-to-end latency
`property_db_query_duration_seconds`	histogram	operation, table	Repository spans
`property_outbox_lag_seconds`	gauge	topic	Oldest unpublished row age
`property_outbox_publish_total`	counter	topic, result	success / retry / dlq
`property_inbox_processed_total`	counter	topic, result	success / retry / dlq
`property_sync_pull_rows_total`	counter	aggregate	Pull throughput
`property_sync_apply_total`	counter	aggregate, result	applied / rejected / conflict_resolved
`property_sync_conflicts_total`	counter	aggregate, resolution	Conflict counter
`property_publish_total`	counter	result, reason	accepted / rejected_invariant / rejected_authz
`property_room_status_changes_total`	counter	from_status, to_status, source	source ∈ {api, sync, event(housekeeping), event(maintenance), system}
`property_photo_pipeline_seconds`	histogram	stage	upload, scan_wait, ready
`property_ai_runs_total`	counter	capability, result	success / blocked / quota / error
`property_ai_run_seconds`	histogram	capability, model_route
`property_authz_denied_total`	counter	route, role	Repeated denials → security alert
`property_geo_search_seconds`	histogram	shape	bbox / nearby
`property_cache_total`	counter	key_pattern, result	hit / miss / fill / invalidate
`property_tenant_isolation_audit_failures_total`	counter	table	MUST be 0

4. Logs

4.1 Format

JSON to stdout, one event per line. Fields:

{
  "ts": "2026-04-22T10:00:00.123Z",
  "level": "INFO",
  "service": "property-service",
  "version": "git-sha-abc1234",
  "msg": "property.published",
  "tenant_id": "tnt_…",
  "request_id": "req_…",
  "trace_id": "00-…-…-01",
  "span_id": "…",
  "actor_user_id": "usr_…",
  "actor_kind": "user",
  "route": "POST /properties/:id/publish",
  "aggregate": "property",
  "aggregate_id": "ppt_…",
  "duration_ms": 312,
  "outcome": "ok",
  "extras": { "version": 7 }
}

4.2 Required log events

Every state-changing API call: route.completed with outcome and aggregate version.
Every emitted domain event: outbox.appended with topic + outboxId.
Every consumed event: inbox.processed with topic + result.
Every authorization denial: authz.denied with role, subject_user_id, resource_type, resource_id, policy_decision_id.
Every AI run: ai.run.completed with the orchestrator response metadata.
Every conflict resolution: sync.conflict.resolved with the field-level merge summary.

4.3 Redactions

The LoggingInterceptor redacts contact.phone, contact.email, OAuth tokens, signed URLs, and any field tagged @Sensitive in the DTO layer.

5. Dashboards

Authoritative Grafana / Cloud Monitoring dashboards (URLs maintained in infra/observability/dashboards/property-service/):

Property — Service Health. RED metrics per route, p50/p95/p99 latency, error rate, outbox lag, inbox lag.
Property — Domain Activity. Properties created/published/unpublished per hour, room create/archive, OOO transitions, photo pipeline funnel.
Property — AI Activity. Runs per capability, acceptance rate, moderation block rate, quota usage per tenant.
Property — Sync. Pull rows/min, push apply success, conflict rate, per-device error rates.
Property — Tenant Drill-down. Same panels filtered by tenant_id (templated variable).
Property — Search Projection Health. Time from event to projection visibility (computed via search-aggregation-service echo metric).

6. Alerts (with runbook anchors)

Alert	Condition	Severity	Runbook
`PropertyReadAvailabilityBurn`	error budget burn 14.4× over 1 h	page (P1)	runbooks/property/read-availability.md
`PropertyWriteAvailabilityBurn`	same on writes	page (P1)	runbooks/property/write-availability.md
`PropertyOutboxBacklog`	`property_outbox_lag_seconds > 30` for 5 min	page (P2)	runbooks/property/outbox-backlog.md
`PropertyInboxStuck`	`property_inbox_processed_total{result="error"}` rate > 1/s for 10 min	page (P2)	runbooks/property/inbox-stuck.md
`PropertyTenantIsolationFailure`	`property_tenant_isolation_audit_failures_total > 0`	page (P0, security)	runbooks/security/tenant-isolation-breach.md
`PropertyPublishRejectSpike`	`property_publish_total{result="rejected_invariant"}` rate > baseline ×5	ticket	runbooks/property/publish-rejects.md
`PropertyPhotoPipelineSlow`	`property_photo_pipeline_seconds{stage="ready"}` p95 > 60s for 30 min	ticket	runbooks/property/photo-pipeline.md
`PropertyAIQuotaExhaustionSpike`	`property_ai_runs_total{result="quota"}` rate > 1/s	ticket	runbooks/ai/quota-spikes.md
`PropertySyncConflictStorm`	`property_sync_conflicts_total` rate per device > 5/min for 30 min	ticket	runbooks/property/sync-conflict-storm.md
`PropertyAuthzDenialSpike`	`property_authz_denied_total` rate per actor > 30/min	ticket (security)	runbooks/security/authz-denial.md
`PropertyGeoSearchSlow`	`property_geo_search_seconds` p99 > 1 s for 30 min	ticket	runbooks/property/geo-search.md

All page-grade alerts include the affected tenant_id (where unambiguous) and a templated link to the relevant dashboard with the time range pre-selected.

7. Synthetic Probes

Read probe. Every 60 s from each region: GET /properties/{seedPropertyId} for the platform synthetic tenant.
Publish probe. Hourly: full create→publish→unpublish→archive cycle on a synthetic property.
Sync probe. Every 5 min: simulated pull + push of a status flip on a synthetic room.
Photo probe. Every 30 min: signed-URL upload + scan completion check.

Probe results contribute to a separate "external SLI" panel that explicitly excludes synthetic load from billing analytics.

8. Trace Sampling

100 % sampling on errors and on route ∈ {publish, archive}.
5 % head-based sampling on reads.
100 % sampling on AI capability spans.
1 % sampling on synthetic probe traffic.
Sampling decisions made at the gateway and propagated.

9. Capacity & Cost

Capacity SLIs reported weekly:

Active properties per tenant (p50, p95, max).
Rooms per property (p50, p95).
Photos per property (p50, p95).
Outbox backlog mean / max.
Cloud SQL CPU + IOPS utilization.

Cost panel ties (a) Cloud Run billable seconds, (b) Cloud SQL units, (c) Pub/Sub throughput, (d) AI orchestrator-attributed runs, into a per-tenant cost projection used by billing-service.

10. On-call & Hand-off

Primary: property-service-oncall rotation in PagerDuty (platform-domain schedule).
Escalation: platform SRE → engineering lead.
Hand-off doc updated weekly in runbooks/property/oncall-handoff.md with the previous week's incidents, top false-positive alerts, and mitigation status.

Cross-references: failure catalog and runbook bodies in FAILURE_MODES; deployment topology and resource limits in DEPLOYMENT_TOPOLOGY; event topics in EVENT_SCHEMAS.

1. SLIs and SLOs​

1.1 User-facing SLIs​

1.2 Internal SLIs​

2. Required Span Attributes​

3. Metrics Catalog​

4. Logs​

4.1 Format​

4.2 Required log events​

4.3 Redactions​

5. Dashboards​

6. Alerts (with runbook anchors)​

7. Synthetic Probes​

8. Trace Sampling​

9. Capacity & Cost​

10. On-call & Hand-off​