property-service — OBSERVABILITY
Companion: SERVICE_OVERVIEW · API_CONTRACTS · EVENT_SCHEMAS · FAILURE_MODES · DEPLOYMENT_TOPOLOGY
This document defines the SLIs, SLOs, dashboards, alerts, runbook hooks, structured-log shape, and trace/metric attributes that property-service is required to emit. It is a binding contract between the service team and platform SRE.
Stack: OpenTelemetry SDK for traces + metrics, exported via OTLP to Cloud Trace and Cloud Monitoring; structured JSON logs to Cloud Logging via stdout. No vendor-specific instrumentation.
1. SLIs and SLOs
1.1 User-facing SLIs
| SLI | Definition | SLO (rolling 28d) | Burn alert |
|---|---|---|---|
| Read availability | 1 - (5xx on GET /properties* / total GET) | ≥ 99.95 % | 14.4× burn over 1 h or 6× over 6 h |
| Read latency p99 | p99 of GET /properties/:id end-to-end on the service span | ≤ 250 ms | breach 30 min |
| Write availability | 1 - (5xx on POST/PATCH/DELETE / total mutations) | ≥ 99.9 % | 14.4× burn over 1 h |
| Write latency p99 | p99 of POST /properties and PATCH /properties/:id | ≤ 800 ms | breach 30 min |
| Publish success rate | accepted_publish / requested_publish (excludes 4xx by domain validation) | ≥ 99.5 % | breach 1 h |
| Photo readiness latency p95 | time from POST /properties/:id/photos to photo.added.v1 with status=ready | ≤ 30 s | breach 1 h |
| Geo-search latency p99 | p99 of GET /properties/geo/nearby | ≤ 400 ms | breach 30 min |
1.2 Internal SLIs
| SLI | Definition | SLO |
|---|---|---|
| Outbox publish lag p95 | published_at - created_at for outbox rows | ≤ 5 s |
| Outbox backlog | rows where published_at IS NULL | ≤ 100 sustained |
| Inbox processing lag p95 | processed_at - received_at for inbox rows | ≤ 10 s |
| Sync apply latency p95 | /internal/sync/apply per operation | ≤ 200 ms |
| AI orchestrator round-trip p95 | per capability | ≤ 8 s for prop.describe.draft, ≤ 12 s for prop.photo.tag |
2. Required Span Attributes
Every span (HTTP, DB, Pub/Sub, Redis, internal RPC) must carry:
melmastoon.tenant_id
melmastoon.request_id
melmastoon.actor_user_id (when authenticated)
melmastoon.actor_kind (user|system|integration|ai)
melmastoon.route (e.g., GET /properties/:id)
melmastoon.aggregate (property|room|room_type|photo|policy|amenity)
melmastoon.aggregate_id
melmastoon.idempotency_key (when present)
melmastoon.cache.hit (true|false on Redis-touching spans)
melmastoon.error.code (MELMASTOON.* on errors only)
Sync spans add: melmastoon.device_id, melmastoon.operation_id, melmastoon.conflict_resolution.
AI spans add: melmastoon.ai.run_id, melmastoon.ai.capability, melmastoon.ai.model_route, melmastoon.ai.tokens_in, melmastoon.ai.tokens_out, melmastoon.ai.moderation_flagged.
Trace propagation: W3C traceparent only; gateway injects, every internal hop preserves.
3. Metrics Catalog
All counters/histograms are tagged with tenant_id (low cardinality after aggregation by tenant tier) and route or aggregate as appropriate. Histograms use OTel exponential buckets.
| Name | Type | Tags | Description |
|---|---|---|---|
property_http_requests_total | counter | route, method, status, tenant_id | All HTTP requests |
property_http_request_duration_seconds | histogram | route, method, status | End-to-end latency |
property_db_query_duration_seconds | histogram | operation, table | Repository spans |
property_outbox_lag_seconds | gauge | topic | Oldest unpublished row age |
property_outbox_publish_total | counter | topic, result | success / retry / dlq |
property_inbox_processed_total | counter | topic, result | success / retry / dlq |
property_sync_pull_rows_total | counter | aggregate | Pull throughput |
property_sync_apply_total | counter | aggregate, result | applied / rejected / conflict_resolved |
property_sync_conflicts_total | counter | aggregate, resolution | Conflict counter |
property_publish_total | counter | result, reason | accepted / rejected_invariant / rejected_authz |
property_room_status_changes_total | counter | from_status, to_status, source | source ∈ {api, sync, event(housekeeping), event(maintenance), system} |
property_photo_pipeline_seconds | histogram | stage | upload, scan_wait, ready |
property_ai_runs_total | counter | capability, result | success / blocked / quota / error |
property_ai_run_seconds | histogram | capability, model_route | |
property_authz_denied_total | counter | route, role | Repeated denials → security alert |
property_geo_search_seconds | histogram | shape | bbox / nearby |
property_cache_total | counter | key_pattern, result | hit / miss / fill / invalidate |
property_tenant_isolation_audit_failures_total | counter | table | MUST be 0 |
4. Logs
4.1 Format
JSON to stdout, one event per line. Fields:
{
"ts": "2026-04-22T10:00:00.123Z",
"level": "INFO",
"service": "property-service",
"version": "git-sha-abc1234",
"msg": "property.published",
"tenant_id": "tnt_…",
"request_id": "req_…",
"trace_id": "00-…-…-01",
"span_id": "…",
"actor_user_id": "usr_…",
"actor_kind": "user",
"route": "POST /properties/:id/publish",
"aggregate": "property",
"aggregate_id": "ppt_…",
"duration_ms": 312,
"outcome": "ok",
"extras": { "version": 7 }
}
4.2 Required log events
- Every state-changing API call:
route.completedwith outcome and aggregate version. - Every emitted domain event:
outbox.appendedwith topic + outboxId. - Every consumed event:
inbox.processedwith topic + result. - Every authorization denial:
authz.deniedwithrole,subject_user_id,resource_type,resource_id,policy_decision_id. - Every AI run:
ai.run.completedwith the orchestrator response metadata. - Every conflict resolution:
sync.conflict.resolvedwith the field-level merge summary.
4.3 Redactions
The LoggingInterceptor redacts contact.phone, contact.email, OAuth tokens, signed URLs, and any field tagged @Sensitive in the DTO layer.
5. Dashboards
Authoritative Grafana / Cloud Monitoring dashboards (URLs maintained in infra/observability/dashboards/property-service/):
- Property — Service Health. RED metrics per route, p50/p95/p99 latency, error rate, outbox lag, inbox lag.
- Property — Domain Activity. Properties created/published/unpublished per hour, room create/archive, OOO transitions, photo pipeline funnel.
- Property — AI Activity. Runs per capability, acceptance rate, moderation block rate, quota usage per tenant.
- Property — Sync. Pull rows/min, push apply success, conflict rate, per-device error rates.
- Property — Tenant Drill-down. Same panels filtered by
tenant_id(templated variable). - Property — Search Projection Health. Time from event to projection visibility (computed via
search-aggregation-serviceecho metric).
6. Alerts (with runbook anchors)
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
PropertyReadAvailabilityBurn | error budget burn 14.4× over 1 h | page (P1) | runbooks/property/read-availability.md |
PropertyWriteAvailabilityBurn | same on writes | page (P1) | runbooks/property/write-availability.md |
PropertyOutboxBacklog | property_outbox_lag_seconds > 30 for 5 min | page (P2) | runbooks/property/outbox-backlog.md |
PropertyInboxStuck | property_inbox_processed_total{result="error"} rate > 1/s for 10 min | page (P2) | runbooks/property/inbox-stuck.md |
PropertyTenantIsolationFailure | property_tenant_isolation_audit_failures_total > 0 | page (P0, security) | runbooks/security/tenant-isolation-breach.md |
PropertyPublishRejectSpike | property_publish_total{result="rejected_invariant"} rate > baseline ×5 | ticket | runbooks/property/publish-rejects.md |
PropertyPhotoPipelineSlow | property_photo_pipeline_seconds{stage="ready"} p95 > 60s for 30 min | ticket | runbooks/property/photo-pipeline.md |
PropertyAIQuotaExhaustionSpike | property_ai_runs_total{result="quota"} rate > 1/s | ticket | runbooks/ai/quota-spikes.md |
PropertySyncConflictStorm | property_sync_conflicts_total rate per device > 5/min for 30 min | ticket | runbooks/property/sync-conflict-storm.md |
PropertyAuthzDenialSpike | property_authz_denied_total rate per actor > 30/min | ticket (security) | runbooks/security/authz-denial.md |
PropertyGeoSearchSlow | property_geo_search_seconds p99 > 1 s for 30 min | ticket | runbooks/property/geo-search.md |
All page-grade alerts include the affected tenant_id (where unambiguous) and a templated link to the relevant dashboard with the time range pre-selected.
7. Synthetic Probes
- Read probe. Every 60 s from each region:
GET /properties/{seedPropertyId}for the platform synthetic tenant. - Publish probe. Hourly: full create→publish→unpublish→archive cycle on a synthetic property.
- Sync probe. Every 5 min: simulated pull + push of a status flip on a synthetic room.
- Photo probe. Every 30 min: signed-URL upload + scan completion check.
Probe results contribute to a separate "external SLI" panel that explicitly excludes synthetic load from billing analytics.
8. Trace Sampling
- 100 % sampling on errors and on
route ∈ {publish, archive}. - 5 % head-based sampling on reads.
- 100 % sampling on AI capability spans.
- 1 % sampling on synthetic probe traffic.
- Sampling decisions made at the gateway and propagated.
9. Capacity & Cost
Capacity SLIs reported weekly:
- Active properties per tenant (p50, p95, max).
- Rooms per property (p50, p95).
- Photos per property (p50, p95).
- Outbox backlog mean / max.
- Cloud SQL CPU + IOPS utilization.
Cost panel ties (a) Cloud Run billable seconds, (b) Cloud SQL units, (c) Pub/Sub throughput, (d) AI orchestrator-attributed runs, into a per-tenant cost projection used by billing-service.
10. On-call & Hand-off
- Primary:
property-service-oncallrotation in PagerDuty (platform-domainschedule). - Escalation: platform SRE → engineering lead.
- Hand-off doc updated weekly in
runbooks/property/oncall-handoff.mdwith the previous week's incidents, top false-positive alerts, and mitigation status.
Cross-references: failure catalog and runbook bodies in FAILURE_MODES; deployment topology and resource limits in DEPLOYMENT_TOPOLOGY; event topics in EVENT_SCHEMAS.