Skip to main content

EP-MEL-20 — Observability, Reliability, Cost Controls

Companion: Backlog README · EPICS.md · canonical: 07-epics-and-user-stories.md §22

Summary

WaveR1 (+ multi-region in R3)
PriorityP0
Primary ownerplatform-wide (Platform & SRE squad)
Participating servicesevery service
Journeys realisedAll journeys (telemetry coverage)
WorkflowsAll
Frontend surfacesControl Plane
Story count18

Outcome

Every service emits structured logs (with trace_id, tenant_id, request_id), metrics (with tenant_id label), and W3C-traced spans; SLOs are defined and alerting is wired; costs visible per tenant + per service; autoscaling, error-budget enforcement, feature flags, canary rollouts, backups, DLQ handling, runbooks, post-mortems, RUM, sync telemetry, AI cost telemetry, and a customer-facing status page.

Cross-cutting AC for this epic

  • Every log line carries trace_id, tenant_id, request_id (or explicit reason for absence).
  • Every Cloud Run service has SLO defined + alert routes; error budget burn alerts wired.
  • Every Cloud SQL instance has automated backups + PITR; restore drill rehearsed.
  • Every Pub/Sub subscription has DLQ + alert + replay workflow.
  • Cost dashboards refresh ≥ daily; per-tenant cost band documented.

Stories

IDTitle
US-MEL-0144Structured logs with trace_id, tenant_id, request_id
US-MEL-0145Metrics with tenant_id label
US-MEL-0146Distributed traces with W3C traceparent
US-MEL-0147SLOs and alerting
US-MEL-0148Cost dashboards per tenant + per service
US-MEL-0149Autoscaling policies for Cloud Run
US-MEL-0150Synthetic monitoring of P0 journeys
US-MEL-0151Error budget enforcement
US-MEL-0152Feature flags with tenant scope
US-MEL-0153Canary rollouts with auto-rollback
US-MEL-0154Backup & restore for Cloud SQL
US-MEL-0155Pub/Sub DLQ handling
US-MEL-0156Runbook discoverability
US-MEL-0157Incident timeline and post-mortems
US-MEL-0158RUM (Real User Monitoring) for booking surfaces
US-MEL-0159Sync telemetry (queue depth, conflict rate)
US-MEL-0160AI cost & latency telemetry
US-MEL-0161Status page & customer-facing communication

Full AC in ../07-epics-and-user-stories.md §22.

Cross-references