EP-MEL-20 — Observability, Reliability, Cost Controls
Companion: Backlog README ·
EPICS.md· canonical:07-epics-and-user-stories.md§22
Summary
| Wave | R1 (+ multi-region in R3) |
| Priority | P0 |
| Primary owner | platform-wide (Platform & SRE squad) |
| Participating services | every service |
| Journeys realised | All journeys (telemetry coverage) |
| Workflows | All |
| Frontend surfaces | Control Plane |
| Story count | 18 |
Outcome
Every service emits structured logs (with trace_id, tenant_id, request_id), metrics (with tenant_id label), and W3C-traced spans; SLOs are defined and alerting is wired; costs visible per tenant + per service; autoscaling, error-budget enforcement, feature flags, canary rollouts, backups, DLQ handling, runbooks, post-mortems, RUM, sync telemetry, AI cost telemetry, and a customer-facing status page.
Cross-cutting AC for this epic
- Every log line carries
trace_id,tenant_id,request_id(or explicit reason for absence). - Every Cloud Run service has SLO defined + alert routes; error budget burn alerts wired.
- Every Cloud SQL instance has automated backups + PITR; restore drill rehearsed.
- Every Pub/Sub subscription has DLQ + alert + replay workflow.
- Cost dashboards refresh ≥ daily; per-tenant cost band documented.
Stories
| ID | Title |
|---|---|
| US-MEL-0144 | Structured logs with trace_id, tenant_id, request_id |
| US-MEL-0145 | Metrics with tenant_id label |
| US-MEL-0146 | Distributed traces with W3C traceparent |
| US-MEL-0147 | SLOs and alerting |
| US-MEL-0148 | Cost dashboards per tenant + per service |
| US-MEL-0149 | Autoscaling policies for Cloud Run |
| US-MEL-0150 | Synthetic monitoring of P0 journeys |
| US-MEL-0151 | Error budget enforcement |
| US-MEL-0152 | Feature flags with tenant scope |
| US-MEL-0153 | Canary rollouts with auto-rollback |
| US-MEL-0154 | Backup & restore for Cloud SQL |
| US-MEL-0155 | Pub/Sub DLQ handling |
| US-MEL-0156 | Runbook discoverability |
| US-MEL-0157 | Incident timeline and post-mortems |
| US-MEL-0158 | RUM (Real User Monitoring) for booking surfaces |
| US-MEL-0159 | Sync telemetry (queue depth, conflict rate) |
| US-MEL-0160 | AI cost & latency telemetry |
| US-MEL-0161 | Status page & customer-facing communication |
Full AC in
../07-epics-and-user-stories.md§22.
Cross-references
- Service readiness:
../roadmap/service-readiness-gates.md - Definition of Done (Observability section):
../standards/DEFINITION_OF_DONE.md