OBSERVABILITY — theme-config-service
Sibling: APPLICATION_LOGIC · FAILURE_MODES · TESTING_STRATEGY
Platform anchors:
docs/02-enterprise-architecture.md§Observability
This document defines logs, metrics, traces, dashboards, alerts, and the SLOs that govern operational excellence for theme-config-service.
1. SLOs
| Surface | Indicator | Objective | Window |
|---|---|---|---|
| Authoring API availability | 2xx + 4xx (excluding 408/499) / total | ≥ 99.9 % | 30 days |
| Authoring API latency | p95 of PATCH/POST ≤ 350 ms | ≥ 99 % of windows | 30 days |
| Publish use case end-to-end | bundle uploaded + outbox written within 4 s | ≥ 99 % | 30 days |
| CDN propagation | theme.published.v1 → bundle visible at edge worldwide | p95 ≤ 60 s | 30 days |
| Public bundle read p95 | edge-hit latency | ≤ 80 ms | 30 days |
| Public bundle availability | 2xx / total | ≥ 99.99 % | 30 days |
| Internal email-theme read p95 | mTLS endpoint | ≤ 50 ms | 30 days |
| Outbox lag | time from row insert to Pub/Sub publish | p95 ≤ 2 s | 30 days |
| Inbox processing lag | time from message receive to handler complete | p95 ≤ 5 s | 30 days |
| HITL request → first decision | wall-clock (informational, not SLO-binding) | report only | weekly |
Error budget = 1 − objective × window_seconds. Burn-rate alerts fire on multi-window multi-burn-rate per the Google SRE workbook (1h@14.4× and 6h@6×).
2. Logging
2.1 Format & sink
- Structured JSON to stdout; collected by the platform Cloud Logging agent.
- Schema fields:
{"ts": "2026-04-23T10:14:55.211Z","level": "info","service": "theme-config-service","version": "1.42.0","env": "prod","tenantId": "tnt_01J...","actorId": "usr_01J...","requestId": "req_01J...","correlationId": "req_01J...","traceId": "00-...-...-01","spanId": "0fa2c...","useCase": "PublishThemeVersionUseCase","themeId": "thm_01J...","themeVersionId": "thv_01J...","msg": "publish.committed","durationMs": 1842,"outcome": "ok"}
- All logs are redacted by the platform middleware (no full request bodies, no full prompts, no token secrets, no JWT, no email PII). The redaction allow-list is in
src/infrastructure/logging/redaction.ts.
2.2 Sampling
| Level | Sampling |
|---|---|
error / warn | 100 % |
info for state transitions (publish, rollback, draft create) | 100 % |
info for hot-path reads | 1 % |
debug | off in prod; on in non-prod |
2.3 Required log events
| Event key | When |
|---|---|
theme.draft.created | After CreateThemeVersionUseCase commit |
theme.draft.updated | After PatchThemeVersionUseCase commit |
theme.preview.minted | After MintPreviewTokenUseCase commit |
theme.publish.attempted | At the start of PublishThemeVersionUseCase |
theme.publish.committed | After commit (carries bundleSha256, bundleSizeGzippedBytes, latencyMs) |
theme.publish.rejected | On rejection with reasons |
theme.rollback.committed | After rollback commit |
theme.cdn.invalidation.queued / …succeeded / …failed | per attempt |
theme.outbox.drained.batch | per drain (carries batchSize, lagMs) |
theme.inbox.consumed | per consumed event with eventId, eventType |
theme.bundle.integrity.violation | when CDN bundle fails SHA verification |
theme.ai.suggestion.created / …applied / …rejected | per HITL transition |
3. Metrics
All metrics emitted to OpenTelemetry → Cloud Monitoring with the labels tenantId (low-cardinality only via tenant-tier bucketing), useCase, outcome, env.
3.1 RED metrics (per use case)
theme_usecase_requests_total{useCase, outcome}— counter.theme_usecase_duration_seconds{useCase, outcome}— histogram (default OTel buckets).theme_usecase_errors_total{useCase, errorCode}— counter, errorCode from the canonical catalogue.
3.2 Domain metrics
| Metric | Type | Labels | Notes |
|---|---|---|---|
theme_publish_total | counter | outcome, tenantTier | success/failure |
theme_publish_bundle_size_bytes | histogram | tenantTier | gzipped |
theme_publish_validation_warnings_total | counter | code | locale incompleteness etc. |
theme_publish_e2e_seconds | histogram | outcome | from request → CDN invalidation queued |
theme_rollback_total | counter | outcome | |
theme_preview_tokens_active | gauge | tenantTier | by sweep |
theme_locale_packs_completeness_pct | gauge | locale | by sweep |
theme_content_blocks_per_version | histogram | ||
theme_active_versions_total | gauge | ||
theme_active_publications_total | gauge | ||
theme_cdn_invalidation_seconds | histogram | outcome | |
theme_outbox_lag_seconds | gauge | sampled at drain | |
theme_outbox_unpublished_rows | gauge | ||
theme_inbox_consumed_total | counter | eventType, outcome | |
theme_ai_requests_total | counter | surface, outcome | |
theme_ai_request_duration_seconds | histogram | surface, outcome | |
theme_ai_cost_usd_total | counter | surface | aggregated from provenance |
theme_bundle_integrity_violations_total | counter | ||
theme_db_pool_connections_in_use | gauge | from PgBouncer / Drizzle | |
theme_rls_violations_total | counter | should be 0; alert on > 0 |
3.3 Resource metrics
Standard Cloud Run resource metrics (CPU, memory, instance count, request count, request latencies) are emitted automatically.
4. Tracing
- OpenTelemetry SDK with W3C Trace Context propagation.
- Every inbound HTTP request is the root span; downstream calls (Postgres, Pub/Sub, GCS, CDN, AI orchestrator, file-storage, tenant-service) become child spans with attributes:
db.system=postgresql,db.statement=<param-stripped>,db.row.count=<n>.messaging.system=pubsub,messaging.destination=<topic>,messaging.message_id=<eventId>.gcp.gcs.bucket,gcp.gcs.object,gcp.gcs.operation.http.url=<callee>,http.status_code,peer.service.ghasi.useCase=<UseCaseName>,ghasi.tenantId=<...>,ghasi.themeId=<...>.
- Sampling: 10 % default; 100 % for any request that carries
X-Debug-Trace: 1(gateway accepts only from internal callers). - Trace exemplars attached to histogram metrics (Cloud Monitoring exemplar support).
The publish flow's expected span tree:
http.POST /v1/theme-versions/:id/publish
├── usecase.PublishThemeVersionUseCase
│ ├── repo.themeVersion.findById
│ ├── repo.theme.findById
│ ├── repo.contentBlocks.listByVersion
│ ├── repo.navConfigs.listByVersion
│ ├── repo.bookingFlow.findByVersion
│ ├── repo.emailTheme.findByVersion
│ ├── repo.localePacks.listByVersion
│ ├── client.fileStorage.headMany (asset integrity)
│ ├── pure.buildBundle
│ ├── client.gcs.uploadBundle
│ ├── tx.publishFlip
│ │ ├── repo.themeVersion.save
│ │ ├── repo.themePublication.flipActive
│ │ ├── repo.theme.save
│ │ └── outbox.publishMany
│ ├── cache.memorystore.set
│ └── client.cloudCdn.invalidate
└── 202 Accepted
5. Dashboards
Cloud Monitoring dashboards in services/theme-config-service/observability/dashboards/:
| Dashboard | Audience | Key panels |
|---|---|---|
theme-config-overview | on-call | RED per use case, error budget burn, top error codes, p50/p95/p99 latencies, deployment markers |
theme-config-publish | feature owners | Publish success rate, e2e latency, bundle size distribution, CDN propagation, rollbacks/day |
theme-config-data | DBA + on-call | Pool utilisation, slow queries, RLS violations, outbox lag, inbox lag, replica replication lag |
theme-config-ai | AI ops | AI calls per surface, success rate, cost/day, HITL approval lag |
theme-config-tenant-tier | product | Active themes, publishes/tenant, content blocks/tenant, bundle size by tier |
6. Alerts
| Alert | Trigger | Severity | Channel |
|---|---|---|---|
Burn — fast | error budget burn 14.4× over 1h | sev2 | PagerDuty primary |
Burn — slow | error budget burn 6× over 6h | sev3 | PagerDuty secondary |
Publish failure spike | publish failure rate > 2 % over 10 min | sev2 | PagerDuty primary |
Rollback rate | > 3 rollbacks in 30 min globally | sev3 | Slack #theme-ops |
Bundle integrity violation | any in 5 min | sev1 | PagerDuty primary + #security-incidents |
RLS violation counter > 0 | any | sev1 | PagerDuty primary + #security-incidents |
Outbox lag | p95 outbox lag > 30 s for 5 min | sev2 | PagerDuty primary |
Outbox backlog | unpublished rows > 1000 for 10 min | sev2 | PagerDuty primary |
CDN invalidation failure | failure rate > 5 % over 15 min | sev2 | PagerDuty primary |
Preview brute-force | sustained 401/404 rate on GET /public/preview/... from one IP > 60 rpm | sev3 | Slack #security-ops + auto-rate-limit boost |
AI surface failure | failure rate > 10 % over 15 min on any surface | sev3 | Slack #theme-ops |
AI budget exceeded for tenant | first occurrence in a billing period | informational | tenant email + product Slack |
Schema migration drift | startup health check fails | sev2 | PagerDuty primary |
Health check failure | /healthz 5xx for 1 min | sev2 | PagerDuty primary |
Memorystore unreachable | error rate > 1 % on cache ops for 5 min | sev3 | Slack (we degrade to origin) |
7. Health endpoints
| Endpoint | Purpose | Auth |
|---|---|---|
GET /healthz | liveness — process up, DB connection ok | none |
GET /readyz | readiness — migrations applied, Pub/Sub publisher ok, Memorystore reachable | none |
GET /metrics | Prometheus scrape (sidecar OTel collector exposes this) | mTLS internal only |
readyz returns:
{
"ok": true,
"checks": {
"db": "ok",
"migrations": "applied:142",
"pubsub": "ok",
"memorystore": "ok",
"gcs": "ok",
"ai_orchestrator": "ok",
"file_storage": "ok"
}
}
Failure of any check returns 503 with the failing keys.
8. Audit & compliance trails
- Every mutating use case emits a
theme.audit.*event consumed byaudit-service(immutable, 7-year retention). - Every HITL approval emits
theme.ai.suggestion_approved.v1carryingapproverUserId,approverNote,appliedToVersionId. - DB trigger on
theme_publicationswrites a separateaudit_log_publicationrow (defense in depth in case the application bypasses the use case).
9. Cost observability
- Per-tenant cost attribution via the platform
costs/<tenantId>Cloud Logging label, summed monthly:- DB CPU / IOPS share by
app.tenant_idfrom query logs. - Pub/Sub publish bytes per
tenantIdpartition. - GCS object storage by
tenantIdprefix. - CDN egress by host (mapped back to tenant via
host_to_theme_view). - AI cost from
theme_ai_cost_usd_total{surface,tenantTier}× per-tenant breakdown read fromaudit_log(rare query).
- DB CPU / IOPS share by
- Aggregated daily into the platform
costs.daily_per_service_per_tenantBigQuery table.
10. References
- Platform observability standards:
docs/02-enterprise-architecture.md - Use-case orchestration:
APPLICATION_LOGIC - Failure scenarios cross-referenced from this doc:
FAILURE_MODES - Test fixtures asserting metric names:
services/theme-config-service/test/observability/