OBSERVABILITY — theme-config-service

Sibling: APPLICATION_LOGIC · FAILURE_MODES · TESTING_STRATEGY

Platform anchors: docs/02-enterprise-architecture.md §Observability

This document defines logs, metrics, traces, dashboards, alerts, and the SLOs that govern operational excellence for theme-config-service.

1. SLOs

Surface	Indicator	Objective	Window
Authoring API availability	`2xx + 4xx (excluding 408/499) / total`	≥ 99.9 %	30 days
Authoring API latency	p95 of `PATCH/POST` ≤ 350 ms	≥ 99 % of windows	30 days
Publish use case end-to-end	bundle uploaded + outbox written within 4 s	≥ 99 %	30 days
CDN propagation	`theme.published.v1 → bundle visible at edge worldwide`	p95 ≤ 60 s	30 days
Public bundle read p95	edge-hit latency	≤ 80 ms	30 days
Public bundle availability	`2xx / total`	≥ 99.99 %	30 days
Internal email-theme read p95	mTLS endpoint	≤ 50 ms	30 days
Outbox lag	time from row insert to Pub/Sub publish	p95 ≤ 2 s	30 days
Inbox processing lag	time from message receive to handler complete	p95 ≤ 5 s	30 days
HITL request → first decision	wall-clock (informational, not SLO-binding)	report only	weekly

Error budget = 1 − objective × window_seconds. Burn-rate alerts fire on multi-window multi-burn-rate per the Google SRE workbook (1h@14.4× and 6h@6×).

2. Logging

2.1 Format & sink

Structured JSON to stdout; collected by the platform Cloud Logging agent.

Schema fields:

{
  "ts": "2026-04-23T10:14:55.211Z",
  "level": "info",
  "service": "theme-config-service",
  "version": "1.42.0",
  "env": "prod",
  "tenantId": "tnt_01J...",
  "actorId": "usr_01J...",
  "requestId": "req_01J...",
  "correlationId": "req_01J...",
  "traceId": "00-...-...-01",
  "spanId": "0fa2c...",
  "useCase": "PublishThemeVersionUseCase",
  "themeId": "thm_01J...",
  "themeVersionId": "thv_01J...",
  "msg": "publish.committed",
  "durationMs": 1842,
  "outcome": "ok"
}

All logs are redacted by the platform middleware (no full request bodies, no full prompts, no token secrets, no JWT, no email PII). The redaction allow-list is in src/infrastructure/logging/redaction.ts.

2.2 Sampling

Level	Sampling
`error` / `warn`	100 %
`info` for state transitions (publish, rollback, draft create)	100 %
`info` for hot-path reads	1 %
`debug`	off in prod; on in non-prod

2.3 Required log events

Event key	When
`theme.draft.created`	After `CreateThemeVersionUseCase` commit
`theme.draft.updated`	After `PatchThemeVersionUseCase` commit
`theme.preview.minted`	After `MintPreviewTokenUseCase` commit
`theme.publish.attempted`	At the start of `PublishThemeVersionUseCase`
`theme.publish.committed`	After commit (carries `bundleSha256`, `bundleSizeGzippedBytes`, `latencyMs`)
`theme.publish.rejected`	On rejection with reasons
`theme.rollback.committed`	After rollback commit
`theme.cdn.invalidation.queued` / `…succeeded` / `…failed`	per attempt
`theme.outbox.drained.batch`	per drain (carries `batchSize`, `lagMs`)
`theme.inbox.consumed`	per consumed event with `eventId`, `eventType`
`theme.bundle.integrity.violation`	when CDN bundle fails SHA verification
`theme.ai.suggestion.created` / `…applied` / `…rejected`	per HITL transition

3. Metrics

All metrics emitted to OpenTelemetry → Cloud Monitoring with the labels tenantId (low-cardinality only via tenant-tier bucketing), useCase, outcome, env.

3.1 RED metrics (per use case)

theme_usecase_requests_total{useCase, outcome} — counter.
theme_usecase_duration_seconds{useCase, outcome} — histogram (default OTel buckets).
theme_usecase_errors_total{useCase, errorCode} — counter, errorCode from the canonical catalogue.

3.2 Domain metrics

Metric	Type	Labels	Notes
`theme_publish_total`	counter	`outcome`, `tenantTier`	success/failure
`theme_publish_bundle_size_bytes`	histogram	`tenantTier`	gzipped
`theme_publish_validation_warnings_total`	counter	`code`	locale incompleteness etc.
`theme_publish_e2e_seconds`	histogram	`outcome`	from request → CDN invalidation queued
`theme_rollback_total`	counter	`outcome`
`theme_preview_tokens_active`	gauge	`tenantTier`	by sweep
`theme_locale_packs_completeness_pct`	gauge	`locale`	by sweep
`theme_content_blocks_per_version`	histogram
`theme_active_versions_total`	gauge
`theme_active_publications_total`	gauge
`theme_cdn_invalidation_seconds`	histogram	`outcome`
`theme_outbox_lag_seconds`	gauge		sampled at drain
`theme_outbox_unpublished_rows`	gauge
`theme_inbox_consumed_total`	counter	`eventType`, `outcome`
`theme_ai_requests_total`	counter	`surface`, `outcome`
`theme_ai_request_duration_seconds`	histogram	`surface`, `outcome`
`theme_ai_cost_usd_total`	counter	`surface`	aggregated from provenance
`theme_bundle_integrity_violations_total`	counter
`theme_db_pool_connections_in_use`	gauge		from PgBouncer / Drizzle
`theme_rls_violations_total`	counter		should be 0; alert on > 0

3.3 Resource metrics

Standard Cloud Run resource metrics (CPU, memory, instance count, request count, request latencies) are emitted automatically.

4. Tracing

OpenTelemetry SDK with W3C Trace Context propagation.
Every inbound HTTP request is the root span; downstream calls (Postgres, Pub/Sub, GCS, CDN, AI orchestrator, file-storage, tenant-service) become child spans with attributes:
- db.system=postgresql, db.statement=<param-stripped>, db.row.count=<n>.
- messaging.system=pubsub, messaging.destination=<topic>, messaging.message_id=<eventId>.
- gcp.gcs.bucket, gcp.gcs.object, gcp.gcs.operation.
- http.url=<callee>, http.status_code, peer.service.
- ghasi.useCase=<UseCaseName>, ghasi.tenantId=<...>, ghasi.themeId=<...>.
Sampling: 10 % default; 100 % for any request that carries X-Debug-Trace: 1 (gateway accepts only from internal callers).
Trace exemplars attached to histogram metrics (Cloud Monitoring exemplar support).

The publish flow's expected span tree:

http.POST /v1/theme-versions/:id/publish
├── usecase.PublishThemeVersionUseCase
│   ├── repo.themeVersion.findById
│   ├── repo.theme.findById
│   ├── repo.contentBlocks.listByVersion
│   ├── repo.navConfigs.listByVersion
│   ├── repo.bookingFlow.findByVersion
│   ├── repo.emailTheme.findByVersion
│   ├── repo.localePacks.listByVersion
│   ├── client.fileStorage.headMany    (asset integrity)
│   ├── pure.buildBundle
│   ├── client.gcs.uploadBundle
│   ├── tx.publishFlip
│   │   ├── repo.themeVersion.save
│   │   ├── repo.themePublication.flipActive
│   │   ├── repo.theme.save
│   │   └── outbox.publishMany
│   ├── cache.memorystore.set
│   └── client.cloudCdn.invalidate
└── 202 Accepted

5. Dashboards

Cloud Monitoring dashboards in services/theme-config-service/observability/dashboards/:

Dashboard	Audience	Key panels
`theme-config-overview`	on-call	RED per use case, error budget burn, top error codes, p50/p95/p99 latencies, deployment markers
`theme-config-publish`	feature owners	Publish success rate, e2e latency, bundle size distribution, CDN propagation, rollbacks/day
`theme-config-data`	DBA + on-call	Pool utilisation, slow queries, RLS violations, outbox lag, inbox lag, replica replication lag
`theme-config-ai`	AI ops	AI calls per surface, success rate, cost/day, HITL approval lag
`theme-config-tenant-tier`	product	Active themes, publishes/tenant, content blocks/tenant, bundle size by tier

6. Alerts

Alert	Trigger	Severity	Channel
`Burn — fast`	error budget burn 14.4× over 1h	sev2	PagerDuty primary
`Burn — slow`	error budget burn 6× over 6h	sev3	PagerDuty secondary
`Publish failure spike`	publish failure rate > 2 % over 10 min	sev2	PagerDuty primary
`Rollback rate`	> 3 rollbacks in 30 min globally	sev3	Slack #theme-ops
`Bundle integrity violation`	any in 5 min	sev1	PagerDuty primary + #security-incidents
`RLS violation counter > 0`	any	sev1	PagerDuty primary + #security-incidents
`Outbox lag`	p95 outbox lag > 30 s for 5 min	sev2	PagerDuty primary
`Outbox backlog`	unpublished rows > 1000 for 10 min	sev2	PagerDuty primary
`CDN invalidation failure`	failure rate > 5 % over 15 min	sev2	PagerDuty primary
`Preview brute-force`	sustained 401/404 rate on `GET /public/preview/...` from one IP > 60 rpm	sev3	Slack #security-ops + auto-rate-limit boost
`AI surface failure`	failure rate > 10 % over 15 min on any surface	sev3	Slack #theme-ops
`AI budget exceeded for tenant`	first occurrence in a billing period	informational	tenant email + product Slack
`Schema migration drift`	startup health check fails	sev2	PagerDuty primary
`Health check failure`	`/healthz` 5xx for 1 min	sev2	PagerDuty primary
`Memorystore unreachable`	error rate > 1 % on cache ops for 5 min	sev3	Slack (we degrade to origin)

7. Health endpoints

Endpoint	Purpose	Auth
`GET /healthz`	liveness — process up, DB connection ok	none
`GET /readyz`	readiness — migrations applied, Pub/Sub publisher ok, Memorystore reachable	none
`GET /metrics`	Prometheus scrape (sidecar OTel collector exposes this)	mTLS internal only

readyz returns:

{
  "ok": true,
  "checks": {
    "db": "ok",
    "migrations": "applied:142",
    "pubsub": "ok",
    "memorystore": "ok",
    "gcs": "ok",
    "ai_orchestrator": "ok",
    "file_storage": "ok"
  }
}

Failure of any check returns 503 with the failing keys.

8. Audit & compliance trails

Every mutating use case emits a theme.audit.* event consumed by audit-service (immutable, 7-year retention).
Every HITL approval emits theme.ai.suggestion_approved.v1 carrying approverUserId, approverNote, appliedToVersionId.
DB trigger on theme_publications writes a separate audit_log_publication row (defense in depth in case the application bypasses the use case).

9. Cost observability

Per-tenant cost attribution via the platform costs/<tenantId> Cloud Logging label, summed monthly:
- DB CPU / IOPS share by app.tenant_id from query logs.
- Pub/Sub publish bytes per tenantId partition.
- GCS object storage by tenantId prefix.
- CDN egress by host (mapped back to tenant via host_to_theme_view).
- AI cost from theme_ai_cost_usd_total{surface,tenantTier} × per-tenant breakdown read from audit_log (rare query).
Aggregated daily into the platform costs.daily_per_service_per_tenant BigQuery table.

10. References

Platform observability standards: docs/02-enterprise-architecture.md
Use-case orchestration: APPLICATION_LOGIC
Failure scenarios cross-referenced from this doc: FAILURE_MODES
Test fixtures asserting metric names: services/theme-config-service/test/observability/

1. SLOs​

2. Logging​

2.1 Format & sink​

2.2 Sampling​

2.3 Required log events​

3. Metrics​

3.1 RED metrics (per use case)​

3.2 Domain metrics​

3.3 Resource metrics​

4. Tracing​

5. Dashboards​

6. Alerts​

7. Health endpoints​

8. Audit & compliance trails​

9. Cost observability​

10. References​