Skip to main content

OBSERVABILITY — theme-config-service

Sibling: APPLICATION_LOGIC · FAILURE_MODES · TESTING_STRATEGY

Platform anchors: docs/02-enterprise-architecture.md §Observability

This document defines logs, metrics, traces, dashboards, alerts, and the SLOs that govern operational excellence for theme-config-service.


1. SLOs

SurfaceIndicatorObjectiveWindow
Authoring API availability2xx + 4xx (excluding 408/499) / total≥ 99.9 %30 days
Authoring API latencyp95 of PATCH/POST ≤ 350 ms≥ 99 % of windows30 days
Publish use case end-to-endbundle uploaded + outbox written within 4 s≥ 99 %30 days
CDN propagationtheme.published.v1 → bundle visible at edge worldwidep95 ≤ 60 s30 days
Public bundle read p95edge-hit latency≤ 80 ms30 days
Public bundle availability2xx / total≥ 99.99 %30 days
Internal email-theme read p95mTLS endpoint≤ 50 ms30 days
Outbox lagtime from row insert to Pub/Sub publishp95 ≤ 2 s30 days
Inbox processing lagtime from message receive to handler completep95 ≤ 5 s30 days
HITL request → first decisionwall-clock (informational, not SLO-binding)report onlyweekly

Error budget = 1 − objective × window_seconds. Burn-rate alerts fire on multi-window multi-burn-rate per the Google SRE workbook (1h@14.4× and 6h@6×).


2. Logging

2.1 Format & sink

  • Structured JSON to stdout; collected by the platform Cloud Logging agent.
  • Schema fields:
    {
    "ts": "2026-04-23T10:14:55.211Z",
    "level": "info",
    "service": "theme-config-service",
    "version": "1.42.0",
    "env": "prod",
    "tenantId": "tnt_01J...",
    "actorId": "usr_01J...",
    "requestId": "req_01J...",
    "correlationId": "req_01J...",
    "traceId": "00-...-...-01",
    "spanId": "0fa2c...",
    "useCase": "PublishThemeVersionUseCase",
    "themeId": "thm_01J...",
    "themeVersionId": "thv_01J...",
    "msg": "publish.committed",
    "durationMs": 1842,
    "outcome": "ok"
    }
  • All logs are redacted by the platform middleware (no full request bodies, no full prompts, no token secrets, no JWT, no email PII). The redaction allow-list is in src/infrastructure/logging/redaction.ts.

2.2 Sampling

LevelSampling
error / warn100 %
info for state transitions (publish, rollback, draft create)100 %
info for hot-path reads1 %
debugoff in prod; on in non-prod

2.3 Required log events

Event keyWhen
theme.draft.createdAfter CreateThemeVersionUseCase commit
theme.draft.updatedAfter PatchThemeVersionUseCase commit
theme.preview.mintedAfter MintPreviewTokenUseCase commit
theme.publish.attemptedAt the start of PublishThemeVersionUseCase
theme.publish.committedAfter commit (carries bundleSha256, bundleSizeGzippedBytes, latencyMs)
theme.publish.rejectedOn rejection with reasons
theme.rollback.committedAfter rollback commit
theme.cdn.invalidation.queued / …succeeded / …failedper attempt
theme.outbox.drained.batchper drain (carries batchSize, lagMs)
theme.inbox.consumedper consumed event with eventId, eventType
theme.bundle.integrity.violationwhen CDN bundle fails SHA verification
theme.ai.suggestion.created / …applied / …rejectedper HITL transition

3. Metrics

All metrics emitted to OpenTelemetry → Cloud Monitoring with the labels tenantId (low-cardinality only via tenant-tier bucketing), useCase, outcome, env.

3.1 RED metrics (per use case)

  • theme_usecase_requests_total{useCase, outcome} — counter.
  • theme_usecase_duration_seconds{useCase, outcome} — histogram (default OTel buckets).
  • theme_usecase_errors_total{useCase, errorCode} — counter, errorCode from the canonical catalogue.

3.2 Domain metrics

MetricTypeLabelsNotes
theme_publish_totalcounteroutcome, tenantTiersuccess/failure
theme_publish_bundle_size_byteshistogramtenantTiergzipped
theme_publish_validation_warnings_totalcountercodelocale incompleteness etc.
theme_publish_e2e_secondshistogramoutcomefrom request → CDN invalidation queued
theme_rollback_totalcounteroutcome
theme_preview_tokens_activegaugetenantTierby sweep
theme_locale_packs_completeness_pctgaugelocaleby sweep
theme_content_blocks_per_versionhistogram
theme_active_versions_totalgauge
theme_active_publications_totalgauge
theme_cdn_invalidation_secondshistogramoutcome
theme_outbox_lag_secondsgaugesampled at drain
theme_outbox_unpublished_rowsgauge
theme_inbox_consumed_totalcountereventType, outcome
theme_ai_requests_totalcountersurface, outcome
theme_ai_request_duration_secondshistogramsurface, outcome
theme_ai_cost_usd_totalcountersurfaceaggregated from provenance
theme_bundle_integrity_violations_totalcounter
theme_db_pool_connections_in_usegaugefrom PgBouncer / Drizzle
theme_rls_violations_totalcountershould be 0; alert on > 0

3.3 Resource metrics

Standard Cloud Run resource metrics (CPU, memory, instance count, request count, request latencies) are emitted automatically.


4. Tracing

  • OpenTelemetry SDK with W3C Trace Context propagation.
  • Every inbound HTTP request is the root span; downstream calls (Postgres, Pub/Sub, GCS, CDN, AI orchestrator, file-storage, tenant-service) become child spans with attributes:
    • db.system=postgresql, db.statement=<param-stripped>, db.row.count=<n>.
    • messaging.system=pubsub, messaging.destination=<topic>, messaging.message_id=<eventId>.
    • gcp.gcs.bucket, gcp.gcs.object, gcp.gcs.operation.
    • http.url=<callee>, http.status_code, peer.service.
    • ghasi.useCase=<UseCaseName>, ghasi.tenantId=<...>, ghasi.themeId=<...>.
  • Sampling: 10 % default; 100 % for any request that carries X-Debug-Trace: 1 (gateway accepts only from internal callers).
  • Trace exemplars attached to histogram metrics (Cloud Monitoring exemplar support).

The publish flow's expected span tree:

http.POST /v1/theme-versions/:id/publish
├── usecase.PublishThemeVersionUseCase
│ ├── repo.themeVersion.findById
│ ├── repo.theme.findById
│ ├── repo.contentBlocks.listByVersion
│ ├── repo.navConfigs.listByVersion
│ ├── repo.bookingFlow.findByVersion
│ ├── repo.emailTheme.findByVersion
│ ├── repo.localePacks.listByVersion
│ ├── client.fileStorage.headMany (asset integrity)
│ ├── pure.buildBundle
│ ├── client.gcs.uploadBundle
│ ├── tx.publishFlip
│ │ ├── repo.themeVersion.save
│ │ ├── repo.themePublication.flipActive
│ │ ├── repo.theme.save
│ │ └── outbox.publishMany
│ ├── cache.memorystore.set
│ └── client.cloudCdn.invalidate
└── 202 Accepted

5. Dashboards

Cloud Monitoring dashboards in services/theme-config-service/observability/dashboards/:

DashboardAudienceKey panels
theme-config-overviewon-callRED per use case, error budget burn, top error codes, p50/p95/p99 latencies, deployment markers
theme-config-publishfeature ownersPublish success rate, e2e latency, bundle size distribution, CDN propagation, rollbacks/day
theme-config-dataDBA + on-callPool utilisation, slow queries, RLS violations, outbox lag, inbox lag, replica replication lag
theme-config-aiAI opsAI calls per surface, success rate, cost/day, HITL approval lag
theme-config-tenant-tierproductActive themes, publishes/tenant, content blocks/tenant, bundle size by tier

6. Alerts

AlertTriggerSeverityChannel
Burn — fasterror budget burn 14.4× over 1hsev2PagerDuty primary
Burn — slowerror budget burn 6× over 6hsev3PagerDuty secondary
Publish failure spikepublish failure rate > 2 % over 10 minsev2PagerDuty primary
Rollback rate> 3 rollbacks in 30 min globallysev3Slack #theme-ops
Bundle integrity violationany in 5 minsev1PagerDuty primary + #security-incidents
RLS violation counter > 0anysev1PagerDuty primary + #security-incidents
Outbox lagp95 outbox lag > 30 s for 5 minsev2PagerDuty primary
Outbox backlogunpublished rows > 1000 for 10 minsev2PagerDuty primary
CDN invalidation failurefailure rate > 5 % over 15 minsev2PagerDuty primary
Preview brute-forcesustained 401/404 rate on GET /public/preview/... from one IP > 60 rpmsev3Slack #security-ops + auto-rate-limit boost
AI surface failurefailure rate > 10 % over 15 min on any surfacesev3Slack #theme-ops
AI budget exceeded for tenantfirst occurrence in a billing periodinformationaltenant email + product Slack
Schema migration driftstartup health check failssev2PagerDuty primary
Health check failure/healthz 5xx for 1 minsev2PagerDuty primary
Memorystore unreachableerror rate > 1 % on cache ops for 5 minsev3Slack (we degrade to origin)

7. Health endpoints

EndpointPurposeAuth
GET /healthzliveness — process up, DB connection oknone
GET /readyzreadiness — migrations applied, Pub/Sub publisher ok, Memorystore reachablenone
GET /metricsPrometheus scrape (sidecar OTel collector exposes this)mTLS internal only

readyz returns:

{
"ok": true,
"checks": {
"db": "ok",
"migrations": "applied:142",
"pubsub": "ok",
"memorystore": "ok",
"gcs": "ok",
"ai_orchestrator": "ok",
"file_storage": "ok"
}
}

Failure of any check returns 503 with the failing keys.


8. Audit & compliance trails

  • Every mutating use case emits a theme.audit.* event consumed by audit-service (immutable, 7-year retention).
  • Every HITL approval emits theme.ai.suggestion_approved.v1 carrying approverUserId, approverNote, appliedToVersionId.
  • DB trigger on theme_publications writes a separate audit_log_publication row (defense in depth in case the application bypasses the use case).

9. Cost observability

  • Per-tenant cost attribution via the platform costs/<tenantId> Cloud Logging label, summed monthly:
    • DB CPU / IOPS share by app.tenant_id from query logs.
    • Pub/Sub publish bytes per tenantId partition.
    • GCS object storage by tenantId prefix.
    • CDN egress by host (mapped back to tenant via host_to_theme_view).
    • AI cost from theme_ai_cost_usd_total{surface,tenantTier} × per-tenant breakdown read from audit_log (rare query).
  • Aggregated daily into the platform costs.daily_per_service_per_tenant BigQuery table.

10. References