OBSERVABILITY — pricing-service

Sibling: APPLICATION_LOGIC · FAILURE_MODES · DEPLOYMENT_TOPOLOGY

Strategic anchors: 02 §13 Observability

We instrument the service with OpenTelemetry (traces, metrics, logs). The collector ships data to:

Cloud Trace for distributed traces.
Cloud Monitoring (with Managed Prometheus) for metrics.
Cloud Logging for structured logs.
BigQuery (via log-router sink) for long-term audit + analytics.
Sentry for application-level error tracking (frontend BFFs only consume the IDs we surface in error envelopes).

1. Service-level objectives

SLI	Definition	Target	Window	Error budget
`quote_latency_p99`	p99 of `POST /v1/pricing/quotes` server time	< 250 ms	rolling 28 d	1.0% over target
`quote_latency_p50`	p50 of same	< 80 ms	rolling 28 d	informational
`quote_availability`	1 − (5xx / total) for `/v1/pricing/quotes`	≥ 99.95%	rolling 28 d	21.6 m/month
`admin_availability`	1 − (5xx / total) for `/v1/admin/*`	≥ 99.9%	rolling 28 d	43.2 m/month
`fx_freshness`	fraction of `/quotes` calls that used a non-stale FX snapshot	≥ 99.5%	rolling 7 d	—
`event_publish_lag_p95`	p95 lag from outbox insert to Pub/Sub publish ack	< 2 s	rolling 7 d	—
`event_consume_lag_p95`	p95 lag from Pub/Sub publish to inbox-handler completion	< 5 s	rolling 7 d	—
`derivation_correctness`	% of quotes whose recomputation by reservation-service matches	100.000%	rolling 28 d	zero tolerance
`dynamic_suggestion_latency_p95`	p95 of `:generate` end-to-end	< 8 s	rolling 7 d	—

SLOs are codified in services/pricing-service/observability/slo.yaml and rendered to Cloud Monitoring SLOs via Terraform. Burn-rate alerts fire on 5%/1h (P1) and 10%/6h (P2).

2. RED metrics

Every HTTP endpoint emits the standard RED set via the @melmastoon/otel-http middleware:

Metric	Type	Labels
`http_server_requests_total`	counter	`method`, `route`, `status_code`, `tenant_id` (sampled), `service`
`http_server_request_duration_seconds`	histogram	same
`http_server_in_flight_requests`	gauge	`service`, `route`
`http_server_errors_total`	counter	`route`, `error_code`, `service`

tenant_id cardinality is capped: high-volume tenants are emitted as their tenant id; long-tail tenants are rolled up to _other.

Use-case histograms

Metric	Description
`pricing_use_case_duration_seconds{use_case=…,outcome=…}`	wall-clock per use case
`pricing_use_case_failures_total{use_case=…,error_code=…}`	failure counter
`pricing_quote_derivation_steps_seconds{step=…}`	per-step latency in the derivation pipeline

Domain metrics

Metric	Description
`pricing_quotes_created_total{property_id, currency, channel}`	live quotes minted
`pricing_quotes_expired_total{reason}`	expired quotes by reason
`pricing_quote_grand_total_micro_sum / _count`	running average grand-total micro
`pricing_promo_redemptions_total{promo_id, outcome}`	redemption attempts
`pricing_promo_overcap_total{promo_id}`	rejected over-cap attempts
`pricing_fx_snapshot_age_seconds{base, quote}`	age of latest snapshot
`pricing_fx_refresh_failures_total{provider, reason}`	provider failures
`pricing_dynamic_suggestion_total{outcome=generated\|accepted\|rejected\|expired}`	AI HITL throughput
`pricing_dynamic_suggestion_cost_micro_usd_sum`	running AI cost
`pricing_sharia_guard_failures_total{property_id}`	sharia rejections
`pricing_outbox_unpublished`	gauge of unpublished outbox rows
`pricing_inbox_lag_seconds{subject}`	delay between event publish and handler completion
`pricing_quote_locks_active`	gauge of active locks (per property)

3. Tracing

OpenTelemetry tracing is enabled service-wide. Sampling: 100% for /v1/pricing/quotes (cheap, high-value) and 10% otherwise; 100% on errors.

Required span attributes

Every span MUST carry:

service.name = "pricing-service"
service.version
service.instance.id
tenant.id (when known)
actor.type, actor.id (when known)
correlation.id
idempotency.key (when present)

Spans of interest

POST /v1/pricing/quotes                                       [server]
  └── pricing.use_case.calculate_quote                        [internal]
       ├── pricing.derive.resolve_rate_plan
       ├── pricing.derive.derive_nightly_base
       ├── pricing.derive.apply_discounts
       ├── pricing.derive.compose_fees
       ├── pricing.derive.compose_taxes
       ├── pricing.derive.apply_fx
       ├── pricing.derive.sharia_guard
       ├── pricing.derive.pin_quote
       ├── pricing.repo.rate_plan.find_candidates
       ├── pricing.repo.rate_rule.find_by_plan
       ├── pricing.repo.tax_rule.find_applicable
       ├── pricing.repo.fee_rule.find_applicable
       ├── pricing.repo.fx_snapshot.latest
       └── pricing.outbox.publish (subject=melmastoon.pricing.quote.created.v1)

pricing.derive.* spans carry derivation.step.outcome and derivation.step.duration_ms attributes, so the derivation log embedded in the quote can be cross-checked against trace spans for any future audit.

4. Structured logs

Logs are JSON, written to stdout (Cloud Run/GKE picks them up), with the following baseline fields:

{
  "ts": "2026-04-22T10:14:09.123Z",
  "lvl": "info",
  "msg": "quote.created",
  "service": "pricing-service",
  "version": "1.42.0",
  "instance": "pricing-service-7d8c-abcde",
  "tenant_id": "tnt_…",
  "actor": { "type": "user", "id": "usr_…" },
  "correlation_id": "01H8Z…",
  "trace_id": "1234abcd…",
  "span_id": "ef56…",
  "use_case": "CalculateQuote",
  "duration_ms": 142,
  "outcome": "success",
  "ctx": { "property_id": "pty_…", "rate_plan_id": "rate_…", "quote_id": "qte_…" }
}

Log levels:

error — every unexpected exception, every domain error mapped to 5xx, every outbox publish failure.
warn — domain rejections (4xx), FX snapshot stale fallback, promo cap reached.
info — successful use cases (always), authn results.
debug — enabled per-tenant via runtime config (off by default; never includes raw payloads).

PII is never logged; the linter rejects logs that include email, phone, name keys outside an explicit allow-list.

5. Dashboards

Three Grafana dashboards (in Cloud Monitoring) are maintained per environment. JSON definitions live in services/pricing-service/observability/dashboards/.

Dashboard	Panels
`pricing-quotes-overview`	quote latency p50/p95/p99, throughput, error rate, quotes-by-currency, sharia-guard failures, fx freshness gauge, outbox depth
`pricing-admin-and-rules`	admin endpoint RED, rate-plan publish rate, rule edits per minute, promo redemption funnel
`pricing-ai-hitl`	dynamic-suggestion generation rate, accept/reject ratio, model latency, model cost, refusal rate

Each panel is annotated with the SLO threshold and links to the corresponding alert policy.

6. Alerts

Alerts are defined in observability/alerts.yaml and delivered to:

PagerDuty pricing-oncall (P1, P2)
Slack #alerts-pricing (P3, P4)
Email digest to revenue-eng@melmastoon.tech (P4 daily)

Alert	Severity	Trigger
`quote_latency_slo_burn_fast`	P1	5%/1h burn on `quote_latency_p99`
`quote_5xx_spike`	P1	`> 1%` 5xx for 5 m on `/v1/pricing/quotes`
`outbox_backlog`	P1	`pricing_outbox_unpublished > 1000` for 5 m
`inbox_lag_high`	P2	`pricing_inbox_lag_seconds > 60` p95 for 10 m
`fx_provider_down`	P2	3 consecutive `pricing_fx_refresh_failures_total` increments
`fx_snapshot_hard_expire_imminent`	P2	`pricing_fx_snapshot_age_seconds > 60h`
`promo_overcap_storm`	P3	`pricing_promo_overcap_total > 50/min` for one promo
`sharia_guard_storm`	P3	`pricing_sharia_guard_failures_total > 20/min`
`dynamic_suggestion_latency_breach`	P3	p95 > 8 s for 15 m
`dynamic_suggestion_cost_burn`	P3	per-tenant daily cost projected > 80% of cap
`rls_violation`	P1	any log entry with `code=MELMASTOON.SECURITY.TENANT_VIOLATION`

7. Synthetic checks

Cloud Monitoring Uptime checks every 60 s from 4 regions hit:

GET /v1/healthz (liveness)
GET /v1/readyz (readiness — verifies DB, Redis, Pub/Sub, FX provider)
POST /v1/pricing/quotes against a sandbox tenant with fixed seed data — verifies grand-total deterministically

Synthetic failures over two consecutive checks page on-call.

1. Service-level objectives​

2. RED metrics​

Use-case histograms​

Domain metrics​

3. Tracing​

Required span attributes​

Spans of interest​

4. Structured logs​

5. Dashboards​

6. Alerts​

7. Synthetic checks​