Skip to main content

OBSERVABILITY — pricing-service

Sibling: APPLICATION_LOGIC · FAILURE_MODES · DEPLOYMENT_TOPOLOGY

Strategic anchors: 02 §13 Observability

We instrument the service with OpenTelemetry (traces, metrics, logs). The collector ships data to:

  • Cloud Trace for distributed traces.
  • Cloud Monitoring (with Managed Prometheus) for metrics.
  • Cloud Logging for structured logs.
  • BigQuery (via log-router sink) for long-term audit + analytics.
  • Sentry for application-level error tracking (frontend BFFs only consume the IDs we surface in error envelopes).

1. Service-level objectives

SLIDefinitionTargetWindowError budget
quote_latency_p99p99 of POST /v1/pricing/quotes server time< 250 msrolling 28 d1.0% over target
quote_latency_p50p50 of same< 80 msrolling 28 dinformational
quote_availability1 − (5xx / total) for /v1/pricing/quotes≥ 99.95%rolling 28 d21.6 m/month
admin_availability1 − (5xx / total) for /v1/admin/*≥ 99.9%rolling 28 d43.2 m/month
fx_freshnessfraction of /quotes calls that used a non-stale FX snapshot≥ 99.5%rolling 7 d
event_publish_lag_p95p95 lag from outbox insert to Pub/Sub publish ack< 2 srolling 7 d
event_consume_lag_p95p95 lag from Pub/Sub publish to inbox-handler completion< 5 srolling 7 d
derivation_correctness% of quotes whose recomputation by reservation-service matches100.000%rolling 28 dzero tolerance
dynamic_suggestion_latency_p95p95 of :generate end-to-end< 8 srolling 7 d

SLOs are codified in services/pricing-service/observability/slo.yaml and rendered to Cloud Monitoring SLOs via Terraform. Burn-rate alerts fire on 5%/1h (P1) and 10%/6h (P2).


2. RED metrics

Every HTTP endpoint emits the standard RED set via the @melmastoon/otel-http middleware:

MetricTypeLabels
http_server_requests_totalcountermethod, route, status_code, tenant_id (sampled), service
http_server_request_duration_secondshistogramsame
http_server_in_flight_requestsgaugeservice, route
http_server_errors_totalcounterroute, error_code, service

tenant_id cardinality is capped: high-volume tenants are emitted as their tenant id; long-tail tenants are rolled up to _other.

Use-case histograms

MetricDescription
pricing_use_case_duration_seconds{use_case=…,outcome=…}wall-clock per use case
pricing_use_case_failures_total{use_case=…,error_code=…}failure counter
pricing_quote_derivation_steps_seconds{step=…}per-step latency in the derivation pipeline

Domain metrics

MetricDescription
pricing_quotes_created_total{property_id, currency, channel}live quotes minted
pricing_quotes_expired_total{reason}expired quotes by reason
pricing_quote_grand_total_micro_sum / _countrunning average grand-total micro
pricing_promo_redemptions_total{promo_id, outcome}redemption attempts
pricing_promo_overcap_total{promo_id}rejected over-cap attempts
pricing_fx_snapshot_age_seconds{base, quote}age of latest snapshot
pricing_fx_refresh_failures_total{provider, reason}provider failures
pricing_dynamic_suggestion_total{outcome=generated|accepted|rejected|expired}AI HITL throughput
pricing_dynamic_suggestion_cost_micro_usd_sumrunning AI cost
pricing_sharia_guard_failures_total{property_id}sharia rejections
pricing_outbox_unpublishedgauge of unpublished outbox rows
pricing_inbox_lag_seconds{subject}delay between event publish and handler completion
pricing_quote_locks_activegauge of active locks (per property)

3. Tracing

OpenTelemetry tracing is enabled service-wide. Sampling: 100% for /v1/pricing/quotes (cheap, high-value) and 10% otherwise; 100% on errors.

Required span attributes

Every span MUST carry:

  • service.name = "pricing-service"
  • service.version
  • service.instance.id
  • tenant.id (when known)
  • actor.type, actor.id (when known)
  • correlation.id
  • idempotency.key (when present)

Spans of interest

POST /v1/pricing/quotes [server]
└── pricing.use_case.calculate_quote [internal]
├── pricing.derive.resolve_rate_plan
├── pricing.derive.derive_nightly_base
├── pricing.derive.apply_discounts
├── pricing.derive.compose_fees
├── pricing.derive.compose_taxes
├── pricing.derive.apply_fx
├── pricing.derive.sharia_guard
├── pricing.derive.pin_quote
├── pricing.repo.rate_plan.find_candidates
├── pricing.repo.rate_rule.find_by_plan
├── pricing.repo.tax_rule.find_applicable
├── pricing.repo.fee_rule.find_applicable
├── pricing.repo.fx_snapshot.latest
└── pricing.outbox.publish (subject=melmastoon.pricing.quote.created.v1)

pricing.derive.* spans carry derivation.step.outcome and derivation.step.duration_ms attributes, so the derivation log embedded in the quote can be cross-checked against trace spans for any future audit.


4. Structured logs

Logs are JSON, written to stdout (Cloud Run/GKE picks them up), with the following baseline fields:

{
"ts": "2026-04-22T10:14:09.123Z",
"lvl": "info",
"msg": "quote.created",
"service": "pricing-service",
"version": "1.42.0",
"instance": "pricing-service-7d8c-abcde",
"tenant_id": "tnt_…",
"actor": { "type": "user", "id": "usr_…" },
"correlation_id": "01H8Z…",
"trace_id": "1234abcd…",
"span_id": "ef56…",
"use_case": "CalculateQuote",
"duration_ms": 142,
"outcome": "success",
"ctx": { "property_id": "pty_…", "rate_plan_id": "rate_…", "quote_id": "qte_…" }
}

Log levels:

  • error — every unexpected exception, every domain error mapped to 5xx, every outbox publish failure.
  • warn — domain rejections (4xx), FX snapshot stale fallback, promo cap reached.
  • info — successful use cases (always), authn results.
  • debug — enabled per-tenant via runtime config (off by default; never includes raw payloads).

PII is never logged; the linter rejects logs that include email, phone, name keys outside an explicit allow-list.


5. Dashboards

Three Grafana dashboards (in Cloud Monitoring) are maintained per environment. JSON definitions live in services/pricing-service/observability/dashboards/.

DashboardPanels
pricing-quotes-overviewquote latency p50/p95/p99, throughput, error rate, quotes-by-currency, sharia-guard failures, fx freshness gauge, outbox depth
pricing-admin-and-rulesadmin endpoint RED, rate-plan publish rate, rule edits per minute, promo redemption funnel
pricing-ai-hitldynamic-suggestion generation rate, accept/reject ratio, model latency, model cost, refusal rate

Each panel is annotated with the SLO threshold and links to the corresponding alert policy.


6. Alerts

Alerts are defined in observability/alerts.yaml and delivered to:

  • PagerDuty pricing-oncall (P1, P2)
  • Slack #alerts-pricing (P3, P4)
  • Email digest to revenue-eng@melmastoon.tech (P4 daily)
AlertSeverityTrigger
quote_latency_slo_burn_fastP15%/1h burn on quote_latency_p99
quote_5xx_spikeP1> 1% 5xx for 5 m on /v1/pricing/quotes
outbox_backlogP1pricing_outbox_unpublished > 1000 for 5 m
inbox_lag_highP2pricing_inbox_lag_seconds > 60 p95 for 10 m
fx_provider_downP23 consecutive pricing_fx_refresh_failures_total increments
fx_snapshot_hard_expire_imminentP2pricing_fx_snapshot_age_seconds > 60h
promo_overcap_stormP3pricing_promo_overcap_total > 50/min for one promo
sharia_guard_stormP3pricing_sharia_guard_failures_total > 20/min
dynamic_suggestion_latency_breachP3p95 > 8 s for 15 m
dynamic_suggestion_cost_burnP3per-tenant daily cost projected > 80% of cap
rls_violationP1any log entry with code=MELMASTOON.SECURITY.TENANT_VIOLATION

7. Synthetic checks

Cloud Monitoring Uptime checks every 60 s from 4 regions hit:

  • GET /v1/healthz (liveness)
  • GET /v1/readyz (readiness — verifies DB, Redis, Pub/Sub, FX provider)
  • POST /v1/pricing/quotes against a sandbox tenant with fixed seed data — verifies grand-total deterministically

Synthetic failures over two consecutive checks page on-call.