OBSERVABILITY — pricing-service
Sibling: APPLICATION_LOGIC · FAILURE_MODES · DEPLOYMENT_TOPOLOGY
Strategic anchors: 02 §13 Observability
We instrument the service with OpenTelemetry (traces, metrics, logs). The collector ships data to:
- Cloud Trace for distributed traces.
- Cloud Monitoring (with Managed Prometheus) for metrics.
- Cloud Logging for structured logs.
- BigQuery (via log-router sink) for long-term audit + analytics.
- Sentry for application-level error tracking (frontend BFFs only consume the IDs we surface in error envelopes).
1. Service-level objectives
| SLI | Definition | Target | Window | Error budget |
|---|---|---|---|---|
quote_latency_p99 | p99 of POST /v1/pricing/quotes server time | < 250 ms | rolling 28 d | 1.0% over target |
quote_latency_p50 | p50 of same | < 80 ms | rolling 28 d | informational |
quote_availability | 1 − (5xx / total) for /v1/pricing/quotes | ≥ 99.95% | rolling 28 d | 21.6 m/month |
admin_availability | 1 − (5xx / total) for /v1/admin/* | ≥ 99.9% | rolling 28 d | 43.2 m/month |
fx_freshness | fraction of /quotes calls that used a non-stale FX snapshot | ≥ 99.5% | rolling 7 d | — |
event_publish_lag_p95 | p95 lag from outbox insert to Pub/Sub publish ack | < 2 s | rolling 7 d | — |
event_consume_lag_p95 | p95 lag from Pub/Sub publish to inbox-handler completion | < 5 s | rolling 7 d | — |
derivation_correctness | % of quotes whose recomputation by reservation-service matches | 100.000% | rolling 28 d | zero tolerance |
dynamic_suggestion_latency_p95 | p95 of :generate end-to-end | < 8 s | rolling 7 d | — |
SLOs are codified in services/pricing-service/observability/slo.yaml and rendered to Cloud Monitoring SLOs via Terraform. Burn-rate alerts fire on 5%/1h (P1) and 10%/6h (P2).
2. RED metrics
Every HTTP endpoint emits the standard RED set via the @melmastoon/otel-http middleware:
| Metric | Type | Labels |
|---|---|---|
http_server_requests_total | counter | method, route, status_code, tenant_id (sampled), service |
http_server_request_duration_seconds | histogram | same |
http_server_in_flight_requests | gauge | service, route |
http_server_errors_total | counter | route, error_code, service |
tenant_id cardinality is capped: high-volume tenants are emitted as their tenant id; long-tail tenants are rolled up to _other.
Use-case histograms
| Metric | Description |
|---|---|
pricing_use_case_duration_seconds{use_case=…,outcome=…} | wall-clock per use case |
pricing_use_case_failures_total{use_case=…,error_code=…} | failure counter |
pricing_quote_derivation_steps_seconds{step=…} | per-step latency in the derivation pipeline |
Domain metrics
| Metric | Description |
|---|---|
pricing_quotes_created_total{property_id, currency, channel} | live quotes minted |
pricing_quotes_expired_total{reason} | expired quotes by reason |
pricing_quote_grand_total_micro_sum / _count | running average grand-total micro |
pricing_promo_redemptions_total{promo_id, outcome} | redemption attempts |
pricing_promo_overcap_total{promo_id} | rejected over-cap attempts |
pricing_fx_snapshot_age_seconds{base, quote} | age of latest snapshot |
pricing_fx_refresh_failures_total{provider, reason} | provider failures |
pricing_dynamic_suggestion_total{outcome=generated|accepted|rejected|expired} | AI HITL throughput |
pricing_dynamic_suggestion_cost_micro_usd_sum | running AI cost |
pricing_sharia_guard_failures_total{property_id} | sharia rejections |
pricing_outbox_unpublished | gauge of unpublished outbox rows |
pricing_inbox_lag_seconds{subject} | delay between event publish and handler completion |
pricing_quote_locks_active | gauge of active locks (per property) |
3. Tracing
OpenTelemetry tracing is enabled service-wide. Sampling: 100% for /v1/pricing/quotes (cheap, high-value) and 10% otherwise; 100% on errors.
Required span attributes
Every span MUST carry:
service.name = "pricing-service"service.versionservice.instance.idtenant.id(when known)actor.type,actor.id(when known)correlation.ididempotency.key(when present)
Spans of interest
POST /v1/pricing/quotes [server]
└── pricing.use_case.calculate_quote [internal]
├── pricing.derive.resolve_rate_plan
├── pricing.derive.derive_nightly_base
├── pricing.derive.apply_discounts
├── pricing.derive.compose_fees
├── pricing.derive.compose_taxes
├── pricing.derive.apply_fx
├── pricing.derive.sharia_guard
├── pricing.derive.pin_quote
├── pricing.repo.rate_plan.find_candidates
├── pricing.repo.rate_rule.find_by_plan
├── pricing.repo.tax_rule.find_applicable
├── pricing.repo.fee_rule.find_applicable
├── pricing.repo.fx_snapshot.latest
└── pricing.outbox.publish (subject=melmastoon.pricing.quote.created.v1)
pricing.derive.* spans carry derivation.step.outcome and derivation.step.duration_ms attributes, so the derivation log embedded in the quote can be cross-checked against trace spans for any future audit.
4. Structured logs
Logs are JSON, written to stdout (Cloud Run/GKE picks them up), with the following baseline fields:
{
"ts": "2026-04-22T10:14:09.123Z",
"lvl": "info",
"msg": "quote.created",
"service": "pricing-service",
"version": "1.42.0",
"instance": "pricing-service-7d8c-abcde",
"tenant_id": "tnt_…",
"actor": { "type": "user", "id": "usr_…" },
"correlation_id": "01H8Z…",
"trace_id": "1234abcd…",
"span_id": "ef56…",
"use_case": "CalculateQuote",
"duration_ms": 142,
"outcome": "success",
"ctx": { "property_id": "pty_…", "rate_plan_id": "rate_…", "quote_id": "qte_…" }
}
Log levels:
error— every unexpected exception, every domain error mapped to 5xx, every outbox publish failure.warn— domain rejections (4xx), FX snapshot stale fallback, promo cap reached.info— successful use cases (always), authn results.debug— enabled per-tenant via runtime config (off by default; never includes raw payloads).
PII is never logged; the linter rejects logs that include email, phone, name keys outside an explicit allow-list.
5. Dashboards
Three Grafana dashboards (in Cloud Monitoring) are maintained per environment. JSON definitions live in services/pricing-service/observability/dashboards/.
| Dashboard | Panels |
|---|---|
pricing-quotes-overview | quote latency p50/p95/p99, throughput, error rate, quotes-by-currency, sharia-guard failures, fx freshness gauge, outbox depth |
pricing-admin-and-rules | admin endpoint RED, rate-plan publish rate, rule edits per minute, promo redemption funnel |
pricing-ai-hitl | dynamic-suggestion generation rate, accept/reject ratio, model latency, model cost, refusal rate |
Each panel is annotated with the SLO threshold and links to the corresponding alert policy.
6. Alerts
Alerts are defined in observability/alerts.yaml and delivered to:
- PagerDuty
pricing-oncall(P1, P2) - Slack
#alerts-pricing(P3, P4) - Email digest to
revenue-eng@melmastoon.tech(P4 daily)
| Alert | Severity | Trigger |
|---|---|---|
quote_latency_slo_burn_fast | P1 | 5%/1h burn on quote_latency_p99 |
quote_5xx_spike | P1 | > 1% 5xx for 5 m on /v1/pricing/quotes |
outbox_backlog | P1 | pricing_outbox_unpublished > 1000 for 5 m |
inbox_lag_high | P2 | pricing_inbox_lag_seconds > 60 p95 for 10 m |
fx_provider_down | P2 | 3 consecutive pricing_fx_refresh_failures_total increments |
fx_snapshot_hard_expire_imminent | P2 | pricing_fx_snapshot_age_seconds > 60h |
promo_overcap_storm | P3 | pricing_promo_overcap_total > 50/min for one promo |
sharia_guard_storm | P3 | pricing_sharia_guard_failures_total > 20/min |
dynamic_suggestion_latency_breach | P3 | p95 > 8 s for 15 m |
dynamic_suggestion_cost_burn | P3 | per-tenant daily cost projected > 80% of cap |
rls_violation | P1 | any log entry with code=MELMASTOON.SECURITY.TENANT_VIOLATION |
7. Synthetic checks
Cloud Monitoring Uptime checks every 60 s from 4 regions hit:
GET /v1/healthz(liveness)GET /v1/readyz(readiness — verifies DB, Redis, Pub/Sub, FX provider)POST /v1/pricing/quotesagainst a sandbox tenant with fixed seed data — verifies grand-total deterministically
Synthetic failures over two consecutive checks page on-call.