Skip to main content

OBSERVABILITY — payment-gateway-service

Sibling: DEPLOYMENT_TOPOLOGY · FAILURE_MODES · APPLICATION_LOGIC

This service emits metrics, traces, logs, and audit events through the platform's standard observability stack (OpenTelemetry → Cloud Operations + Cloud Monitoring; logs → Cloud Logging; audit → audit-log-service). The contract below is the source of truth for SLOs, dashboards, alerts, and runbooks.

1. Service Level Objectives (SLOs)

SLITargetMeasurement windowBurn rate alerts
authorize_p99_latency_ms< 1500 ms30-day rolling2× over 1 h, 5× over 5 m
capture_p99_latency_ms< 1500 ms30-day rolling2× over 1 h
refund_p99_latency_ms< 2000 ms30-day rolling2× over 1 h
webhook_processing_p95_latency_ms< 5000 ms30-day rolling2× over 1 h
authorize_success_rate≥ 99.0% (excluding genuine declines)30-day rolling< 98.5% over 15 m
capture_success_rate≥ 99.5%30-day rolling< 99.0% over 15 m
webhook_dispatch_success_rate≥ 99.9%30-day rolling< 99.5% over 15 m
reconciliation_success_rate≥ 99.9% (job completes without unresolved discrepancies)30-day rolling< 99.5% over 24 h
availability (HTTP 5xx rate)≤ 0.1%30-day rolling> 1% over 5 m
idempotency_collision_rate≤ 0.5% (same-body collapses excluded)7-day rolling> 1% over 1 h

Error budget consumption is published to the platform SLO dashboard (Cloud Monitoring › SLO).

2. Metrics

All metrics are RED + USE oriented; names follow Prometheus convention.

2.1 Request metrics

NameTypeLabels
payments_request_duration_secondshistogramroute, method, status
payments_request_totalcounterroute, method, status, tenant_id (low cardinality bucket)
payments_idempotency_collision_totalcounterscope, outcome (same_body, different_body)

2.2 Domain metrics

NameTypeLabels
payments_authorize_totalcounterprocessor, outcome (authorized, declined, requires_action, failed)
payments_capture_totalcounterprocessor, outcome
payments_refund_totalcounterprocessor, outcome, reason
payments_void_totalcounterprocessor
payments_cash_receipt_totalcounterproperty_id, currency
payments_amount_micro_totalcounterprocessor, currency, outcome
payments_fx_lookup_totalcounterprovider, outcome

2.3 Adapter metrics

NameTypeLabels
payments_adapter_call_duration_secondshistogramprocessor, operation (authorize/capture/…)
payments_adapter_error_totalcounterprocessor, operation, error_class (network, 5xx, 4xx, decline)
payments_adapter_circuit_stategaugeprocessor (value: 0=closed, 1=half_open, 2=open)
payments_adapter_circuit_transitions_totalcounterprocessor, from_state, to_state

2.4 Webhook metrics

NameTypeLabels
payments_webhook_received_totalcounterprocessor, event_type, signature_valid
payments_webhook_dispatch_duration_secondshistogramprocessor, event_type
payments_webhook_dispatch_totalcounterprocessor, event_type, outcome (applied, duplicate, failed)
payments_webhook_inbox_lag_secondsgaugeprocessor (oldest pending row age)
payments_webhook_dlq_sizegauge(no labels)

2.5 Reconciliation metrics

NameTypeLabels
payments_reconciliation_run_duration_secondshistogramprocessor
payments_reconciliation_unmatched_totalcounterprocessor, side (platform_only, vendor_only)
payments_reconciliation_unmatched_amount_microcounterprocessor, currency, side

2.6 Sync metrics

NameTypeLabels
payments_desktop_cash_pushed_totalcounterproperty_id, outcome
payments_desktop_cash_outbox_age_secondsgaugeproperty_id (oldest pending row age)

3. Traces

OpenTelemetry tracing is mandatory; sampling is head-based 10% on read paths and always-on for mutating paths (authorize, capture, refund, void, cash receipt, webhook dispatch). Span names use the <service>.<use_case>.<step> convention.

3.1 Standard spans for AuthorizePaymentUseCase

payments.authorize_payment.handle
├── payments.authorize_payment.idempotency.lookup
├── payments.authorize_payment.adapter.select
├── payments.authorize_payment.fx.snapshot
├── payments.authorize_payment.persist.intent
├── payments.adapter.stripe.authorize (← outbound; duration measured against vendor SLO)
├── payments.authorize_payment.persist.outcome
└── payments.authorize_payment.outbox.publish

3.2 Required attributes on every span

  • tenant.id (low-cardinality bucket — never raw)
  • processor
  • payment.id if available
  • idempotency.key.hash (sha256 prefix, 8 chars)
  • feature.flags (resolved set)
  • error.code on failure spans

3.3 Trace context propagation

  • Inbound REST/Pub/Sub: traceparent extracted; new span starts as child.
  • Outbound adapter calls: traceparent injected into vendor SDK request headers where supported.
  • Webhook dispatch: trace continues from the originating intent if vendor includes a tracing-friendly metadata field; otherwise a new trace with link to the intent's trace.

4. Structured logs

All logs are JSON via Pino, shipped to Cloud Logging through the GKE log agent. Required fields:

  • timestamp (ISO-8601)
  • severity (DEBUG/INFO/WARN/ERROR)
  • service = payment-gateway-service
  • version (git sha + semver)
  • traceId, spanId
  • tenantId, requestId, useCase
  • event (short snake-case verb), outcome
  • Domain-specific fields (paymentId, processor, amountMicro, currency)

Forbidden in logs: PAN, CVV, full processor token, webhook signature, secret URI body, raw card-related fields. The platform log filter strips them defensively.

4.1 Notable log events

EventSeverityNotes
idempotency.replayedINFOsafe replay served from cache
idempotency.collisionWARNdifferent body for same key — operator alert if rate climbs
adapter.declinedINFOnormal business outcome
adapter.errorERRORtagged with error.class
adapter.circuit.openedWARNalso emits melmastoon.payment.adapter.health_changed.v1
webhook.signature.invalidERRORsecurity alert ↑ if rate > baseline 3×
webhook.dispatch.failedERRORwith attempt number; DLQ at attempt 7
reconciliation.discrepancy.foundWARNone row per discrepancy
cash.receipt.recordedINFOoffline-vs-online flag included
pci.pan_exposure.blockedCRITICALpages SecOps

5. Alerts (routed via notification-service + PagerDuty)

AlertConditionSeverityRouting
payments_authorize_5xx_burn_2h5xx rate > 1% for 2 hP1on-call payments engineer
payments_capture_success_rate_low< 99.0% over 15 mP1on-call payments engineer
payments_webhook_inbox_lag_highpayments_webhook_inbox_lag_seconds > 300 for 10 mP2on-call payments engineer
payments_webhook_dlq_growingpayments_webhook_dlq_size increases by ≥ 10 in 1 hP2on-call payments engineer
payments_adapter_circuit_openany adapter open for > 5 m in productionP2on-call + vendor-management bot
payments_pci_pan_exposure_blockedany occurrenceP0SecOps + payments lead + paged immediately
payments_reconciliation_failedreconciliation job did not complete by 04:00 UTCP2on-call payments engineer
payments_reconciliation_unmatched_amount_highunmatched_amount_micro > tenant thresholdP2accountant on call (per tenant)
payments_idempotency_collision_highrate > 1%/hP3payments engineer (working hours)
payments_desktop_cash_outbox_age_highany property has age > 4 hP3property manager via notification-service

Each alert links to a runbook URL at https://runbooks.melmastoon.ghasi.io/payments/<slug>.

6. Dashboards

The Cloud Monitoring workspace payment-gateway-service includes:

  1. Service overview: RED metrics for top 10 endpoints, error-budget burn, SLO compliance.
  2. Adapter health: per-adapter latency, error rate, circuit state, fallback rate.
  3. Webhook pipeline: receive rate, dispatch latency, inbox lag, DLQ size.
  4. Reconciliation: per-tenant per-processor matched/unmatched counts and totals.
  5. Cash flows: per-property receipts, refunds, dual-sign-off rate, drift events.
  6. PCI hygiene: PAN-exposure-blocked counter (must remain at 0), pci-scan results.

7. Audit trail

Every domain mutation (authorize, capture, refund, void, cash receipt, webhook applied, reconciliation completed, chargeback evidence submitted) emits an audit record to audit-log-service with:

  • actor (user or service identity)
  • tenant id, payment id, amount, currency
  • before/after state
  • ai-provenance id (where applicable)
  • correlation id

Audit records are retained 7 years per financial-records policy.