OBSERVABILITY — payment-gateway-service
Sibling: DEPLOYMENT_TOPOLOGY · FAILURE_MODES · APPLICATION_LOGIC
This service emits metrics, traces, logs, and audit events through the platform's standard observability stack (OpenTelemetry → Cloud Operations + Cloud Monitoring; logs → Cloud Logging; audit → audit-log-service). The contract below is the source of truth for SLOs, dashboards, alerts, and runbooks.
1. Service Level Objectives (SLOs)
| SLI | Target | Measurement window | Burn rate alerts |
|---|---|---|---|
authorize_p99_latency_ms | < 1500 ms | 30-day rolling | 2× over 1 h, 5× over 5 m |
capture_p99_latency_ms | < 1500 ms | 30-day rolling | 2× over 1 h |
refund_p99_latency_ms | < 2000 ms | 30-day rolling | 2× over 1 h |
webhook_processing_p95_latency_ms | < 5000 ms | 30-day rolling | 2× over 1 h |
authorize_success_rate | ≥ 99.0% (excluding genuine declines) | 30-day rolling | < 98.5% over 15 m |
capture_success_rate | ≥ 99.5% | 30-day rolling | < 99.0% over 15 m |
webhook_dispatch_success_rate | ≥ 99.9% | 30-day rolling | < 99.5% over 15 m |
reconciliation_success_rate | ≥ 99.9% (job completes without unresolved discrepancies) | 30-day rolling | < 99.5% over 24 h |
availability (HTTP 5xx rate) | ≤ 0.1% | 30-day rolling | > 1% over 5 m |
idempotency_collision_rate | ≤ 0.5% (same-body collapses excluded) | 7-day rolling | > 1% over 1 h |
Error budget consumption is published to the platform SLO dashboard (Cloud Monitoring › SLO).
2. Metrics
All metrics are RED + USE oriented; names follow Prometheus convention.
2.1 Request metrics
| Name | Type | Labels |
|---|---|---|
payments_request_duration_seconds | histogram | route, method, status |
payments_request_total | counter | route, method, status, tenant_id (low cardinality bucket) |
payments_idempotency_collision_total | counter | scope, outcome (same_body, different_body) |
2.2 Domain metrics
| Name | Type | Labels |
|---|---|---|
payments_authorize_total | counter | processor, outcome (authorized, declined, requires_action, failed) |
payments_capture_total | counter | processor, outcome |
payments_refund_total | counter | processor, outcome, reason |
payments_void_total | counter | processor |
payments_cash_receipt_total | counter | property_id, currency |
payments_amount_micro_total | counter | processor, currency, outcome |
payments_fx_lookup_total | counter | provider, outcome |
2.3 Adapter metrics
| Name | Type | Labels |
|---|---|---|
payments_adapter_call_duration_seconds | histogram | processor, operation (authorize/capture/…) |
payments_adapter_error_total | counter | processor, operation, error_class (network, 5xx, 4xx, decline) |
payments_adapter_circuit_state | gauge | processor (value: 0=closed, 1=half_open, 2=open) |
payments_adapter_circuit_transitions_total | counter | processor, from_state, to_state |
2.4 Webhook metrics
| Name | Type | Labels |
|---|---|---|
payments_webhook_received_total | counter | processor, event_type, signature_valid |
payments_webhook_dispatch_duration_seconds | histogram | processor, event_type |
payments_webhook_dispatch_total | counter | processor, event_type, outcome (applied, duplicate, failed) |
payments_webhook_inbox_lag_seconds | gauge | processor (oldest pending row age) |
payments_webhook_dlq_size | gauge | (no labels) |
2.5 Reconciliation metrics
| Name | Type | Labels |
|---|---|---|
payments_reconciliation_run_duration_seconds | histogram | processor |
payments_reconciliation_unmatched_total | counter | processor, side (platform_only, vendor_only) |
payments_reconciliation_unmatched_amount_micro | counter | processor, currency, side |
2.6 Sync metrics
| Name | Type | Labels |
|---|---|---|
payments_desktop_cash_pushed_total | counter | property_id, outcome |
payments_desktop_cash_outbox_age_seconds | gauge | property_id (oldest pending row age) |
3. Traces
OpenTelemetry tracing is mandatory; sampling is head-based 10% on read paths and always-on for mutating paths (authorize, capture, refund, void, cash receipt, webhook dispatch). Span names use the <service>.<use_case>.<step> convention.
3.1 Standard spans for AuthorizePaymentUseCase
payments.authorize_payment.handle
├── payments.authorize_payment.idempotency.lookup
├── payments.authorize_payment.adapter.select
├── payments.authorize_payment.fx.snapshot
├── payments.authorize_payment.persist.intent
├── payments.adapter.stripe.authorize (← outbound; duration measured against vendor SLO)
├── payments.authorize_payment.persist.outcome
└── payments.authorize_payment.outbox.publish
3.2 Required attributes on every span
tenant.id(low-cardinality bucket — never raw)processorpayment.idif availableidempotency.key.hash(sha256 prefix, 8 chars)feature.flags(resolved set)error.codeon failure spans
3.3 Trace context propagation
- Inbound REST/Pub/Sub:
traceparentextracted; new span starts as child. - Outbound adapter calls:
traceparentinjected into vendor SDK request headers where supported. - Webhook dispatch: trace continues from the originating intent if vendor includes a tracing-friendly
metadatafield; otherwise a new trace withlinkto the intent's trace.
4. Structured logs
All logs are JSON via Pino, shipped to Cloud Logging through the GKE log agent. Required fields:
timestamp(ISO-8601)severity(DEBUG/INFO/WARN/ERROR)service=payment-gateway-serviceversion(git sha + semver)traceId,spanIdtenantId,requestId,useCaseevent(short snake-case verb),outcome- Domain-specific fields (
paymentId,processor,amountMicro,currency)
Forbidden in logs: PAN, CVV, full processor token, webhook signature, secret URI body, raw card-related fields. The platform log filter strips them defensively.
4.1 Notable log events
| Event | Severity | Notes |
|---|---|---|
idempotency.replayed | INFO | safe replay served from cache |
idempotency.collision | WARN | different body for same key — operator alert if rate climbs |
adapter.declined | INFO | normal business outcome |
adapter.error | ERROR | tagged with error.class |
adapter.circuit.opened | WARN | also emits melmastoon.payment.adapter.health_changed.v1 |
webhook.signature.invalid | ERROR | security alert ↑ if rate > baseline 3× |
webhook.dispatch.failed | ERROR | with attempt number; DLQ at attempt 7 |
reconciliation.discrepancy.found | WARN | one row per discrepancy |
cash.receipt.recorded | INFO | offline-vs-online flag included |
pci.pan_exposure.blocked | CRITICAL | pages SecOps |
5. Alerts (routed via notification-service + PagerDuty)
| Alert | Condition | Severity | Routing |
|---|---|---|---|
payments_authorize_5xx_burn_2h | 5xx rate > 1% for 2 h | P1 | on-call payments engineer |
payments_capture_success_rate_low | < 99.0% over 15 m | P1 | on-call payments engineer |
payments_webhook_inbox_lag_high | payments_webhook_inbox_lag_seconds > 300 for 10 m | P2 | on-call payments engineer |
payments_webhook_dlq_growing | payments_webhook_dlq_size increases by ≥ 10 in 1 h | P2 | on-call payments engineer |
payments_adapter_circuit_open | any adapter open for > 5 m in production | P2 | on-call + vendor-management bot |
payments_pci_pan_exposure_blocked | any occurrence | P0 | SecOps + payments lead + paged immediately |
payments_reconciliation_failed | reconciliation job did not complete by 04:00 UTC | P2 | on-call payments engineer |
payments_reconciliation_unmatched_amount_high | unmatched_amount_micro > tenant threshold | P2 | accountant on call (per tenant) |
payments_idempotency_collision_high | rate > 1%/h | P3 | payments engineer (working hours) |
payments_desktop_cash_outbox_age_high | any property has age > 4 h | P3 | property manager via notification-service |
Each alert links to a runbook URL at https://runbooks.melmastoon.ghasi.io/payments/<slug>.
6. Dashboards
The Cloud Monitoring workspace payment-gateway-service includes:
- Service overview: RED metrics for top 10 endpoints, error-budget burn, SLO compliance.
- Adapter health: per-adapter latency, error rate, circuit state, fallback rate.
- Webhook pipeline: receive rate, dispatch latency, inbox lag, DLQ size.
- Reconciliation: per-tenant per-processor matched/unmatched counts and totals.
- Cash flows: per-property receipts, refunds, dual-sign-off rate, drift events.
- PCI hygiene: PAN-exposure-blocked counter (must remain at 0), pci-scan results.
7. Audit trail
Every domain mutation (authorize, capture, refund, void, cash receipt, webhook applied, reconciliation completed, chargeback evidence submitted) emits an audit record to audit-log-service with:
- actor (user or service identity)
- tenant id, payment id, amount, currency
- before/after state
- ai-provenance id (where applicable)
- correlation id
Audit records are retained 7 years per financial-records policy.