FAILURE_MODES — payment-gateway-service
Sibling: APPLICATION_LOGIC · OBSERVABILITY · SERVICE_RISK_REGISTER
This document enumerates the failure modes payments engineers must understand to operate this service. Each entry follows: Symptom → Detection → Cause → Mitigation → Long-term fix. Items map to runbook URLs at https://runbooks.melmastoon.ghasi.io/payments/<slug>.
1. Vendor adapter unavailable
Symptom: Authorize/capture/refund returns 502 MELMASTOON.PAYMENT.GATEWAY_UNAVAILABLE; circuit breaker opens.
Detection: payments_adapter_circuit_state{processor="…"} = 2 for > 5 m → P2 alert.
Cause: Vendor incident, network partition, certificate expiry, exhausted rate limit, expired API key.
Mitigation:
- Circuit breaker stops new calls to the failing adapter.
- Per-tenant adapter precedence routes new intents to the next-preference processor (e.g., Stripe → PayPal). Cash-on-arrival is always available.
- In-flight intents that succeeded but missed the response are reconciled when the daily batch runs (we hold the
(intent_id, processor_idempotency_key)mapping). - UI surfaces a property-aware notice via
notification-service; staff can record cash receipts as an alternative.
Long-term fix: Add additional adapters per region (e.g., Adyen for EMEA, Razorpay for South Asia) and bake more aggressive fallback into the saga via reservation-service policy.
2. Webhook lost or delayed
Symptom: Transaction stuck in authorized state past expected capture, or pending_settlement for hours.
Detection: Daily reconciliation surfaces a vendor_only line; payments_webhook_inbox_lag_seconds rising; alert P2.
Cause: Vendor delivery failure, our endpoint returned non-2xx, signature secret rotated incorrectly, network drop.
Mitigation:
- Reconciliation matches platform vs vendor settlement and back-fills missing state changes (e.g., the
capturedstate is applied based on the settlement report when the webhook never arrived). This is the canonical repair path. - Operator can replay via
payments-admin-cli webhooks replay --vendor stripe --since=…which re-fetches events from the vendor API and re-runs the dispatcher. webhook.duplicate_dropped.v1ensures idempotency if the original eventually arrives.
Long-term fix: Verify each vendor's "list events since X" endpoint is wired into a 6-hourly catch-up job (currently only daily).
3. Double-charge attempt
Symptom: Same guest, same reservation, two transactions both captured.
Detection: A pre-write check in AuthorizePayment compares (reservationId, paymentMethodId, amount, currency) against the last 5 minutes of transactions; collisions emit payments_idempotency_collision_total{outcome="different_body"}. The 409 response indicates UI should retry as same-body.
Cause: Frontend retried without preserving the original Idempotency-Key; saga emitted held.v1 twice without dedupe at the inbox; race between two operators.
Mitigation:
Idempotency-Keyrequired on every authorize; same key + same body collapses to one charge.- Saga inbox dedupe by
(reservationId, sagaStepId). - If two captures land despite controls, an immediate void+refund is issued and a P1 incident is opened.
Long-term fix: Move from advisory pre-check to a unique constraint at the application layer keyed on (tenantId, reservationId, paymentMethodId, amountMicro, currency) for a 5-minute sliding window using a TTL'd attempt_lock table.
4. Refund queue overflow
Symptom: Refunds remain pending for hours; refund SLO breaches.
Detection: payments_refund_total{outcome="pending"} increases without matching outcome="refunded"; payments_webhook_inbox_lag_seconds correlated.
Cause: Vendor processing slower than our submission rate, mass cancellation event (e.g., property closure), webhook backlog.
Mitigation:
- Mass-cancellation events trigger a per-tenant rate limit on refund submission (5 RPS) to avoid vendor throttling.
- Refund worker runs on its own queue with HPA on backlog; can scale to 30 pods.
- Operator dashboard shows pending count with ETA.
Long-term fix: Group refunds where a vendor supports batch refunds; pre-validate refund amounts against captures to fail fast.
5. FX provider down or stale
Symptom: Authorize fails with MELMASTOON.PAYMENT.FX_UNAVAILABLE; or quoted rate is older than tolerance.
Detection: payments_fx_lookup_total{outcome!="ok"} rising; alert at 1% over 5 m.
Cause: ECB/OANDA outage, network partition.
Mitigation:
- Cached rates (24-h TTL) are accepted with
source="cached_24h"and emit a WARN log; the FX context records the fallback so finance can audit later. - If cache is also stale beyond 24 h, the use case fails fast; the UI shows "Foreign exchange unavailable; try again or use local currency."
- Cash-on-arrival is unaffected.
Long-term fix: Add a second FX provider (e.g., OANDA + Open Exchange Rates) and pick the median quote.
6. PAN exposure attempt
Symptom: MELMASTOON.PAYMENT.PAN_EXPOSURE_BLOCKED raised.
Detection: any occurrence → P0; SecOps paged.
Cause: A bug or misuse caused PAN-shaped data to enter our request body, log line, or event payload.
Mitigation:
- The redaction proxy and serializer both block before the data leaves the request handler.
- The transaction is rejected with a 400; no card data persists.
- Incident: snapshot the offending request (with PAN redacted), trace it to the caller, file a P0.
Long-term fix: PCI scanner test in CI plus runtime detector — both must remain green. New fields touching guest input require a security review checkbox in the PR template.
7. Tenant offboarding with pending refunds
Symptom: Tenant requested deletion; refunds still in pending.
Detection: Tenant lifecycle event tenant.deletion.requested.v1 triggers a check; if any refunds are not in terminal state, the deletion is paused with a P2 alert to the accountant.
Cause: Vendor delay during sunset window.
Mitigation:
- Soft-delete for 30 days defers schema drop; refunds continue to process as webhooks arrive.
- After 30 days, any remaining pending refunds are escalated via
notification-service; legal review may extend retention. - Card tokens are deleted at processor level immediately on tenant request; we keep only the financial record.
Long-term fix: Pre-deletion checklist enforced by tenant-service requiring zero pending refunds before scheduling a deletion.
8. MFS confirmation timeout (HesabPay / async vendors)
Symptom: Transaction stuck in pending_settlement past the configured window (default 24 h).
Detection: Cron payments-mfs-timeouts-sweep (hourly) finds rows with created_at < now() - 24h AND status='pending_settlement' AND processor IN (mfs_*).
Cause: Guest abandoned the MFS confirmation flow, or vendor never delivered the confirmation.
Mitigation:
- Sweep marks the transaction as
failedwithMELMASTOON.PAYMENT.MFS_CONFIRMATION_TIMEOUT. - Saga reacts:
transaction.failed.v1triggersreservation-serviceto release the hold or retry with a different method. - Guest receives a notification with a re-payment link.
Long-term fix: Configurable per-tenant timeout; expose to admin UI in bff-backoffice-service.
9. Chargeback for cash payment
Symptom: Vendor sends a chargeback webhook that maps to a cash_on_arrival transaction.
Detection: RecordChargebackUseCase rejects the operation as logically impossible and raises MELMASTOON.PAYMENT.CHARGEBACK_ON_CASH_IMPOSSIBLE. chargeback.received.v1 is emitted with fraudSignal: true, impossible: true.
Cause: Most likely vendor mis-association, or a fraudulent dispute. In our model cash never traversed a vendor processor, so a chargeback against it is structurally impossible — it's a strong fraud signal worth investigating.
Mitigation:
- Open an investigation case in
bff-backoffice-service; do not auto-issue any refund. - Notify accountant + property manager.
- Capture details and submit "no charge ever existed" evidence to vendor.
Long-term fix: Push vendors for stricter merchant-record matching; consider adding a watermark on cash receipt artifacts.
10. Outbox flush failure
Symptom: Outbound events delayed; downstream services don't react.
Detection: payments_central.outbox row count where published_at IS NULL rising; alert at > 1000 rows or > 5 m oldest.
Cause: Pub/Sub outage, IAM permission revoked, schema drift between event payload and registered schema (validator rejects).
Mitigation:
- Outbox worker retries with backoff; events eventually flush when Pub/Sub returns.
- Schema rejects emit ERROR logs; payload is requeued and the bad envelope investigated by the team.
Long-term fix: Add a circuit breaker on the publisher and a backpressure signal so use cases can fail fast rather than fill the outbox if Pub/Sub is down for long.
11. Optimistic concurrency conflict storm
Symptom: 409 MELMASTOON.PAYMENT.OPTIMISTIC_CONFLICT rate climbs; UI sees retries.
Detection: payments_request_total{status="409"} correlated with payments_capture_total or payments_refund_total.
Cause: A buggy caller polling at high frequency; a saga retry storm; concurrent operators.
Mitigation: Rate limiting kicks in at 30 RPS per tenant; UI shows debounced state; logs include payment.id for triage.
Long-term fix: Add command-merge logic where appropriate (e.g., capture+capture for the same payment within 1 s collapses).
12. Adapter circuit half-open misbehavior
Symptom: Adapter recovers, closed → half-open → open repeatedly (flapping).
Detection: payments_adapter_circuit_transitions_total{from="half_open",to="open"} / time > 1/min for 10 m.
Cause: Vendor recovered partially; or our health-probe traffic is too aggressive after open.
Mitigation: Increase half-open probe interval to 60 s; require 5 consecutive successes before close.
Long-term fix: Adaptive half-open thresholds based on historical recovery patterns.
13. Local-cash desktop outbox stuck
Symptom: Property's desktop reports cash receipts pending for hours.
Detection: payments_desktop_cash_outbox_age_seconds{property_id} > 14400 → P3 alert.
Cause: Property network outage; bad operator session; expired JWT not refreshing.
Mitigation: Notification to property manager with troubleshooting steps; manual escalation can drain the outbox by re-issuing JWT and triggering sync.
Long-term fix: Build sync health into the desktop status bar with a one-click "force sync" affordance.
14. Database schema-migration partial failure
Symptom: A new tenant migration applies on 99 of 100 tenant schemas; one tenant remains on previous version.
Detection: payments_central.tenant_migrations query returns version mismatch; CI gate post-deploy.
Cause: Lock contention, ad-hoc DDL on the failing tenant, transient connection error.
Mitigation: Per-tenant migrations are idempotent retriable; payments-admin-cli migrate-tenant <id> resumes; alerts gate the deploy promotion.
Long-term fix: Pre-deploy dry-run on a sampled set of tenants; per-tenant lock timeout tuned.