Skip to main content

DEPLOYMENT_TOPOLOGY — payment-gateway-service

Sibling: OBSERVABILITY · SECURITY_MODEL · LOCAL_DEV_SETUP · SERVICE_READINESS

This service runs on Google Cloud Platform (per platform standard; Electron is the only desktop technology, never Tauri). Production workloads run on GKE Autopilot; data lives in Cloud SQL Postgres 16; events flow through Pub/Sub; secrets in Secret Manager; key material in Cloud KMS.

1. Environments

EnvGCP projectPurposeVendor mode
devgm-payments-devLocal-merge integrationAll processors in sandbox; hesabpay-mock
staginggm-payments-stagingPre-prod validation, contract testsStripe/PayPal sandbox; HesabPay sandbox
prodgm-payments-prodLiveStripe/PayPal live; HesabPay live (per-region per-tenant)

Per-tenant Cloud SQL schemas exist in each environment; tenants in dev/staging are synthetic.

2. Cluster topology

  • GKE Autopilot clusters per env, regional (us-central1 primary, europe-west1 warm-standby for EU tenants).
  • Two deployments:
    1. payment-api — REST + webhook receivers, HPA on RPS + p95 latency, min 3 pods, max 60.
    2. payment-worker — webhook dispatcher, reconciliation jobs, outbox flush, HPA on Pub/Sub backlog, min 2 pods, max 30.
  • A scheduled CronJob payments-reconciliation-daily runs at 02:00 UTC per tenant per processor.
  • A scheduled CronJob payments-idempotency-gc runs hourly, dropping expired idempotency keys.

3. Network

  • All inbound API traffic enters via the platform gateway (api.melmastoon.ghasi.io) → Cloud Armor → Internal HTTPS LB → payment-api.
  • Webhook traffic enters via a separate subdomain webhooks.payments.melmastoon.ghasi.io → Cloud Armor (with vendor IP allowlists per SECURITY_MODEL §4.3) → payment-api webhook receivers.
  • Egress to vendors (Stripe, PayPal, HesabPay) goes through a Cloud NAT with a fixed public IP, allowlisted on each vendor's side where they support it.
  • Egress to FX provider behind the same NAT.
  • Service-to-service (Pub/Sub, Cloud SQL, Secret Manager, KMS) over Private Service Connect — no public IPs.
┌────────────────────────────┐
│ api.melmastoon.ghasi.io │
guests / staff / s2s ─────────▶│ Cloud Armor + LB │──▶ payment-api ──┬─▶ Cloud SQL (Postgres 16, schema-per-tenant)
└────────────────────────────┘ │
├─▶ Pub/Sub (events out + saga in)
┌─────────────────────────────────┐ │
stripe / paypal / hesabpay ──▶│ webhooks.payments.melmastoon.… │──▶ payment-api│
│ Cloud Armor + WAF + IP allow │ │
└─────────────────────────────────┘ │
├─▶ Secret Manager (vendor creds)
payment-worker ◀── Pub/Sub backlog (HPA) ── outbox / webhook inbox flush │
└─▶ Cloud KMS (CMEK envelope decrypt)

Cloud NAT (fixed egress IP) ──▶ Stripe / PayPal / HesabPay / FX

4. Data plane

  • Cloud SQL Postgres 16, regional HA (primary + standby), private IP only.
  • Read replica for analytics queries (no app traffic).
  • Backups: continuous PITR (35-day retention), daily snapshots (90-day retention), monthly archive snapshot to Coldline GCS bucket (7-year retention).
  • Maintenance window: Sundays 03:00–05:00 UTC.

5. Configuration & secrets

  • App config via Kubernetes ConfigMaps for non-secret env (FX provider id, default circuit-breaker thresholds, feature flags).
  • Secrets via External Secrets Operator pulling from Secret Manager. Pods receive secrets as env vars or mounted files (vendor private keys mounted, never env-vared).
  • Workload Identity binds the payment-api and payment-worker GSAs to namespace-scoped permissions (secretmanager.secretAccessor, cloudkms.cryptoKeyDecrypter, cloudsql.client, pubsub.publisher, pubsub.subscriber).

6. Release pipeline

  • CI on PR: lint + unit + integration + contract + PCI scan + OpenAPI diff + event-schema diff.
  • Merge to main: build container image (Wolfi-based distroless), sign with Cosign, push to Artifact Registry, deploy to dev.
  • Promotion to staging: manual approval; runs e2e:sandbox suite in staging post-deploy.
  • Promotion to prod: manual approval (2-engineer); progressive rollout with canary at 5% for 30 min, then 25% for 30 min, then 100%; auto-rollback on SLO burn or error-rate spike.

7. Observability stack

  • Metrics: OpenTelemetry → Cloud Operations.
  • Traces: OpenTelemetry → Cloud Trace.
  • Logs: stdout JSON → Cloud Logging.
  • Alerts: routed via notification-service to PagerDuty.
  • Dashboards: Cloud Monitoring workspace payment-gateway-service (see OBSERVABILITY).

8. Disaster recovery

ScenarioRPORTOStrategy
Cluster failure015 mFailover to warm standby region (DNS swap)
DB primary failure0 (HA standby in same region)5 mCloud SQL automatic failover
Region-wide outage≤ 5 m (PITR)60 mRestore PITR to standby region; webhook ingestion paused via 503 (vendors retry); re-issued events on recovery
Pub/Sub outagen/aas long as outageOutbox holds; resumes when Pub/Sub returns
Secret Manager outage5 m read cache coversas long as outageCached secrets serve in-flight requests; new tenants paused

Quarterly DR drills include a region failover and a full restore test.

9. Tenant onboarding & offboarding flow (deploy-side)

  • Onboarding: tenant-service calls payments-admin-cli provision-tenant <id> which creates the schema, seeds vendor pointers, and runs initial migrations.
  • Offboarding: 30-day soft-delete window; after that, schema dropped, secrets destroyed, archived snapshot retained per legal hold rules (SECURITY_MODEL §11).