Skip to main content

04 — Event-Driven Architecture

Companion: 02 Enterprise Architecture · 03 Microservices Catalog · 05 API Design · 06 Data Models · 07 Security & Tenancy · Standards · NAMING · Standards · ERROR_CODES · ADR-0001 Core Stack · ADR-0003 Electron Offline-First

This is the authoritative source for the event-driven backbone of Ghasi Melmastoon: GCP Pub/Sub topology, the canonical event envelope, subject taxonomy, the transactional outbox + inbox patterns, the saga catalog, dead-lettering, replay, ordering, retention, observability, and explicit anti-patterns. Every per-service EVENT_SCHEMAS.md defers to this document for cross-cutting concerns; service docs only specify their own event payloads.

The cloud is GCP; the messaging substrate is GCP Pub/Sub (never NATS, never Kafka). The desktop is Electron; replay semantics on the offline desktop go through the local SQLite outbox + the /sync/v1 protocol. All naming follows MELMASTOON.* / melmastoon.* per NAMING.md.


1. Why Event-Driven

A guest pressing Pay on a tenant booking site triggers a chain that crosses at least seven bounded contexts: pricing (final amount + currency snapshot) → reservation (hold → confirm) → inventory (allocation commit, oversell guard) → payment (capture or pending-cash) → folio (open + post charges) → lock (issue key credential against the room and stay window) → notification (multilingual confirmation across email + SMS + WhatsApp) → analytics (booking indexed for the dashboard the GM opens 30 seconds later). Half of these have vendor side-effects (PayPal, Visa rails, TTLock cloud, Salto Connect, Twilio, MFS, BigQuery sink) with their own latency and failure profiles.

A synchronous REST chain across that path is not an option. Three structural reasons drive event-driven as a non-negotiable:

  1. Operational latency budget. The booking funnel commits to a sub-2s perceived latency on a 200ms RTT for the guest; the lock vendor alone may take 3–8 seconds to issue a credential. The lock issuance must run after the user-visible confirmation, on its own clock, with its own retry policy.
  2. Real-time backoffice updates. Front-desk staff need the check-in board to update the moment housekeeping marks a room ready, the moment a payment captures, the moment the AI orchestrator detects a duplicate-booking anomaly. Polling N services from the desktop on a flaky link is hostile to the user; event projections fan into a single push channel (SSE in 05 API Design §12) and into the local SQLite via /sync/v1.
  3. Offline desktop replay. The Electron backoffice is offline-first (ADR-0003). When connectivity returns after a five-hour blackout, the desktop must catch up by replaying missed events in order, deduplicated, with provenance — not by re-hitting REST endpoints in a thundering herd. This requires events to be the canonical record of state changes, with a stable subject grammar and a stable envelope.

The architectural commitment from 02 §15: no synchronous cross-service chain longer than 2 hops; everything deeper goes async via Pub/Sub + saga.

1.1 Principles

  1. Events are facts. Past-tense names; immutable payloads; one event = one fact that already happened.
  2. At-least-once delivery; exactly-once application. Producers use the transactional outbox. Consumers use the inbox table.
  3. Idempotent consumers always. Dedupe key = (eventId, consumerName).
  4. Versioned subjects + schemas. Breaking changes ship a .v<n+1> topic, dual-publish for one full release, then retire the old.
  5. Tenant on every envelope. No untenanted events except a small allow-list (platform.*).
  6. Audit + replay over the same log. Analytics, compliance, sync, and disaster recovery all consume the canonical Pub/Sub log.
  7. Money and inventory are append-only or server-authoritative. Never last-write-wins. (02 §8.2)
  8. Provenance for AI events. Every melmastoon.ai_orchestrator.* event carries aiProvenance per 02 §9.3.

2. Topology — GCP Pub/Sub

2.1 Topics

One Pub/Sub topic per <service>.<aggregate> event subject. The topic name is the event subject string with the melmastoon. prefix and the .v<n> suffix:

projects/melmastoon-prod/topics/melmastoon.reservation.booking.confirmed.v1
projects/melmastoon-prod/topics/melmastoon.inventory.allocation.committed.v1
projects/melmastoon-prod/topics/melmastoon.payment.intent.captured.v1
projects/melmastoon-prod/topics/melmastoon.lock_integration.key_credential.issued.v1

This 1:1 mapping (subject = topic) keeps subscription filters trivial, lets IAM apply per-topic, and makes routing explicit. We deliberately do not use a single firehose topic with attribute filters: per-topic IAM is the strongest tenant-isolation lever Pub/Sub offers and we do not give it up for fan-out convenience.

2.2 Subscriptions

Subscription naming: <consumer-service>.<topic-short>.sub. The <topic-short> is the topic name minus the melmastoon. prefix and .v<n> suffix where unambiguous; full topic where not.

inventory-service.reservation.booking.confirmed.v1.sub
billing-service.reservation.booking.confirmed.v1.sub
lock-integration-service.reservation.booking.confirmed.v1.sub
notification-service.reservation.booking.confirmed.v1.sub
analytics-service.reservation.booking.confirmed.v1.sub
search-aggregation-service.reservation.booking.confirmed.v1.sub
audit-service.reservation.booking.confirmed.v1.sub

Each consumer owns exactly one subscription per topic it cares about. Multiple consumers never share a subscription (Pub/Sub fan-out semantics: each subscription gets its own copy).

2.3 Push vs Pull

WorkloadModeWhy
Cloud Run consumers (most services)Push to authenticated HTTPS endpoint /internal/events/<topic-short>Scale-to-zero compatibility; OIDC token verification at the receiver; aligns with Cloud Run autoscaling
Long-running projections (analytics-service BigQuery sink, search-aggregation-service reindex)Pull from a worker poolBetter throughput control; explicit batch acks; cheaper per-message cost at high volume
Desktop sync replayIndirect (sync-service projects events into the per-device sync log; the Electron desktop pulls via /sync/v1/pull)Pub/Sub is not exposed to clients; never to anonymous; never across the WAN

2.4 Topic Inventory (canonical)

The canonical list of topics is generated from each service's EVENT_SCHEMAS.md and committed to infrastructure/pubsub/topics.json. CI fails if a service emits a topic not in that registry. The full taxonomy is enumerated in §3 below.

2.5 IAM

PrincipalPermission
<service>-publisher@melmastoon-prod.iamroles/pubsub.publisher on only the topics owned by that service
<service>-subscriber@melmastoon-prod.iamroles/pubsub.subscriber on only the subscriptions owned by that service
audit-service-subscriber@…Subscriber on every topic (allowed; audit is the universal append-only consumer)
dataflow-bigquery@…Subscriber on the melmastoon.analytics.* topics for warm/cold sink

No service has cross-tenant publish permission on another service's topics. No human IAM principal has publisher rights in production except the on-call replay role (audit-logged, time-boxed).


3. Subject Taxonomy

3.1 Naming Grammar

Per NAMING.md:

subject := "melmastoon" "." service "." aggregate "." verb_past_tense "." version
service := short name of the owning service (service dir minus "-service")
aggregate := snake_case noun
verb_past_tense := snake_case past tense (one event = one fact)
version := "v" digit+ (starts at v1)

service short names map deterministically:

Service dirservice token
iam-serviceiam
tenant-servicetenant
property-serviceproperty
reservation-servicereservation
inventory-serviceinventory
pricing-servicepricing
housekeeping-servicehousekeeping
maintenance-servicemaintenance
billing-servicebilling
payment-gateway-servicepayment
notification-servicenotification
theme-config-servicetheme
lock-integration-servicelock_integration
search-aggregation-servicesearch_aggregation
ai-orchestrator-serviceai_orchestrator
file-storage-servicefile
reporting-servicereporting
analytics-serviceanalytics
audit-serviceaudit
sync-servicesync
staff-servicestaff
bff-tenant-booking-servicebff_tenant_booking

3.2 Canonical Event Catalog (≥ 60 events)

The subject column is the full Pub/Sub topic. Owning service publishes; others may consume.

Identity & Tenancy

SubjectOwnerTrigger
melmastoon.iam.user.created.v1iam-serviceNew user account provisioned
melmastoon.iam.session.issued.v1iam-serviceAccess + refresh JWT pair issued
melmastoon.iam.session.revoked.v1iam-serviceRefresh token revoked or session killed
melmastoon.iam.device.bound.v1iam-serviceDesktop device pairing completed
melmastoon.iam.device.unbound.v1iam-serviceDevice pairing revoked
melmastoon.iam.mfa.enrolled.v1iam-serviceTOTP / WebAuthn factor added
melmastoon.tenant.tenant.created.v1tenant-serviceNew tenant onboarded
melmastoon.tenant.tenant.suspended.v1tenant-servicePlan limit / billing failure / compliance suspension
melmastoon.tenant.role.granted.v1tenant-serviceRBAC role assigned to a membership
melmastoon.tenant.settings.changed.v1tenant-serviceTenant configuration mutated (currency, locales, tax, residency)

Property & Theme

SubjectOwnerTrigger
melmastoon.property.property.created.v1property-serviceNew property registered under a tenant
melmastoon.property.room.added.v1property-serviceSellable room added to a property
melmastoon.property.room.archived.v1property-serviceRoom retired from inventory
melmastoon.property.room_type.published.v1property-serviceRoom type spec finalized
melmastoon.property.amenity.changed.v1property-serviceAmenity list updated (drives meta-search facets)
melmastoon.theme.preset.published.v1theme-config-serviceNew theme preset version published
melmastoon.theme.tokens.changed.v1theme-config-serviceTenant brand tokens updated
melmastoon.theme.content_block.changed.v1theme-config-serviceTheme content block mutated

Reservation & Inventory

SubjectOwnerTrigger
melmastoon.reservation.booking.held.v1reservation-serviceInventory hold placed (TTL 10 min)
melmastoon.reservation.booking.confirmed.v1reservation-serviceBooking confirmed (post-payment or cash-pending)
melmastoon.reservation.booking.cancelled.v1reservation-serviceCancellation accepted (policy-checked)
melmastoon.reservation.booking.modified.v1reservation-serviceNon-date modification (guest, special requests)
melmastoon.reservation.booking.dates_changed.v1reservation-serviceStay window mutated (drives re-allocation + key update)
melmastoon.reservation.booking.check_in_started.v1reservation-serviceCheck-in flow initiated by front desk
melmastoon.reservation.booking.checked_in.v1reservation-serviceCheck-in complete (folio opened, key issued)
melmastoon.reservation.booking.check_out_started.v1reservation-serviceCheck-out flow initiated
melmastoon.reservation.booking.checked_out.v1reservation-serviceCheck-out complete (folio closed, key revoked)
melmastoon.reservation.booking.no_show.v1reservation-serviceNo-show declared after window expiry
melmastoon.inventory.allocation.committed.v1inventory-serviceAllocation moved from hold to commit
melmastoon.inventory.allocation.released.v1inventory-serviceAllocation released (cancel / no-show / hold expired)
melmastoon.inventory.allocation.reallocated.v1inventory-serviceReallocation due to dates_changed
melmastoon.inventory.stop_sell.activated.v1inventory-serviceStop-sell turned on for a (property, roomType, date)
melmastoon.inventory.stop_sell.cleared.v1inventory-serviceStop-sell cleared
melmastoon.inventory.oversell.blocked.v1inventory-serviceOversell guard prevented a confirm

Pricing

SubjectOwnerTrigger
melmastoon.pricing.rate_plan.published.v1pricing-serviceRate plan published (active for sale)
melmastoon.pricing.rate_plan.archived.v1pricing-serviceRate plan archived
melmastoon.pricing.calendar.updated.v1pricing-servicePer-date BAR / restrictions updated
melmastoon.pricing.delta.applied.v1pricing-servicePricing delta computed for a date_change saga
melmastoon.pricing.quote.expired.v1pricing-serviceBooking-funnel quote expired

Billing & Payment

SubjectOwnerTrigger
melmastoon.billing.folio.opened.v1billing-serviceFolio opened on check-in
melmastoon.billing.folio.charge_posted.v1billing-serviceCharge posted to folio
melmastoon.billing.folio.refund_posted.v1billing-serviceRefund posted
melmastoon.billing.folio.closed.v1billing-serviceFolio closed at check-out
melmastoon.billing.invoice.issued.v1billing-serviceInvoice document issued
melmastoon.billing.invoice.voided.v1billing-serviceInvoice voided (with reason)
melmastoon.payment.intent.created.v1payment-gateway-servicePaymentIntent created against a folio
melmastoon.payment.intent.captured.v1payment-gateway-servicePayment captured (PayPal / card / MFS)
melmastoon.payment.intent.failed.v1payment-gateway-serviceCapture failed (gateway / decline / insufficient funds)
melmastoon.payment.intent.refunded.v1payment-gateway-serviceRefund processed end-to-end
melmastoon.payment.cash.reconciled.v1payment-gateway-serviceCash-on-arrival reconciled at front desk

Lock & Key

SubjectOwnerTrigger
melmastoon.lock_integration.key_credential.issued.v1lock-integration-serviceKey issued via vendor adapter (TTLock / Salto / Assa Abloy / Wiegand)
melmastoon.lock_integration.key_credential.updated.v1lock-integration-serviceKey updated (room change / dates_changed)
melmastoon.lock_integration.key_credential.revoked.v1lock-integration-serviceKey revoked (checkout / cancel / security)
melmastoon.lock_integration.key_credential.suspended.v1lock-integration-serviceKey temporarily suspended (police hold / dispute)
melmastoon.lock_integration.vendor.error.v1lock-integration-serviceVendor-API failure recorded for runbook + retry
melmastoon.lock_integration.device.paired.v1lock-integration-serviceLock device paired to a property

Housekeeping & Maintenance

SubjectOwnerTrigger
melmastoon.housekeeping.task.created.v1housekeeping-serviceTask generated (post-checkout / scheduled)
melmastoon.housekeeping.task.assigned.v1housekeeping-serviceTask assigned to staff member
melmastoon.housekeeping.task.completed.v1housekeeping-serviceRoom turnover complete
melmastoon.housekeeping.room.status_changed.v1housekeeping-serviceRoom state changed (clean ↔ dirty ↔ OOO ↔ OOS)
melmastoon.maintenance.ticket.opened.v1maintenance-serviceWork order opened
melmastoon.maintenance.ticket.closed.v1maintenance-serviceWork order closed (with parts/labor)

Notification & AI

SubjectOwnerTrigger
melmastoon.notification.message.dispatched.v1notification-serviceOutbound message handed to provider
melmastoon.notification.message.delivered.v1notification-serviceProvider confirmed delivery
melmastoon.notification.message.bounced.v1notification-serviceProvider reported bounce / failure
melmastoon.ai_orchestrator.completion.completed.v1ai-orchestrator-serviceLLM completion finished (with provenance)
melmastoon.ai_orchestrator.completion.refused.v1ai-orchestrator-servicePre/post moderation refused
melmastoon.ai_orchestrator.inference.local_completed.v1ai-orchestrator-serviceEdge ONNX inference recorded (offline-replay-safe)
melmastoon.ai_orchestrator.anomaly.detected.v1ai-orchestrator-serviceAnomaly raised (booking / payment / lock)
melmastoon.ai_orchestrator.hitl.decided.v1ai-orchestrator-serviceHuman-in-the-loop decision recorded

Search, File, Reporting, Analytics, Audit, Sync, Staff

SubjectOwnerTrigger
melmastoon.search_aggregation.listing.indexed.v1search-aggregation-serviceProperty+inventory snapshot reindexed
melmastoon.search_aggregation.listing.evicted.v1search-aggregation-serviceListing removed from index
melmastoon.file.uploaded.v1file-storage-serviceObject uploaded to Cloud Storage via signed URL
melmastoon.file.scanned.v1file-storage-serviceVirus scan completed
melmastoon.reporting.report.generated.v1reporting-serviceReport rendered (PDF / CSV)
melmastoon.analytics.snapshot.computed.v1analytics-serviceAggregation snapshot computed
melmastoon.audit.entry.appended.v1audit-serviceAppend-only audit entry recorded
melmastoon.audit.merkle.anchored.v1audit-serviceDaily Merkle root anchored externally
melmastoon.sync.cursor.advanced.v1sync-servicePer-device sync cursor moved forward
melmastoon.sync.conflict.recorded.v1sync-serviceConflict surfaced for operator decision
melmastoon.staff.shift.opened.v1staff-serviceShift clocked in
melmastoon.staff.shift.closed.v1staff-serviceShift clocked out

This catalog is 64 events today. New events require a PR that adds a row here and a JSON schema under event-schemas/<topic>/<version>.json.


4. Event Envelope

Every Pub/Sub message body is an EventEnvelope<T> serialized as JSON. The envelope is shared across services through @ghasi/event-envelope and validated at both produce time (in the outbox publisher) and consume time (in the inbox writer).

import type { Branded } from '@ghasi/domain-primitives';

type UUIDv7 = Branded<string, 'UUIDv7'>;
type ULID = Branded<string, 'ULID'>;
type TenantId = Branded<string, 'TenantId'>;
type ISODate = Branded<string, 'ISODate'>;
type SemVer = Branded<string, 'SemVer'>;

export type RetentionClass = 'hot' | 'warm' | 'cold';
export type DataResidency = 'me-central1' | 'asia-south1' | 'europe-west1' | 'us-central1';

export interface EventEnvelope<T = unknown> {
/** Globally unique event id; UUIDv7 (time-ordered). */
eventId: UUIDv7;

/** Subject minus the trailing version, e.g. 'melmastoon.reservation.booking.confirmed'. */
eventType: string;

/** Integer; matches the `.v<n>` suffix on the topic. */
eventVersion: number;

/** Schema URI with content hash, e.g.
* 'schemas://reservation/booking/confirmed/v1#sha256-9f...' */
schemaUri: string;

/** Tenancy: present on every domain event. Platform events use 'platform'. */
tenantId: TenantId | 'platform';

/** W3C trace correlation. Propagated end-to-end. */
correlationId: ULID;

/** The eventId that caused this event (saga lineage). */
causationId?: UUIDv7;

/** Who/what initiated the action. */
actorId: { type: 'user' | 'system' | 'service' | 'api_key'; id: string };

/** Wall-clock time the fact occurred (producer's clock). */
occurredAt: ISODate;

/** Producer identity for forensic + lineage. */
producedBy: {
service: string; // e.g. 'reservation-service'
instance: string; // Cloud Run revision id
commit: string; // git sha
region: string; // e.g. 'me-central1'
};

/** Idempotency: deterministic key for at-least-once safety on consumers. */
idempotencyKey: string;

/** The domain payload — schema-validated against the subject's JSON Schema. */
payload: T;

/** Free-form non-routing metadata. Bounded to 4 KiB. */
metadata: {
retentionClass: RetentionClass;
dataResidency: DataResidency;
/** True when emitted from the desktop main process and replayed on sync. */
fromOfflineReplay?: boolean;
/** Required on `melmastoon.ai_orchestrator.*` events. */
aiProvenance?: AIProvenance;
/** Pub/Sub ordering key (also set as a Pub/Sub message attribute). */
orderingKey: string; // 'tenantId:aggregateId'
/** Outbox row id for replay forensics. */
outboxId: ULID;
};
}

Envelope rules:

  1. eventId is UUIDv7 (time-ordered) — not v4. Time-ordering helps DLQ debugging, BigQuery clustering, and outbox draining.
  2. correlationId is set by the originating BFF or external webhook handler and propagated unmodified across the entire saga.
  3. causationId ties an event to its parent in a saga; the orchestrator (reservation-service for booking sagas) stamps it.
  4. idempotencyKey is computed deterministically from the business intent (e.g. for payment.intent.captured.v1, key = pay_<id>:capture); identical re-publishes by the outbox relay must collapse on the consumer side.
  5. metadata.orderingKey is always <tenantId>:<aggregateId>; this is also set as the Pub/Sub message ordering key (§11).
  6. metadata.aiProvenance is mandatory on every melmastoon.ai_orchestrator.* event; emitting one without it fails CI contract tests and is rejected by the inbox.

The schema is registered at schemas://platform/event_envelope/v1.json and is the only schema versioned independently from any single service.


5. Schema Registry

5.1 Storage

JSON Schemas live in the monorepo at event-schemas/<service>/<aggregate>/<verb>/v<n>.json and are also published to Cloud Storage (gs://melmastoon-event-schemas-prod/) on release. The published bundle is the runtime source of truth; the monorepo copy is the source of versioned changes.

Each schema's content hash is embedded in the schemaUri field of the envelope. Consumers verify the hash on first sight per process and cache the schema; mismatches bypass the cache and re-fetch.

5.2 Evolution Rules

ChangeAllowed within v<n>Required action
Add an optional field with a defaultYesBump schemaUri content hash; no consumer changes required
Add an enum value that consumers ignore safelyYesBump content hash; document in CHANGELOG.events.md
Tighten a numeric range / regexYes if no producer ever produced an out-of-range valueRun a backfill audit against last 30 days of events first
Make an optional field requiredNoShip .v<n+1>; dual-publish
Remove a fieldNoShip .v<n+1>; dual-publish
Change a field's typeNoShip .v<n+1>; dual-publish
Rename a fieldNoShip .v<n+1>; dual-publish
Change eventType (subject)NoNew topic + new subscriptions

5.3 Dual-Publish Window

When a service ships .v2, it publishes both .v1 and .v2 for one full release (≥ 2 weeks). Consumers cut over at their own pace; once all subscriptions on .v1 show zero deliveries for 7 consecutive days, .v1 is retired by deleting the topic and removing the schema from the registry. The cutover is orchestrated by audit-service which records melmastoon.audit.entry.appended.v1 entries marking each consumer cutover.

5.4 CI Gate

A pre-merge GitHub Action (event-schema-evolution.yml) runs:

  1. Download the previous release's schema bundle from Cloud Storage.
  2. For every changed schema in the PR, run schema-diff (built on json-schema-diff-validator).
  3. If the diff is non-additive and the version did not bump, fail the build with a link to this section.
  4. If the version bumped, verify the dual-publish runbook entry exists in the PR description.

This gate is documented in the SERVICE_TEMPLATE and is non-overridable; only the platform on-call lead can bypass with an explicit incident ticket.


6. Transactional Outbox

Every service that publishes events writes the event into its own Postgres outbox table in the same transaction as the aggregate state change. A separate relay process drains the outbox to Pub/Sub.

6.1 Outbox Table

CREATE TABLE outbox (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
occurred_at timestamptz NOT NULL DEFAULT now(),
tenant_id uuid NOT NULL,
aggregate_type text NOT NULL,
aggregate_id text NOT NULL,
topic text NOT NULL,
ordering_key text NOT NULL,
envelope jsonb NOT NULL, -- full EventEnvelope serialized
idempotency_key text NOT NULL,
attempts int NOT NULL DEFAULT 0,
last_error text,
published_at timestamptz,
CONSTRAINT outbox_idempotency_unique UNIQUE (topic, idempotency_key)
);
CREATE INDEX ix_outbox_unpublished ON outbox (occurred_at) WHERE published_at IS NULL;
CREATE INDEX ix_outbox_tenant ON outbox (tenant_id);

ALTER TABLE outbox ENABLE ROW LEVEL SECURITY;
CREATE POLICY outbox_tenant_isolation ON outbox
USING (tenant_id::text = current_setting('app.tenant_id', true));

6.2 Producer Pattern

// services/reservation-service/src/application/use-cases/confirm-reservation.use-case.ts
async execute(cmd: ConfirmReservationCommand): Promise<Reservation> {
return this.uow.run(async (tx) => {
const reservation = await this.repo.find(cmd.reservationId, tx);
reservation.confirm(cmd.actorId, this.clock.now());

await this.repo.save(reservation, tx);

const event = this.envelope.build({
eventType: 'melmastoon.reservation.booking.confirmed',
eventVersion: 1,
tenantId: cmd.tenantId,
payload: ReservationMapper.toConfirmedEvent(reservation),
idempotencyKey: `${reservation.id}:confirmed:${reservation.version}`,
orderingKey: `${cmd.tenantId}:${reservation.id}`,
causationId: cmd.causationEventId,
correlationId: cmd.correlationId,
actorId: cmd.actorId,
});

await this.outbox.append(event, tx);
return reservation;
});
}

The aggregate save and the outbox append are atomic. If the aggregate write commits, the event will be published — at least once. If the publish fails repeatedly, the relay retries until success or the message is hand-quarantined.

6.3 Outbox Relay

A per-service outbox-relay runs as a Cloud Run job (or a sidecar in long-running services). The relay:

  1. SELECT … FROM outbox WHERE published_at IS NULL ORDER BY occurred_at LIMIT 256 FOR UPDATE SKIP LOCKED
  2. For each row, pubsub.publishMessage({ data: row.envelope, orderingKey: row.ordering_key, attributes: { eventId, eventType, eventVersion, tenantId, idempotencyKey } }).
  3. On Pub/Sub publish ack: UPDATE outbox SET published_at = now() WHERE id = $1.
  4. On Pub/Sub error: UPDATE outbox SET attempts = attempts + 1, last_error = $err WHERE id = $1.
  5. After 50 attempts spanning ≥ 6 hours, the row is moved to outbox_quarantine and an alert pages the on-call.

Ack-on-publish is the contract: we do not wait for any consumer to process. Consumers handle their own at-least-once semantics via the inbox.

6.4 Why Outbox

Without the outbox, a process crash between COMMIT and Pub/Sub publish silently loses events. With the outbox, every committed state change is durable and eventually published. Combined with the inbox (next section), we achieve exactly-once application semantics on top of at-least-once transport.


7. Inbox & Dedup

Every consumer keeps an inbox table keyed by (eventId, consumerName). The inbox is checked before the handler runs; if the row already exists, the handler is skipped and the message is acked.

CREATE TABLE inbox (
consumer_name text NOT NULL,
event_id uuid NOT NULL,
topic text NOT NULL,
tenant_id uuid NOT NULL,
received_at timestamptz NOT NULL DEFAULT now(),
processed_at timestamptz,
result text, -- 'ok' | 'noop' | 'error'
PRIMARY KEY (consumer_name, event_id)
);
CREATE INDEX ix_inbox_tenant ON inbox (tenant_id);

ALTER TABLE inbox ENABLE ROW LEVEL SECURITY;
CREATE POLICY inbox_tenant_isolation ON inbox
USING (tenant_id::text = current_setting('app.tenant_id', true));

7.1 Consumer Pattern

async handle(envelope: EventEnvelope<BookingConfirmedV1>): Promise<HandlerResult> {
return this.uow.run(async (tx) => {
const claimed = await this.inbox.claim({
consumerName: 'inventory-service.booking-confirmed',
eventId: envelope.eventId,
topic: envelope.eventType + '.v' + envelope.eventVersion,
tenantId: envelope.tenantId,
}, tx);

if (!claimed) return { status: 'noop', reason: 'already_processed' };

const allocation = await this.allocator.commit(envelope.payload, tx);
await this.outbox.append(/* inventory.allocation.committed.v1 */, tx);

await this.inbox.markProcessed(envelope.eventId, 'ok', tx);
return { status: 'ok' };
});
}

The handler is wrapped in the same transaction as the inbox claim, the aggregate write, and the downstream outbox append. Replay-safe by construction.

7.2 Retention

Inbox rows are retained for 30 days (the operational hot retention class). After 30 days, no Pub/Sub redelivery is possible (max retry window is 7 days), so dedup is no longer needed; rows older than 30 days are archived to BigQuery and deleted from Postgres by a daily job.


8. Sagas

Sagas are explicit, named, and orchestrated by the owning bounded context. We do not use a generic saga framework; every saga is a use-case-driven coordinator in its owning service that listens to upstream events and emits downstream commands (or events that trigger downstream services). Compensations are first-class.

8.1 Booking Saga

The most-trafficked multi-service flow. Owner: reservation-service.

[Guest hits Pay on bff-tenant-booking-service]


bff-tenant-booking-service ─POST→ /api/v1/reservations/{id}/confirm


reservation-service emits melmastoon.reservation.booking.held.v1

┌───────────────────┴────────────────────────────┐
▼ ▼
inventory-service pricing-service (subscribes for telemetry)
─ commits allocation ─ records final quote snapshot
─ emits inventory.allocation.committed.v1


payment-gateway-service (subscribed to booking.held + allocation.committed)
─ creates payment intent (PayPal / card / cash_pending)
─ on success → emits payment.intent.captured.v1
─ on cash → emits payment.intent.captured.v1 with rail='cash_pending'


reservation-service (subscribed to payment.intent.captured.v1)
─ transitions held → confirmed
─ emits reservation.booking.confirmed.v1

┌──────────────┼──────────────────────────┬──────────────────────────┬──────────────────┐
▼ ▼ ▼ ▼ ▼
billing-svc lock-integration-svc notification-svc analytics-svc search-aggregation-svc
─ opens folio ─ if eligible (auto-issue) ─ confirmation email + ─ booking_indexed ─ refresh availability
emits key.issued.v1 SMS + WhatsApp snapshot for the property

Compensations (forward path failures):

FailureCompensationEmits
inventory.allocation.committed.v1 never arrives within 30sCancel hold; emit reservation.booking.cancelled.v1 (reason=hold_expired)inventory.allocation.released.v1
payment.intent.failed.v1 arrivesRelease allocation; mark reservation cancelled (reason=payment_failed)inventory.allocation.released.v1, reservation.booking.cancelled.v1
lock_integration.vendor.error.v1 on key issuanceReservation stays confirmed; key issuance moves to manual-front-desk queue; surface in backoffice dashboardlock_integration.vendor.error.v1
Cash-on-arrival selectedReservation enters confirmed_cash_pending; key not auto-issued; reconciled at front desk on arrivalpayment.intent.captured.v1 with rail='cash_pending'

8.2 Cancellation Saga

Owner: reservation-service.

reservation.booking.cancelled.v1

├──► inventory-service → inventory.allocation.released.v1

├──► payment-gateway-service → if refundable per policy:
│ payment.intent.refunded.v1
│ else: no-op + audit entry

├──► lock-integration-service → if key was issued:
│ lock_integration.key_credential.revoked.v1

├──► billing-service → folio.refund_posted.v1 (if applicable)

└──► notification-service → cancellation email + SMS

Compensations are not symmetrical: refund failures do not un-cancel the reservation; they raise an alert for the finance on-call instead. Money flows are append-only.

8.3 Check-In Saga

Owner: reservation-service.

reservation.booking.check_in_started.v1

├──► billing-service → folio.opened.v1

├──► lock-integration-service → key_credential.issued.v1
│ (or .updated.v1 if pre-issued for date-change)

└──► notification-service → welcome message in guest's preferred locale


reservation-service emits booking.checked_in.v1 (only after all three above succeed
within deadline; otherwise raises
partial-checkin alert + manual hold)

The check-in saga has a strict deadline (30s wall-clock). Past the deadline, the reservation enters check_in_partial; the desktop UI surfaces the missing step (key, folio, or notification) with a one-click retry. We never auto-mark checked_in on partial success because the room state machine in housekeeping-service listens for it.

8.4 Check-Out Saga

Owner: reservation-service.

reservation.booking.check_out_started.v1

├──► billing-service → folio.closed.v1 (computes final balance)
│ │
│ └──► payment-gateway-service → if balance > 0:
│ payment.intent.captured.v1 (remainder)

├──► lock-integration-service → key_credential.revoked.v1

├──► housekeeping-service → housekeeping.task.created.v1
│ (room enters dirty state)

└──► notification-service → checkout receipt + post-stay survey


reservation-service emits booking.checked_out.v1

If the final balance cannot be charged (card declined, cash short), the folio remains open and the reservation moves to checked_out_balance_pending. Key revocation still happens; the guest's access is terminated regardless of folio state. This is a deliberate guest-experience trade-off: we never block a checkout on payment, but we never let a key linger past checkout either.

8.5 Date-Change Saga

Owner: reservation-service. The compensation surface is largest here because dates touch four downstream services.

reservation.booking.dates_changed.v1

├──► inventory-service → release old window; commit new window
│ → inventory.allocation.reallocated.v1
│ (atomic per (room, day) — overbooking guard runs)

├──► pricing-service → compute delta vs original quote
│ → pricing.delta.applied.v1

├──► billing-service → folio.charge_posted.v1 for the delta
│ │
│ └──► payment-gateway → payment.intent.created.v1 (delta)

├──► lock-integration-service → key_credential.updated.v1 (new validity window)

└──► notification-service → modification confirmation

If the new window cannot be allocated (overbooking on the new dates), the saga fails and the reservation reverts to its original window automatically; the inventory release is conditional on the new commit succeeding. The guest sees a "dates unavailable, original kept" message, not a cancelled reservation.

8.6 Saga Choice: Orchestration vs Choreography

SagaStyleWhy
BookingOrchestration by reservation-serviceMany compensations; deterministic order; needs a single source of truth for partial-failure state
CancellationChoreographyRead-only fan-out; compensations are independent
Check-inOrchestrationStrict 30s deadline; needs a coordinator to declare partial-state
Check-outHybrid (orchestration of folio+lock; choreography of housekeeping+notification)Folio close + key revoke must be ordered; housekeeping fan-out is independent
Date-changeOrchestrationLargest compensation surface; conditional inventory swap

The orchestrator is always the aggregate's owning service. We do not introduce a generic "saga-service".


9. Dead-Letter & Retry

9.1 Pub/Sub Retry Policy

Each subscription is configured with:

SettingValue
minimumBackoff10 s
maximumBackoff600 s (10 min)
messageRetentionDuration7d (Pub/Sub maximum redelivery window)
ackDeadline60 s (push) / 600 s (pull workers)
enableExactlyOnceDeliverytrue (Pub/Sub feature; we still rely on inbox)

The retry policy is exponential with jitter; Pub/Sub handles redelivery up to the message retention duration.

9.2 DLQ Topics

Every subscription has a DLQ topic: <original-topic>.dlq. Example:

melmastoon.reservation.booking.confirmed.v1
melmastoon.reservation.booking.confirmed.v1.dlq ← DLQ

Subscriptions are configured with deadLetterPolicy:

deadLetterPolicy:
deadLetterTopic: projects/melmastoon-prod/topics/<original-topic>.dlq
maxDeliveryAttempts: 7

After 7 ack-deadline-expiry redeliveries, the message moves to DLQ with metadata googclient_deliveryattempt.

9.3 DLQ Operations

ActionOwnerSurface
Monitor depthPlatform on-callCloud Monitoring dashboard pubsub-dlq-depth; alert at depth > 10 sustained 5 min
TriageService on-callDLQ inspector at tools/pubsub-dlq-inspector (reads DLQ, prints envelope + last error)
ReplayService on-calltools/pubsub-replay --topic=<original> --since=<ts> --idempotency-safe
QuarantineCompliance on-callMove to long-term cold storage (gs://melmastoon-pubsub-quarantine-prod/) for forensic review

The replay tool re-publishes from DLQ to the original topic with the original envelope (preserving eventId and idempotencyKey); because consumers use the inbox, replay is safe by construction.

9.4 Runbook (pubsub/dlq-non-empty)

  1. Identify the topic: gcloud pubsub topics list --filter="name~dlq" filtered by alert label.
  2. Inspect the top message: tools/pubsub-dlq-inspector --topic=<dlq>.
  3. Classify the failure:
    • Schema drift → fix producer or consumer; do not replay until fixed.
    • Vendor outage (lock, payment) → wait for vendor recovery; replay safe.
    • Poisoned payload (impossible state) → quarantine; open a defect ticket; do not replay.
  4. Replay or quarantine.
  5. Post incident note in audit-service.

10. Replay & Disaster Recovery

10.1 Pub/Sub Seek-To-Timestamp

Every subscription has retainAckedMessages: true with a 7-day retention window. This lets us rewind a subscription:

gcloud pubsub subscriptions seek <subscription> --time=2026-04-22T10:00:00Z

This is the primary recovery tool when a consumer ships a bug that mis-projects events: fix the bug, redeploy, seek the subscription back to the last known good timestamp, let it re-process. The inbox dedup keeps this safe for events the consumer already saw; for events that mutated downstream state on the consumer side, the consumer must explicitly clear the inbox rows it wants to reprocess (the fix migration ships a dedup-clear SQL).

10.2 Per-Service Replay Safety

Replay is safe for a service iff:

  1. Every consumer handler is deterministic given the event payload.
  2. The handler uses the inbox claim before any mutation.
  3. The handler's downstream effects are themselves idempotent (uses the outbox idempotency key correctly).

We test this with a contract test per consumer (*.replay.contract.spec.ts) that double-delivers a sample envelope and asserts the second delivery is a noop.

10.3 Offline Desktop Sync Replay

The Electron desktop is the secondary replay surface. When a desktop reconnects after an offline period:

  1. The desktop calls /sync/v1/pull?since=<lastCursor> (05 §10).
  2. sync-service reads its per-device sync log (a projection of the canonical Pub/Sub events for the tenant), filters by the device's scope subscriptions, and returns deltas.
  3. The desktop applies deltas in order to local SQLite, advancing its cursor.
  4. Local mutations queued offline are pushed via /sync/v1/push with Idempotency-Key headers; the server replays them as if they came in live, hitting the same outbox + inbox path.

This means the desktop's replay does not bypass any saga: a check-in mutation pushed from offline goes through the same reservation.booking.check_in_started.v1 → check-in-saga path as a live one, with the same outbox + inbox + idempotency guarantees.

10.4 Disaster Recovery Tier

FailureRecovery
Consumer bug, last 7 daysPub/Sub seek-to-timestamp + selective inbox clear
Consumer bug, > 7 days backBigQuery cold sink (§13) replay via the event-replay-from-bq tool, which republishes selected events to the original topic
Pub/Sub regional outageFailover to disaster region; Cloud SQL outboxes hold; relays drain to the recovered region (RPO ≤ 5 min, RTO ≤ 30 min)
Postgres data lossCloud SQL PITR (point-in-time recovery, 7 days); outbox tables included; events that were already published to Pub/Sub are not republished (they are durably in Pub/Sub + BigQuery)
Total tenant restoreRestore tenant's per-service schemas from Cloud SQL backups + replay events from BigQuery cold sink filtered by tenantId

11. Ordering & Partitioning

11.1 Ordering Keys

Every published message carries orderingKey = <tenantId>:<aggregateId>. Pub/Sub guarantees ordered delivery per ordering key per subscription as long as enableMessageOrdering: true is set on the subscription.

This means events for the same reservation are delivered in publish order; events for different reservations may interleave. This matches our domain reality: a single reservation's lifecycle is causally ordered (held → confirmed → checked_in → checked_out), but different reservations are independent.

11.2 Tenant + Aggregate Granularity

We chose <tenantId>:<aggregateId> (not just <aggregateId>) for two reasons:

  1. Tenant isolation in throughput: a tenant with 10x the volume of others does not block its peers' ordering keys.
  2. Cross-tenant ID collision safety: aggregate IDs are unique per service but we never want a far-fetched ID collision to silently couple two tenants.

11.3 Consumer Concurrency

ServiceConcurrency modelPer-key orderingNotes
inventory-serviceCloud Run min=1, max=10, concurrency=4Required (per tenantId:roomId)Allocation must be serialized per room
billing-serviceCloud Run min=1, max=20, concurrency=8Required (per tenantId:folioId)Folio writes are serialized per folio
lock-integration-serviceCloud Run min=1, max=10, concurrency=2Required (per tenantId:reservationId)Vendor APIs are slow; low concurrency by design
notification-serviceCloud Run min=0, max=50, concurrency=20Not requiredMessages are independent
analytics-servicePull worker pool, 4 workers × 16 messagesNot requiredAggregations are commutative
search-aggregation-servicePull worker pool, 2 workers × 8 messagesRequired (per tenantId:propertyId)Index updates are last-write-wins per property
audit-servicePull worker pool, 8 workers × 32 messagesNot requiredAppend-only

11.4 Backpressure

When a consumer cannot keep up, Pub/Sub buffers; the relay producer never blocks. The platform alerts on subscription.num_undelivered_messages > 10k sustained 5 min. The runbook either scales the consumer (Cloud Run max-instances bump), accepts the lag (analytics), or fails over to a slower path.


12. Cross-Tenant Analytics Events

A small set of events are intentionally cross-tenant: they live on melmastoon.analytics.* and melmastoon.search_aggregation.* topics and feed projections that legitimately span tenants (the consumer meta-search read model and the platform-wide analytics warehouse).

SubjectOwnerCross-tenant consumers
melmastoon.analytics.snapshot.computed.v1analytics-serviceBigQuery (warehouse), platform reporting
melmastoon.search_aggregation.listing.indexed.v1search-aggregation-servicebff-consumer-service (read-through cache)

These events still carry tenantId in the envelope (the projection is of a tenant's data, just consumed across tenants). The projection process strips PII and payment details before writing to the search index — the privacy boundary is the projector, not the topic.

No service may publish or subscribe to a cross-tenant topic without an entry in the Cross-Tenant Read Paths table in 02 §6.3. A CI policy gate (cross-tenant-topic.policy.ts) enforces this.


13. Per-Event Retention

Three retention classes apply at the BigQuery sink level. Pub/Sub itself retains 7 days (max); the long-term archive is BigQuery (warm) and Cloud Storage Coldline (cold).

ClassHot (Pub/Sub)Warm (BigQuery)Cold (Coldline)Examples
Hot (operational)7 days30 daysmelmastoon.notification.message.dispatched.v1, melmastoon.search_aggregation.listing.indexed.v1, melmastoon.sync.cursor.advanced.v1
Warm (analytics + product)7 days365 daysmelmastoon.reservation.*, melmastoon.inventory.*, melmastoon.pricing.*, melmastoon.housekeeping.*, melmastoon.theme.*
Cold (compliance + finance + lock + AI)7 days365 days7 yearsmelmastoon.billing.*, melmastoon.payment.*, melmastoon.lock_integration.*, melmastoon.audit.*, melmastoon.ai_orchestrator.*

Each event's class is declared by its metadata.retentionClass field and verified against the registry table at produce time. A Cloud Dataflow job streams every Pub/Sub message into BigQuery (melmastoon_events.<topic_short> tables) regardless of class; the lifecycle policy on BigQuery + the periodic export to Coldline implements the per-class TTL.

The compliance regime requires 7-year retention for financial, payment, key-credential, audit, and AI-decision events; we err on the side of longer retention for lock_integration events because lock-credential disputes have surfaced years after the stay.


14. Observability

14.1 Trace Propagation

The envelope's correlationId is a W3C traceparent-compatible identifier (ULID for storage, parsed to a 16-byte trace id for OTel context). Every consumer extracts correlationId and starts a span as a child of the producer's span. The full booking saga shows up as a single distributed trace in Cloud Trace, spanning bff-tenant-booking-servicereservation-serviceinventory-servicepayment-gateway-servicereservation-servicelock-integration-servicenotification-service, with each Pub/Sub publish + consume tracked as link-style spans.

14.2 Metrics

Every service exports the following metrics (OTel → Cloud Monitoring):

MetricTypeLabels
melmastoon_events_published_totalcounterservice, topic, tenant_id_class
melmastoon_events_consumed_totalcounterservice, topic, subscription, result (ok|noop|error)
melmastoon_events_dlq_totalcounterservice, topic, subscription
melmastoon_event_processing_duration_secondshistogramservice, topic, subscription
melmastoon_outbox_unpublished_countgaugeservice
melmastoon_outbox_oldest_unpublished_secondsgaugeservice
melmastoon_inbox_dedup_skips_totalcounterservice, topic
melmastoon_saga_inflight_countgaugeservice, saga_name
melmastoon_saga_completed_totalcounterservice, saga_name, outcome (success|compensated|partial)

tenant_id_class is bucketed (small/medium/large/platform) — we never label metrics with raw tenantId (cardinality explosion).

14.3 Standard Alerts

AlertThresholdOwner
outbox_lag_highoutbox_oldest_unpublished_seconds > 60 sustained 5 minservice on-call
dlq_depth_growingevents_dlq_total rate > 1/min sustained 5 minservice on-call
consumer_error_rateevents_consumed_total{result=error} / total > 5% sustained 5 minservice on-call
saga_partial_ratesaga_completed_total{outcome=partial} / total > 1% sustained 15 minservice on-call + product on-call
cross_tenant_publish_blockedAny publish blocked by the cross-tenant policy gateplatform on-call (page)

Full observability stack (logs, metrics, traces, RUM, AI eval logging) is documented in docs/observability/01-observability.md.


15. Anti-Patterns

Anti-patternWhy it's bannedWhat we do instead
Synchronous chains > 2 hopsCouples availability; hides cascading failure; defeats the offline-first guaranteeAsync via Pub/Sub + saga; orchestrator in the owning context (02 §15)
Frontend fan-out to multiple eventsCouples client to event topology; breaks the BFF abstraction; defeats rate-limit and authClients call one BFF endpoint; BFF composes downstream calls; events are server-side
Events as RPC (request/response over Pub/Sub)Loses fire-and-forget semantics; introduces blocking on the publisher; usually masks a missing APIUse REST for request/response (gateway, BFF, internal); Pub/Sub strictly for facts
Missing idempotencyKeyReplay duplicates downstream effects (double-charge, double-key-issue); blocks DLQ replayEvery envelope must carry a deterministic idempotencyKey; CI test fails the build otherwise
Last-write-wins for money or inventorySilent data loss; impossible to auditAppend-only for money; server-authoritative for inventory; max-of for room state (02 §8.2)
Untenanted events outside the platform allow-listBreaks tenant isolation in projections, in BigQuery, in the desktop sync logtenantId mandatory; the allow-list is small and reviewed quarterly
AI events without provenanceBreaks the HITL audit chain; blocks compliance reviewmetadata.aiProvenance required on every melmastoon.ai_orchestrator.* event; CI gate + inbox rejection
Cross-tenant topics outside the registered listOne bug = one breach; defeats RLSCross-tenant topics live only on melmastoon.search_aggregation.* and melmastoon.analytics.*; CI policy gate enforces
Generic saga frameworkHides flow; hard to reason about partial failure; couples services to the frameworkEach saga is a use-case in its owning service; explicit compensations in code
Schema-breaking change without .v<n+1>Silently drops events at consumers that updated; silently mis-projects at consumers that didn'tCI gate (§5.4) + dual-publish runbook
Ad-hoc DLQ replay without inbox awarenessRe-fires downstream effectsReplay via the pubsub-replay tool; consumers' inbox dedup is the safety net
Vendor SDK imports outside the owning adapterDefeats hexagonal architecture; couples domain to vendorCI dependency-graph gate; ports & adapters per ADR-0001

Cross-references: per-service event payloads live in services/<service-name>/EVENT_SCHEMAS.md. The API surface that produces events synchronously (REST writes that trigger outbox appends) is in 05 API Design. The desktop sync protocol that consumes events as deltas is in 05 §10. The next document in the strategic set is 06 Data Models.