Skip to main content

Failure Modes

:::info Source Sourced from services/sync-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Delta Projector Lag

  • Domain events accumulate; pulls return stale data.
  • Mitigation: autoscale projector; alert on lag > 60s; clients tolerate eventual consistency.

1.2 Push Route to Owning Service Fails

  • Mutation stays queued; retried with exponential backoff.
  • After 5 retries: alert; mutation stays queued until manual intervention or service recovery.

1.3 Device Clock Skew

  • occurredAt from device may be wrong; server uses receivedAt for ordering.
  • VectorClock is device-scoped counter (not wall-clock); immune to clock skew.

1.4 Cursor Out of Range (Client Too Old)

  • 410 → client must full-resync.
  • Rare in practice (< 0.1% of devices); typically users returning from long absence.

1.5 Yjs CRDT Merge Failure (M5)

  • Very rare; only on schema divergence.
  • Create ConflictRecord; authors see side-by-side diff; AI suggests merge.
  • Pre-merge backup retained 30 days.

1.6 Client Mutation Overflow (7+ Days Offline)

  • Client enters read-only mode; warns user to sync.
  • After 14 days: force prompt; after 30 days: data purge (warn first).

1.7 Network Partition (Client ↔ Server)

  • Client continues locally; mutations queue.
  • On reconnect: push all; pull all; reconcile.

1.8 Duplicate Push (Client Retries)

  • clientMutationId PK deduplicates; second push returns original result.

1.9 HMAC Verification Failure

  • Mutation rejected; device flagged; admin alerted.
  • If repeated (> 3 in 1 hour): device trust investigation.

1.10 Full-Resync Flood

  • Rare; but if caused by bug (e.g., cursor corruption): emergency patch.
  • Rate-limit full-resync per device (1 per 15 min).

2. Retry / Backoff

OpMaxBackoff
Mutation route (to owning service)5500ms, 2s, 10s, 30s, 60s
Postgres write310ms–200ms
Outboxinfiniteexp cap 5m
Delta projector (event processing)infiniteexp cap 30s

3. Circuit Breakers

TargetTripReset
Each owning service10 fail/30s60s
Postgres10 fail/30s60s

4. Fallbacks

PrimaryFallback
Real-time deltaStale (eventual consistency)
Push to owning serviceQueue + retry
AI merge suggestionManual three-way diff

5. Chaos

  • Drop 10% NATS messages → projector catches up.
  • Kill sync pod mid-push → client retries; dedup.
  • Introduce artificial clock skew → VectorClock handles.
  • Force all cursors to expire → full-resync wave → verify system handles.
  • Two devices push conflicting LWW updates → deterministic resolution.