:::info Source
Sourced from services/sync-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 Delta Projector Lag
- Domain events accumulate; pulls return stale data.
- Mitigation: autoscale projector; alert on lag > 60s; clients tolerate eventual consistency.
1.2 Push Route to Owning Service Fails
- Mutation stays
queued; retried with exponential backoff.
- After 5 retries: alert; mutation stays
queued until manual intervention or service recovery.
1.3 Device Clock Skew
occurredAt from device may be wrong; server uses receivedAt for ordering.
- VectorClock is device-scoped counter (not wall-clock); immune to clock skew.
1.4 Cursor Out of Range (Client Too Old)
- 410 → client must full-resync.
- Rare in practice (< 0.1% of devices); typically users returning from long absence.
1.5 Yjs CRDT Merge Failure (M5)
- Very rare; only on schema divergence.
- Create ConflictRecord; authors see side-by-side diff; AI suggests merge.
- Pre-merge backup retained 30 days.
1.6 Client Mutation Overflow (7+ Days Offline)
- Client enters read-only mode; warns user to sync.
- After 14 days: force prompt; after 30 days: data purge (warn first).
1.7 Network Partition (Client ↔ Server)
- Client continues locally; mutations queue.
- On reconnect: push all; pull all; reconcile.
1.8 Duplicate Push (Client Retries)
clientMutationId PK deduplicates; second push returns original result.
1.9 HMAC Verification Failure
- Mutation rejected; device flagged; admin alerted.
- If repeated (> 3 in 1 hour): device trust investigation.
1.10 Full-Resync Flood
- Rare; but if caused by bug (e.g., cursor corruption): emergency patch.
- Rate-limit full-resync per device (1 per 15 min).
2. Retry / Backoff
| Op | Max | Backoff |
|---|
| Mutation route (to owning service) | 5 | 500ms, 2s, 10s, 30s, 60s |
| Postgres write | 3 | 10ms–200ms |
| Outbox | infinite | exp cap 5m |
| Delta projector (event processing) | infinite | exp cap 30s |
3. Circuit Breakers
| Target | Trip | Reset |
|---|
| Each owning service | 10 fail/30s | 60s |
| Postgres | 10 fail/30s | 60s |
4. Fallbacks
| Primary | Fallback |
|---|
| Real-time delta | Stale (eventual consistency) |
| Push to owning service | Queue + retry |
| AI merge suggestion | Manual three-way diff |
5. Chaos
- Drop 10% NATS messages → projector catches up.
- Kill sync pod mid-push → client retries; dedup.
- Introduce artificial clock skew → VectorClock handles.
- Force all cursors to expire → full-resync wave → verify system handles.
- Two devices push conflicting LWW updates → deterministic resolution.