Skip to main content

Observability

:::info Source Sourced from services/sync-service/OBSERVABILITY.md in the documentation repo. :::

1. Logs

Events: sync.push.received|completed, sync.pull.served, sync.ack.received, sync.mutation.applied|conflicted|rejected, sync.delta.projected, sync.conflict.resolved, sync.cursor.advanced|out_of_range, sync.device.stale|disconnected, sync.full_resync.initiated.

Attrs: device_id_hash, scope, mutation_count, delta_count, lamport, conflict_policy.

2. Metrics

RED

  • sync_api_requests_total{endpoint,status} counter
  • sync_api_duration_seconds{endpoint} histogram

Domain

  • sync_push_mutations_total{status=applied|conflicted|rejected} counter
  • sync_push_batch_size histogram
  • sync_pull_deltas_total{scope} counter
  • sync_pull_duration_seconds{scope} histogram
  • sync_conflicts_total{entity_type,policy} counter
  • sync_conflicts_pending gauge
  • sync_conflicts_resolution_duration_seconds histogram
  • sync_cursor_out_of_range_total counter (full-resync triggers)
  • sync_delta_projection_lag_seconds gauge (event → delta materialized)
  • sync_devices_by_status{status=healthy|stale|disconnected} gauge
  • sync_device_mutation_backlog histogram

USE

  • sync_outbox_lag_seconds gauge
  • sync_deltas_table_rows{scope} gauge

3. Traces

Spans: sync.push.batch, sync.push.mutation.route{service}, sync.pull.query, sync.delta.project, sync.conflict.detect, sync.conflict.resolve.

4. Dashboards

  • Push: throughput, batch size, rejection rate, conflict rate.
  • Pull: throughput, delta volume, cursor age distribution.
  • Conflicts: pending count, resolution rate, policy distribution.
  • Device health: healthy/stale/disconnected distribution, offline duration histogram.
  • Delta projection lag.
  • Full-resync frequency (should be rare).

5. Alerts

AlertThresholdSeverity
conflict-rate-spike> 5% of pushesP2
delta-projection-lag> 60s p99P2
full-resync-spike> 100/dayP2
mutation-rejection-spike> 3%P2
stale-devices> 10% of active devicesP3
outbox-lag> 30s p99P2
push-failure> 1%P2
conflict-backlog> 1000 pendingP3

6. SLOs

SLITarget
Push p95< 500ms (100-mutation batch)
Pull p95< 300ms (500-entity delta)
Delta projection lag p99< 30s
Conflict auto-resolution rate≥ 95%
Full-resync rate< 0.1% of active devices/week

6a. EP-20 multi-device NFRs (product ↔ operations)

Epic EP-20 (docs/07-epics-and-user-stories.md) states end-to-end propagation windows (e.g. < 30 s for cursor/conflict visibility on line, ≤ 60 s for wipe/revocation after a device is next online). Those are customer-facing SLOs, validated in staging and production, not as fixed wall-clock assertions in every local dev run.

How we evidence them

Story needPrimary signals / dashboardsNotes
US-100 / US-101 — < 30 ssync_outbox_lag_seconds (p99), sync_pull_duration_seconds, sync_conflicts_resolution_duration_seconds, delta-projection-lag alert (< 60s p99 in §5)End-to-end includes client pull + UI; server side bounded by outbox + pull + projection.
US-102 / US-103 — ≤ 60 s after next onlineSame + sync_bundle_revoked_total, device reconnect sync“After next online” is client-dependent; server guarantees materialize in pull after projection.

Release criteria: Grafana (or equivalent) panels show p99 within targets for the windows above; alert policies in §5 are green during soak. Local integration tests assert correctness of pull/wipe/revoke paths; latency SLOs are operational.

7. Device Telemetry

  • Offline devices buffer metrics.
  • On reconnect, sync pushes device metrics (offline duration, local mutation count, app crashes).
  • Correlate with sync health dashboard.