Observability
:::info Source
Sourced from services/sync-service/OBSERVABILITY.md in the documentation repo.
:::
1. Logs
Events: sync.push.received|completed, sync.pull.served, sync.ack.received, sync.mutation.applied|conflicted|rejected, sync.delta.projected, sync.conflict.resolved, sync.cursor.advanced|out_of_range, sync.device.stale|disconnected, sync.full_resync.initiated.
Attrs: device_id_hash, scope, mutation_count, delta_count, lamport, conflict_policy.
2. Metrics
RED
sync_api_requests_total{endpoint,status}countersync_api_duration_seconds{endpoint}histogram
Domain
sync_push_mutations_total{status=applied|conflicted|rejected}countersync_push_batch_sizehistogramsync_pull_deltas_total{scope}countersync_pull_duration_seconds{scope}histogramsync_conflicts_total{entity_type,policy}countersync_conflicts_pendinggaugesync_conflicts_resolution_duration_secondshistogramsync_cursor_out_of_range_totalcounter (full-resync triggers)sync_delta_projection_lag_secondsgauge (event → delta materialized)sync_devices_by_status{status=healthy|stale|disconnected}gaugesync_device_mutation_backloghistogram
USE
sync_outbox_lag_secondsgaugesync_deltas_table_rows{scope}gauge
3. Traces
Spans: sync.push.batch, sync.push.mutation.route{service}, sync.pull.query, sync.delta.project, sync.conflict.detect, sync.conflict.resolve.
4. Dashboards
- Push: throughput, batch size, rejection rate, conflict rate.
- Pull: throughput, delta volume, cursor age distribution.
- Conflicts: pending count, resolution rate, policy distribution.
- Device health: healthy/stale/disconnected distribution, offline duration histogram.
- Delta projection lag.
- Full-resync frequency (should be rare).
5. Alerts
| Alert | Threshold | Severity |
|---|---|---|
| conflict-rate-spike | > 5% of pushes | P2 |
| delta-projection-lag | > 60s p99 | P2 |
| full-resync-spike | > 100/day | P2 |
| mutation-rejection-spike | > 3% | P2 |
| stale-devices | > 10% of active devices | P3 |
| outbox-lag | > 30s p99 | P2 |
| push-failure | > 1% | P2 |
| conflict-backlog | > 1000 pending | P3 |
6. SLOs
| SLI | Target |
|---|---|
| Push p95 | < 500ms (100-mutation batch) |
| Pull p95 | < 300ms (500-entity delta) |
| Delta projection lag p99 | < 30s |
| Conflict auto-resolution rate | ≥ 95% |
| Full-resync rate | < 0.1% of active devices/week |
6a. EP-20 multi-device NFRs (product ↔ operations)
Epic EP-20 (docs/07-epics-and-user-stories.md) states end-to-end propagation windows (e.g. < 30 s for cursor/conflict visibility on line, ≤ 60 s for wipe/revocation after a device is next online). Those are customer-facing SLOs, validated in staging and production, not as fixed wall-clock assertions in every local dev run.
How we evidence them
| Story need | Primary signals / dashboards | Notes |
|---|---|---|
| US-100 / US-101 — < 30 s | sync_outbox_lag_seconds (p99), sync_pull_duration_seconds, sync_conflicts_resolution_duration_seconds, delta-projection-lag alert (< 60s p99 in §5) | End-to-end includes client pull + UI; server side bounded by outbox + pull + projection. |
| US-102 / US-103 — ≤ 60 s after next online | Same + sync_bundle_revoked_total, device reconnect sync | “After next online” is client-dependent; server guarantees materialize in pull after projection. |
Release criteria: Grafana (or equivalent) panels show p99 within targets for the windows above; alert policies in §5 are green during soak. Local integration tests assert correctness of pull/wipe/revoke paths; latency SLOs are operational.
7. Device Telemetry
- Offline devices buffer metrics.
- On reconnect, sync pushes device metrics (offline duration, local mutation count, app crashes).
- Correlate with sync health dashboard.