Skip to main content

Observability

:::info Source Sourced from services/media-service/OBSERVABILITY.md in the documentation repo. :::

1. Logs

Events: media.upload.initiated|completed|failed, media.scan.*, media.transcode.*, media.caption.*, media.transcript.*, media.ai.*, media.quarantine.*, media.delete.*.

2. Metrics

RED

  • media_api_requests_total{endpoint,status} counter
  • media_api_duration_seconds{endpoint} histogram

Domain

  • media_uploads_total{kind} counter
  • media_uploads_bytes_total counter
  • media_scan_duration_seconds histogram
  • media_scan_result_total{result} counter (clean/quarantined)
  • media_transcode_duration_seconds{profile} histogram
  • media_transcode_success_rate gauge
  • media_ai_image_total{model} counter
  • media_ai_audio_total{model} counter
  • media_caption_duration_seconds histogram
  • media_storage_bytes{tenant_id,class=hot|cold} gauge
  • media_cdn_cache_hit_ratio gauge

Cost

  • media_ai_cost_micro_usd_total{tenant_id} counter
  • media_storage_cost_estimate{tenant_id} gauge
  • media_egress_bytes{tenant_id} counter

3. Traces

Spans: media.upload.finalize, media.scan, media.transcode, media.caption.generate, media.ai.image, media.delete.

4. Dashboards

  • Upload volume + success rate.
  • Transcode queue + latency.
  • Scan outcomes.
  • AI usage + cost.
  • Storage cost per tenant.
  • CDN cache hit.

5. Alerts

AlertThresholdSeverity
upload-failure-rate> 2%P2
scan-queue-backlog> 5000P2
transcode-failure-rate> 3%P2
quarantine-spike> 10/minP1 (possible attack)
ai-budget-exhaustedtenant budget > 100%P3
storage-cost-spike> 50% WoWP3
cdn-cache-hit-low< 85%P3

6. SLOs

SLITarget
Upload URL p95< 200ms
Scan p95< 30s
Transcode (1-min video 1080p) p95< 3 min
Stream URL p95< 100ms
AI image gen p95< 15s