Skip to main content

file-storage-service — OBSERVABILITY

Companion: SERVICE_OVERVIEW §7 SLOs · APPLICATION_LOGIC · FAILURE_MODES · platform: @ghasi/telemetry

This service emits structured logs, OpenTelemetry traces, and Prometheus-format metrics through the platform @ghasi/telemetry package. Telemetry is initialized before NestFactory.create in src/main.ts so that the very first DB connection and Pub/Sub publish are traced. All telemetry tags carry tenant_id, service, version, region, and (where applicable) scope, data_class, caller_surface.

1. SLIs and SLOs

SLIDefinitionTargetWindow
initiate_upload_p95_latency_secondsServer-side latency of POST /api/v1/files/uploads≤ 0.150 srolling 30 d
confirm_upload_p95_latency_secondsServer-side latency of POST /uploads/{ups}/confirm≤ 0.250 srolling 30 d
download_url_p95_latency_secondsServer-side latency of POST /files/{med}/download-url≤ 0.120 srolling 30 d
scan_completion_p95_secondsscan.passed.v1.occurredAt − upload.completed.v1.occurredAt≤ 15 srolling 30 d
optimization_p95_secondsoptimization.completed.v1 − scan.passed.v1 for bytes ≤ 5 MiB≤ 30 srolling 30 d
cdn_get_p95_seconds_cache_hitCloud CDN log latency, hits only≤ 0.080 srolling 30 d
availability_read_pct(5xx / total) for read endpoints≥ 99.95 %rolling 30 d
availability_write_pct(5xx / total) for write endpoints≥ 99.9 %rolling 30 d
cross_tenant_access_countevents file.access.denied.v1 with reason='cross_tenant'0rolling 24 h
retention_sweep_lag_secondsnow() − min(hard_delete_after WHERE not yet purged)≤ 3600continuous
outbox_publish_lag_p95_secondsnow() − created_at for unpublished rows≤ 2continuous
quarantine_rate_pct_24hquarantined / total uploads≤ 0.5 %rolling 24 h (anomaly bound)
signed_url_revoke_to_block_seconds_p95time from DELETE access-grant to next request blocked≤ 30 srolling 24 h

Error budgets are calculated on the rolling 30 d window. Burn-rate alerts at 2 % / hr (slow) and 14.4 % / hr (fast).

2. Metrics (Prometheus exposition on port 9090)

2.1 Request metrics

file_storage_http_requests_total{method,route,status,tenant_id,caller_surface}
file_storage_http_request_duration_seconds_bucket{method,route,status,le,...}
file_storage_http_request_size_bytes{method,route}
file_storage_http_response_size_bytes{method,route}

2.2 Domain metrics

file_storage_uploads_initiated_total{scope,data_class,tenant_id}
file_storage_uploads_confirmed_total{scope,alias,tenant_id}
file_storage_uploads_failed_total{scope,reason,tenant_id}
file_storage_downloads_issued_total{scope,variant,tenant_id,purpose}
file_storage_downloads_denied_total{scope,reason,tenant_id} # reason=cross_tenant|expired|revoked|quarantined|scan_pending|missing_role
file_storage_scan_results_total{scanner,verdict}
file_storage_scan_duration_seconds_bucket{scanner,scope,le}
file_storage_optimization_duration_seconds_bucket{scope,preset,le}
file_storage_optimization_failed_total{scope,preset,reason}
file_storage_dedupe_alias_total{scope,tenant_id}
file_storage_quarantine_total{scope,reason} # reason=virus|ai_safety|magic_byte|polyglot
file_storage_retention_purges_total{scope,policy_name}
file_storage_erasure_runs_total{scope_kind,outcome}
file_storage_erasure_purged_objects_total{scope_kind}
file_storage_erasure_deferred_objects_total{scope_kind}
file_storage_quota_bytes_used{tenant_id,scope}
file_storage_quota_bytes_cap{tenant_id}
file_storage_quota_objects_used{tenant_id,scope}
file_storage_quota_objects_cap{tenant_id}

2.3 Infrastructure metrics

file_storage_outbox_unpublished_total
file_storage_outbox_publish_lag_seconds_bucket{le}
file_storage_outbox_publish_failures_total{reason}
file_storage_inbox_dedupe_skips_total{event_type}
file_storage_inbox_handler_failures_total{event_type,handler}

file_storage_db_pool_in_use{pool}
file_storage_db_pool_idle{pool}
file_storage_db_query_duration_seconds_bucket{op,le}
file_storage_db_constraint_violations_total{constraint}

file_storage_redis_command_duration_seconds_bucket{cmd,le}
file_storage_signed_url_cache_hits_total
file_storage_signed_url_cache_misses_total
file_storage_signed_url_blacklist_size

file_storage_pubsub_publish_total{topic,result}
file_storage_pubsub_consume_total{subscription,result}

file_storage_gcs_op_duration_seconds_bucket{op,le} # op=sign,head,delete,copy
file_storage_gcs_op_failures_total{op,error}

file_storage_cdn_invalidate_attempts_total{result}
file_storage_cdn_invalidate_duration_seconds_bucket{le}
file_storage_cdn_invalidate_backlog

2.4 AI metrics

file_storage_ai_calls_total{purpose,outcome,tier}
file_storage_ai_latency_seconds_bucket{purpose,tier,le}
file_storage_ai_cost_micro_usd_total{purpose,tier,tenant_id}
file_storage_ai_hitl_pending{purpose}
file_storage_ai_hitl_oldest_age_seconds{purpose}

3. Tracing (OpenTelemetry → Cloud Trace)

Span attributes required on every span:

AttributeSource
tenant_idrequest context
request_idX-Request-Id header
caller_surfacederived from JWT
actor.kind / actor.userIdJWT subject
routecontroller path
idempotency_keyrequest header (writes only)

Spans of interest:

  • http.server.request (root)
    • usecase.<command> (e.g., usecase.initiateUpload)
      • db.<operation> (e.g., db.file_objects.insert)
      • db.outbox.append
      • gcs.signed_url.upload
      • pubsub.publish (carries topic, messageId, orderingKey)
    • For confirm:
      • gcs.head
      • gcs.stream_sha256_compare (only on missing object metadata)
      • usecase.dedupe_lookup
  • outbox.relay.batch — root for the relay worker; one span per batch, child spans per message published.
  • inbox.consumer.<subscription> — root for each consumed Pub/Sub message.
  • sweeper.retention.scan and sweeper.session.cleanup — long-running spans with periodic events.
  • ai.<purpose> — wraps every orchestrator call; child of the originating use-case span.

Trace sampling: head-based 5 %; tail-based always-keep if any span has error=true, ai.tier=cloud, gcs.op_failures ≥ 1, or cross_tenant=true.

4. Dashboards (Grafana / Cloud Monitoring)

4.1 file-storage / overview

  • RED panels per route: rate, errors, p50/p95/p99 latency.
  • Upload funnel: initiated → confirmed → ready → quarantined.
  • Outbox lag, inbox lag, Pub/Sub backlog per topic.
  • 30 d error budget burn (multi-window).

4.2 file-storage / quotas

  • Top 20 tenants by bytes_used %, objects_used %.
  • Tenants in 80–95 % band (warn), ≥ 95 % (critical).
  • Reserved bytes pending confirm (leaks?).

4.3 file-storage / scan-and-quarantine

  • Scan throughput (per scanner).
  • Quarantine count + reason distribution (last 24 h, 7 d).
  • Time-to-scan p50/p95.
  • ClamAV definitions version freshness.

4.4 file-storage / optimization

  • Variant build success / fail (per preset, per scope).
  • Optimization latency buckets per byte-size band.
  • Optimizer DLQ depth.

4.5 file-storage / cdn

  • CDN hit ratio per host.
  • p95 edge latency.
  • Invalidation backlog and time-to-invalidate.

4.6 file-storage / erasure-and-retention

  • Erasure runs in flight, completed, partial, failed (last 30 d).
  • Deferred objects by policy.
  • Retention sweep lag.
  • Holds about to release in next 7 d.

4.7 file-storage / ai

  • Per purpose: call rate, latency, refusal rate, cost / 1k calls.
  • HITL queue depth and oldest age per purpose.

5. Logs (pino JSON → Cloud Logging)

Required fields on every log line:

{ "ts", "level", "service":"file-storage-service", "version", "tenant_id", "trace_id", "span_id",
"request_id", "caller_surface", "actor": { "userId" | "service" }, "route", "msg" }

Sensitive fields are redacted by pino's redactor: Authorization, Cookie, any header with signature, Signature, X-Goog-Signature, request bodies on PII-scope endpoints (guest_id_scan upload), and PUT URL query strings.

Notable log lines:

  • upload.initiate — at info level on success; warn on quota near 95 %.
  • upload.confirm.dedupe — info; carries aliasOf.
  • upload.confirm.hash_mismatch — warn; alert if rate spikes (potential client bug or attack).
  • download.issue — info for private/archive, sampled 1/100 for public_media (volume).
  • access.denied — warn; SIEM keys off this line in addition to the event.
  • scan.callback.passed|failed — info / warn.
  • scan.callback.late — warn; scan_completion_seconds > 60.
  • optimize.callback.failed — warn; attempts >= 3.
  • cdn.invalidate.failed — error; backlog alert.
  • erasure.completed — info with counts; erasure.partial warn.
  • retention.sweep.lag — warn at lag > 30 min, error at > 60 min.

6. Audit log export

Domain events on these topics are dual-published to BigQuery via Pub/Sub-to-BQ subscription with a 7-year table retention:

  • melmastoon.file.scan.results.v1
  • melmastoon.file.access.v1 (denied; granted for non-public scopes only)
  • melmastoon.file.deletion.v1
  • melmastoon.file.retention.v1
  • melmastoon.file.erasure.v1
  • melmastoon.file.upload.lifecycle.v1

access_grants table is exported nightly to BigQuery with a 13-month rolling partition retention; older partitions are dropped from Postgres and queried from BigQuery for historical audits.

7. Alerts (Cloud Monitoring + PagerDuty)

AlertConditionSeverityRunbook
FileStorage_ErrorBudgetFastBurn14.4× burn over 1 h on availability_writeP1file/error-budget-burn
FileStorage_CrossTenantAccessDenied≥ 1 in 5 min from same actorP1file/cross-tenant-leak-suspected
FileStorage_OutboxLagHighpublish_lag p95 > 30 s for 5 mP2file/outbox-lag
FileStorage_PubSubDLQGrowthDLQ count > 10 in 5 m on any topicP2file/dlq-growth
FileStorage_ScanLatencyHighscan_completion p95 > 60 s for 10 mP2file/scan-latency
FileStorage_ScanQueueBacklogClamAV queue depth > 1000 for 5 mP2file/quarantine-storm
FileStorage_QuarantineRateHighquarantine_rate_24h > 5 % for 30 mP2file/quarantine-anomaly
FileStorage_QuotaBlockany tenant blocked_at not null & uploads attempted ≥ 100/minP3file/quota-runaway
FileStorage_OptimizerDLQoptimizer DLQ depth > 50P3file/optimizer-dlq
FileStorage_CdnInvalidationBacklogbacklog > 50 for 15 mP3file/cdn-invalidation-backlog
FileStorage_RetentionSweepLagsweep_lag > 60 minP2file/retention-sweep-lag
FileStorage_ErasureFailedany erasure_request status='failed' or 'partial'P2file/erasure-cert-failure
FileStorage_AIBudgetExhaustedbudget refusals > 100/min for 10 mP3file/ai-budget
FileStorage_HITLBackloghitl_oldest_age > 24 hP3file/hitl-backlog
FileStorage_KMS_UnavailableKMS error rate > 1 % for 5 mP1file/kms-unavailable
FileStorage_DBConnPoolSaturatedin_use / max > 0.9 for 10 mP2file/db-saturation

8. SLO computation pipeline

SLO numbers are computed in BigQuery from the Datastreamed file_objects table + raw event topics. A scheduled query refreshes a slos_file_storage table every 5 min; Grafana datasource for the SLO panels reads from there. Error-budget-burn alerts run on Cloud Monitoring metrics for sub-minute responsiveness.

9. Cardinality controls

  • tenant_id label on per-tenant metrics is capped at the top 100 tenants by usage; remaining tenants roll up to tenant_id="other". Dashboards include both.
  • Avoid route cardinality explosion by templatizing IDs (/files/{med} not /files/med_01HXY...).
  • scope and data_class are bounded enums (≤ 9 + 3 values).
  • actor.userId is never a metric label; only a log/span attribute.

10. Validation

Three CI gates on this doc:

  1. Every metric named here must be emitted by source code (script bin/check-metrics.ts greps file_storage_* against the Prometheus collector registry).
  2. Every alert here must have a corresponding Terraform resource in infra/melmastoon/modules/file-storage/alerts.tf.
  3. Every runbook URL must resolve (404 fails CI).