Skip to main content

Observability

:::info Source Sourced from services/certification-service/OBSERVABILITY.md in the documentation repo. :::

1. Logs

Events: certification.certificate.issued, .revoked, .verified (AUDIT), .render.started, .render.completed, .render.failed, .offline_claim.submitted, .offline_claim.verified / .rejected, .template.updated, .kid.rotated.

Attrs: certificate_id, template_id, course_version_id, verification_token_fingerprint.

2. Metrics

RED

  • cert_api_requests_total{endpoint,status} counter
  • cert_api_duration_seconds{endpoint} histogram

Domain

  • cert_issued_total{tenant_id,template_id} counter
  • cert_revoked_total{reason} counter
  • cert_verification_total{result=issued|revoked|not_found|invalid} counter
  • cert_offline_claim_total{status=verified|rejected} counter
  • cert_render_duration_seconds{format=pdf|png|openbadges} histogram
  • cert_issuance_latency_seconds histogram (completion event → cert visible, target p95 < 10s)
  • cert_public_verify_cache_hit_ratio gauge

USE

  • cert_outbox_lag_seconds gauge
  • cert_render_queue_depth gauge

3. Traces

Spans: cert.issue, cert.render.pdf, cert.render.png, cert.sign_jws, cert.verify.public, cert.offline_claim.verify.

4. Dashboards

  • Issuance funnel: completion events → certs issued rate.
  • Verification: total, per-tenant, cache hit.
  • Rendering: latency + failure rate.
  • Revocations: rate + reason distribution.
  • JWKS: serve rate, rotations.

5. Alerts

AlertThresholdSeverity
issuance-latency-highp95 > 30sP2
issuance-failure-rate> 1%P2
revocation-spike> 100/hour for a tenantP2 (may be AI flag candidate)
public-verify-4xx> 10%P3
offline-claim-rejection-spike> 10%P2
render-queue-backlog> 500P2
kid-rotation-pending-overdueP3
jwks-serve-error> 0.1%P1 (verification broken)

6. SLOs

SLITarget
Issuance pipeline p95 (completion → cert)< 10s
Public /verify p95< 100ms (cache hit); < 300ms (miss)
Issuance success rate≥ 99.9%
JWKS availability99.999%

7. RUM

  • Learner certificate page LCP < 1.5s.
  • Verifier page LCP < 1s (mostly cached).

8. Public Verify Analytics

  • Geographic distribution of verifies.
  • User-agent distribution (helps detect bot scraping).
  • Anomaly: same cert token verified 10000x in 1 min → alert.