sender-id-registry-service — Observability
Version: 1.0 Status: Draft Owner: Trust & Safety + Platform SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · APPLICATION_LOGIC · FAILURE_MODES
1. Prometheus Metrics
All metrics exposed at GET /metrics on port 3091 in Prometheus text format.
1.1 Verify gRPC (hot path)
| Metric | Type | Labels | Description |
|---|---|---|---|
sid_verify_requests_total | Counter | status, caller_service | Verify requests by status (ACTIVE/SUSPENDED/REVOKED/UNKNOWN/TENANT_MISMATCH/PENDING/error) |
sid_verify_duration_seconds | Histogram | status | End-to-end gRPC Verify latency |
sid_verify_cache_hit_total | Counter | cache_layer (redis_l1, local_l2) | Cache hit counts |
sid_verify_cache_miss_total | Counter | — | DB lookups (cache miss) |
sid_verify_db_unavail_total | Counter | — | Falls to UNKNOWN due to DB down |
sid_get_reputation_requests_total | Counter | — | GetReputation calls |
sid_get_reputation_duration_seconds | Histogram | — | Latency |
sid_batch_verify_requests_total | Counter | batch_size_bucket | BatchVerify calls |
Histogram buckets for sid_verify_duration_seconds (microsecond targets):
[0.0005, 0.001, 0.002, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5]
1.2 Registration & KYC
| Metric | Type | Labels | Description |
|---|---|---|---|
sid_submissions_total | Counter | type, category, restricted_pattern_matched | New submissions |
sid_kyc_review_decisions_total | Counter | decision (APPROVE/REJECT/REQUEST_INFO), reviewer_role | Reviewer actions |
sid_kyc_review_age_seconds | Histogram | decision | Time from submission to decision |
sid_kyc_review_sla_breach_total | Counter | — | Submissions past 5-business-day SLA |
sid_kyc_doc_view_total | Counter | viewer_role | KYC document inline views |
sid_kyc_doc_view_per_reviewer | Gauge | reviewer_user_id | Daily KYC view count per reviewer (for outlier detection) |
sid_kyc_doc_tamper_detected_total | Counter | — | Nightly hash-mismatch alerts |
sid_kyc_upload_failures_total | Counter | reason (s3_unavailable/encryption_failed/hash_mismatch/too_large) | KYC upload failures |
1.3 Verification
| Metric | Type | Labels | Description |
|---|---|---|---|
sid_verifications_started_total | Counter | method | Verification challenges initiated |
sid_verifications_succeeded_total | Counter | method | Successful verifications |
sid_verifications_failed_total | Counter | method, reason | Failed verifications |
sid_otp_attempts_total | Counter | result (success/wrong/expired/max_attempts) | OTP submission attempts |
sid_otp_issuance_rate_limited_total | Counter | — | OTP rate-limited (3/h cap) |
sid_dns_check_duration_seconds | Histogram | resolver | DNS-TXT resolution latency |
sid_dns_check_failures_total | Counter | reason (nxdomain/timeout/mismatch/sole_resolver) | DNS verification failures |
sid_notarised_dual_control_blocked_total | Counter | — | Same-reviewer dual-control attempts blocked |
1.4 State machine
| Metric | Type | Labels | Description |
|---|---|---|---|
sid_state_transitions_total | Counter | from_state, to_state, trigger (manual/auto/system) | All state transitions |
sid_active_count | Gauge | category | Currently ACTIVE sender-IDs by category |
sid_suspended_count | Gauge | reason_class (manual/reputation/fraud) | Currently SUSPENDED sender-IDs |
sid_revoked_count | Gauge | — | Currently REVOKED (12-mo reservation) |
sid_pending_review_count | Gauge | — | SUBMITTED + KYC_REVIEW + INFO_REQUESTED |
1.5 Reputation
| Metric | Type | Labels | Description |
|---|---|---|---|
sid_reputation_score | Gauge | band (AUTO_SUSPEND/POOR/NEUTRAL/GOOD/EXCELLENT) | Distribution histogram |
sid_reputation_band_distribution | Gauge | band | Count of sender-IDs per band |
sid_reputation_band_transitions_total | Counter | from_band, to_band | Band crossings |
sid_reputation_cron_duration_seconds | Histogram | — | Daily cron cycle time |
sid_reputation_cron_failures_total | Counter | reason | Failed cron runs |
sid_reputation_intraday_deltas_total | Counter | source (fraud/compliance/firewall/complaint) | Intra-day delta events processed |
sid_auto_suspend_total | Counter | — | Sender-IDs auto-suspended due to reputation < 30 |
1.6 Public search
| Metric | Type | Labels | Description |
|---|---|---|---|
sid_public_search_requests_total | Counter | result (hit/miss/rate_limited/tarpit) | Public search calls |
sid_public_search_duration_seconds | Histogram | cache_layer | Latency |
sid_public_search_per_ip_distinct_queries | Gauge | — | Top-N IPs by distinct-query count |
sid_public_search_abuse_blocks_total | Counter | — | IPs soft-blocked for abuse pattern |
1.7 Regulator export
| Metric | Type | Labels | Description |
|---|---|---|---|
sid_regulator_export_total | Counter | triggered_by, result | Export generations |
sid_regulator_export_duration_seconds | Histogram | — | Generation time |
sid_regulator_export_row_count | Histogram | — | Rows per export |
sid_regulator_export_sftp_failures_total | Counter | — | SFTP transmission failures |
sid_regulator_export_signing_duration_seconds | Histogram | — | HSM signing latency |
sid_regulator_export_lag_seconds | Gauge | — | Time since last successful export |
1.8 Outbox / NATS
| Metric | Type | Labels | Description |
|---|---|---|---|
sid_outbox_publish_total | Counter | result (ok/retry/dead) | Outbox publishes |
sid_outbox_lag_rows | Gauge | — | Unpublished rows count |
sid_outbox_lag_seconds | Gauge | — | Oldest unpublished row age |
sid_inbox_processed_total | Counter | subject | Consumed events |
sid_inbox_dedup_hit_total | Counter | subject | Duplicate events ignored |
1.9 AI assist
| Metric | Type | Labels | Description |
|---|---|---|---|
sid_ai_kyc_ocr_duration_seconds | Histogram | — | OCR latency |
sid_ai_llm_kyc_validation_duration_seconds | Histogram | — | LLM analysis latency |
sid_ai_impersonation_flag_total | Counter | confidence_bucket | AI impersonation detection |
sid_ai_restricted_fuzzy_match_total | Counter | — | Embedding-based restricted matches |
sid_ai_llm_invalid_output_total | Counter | — | LLM output failed schema validation |
sid_ai_reviewer_override_total | Counter | ai_signal_type, agreed | Reviewer agreement with AI signal |
2. Structured Log Events
All log output is valid JSON (Pino format). Log level controlled by LOG_LEVEL.
2.1 Verify
{
"level": "debug",
"time": "2026-04-21T08:14:02.123Z",
"event": "sid.verify",
"senderIdValue": "BANK-XYZ",
"senderIdType": "ALPHA",
"tenantId": "t_...",
"callerService": "compliance-engine",
"status": "ACTIVE",
"currentLevel": "NOTARISED",
"hasDomainDns": true,
"reputationScore": 87,
"cacheHit": true,
"latencyMs": 1.2,
"traceId": "abc",
"spanId": "def"
}
2.2 KYC review decision
{
"level": "info",
"event": "sid.kyc.decision",
"senderIdInternalId": "sid_...",
"value": "BANK-XYZ",
"decision": "APPROVE",
"reviewerUserId": "u_...",
"reviewerRole": "platform.sid.reviewer",
"reviewAgeSeconds": 86400,
"restrictedPatternMatched": true,
"kycDocCount": 3,
"aiSignals": {
"impersonationRiskScore": 0.42,
"forgeryIndicatorsCount": 0
}
}
2.3 Suspension
{
"level": "warn",
"event": "sid.suspended",
"senderIdInternalId": "sid_...",
"value": "SOMEBRAND",
"trigger": "AUTO_REPUTATION",
"reputationAtSuspension": 22,
"reason": "LOW_REPUTATION",
"previousState": "ACTIVE"
}
2.4 KYC document view (audit-class)
{
"level": "info",
"event": "sid.kyc.view",
"kycDocId": "kyc_...",
"senderIdInternalId": "sid_...",
"viewerUserId": "u_...",
"viewerRole": "platform.sid.reviewer",
"viewerIp": "10.4.2.18",
"watermarkApplied": true,
"watermarkArtefactKey": "s3://ghasi-sid-kyc-views-kbl/u_.../kyc_....pdf"
}
2.5 Public search abuse detection
{
"level": "warn",
"event": "sid.public_search.abuse",
"ip": "203.0.113.42",
"distinctQueries1h": 1247,
"rateRpm": 158,
"action": "TARPIT_AND_ALERT"
}
2.6 Regulator export
{
"level": "info",
"event": "sid.regulator.export",
"exportId": "exp_...",
"triggeredBy": "CRON",
"windowFrom": "2026-04-20T00:00:00Z",
"windowTo": "2026-04-21T00:00:00Z",
"rowCount": 12482,
"format": "JSONL",
"signedAtMs": 412,
"transmittedToAtraMs": 8421,
"acknowledgmentRef": "ATRA-RX-2026-04-21-0001"
}
Pino redactor masks registrantContactMsisdn, registrantContactEmail; forbids kyc_doc_content, otp_plaintext entirely.
3. OpenTelemetry Tracing
Parent span: sid.Verify (or sid.SubmitRegistration, sid.KycDecision, sid.RegulatorExport).
| Span | Operation | Attributes |
|---|---|---|
sid.cache.check | Redis GET | cache.hit, cache.layer |
sid.db.lookup | PG SELECT sender_ids | db.rows, state |
sid.kyc.encrypt | Vault Transit encrypt | tenant_id, kek_ref |
sid.kyc.upload | S3 PUT | bytes, mime |
sid.dns.resolve | DoT TXT lookup | domain, resolver, consensus |
sid.ai.ocr | OCR sidecar | model, confidence |
sid.ai.llm | local-llm | model_version, cache_hit, latency_ms |
sid.outbox.write | PG INSERT outbox | subject, event_id |
sid.outbox.publish | NATS publish | subject, attempt, result |
sid.regulator.sign | HSM sign | key_slot, latency_ms |
sid.regulator.sftp | SFTP put | host, bytes, latency_ms |
Trace context propagated from grpc-trace-bin (W3C Trace Context).
4. Alerting Rules
groups:
- name: sender-id-registry
rules:
- alert: SidVerifyLatencyHigh
expr: histogram_quantile(0.95, sum(rate(sid_verify_duration_seconds_bucket[5m])) by (le)) > 0.010
for: 5m
labels: { severity: high, team: platform-sre }
annotations:
summary: "sid.Verify P95 > 10ms (target ≤ 5ms)"
runbook: "https://runbooks.ghasi.local/sid/verify-latency"
- alert: SidVerifyLatencyCritical
expr: histogram_quantile(0.99, sum(rate(sid_verify_duration_seconds_bucket[5m])) by (le)) > 0.050
for: 2m
labels: { severity: critical, team: platform-sre }
- alert: SidVerifyDbUnavailable
expr: rate(sid_verify_db_unavail_total[5m]) > 1
for: 2m
labels: { severity: critical, team: platform-sre }
annotations:
summary: "Verify falling back to UNKNOWN due to DB unavailability — compliance HOLD storm imminent"
- alert: SidKycReviewSlaBreach
expr: sid_kyc_review_sla_breach_total > 0
for: 0m
labels: { severity: medium, team: trust-safety }
annotations:
summary: "{{ $value }} registrations past 5-business-day SLA"
- alert: SidKycBulkAccessAlert
expr: sid_kyc_doc_view_per_reviewer > 100
for: 10m
labels: { severity: high, team: security }
annotations:
summary: "Reviewer {{ $labels.reviewer_user_id }} viewed > 100 KYC docs / day — investigate"
- alert: SidPublicSearchAbuse
expr: rate(sid_public_search_requests_total{result="rate_limited"}[5m]) > 10
for: 5m
labels: { severity: medium, team: trust-safety }
- alert: SidReputationCollapse
expr: rate(sid_auto_suspend_total[1h]) > 50
for: 10m
labels: { severity: high, team: trust-safety }
annotations:
summary: "Auto-suspension storm — > 50 sender-IDs auto-suspended in 1h. Possible reputation-formula bug or coordinated attack."
- alert: SidReputationCronStale
expr: time() - sid_reputation_cron_last_success_timestamp > 7200
for: 5m
labels: { severity: high, team: trust-safety }
annotations:
summary: "Reputation cron has not completed in > 2 h"
- alert: SidRegulatorExportLag
expr: sid_regulator_export_lag_seconds > 172800
for: 5m
labels: { severity: high, team: regulator-liaison }
annotations:
summary: "No successful regulator export for > 48h"
- alert: SidExportSignerDown
expr: rate(sid_regulator_export_total{result="signing_failed"}[15m]) > 0
for: 5m
labels: { severity: critical, team: security }
annotations:
summary: "HSM signing failing for regulator export"
- alert: SidOutboxLag
expr: sid_outbox_lag_rows > 1000
for: 5m
labels: { severity: high, team: platform-sre }
- alert: SidNotarisedDualControlAttempt
expr: rate(sid_notarised_dual_control_blocked_total[1h]) > 0
for: 0m
labels: { severity: high, team: security }
annotations:
summary: "Same-reviewer attempted both notarised approvals — possible insider abuse"
- alert: SidSplitBrainConflict
expr: rate(sid_replication_conflict_total[15m]) > 0
for: 0m
labels: { severity: critical, team: platform-sre }
annotations:
summary: "Multi-region conflict detected on sender-IDs — manual reconciliation needed"
5. Grafana Dashboard Panels
Dashboard: dashboards/sender-id-registry.json.
| Panel | Query | Visualization |
|---|---|---|
| Verify P50/P95/P99 latency | histogram_quantile(0.5/0.95/0.99, ...) | Multi-line time series |
| Verify rate by status | rate(sid_verify_requests_total[5m]) by status | Stacked area |
| Verify cache hit rate | sid_verify_cache_hit_total / (sid_verify_cache_hit_total + sid_verify_cache_miss_total) | Gauge |
| Submissions per day | increase(sid_submissions_total[1d]) | Bar chart |
| KYC review queue depth | sid_pending_review_count | Time series + threshold |
| KYC review SLA tracker | sid_kyc_review_age_seconds heatmap | Heatmap |
| KYC views per reviewer (daily) | topk(10, sid_kyc_doc_view_per_reviewer) | Bar chart |
| State distribution | sid_active_count, sid_suspended_count, sid_revoked_count | Pie chart |
| Reputation band distribution | sid_reputation_band_distribution by band | Pie chart |
| Reputation band transitions (24h) | sid_reputation_band_transitions_total | Heatmap |
| Auto-suspension rate | rate(sid_auto_suspend_total[1h]) | Time series |
| Verification success rate by method | sid_verifications_succeeded_total / sid_verifications_started_total by method | Gauge × 4 |
| OTP issuance rate-limited | rate(sid_otp_issuance_rate_limited_total[5m]) | Time series |
| Public search abuse blocks | rate(sid_public_search_abuse_blocks_total[1h]) | Time series |
| Regulator export status | sid_regulator_export_lag_seconds + last success ts | Stat + gauge |
| Outbox lag | sid_outbox_lag_rows, sid_outbox_lag_seconds | Time series |
| AI assist agreement rate | sid_ai_reviewer_override_total{agreed="true"} / sid_ai_reviewer_override_total | Gauge per signal type |
6. SLI / SLO
| SLI | SLO | Window |
|---|---|---|
Verify P95 latency | ≤ 5 ms | 7 days |
Verify P99 latency | ≤ 15 ms | 7 days |
Verify availability (non-error rate) | ≥ 99.99% | 30 days |
GetReputation P95 | ≤ 5 ms | 7 days |
| KYC review SLA | 95% reviewed within 5 business days | 30 days |
| Reputation cron freshness | last successful run ≤ 25 h old | continuous |
| Regulator export freshness | ≤ 48 h since last successful export | continuous |
| Outbox lag | < 1000 unpublished rows | continuous |
Bound to platform NFR catalog (docs/15-nfr-sla-catalog.md).