OTEL Ingest Guide
OpenTelemetry (OTEL) lets ServiceRadar receive traces, metrics, and logs from cloud-native workloads. The platform includes an OTLP gateway that normalizes telemetry before it lands in CNPG and the ServiceRadar registry.
Endpoint Overview
- Protocol: OTLP over gRPC (
0.0.0.0:4317) and OTLP over HTTP (0.0.0.0:4318). - Kubernetes: Access the service via
serviceradar-otel(ClusterIPby default). For internet exposure, front it with an ingress or load balancer that terminates TLS. - Docker Compose: Ports 4317/4318 map directly to the host for local deployments.
Authentication
- Require client certificates by enabling mTLS in the gateway deployment. Reuse the certificates generated in the Self-Signed Certificates guide or your enterprise PKI.
- If you must expose OTLP to untrusted networks, front the OTLP service with an ingress/load balancer that terminates TLS and enforce network policy or source IP allow-lists.
Pipeline Configuration
- Point OTEL Collectors at the ServiceRadar OTLP endpoint.
- Configure resource attributes (
service.name,deployment.environment,account) so SRQL filters can scope telemetry. - Enable span metrics export if you plan to correlate traces with SNMP or NetFlow (see the SRQL reference).
Storage and Querying
- Metrics land in the Timescale hypertable
otel_metricsinside CNPG. Retention defaults to three days via the Ash migration inelixir/serviceradar_core/priv/repo/migrations/; extend it there and re-runmix ash.migrate. - Traces use the
otel_traceshypertable. SRQL simply proxies the query to CNPG, so joins such asSELECT * FROM otel_traces JOIN logs USING (trace_id)stay performant. - Logs from OTEL exporters flow into the shared
logshypertable through theserviceradar-db-event-writer. The syslog pipeline can still mirror events if you need unified retention or GoRules enrichment.
Use the CNPG Monitoring dashboards to watch ingestion volume and Timescale retention jobs, or run ad-hoc SQL directly from the serviceradar-tools pod (cnpg-sql "SELECT COUNT(*) FROM otel_traces WHERE created_at > now() - INTERVAL '5 minutes';").
Troubleshooting
- Validate connectivity with
otelcol --config test-collector.yaml --dry-run. - If running in Kubernetes, check the gateway logs (
kubectl logs deploy/serviceradar-otel -n <namespace>) for schema rejection or TLS errors. - Refer to the Troubleshooting Guide for rate limiting and export lag scenarios.
Core Capability Metrics
ServiceRadar emits capability lifecycle metrics whenever the core service records a capability event:
serviceradar_core_capability_events_total(counter) – increments on every capability snapshot written to CNPG. Key attributes:capability: logical capability string (icmp,snmp,sysmon,gateway, …)service_type: gateway/agent/checker service type (if available)recorded_by: gateway ID or component that produced the eventstate: normalized state stored alongside the snapshot (ok,failed,degraded,unknown)
Suggested PromQL examples once the OTEL collector exports to Prometheus:
# Track per-capability event cadence across the fleet
sum(rate(serviceradar_core_capability_events_total[5m])) by (capability)
# Alert if ICMP capability reports go silent for 10 minutes
sum(rate(serviceradar_core_capability_events_total{capability="icmp"}[10m])) < 0.1
Grafana tip: plot the per-capability series as a stacked area chart to spot imbalances between collectors; overlay recorded_by to see which gateways stop reporting first during outages.