Troubleshooting Guide

Use this guide as a first stop when onboarding ServiceRadar or operating the demo cluster. Each section lists fast diagnostics, common failure modes, and references for deeper dives.

Core Services

Check pod health: kubectl get pods -n demo (or the equivalent Docker Compose status). Pods stuck in CrashLoopBackOff usually point to missing secrets or PVC mounts.
Verify API availability: curl -k https://<core-host>/healthz. TLS errors tie back to mismatched certificates—reissue them with the Self-Signed Certificates guide.
Configuration drift: Reconcile changes with the Configuration Basics checklist and commit updates to KV.

SNMP

Credential failures: Review poller logs for snmp_auth_error. Ensure v3 auth/privacy keys match the SNMP ingest guide recommendations.
Packet loss: Confirm firewall rules allow UDP 161/162 from pollers. Use snmpwalk -v3 ... from the poller pod to validate.
Slow polls: Trim OID lists or increase poller replicas. Long runtimes delay alerting.

Syslog

No events: Ensure devices forward to the correct address and protocol (UDP/TCP 514). Validate listener status via kubectl logs deploy/serviceradar-syslog -n demo.
Parsing issues: Update CNPG grok rules when new vendors join; refer to the Syslog ingest guide.
Clock drift: Systems with unsynchronized NTP create out-of-order events; align to UTC.

NetFlow

Missing flows: Exporters must send to UDP 2055. Use tcpdump on the poller host to confirm arrival.
Template errors: Reset exporters or clear caches when poller logs complain about unknown IPFIX templates. See the NetFlow ingest guide.
High load: Increase NETFLOW_QUEUE_DEPTH and allocate more CPU to pollers.

OTEL

TLS failures: Double-check the OTLP gateway certificate bundle. Clients should trust the CA described in Self-Signed Certificates.
Backpressure: Inspect the gateway metrics; enable batching in exporters. Follow the OTEL guide for tuning tips.
Missing spans: Ensure service.name and other attributes are populated—SRQL filters rely on them.

Discovery

Empty results: Confirm scopes exist in KV under discovery/jobs/*. Reconcile job ownership using the Discovery guide.
Mapper stalled: Tail serviceradar-mapper logs for scheduler messages. Ensure /etc/serviceradar/mapper.json has at least one enabled scheduled_jobs entry and that credentials cover the target CIDRs.
Missing interfaces/topology: Verify stream_config in mapper.json still points to discovered_interfaces and topology_discovery_events. Mapper only emits interface/topology data when those fields are present.
Duplicate devices: Enable canonical matching in Sync so NetBox and Armis merges succeed.
Sweep failures: Check poller network reachability and throttling limits.

Integrations

Armis

Refresh client secrets and inspect serviceradar-sync logs. The Armis integration doc covers faker resets and pagination tuning.
Compare Faker vs. production counts to spot ingestion gaps.

NetBox

Verify API token scopes and rate limits. See the NetBox integration guide for advanced settings.
Check that prefixes are importing as expected; toggle expand_subnets if sweep jobs look incomplete.

Dashboards and UI

Login problems: Ensure local users exist (admin role) and JWT secrets are configured as described in Authentication configuration.
Missing charts: Import default dashboards from the Web UI configuration and double-check CNPG retention windows.
SRQL errors: Reference the SRQL language guide when writing complex joins.

Still Stuck?

Review the operational runbooks in Agents & Demo Operations for environment resets.
Capture failing commands, logs, and SRQL queries before escalating to the core team.
File follow-up work items in Beads (bd) so the broader team can track remediations.

Core Services​

SNMP​

Syslog​

NetFlow​

OTEL​

Discovery​

Integrations​

Armis​

NetBox​

Dashboards and UI​

Still Stuck?​

Core Services

SNMP

Syslog

NetFlow

OTEL

Discovery

Integrations

Armis

NetBox

Dashboards and UI

Still Stuck?