Troubleshooting Guide
Use this guide as a first stop when onboarding ServiceRadar or operating the demo cluster. Each section lists fast diagnostics, common failure modes, and references for deeper dives.
Core Services
- Check pod health:
kubectl get pods -n demo(or the equivalent Docker Compose status). Pods stuck inCrashLoopBackOffusually point to missing secrets or PVC mounts. - Verify API availability:
curl -k https://<core-host>/healthz. TLS errors tie back to mismatched certificates—reissue them with the Self-Signed Certificates guide. - Configuration drift: Reconcile changes with the Configuration Basics checklist and commit updates to KV.
SNMP
- Credential failures: Review
pollerlogs forsnmp_auth_error. Ensure v3 auth/privacy keys match the SNMP ingest guide recommendations. - Packet loss: Confirm firewall rules allow UDP 161/162 from pollers. Use
snmpwalk -v3 ...from the poller pod to validate. - Slow polls: Trim OID lists or increase poller replicas. Long runtimes delay alerting.
Syslog
- No events: Ensure devices forward to the correct address and protocol (
UDP/TCP 514). Validate listener status viakubectl logs deploy/serviceradar-syslog -n demo. - Parsing issues: Update Proton grok rules when new vendors join; refer to the Syslog ingest guide.
- Clock drift: Systems with unsynchronized NTP create out-of-order events; align to UTC.
NetFlow
- Missing flows: Exporters must send to UDP 2055. Use
tcpdumpon the poller host to confirm arrival. - Template errors: Reset exporters or clear caches when poller logs complain about unknown IPFIX templates. See the NetFlow ingest guide.
- High load: Increase
NETFLOW_QUEUE_DEPTHand allocate more CPU to pollers.
OTEL
- TLS failures: Double-check the OTLP gateway certificate bundle. Clients should trust the CA described in Self-Signed Certificates.
- Backpressure: Inspect the gateway metrics; enable batching in exporters. Follow the OTEL guide for tuning tips.
- Missing spans: Ensure
service.nameand other attributes are populated—SRQL filters rely on them.
Discovery
- Empty results: Confirm scopes exist in KV under
discovery/jobs/*. Reconcile job ownership using the Discovery guide. - Mapper stalled: Tail
serviceradar-mapperlogs forschedulermessages. Ensure/etc/serviceradar/mapper.jsonhas at least one enabledscheduled_jobsentry and that credentials cover the target CIDRs. - Missing interfaces/topology: Verify
stream_configinmapper.jsonstill points todiscovered_interfacesandtopology_discovery_events. Mapper only emits interface/topology data when those fields are present. - Duplicate devices: Enable canonical matching in Sync so NetBox and Armis merges succeed.
- Sweep failures: Check poller network reachability and throttling limits.
Integrations
Armis
- Refresh client secrets and inspect
serviceradar-synclogs. The Armis integration doc covers faker resets and pagination tuning. - Compare Faker vs. production counts to spot ingestion gaps.
NetBox
- Verify API token scopes and rate limits. See the NetBox integration guide for advanced settings.
- Check that prefixes are importing as expected; toggle
expand_subnetsif sweep jobs look incomplete.
Dashboards and UI
- Login problems: Ensure local users exist (
adminrole) and JWT secrets are configured as described in Authentication configuration. - Missing charts: Import default dashboards from the Web UI configuration and double-check Proton retention windows.
- SRQL errors: Reference the SRQL language guide when writing complex joins.
Still Stuck?
- Review the operational runbooks in Agents & Demo Operations for environment resets.
- Capture failing commands, logs, and SRQL queries before escalating to the core team.
- File follow-up work items in Beads (
bd) so the broader team can track remediations.