Troubleshooting Guide
Use this guide as a first stop when onboarding ServiceRadar or operating the demo cluster. Each section lists fast diagnostics, common failure modes, and references for deeper dives.
Core Services
- Check pod health:
kubectl get pods -n demo
(or the equivalent Docker Compose status). Pods stuck inCrashLoopBackOff
usually point to missing secrets or PVC mounts. - Verify API availability:
curl -k https://<core-host>/healthz
. TLS errors tie back to mismatched certificates—reissue them with the Self-Signed Certificates guide. - Configuration drift: Reconcile changes with the Configuration Basics checklist and commit updates to KV.
SNMP
- Credential failures: Review
poller
logs forsnmp_auth_error
. Ensure v3 auth/privacy keys match the SNMP ingest guide recommendations. - Packet loss: Confirm firewall rules allow UDP 161/162 from pollers. Use
snmpwalk -v3 ...
from the poller pod to validate. - Slow polls: Trim OID lists or increase poller replicas. Long runtimes delay alerting.
Syslog
- No events: Ensure devices forward to the correct address and protocol (
UDP/TCP 514
). Validate listener status viakubectl logs deploy/serviceradar-syslog -n demo
. - Parsing issues: Update Proton grok rules when new vendors join; refer to the Syslog ingest guide.
- Clock drift: Systems with unsynchronized NTP create out-of-order events; align to UTC.
NetFlow
- Missing flows: Exporters must send to UDP 2055. Use
tcpdump
on the poller host to confirm arrival. - Template errors: Reset exporters or clear caches when poller logs complain about unknown IPFIX templates. See the NetFlow ingest guide.
- High load: Increase
NETFLOW_QUEUE_DEPTH
and allocate more CPU to pollers.
OTEL
- TLS failures: Double-check the OTLP gateway certificate bundle. Clients should trust the CA described in Self-Signed Certificates.
- Backpressure: Inspect the gateway metrics; enable batching in exporters. Follow the OTEL guide for tuning tips.
- Missing spans: Ensure
service.name
and other attributes are populated—SRQL filters rely on them.
Discovery
- Empty results: Confirm scopes exist in KV under
discovery/jobs/*
. Reconcile job ownership using the Discovery guide. - Mapper stalled: Tail
serviceradar-mapper
logs forscheduler
messages. Ensure/etc/serviceradar/mapper.json
has at least one enabledscheduled_jobs
entry and that credentials cover the target CIDRs. - Missing interfaces/topology: Verify
stream_config
inmapper.json
still points todiscovered_interfaces
andtopology_discovery_events
. Mapper only emits interface/topology data when those fields are present. - Duplicate devices: Enable canonical matching in Sync so NetBox and Armis merges succeed.
- Sweep failures: Check poller network reachability and throttling limits.
Integrations
Armis
- Refresh client secrets and inspect
serviceradar-sync
logs. The Armis integration doc covers faker resets and pagination tuning. - Compare Faker vs. production counts to spot ingestion gaps.
NetBox
- Verify API token scopes and rate limits. See the NetBox integration guide for advanced settings.
- Check that prefixes are importing as expected; toggle
expand_subnets
if sweep jobs look incomplete.
Dashboards and UI
- Login problems: Ensure local users exist (
admin
role) and JWT secrets are configured as described in Authentication configuration. - Missing charts: Import default dashboards from the Web UI configuration and double-check Proton retention windows.
- SRQL errors: Reference the SRQL language guide when writing complex joins.
Still Stuck?
- Review the operational runbooks in Agents & Demo Operations for environment resets.
- Capture failing commands, logs, and SRQL queries before escalating to the core team.
- File follow-up work items in Beads (
bd
) so the broader team can track remediations.