Troubleshooting Guide

Use this guide as a first stop when onboarding or operating ServiceRadar. Each section lists fast diagnostics, common failure modes, and references for deeper dives.

Edge Agents

Edge agents are Go binaries that run on monitored hosts outside the Kubernetes cluster, communicating via gRPC with mTLS.

Connection Issues

Agent not connecting: Check agent logs (journalctl -u serviceradar-agent -f) for connection errors. Verify the gateway address in /etc/serviceradar/agent.json.

TLS handshake failures: Ensure certificates are valid and the CA bundle is correct:

openssl verify -CAfile /etc/serviceradar/certs/bundle.pem \
  /etc/serviceradar/certs/svid.pem

Firewall blocking: Confirm port 50052 is open from the agent to the gateway:
```
nc -zv <gateway-host> 50052
```

Certificate Issues

Certificate expired: Check expiry dates:

openssl x509 -in /etc/serviceradar/certs/svid.pem -noout -dates

Wrong CN format: Verify the CN matches <agent_id>.<partition_id>.serviceradar:
```
openssl x509 -in /etc/serviceradar/certs/svid.pem -noout -subject
```
CA mismatch: Ensure the agent's CA bundle matches the cluster's SPIRE trust domain.

Registration Issues

Agent not appearing in UI: Verify the agent is registered via the API:

curl -H "Authorization: Bearer $TOKEN" \
  https://core.example.com/api/v2/agents/<agent-uid>

Status stuck at "connecting": Check gateway logs for gRPC errors. The agent may be connecting but failing health checks.
Wrong account: Agent certificates are deployment-specific. Verify the certificate CN matches the expected deployment.

gRPC Diagnostics

Test gRPC connectivity directly (agent-gateway is the only edge-facing gRPC endpoint):

grpcurl -cert /etc/serviceradar/certs/svid.pem \
        -key /etc/serviceradar/certs/svid-key.pem \
        -cacert /etc/serviceradar/certs/bundle.pem \
        <gateway-host>:50052 list

For detailed edge agent documentation, see Edge Model.

Core Services

Check pod health: kubectl get pods -n <namespace> (or the equivalent Docker Compose status). Pods stuck in CrashLoopBackOff usually point to missing secrets, PVC mounts, or bad environment variables.
Verify API availability: curl -k https://<core-host>/healthz. TLS errors tie back to mismatched certificates—reissue them with the Self-Signed Certificates guide.
Configuration drift: Most configuration is managed through the web UI and delivered to agents via GetConfig. If changes are not taking effect, confirm the agent is online and check agent-gateway logs for config fetch errors.

Observability Rollups

Trace cards stale, missing, or zeroed: Follow the Observability Rollup Recovery runbook. It covers verifying platform.otel_trace_summaries, platform.traces_stats_5m, and the periodic Oban jobs that maintain them.

SNMP

Credential failures: Review gateway logs for snmp_auth_error. Ensure v3 auth/privacy keys match the SNMP ingest guide recommendations.
Packet loss: Confirm firewall rules allow UDP 161/162 from gateways. Use snmpwalk -v3 ... from the gateway pod to validate.
Slow polls: Trim OID lists or increase gateway replicas. Long runtimes delay alerting.

Syslog

No events: Ensure devices forward to the correct address and protocol (UDP/TCP 514). In Kubernetes, validate the listener with kubectl logs deploy/serviceradar-log-collector -n <namespace> --since=10m. If syslog enters through Gateway API, also check kubectl describe udproute -n <namespace> serviceradar-syslog.
Parsing issues: Update the Zen parsing rules when new vendors join; refer to the Syslog ingest guide.
Clock drift: Systems with unsynchronized NTP create out-of-order events; align to UTC.

NetFlow

Missing Flows

Symptoms:

No flows appearing in database
SRQL in:flows queries return empty results
Web UI NetFlow dashboard shows no data

Quick Diagnostics:

# 1. Check collector is running
docker ps | grep flow-collector
kubectl get pods -l app=serviceradar-flow-collector

# 2. Check if packets are arriving
sudo tcpdump -i any -n port 2055
# Should see: IP <router-ip>.12345 > <collector-ip>.2055: UDP, length 1480

# 3. Check collector logs
docker logs serviceradar-flow-collector-mtls | grep "Received.*bytes from"
kubectl logs -l app=serviceradar-flow-collector | grep "Received.*bytes from"

# 4. Check NATS stream
nats stream info events

# 5. Query database directly
psql -c "SELECT COUNT(*) FROM ocsf_network_activity WHERE time > NOW() - INTERVAL '5 minutes';"

Common Causes:

Device not configured: Router/switch/firewall must export NetFlow to collector IP and port 2055
- Verify: Check device NetFlow configuration
- Fix: Configure device per NetFlow ingest guide
Firewall blocking UDP 2055: Network firewall or host firewall blocks UDP
- Verify: sudo iptables -L | grep 2055 or cloud security group rules
- Fix: Allow UDP 2055 from exporter IPs to collector
Wrong collector IP: Device sending to old/wrong collector address
- Verify: Check device config shows current collector IP
- Fix: Update device NetFlow destination address
Collector not listening: Process crashed or misconfigured
- Verify: netstat -ulnp | grep 2055 shows listener
- Fix: Check logs for startup errors, verify config file
NATS unavailable: Collector can't publish to NATS JetStream
- Verify: Check collector logs for NATS connection errors
- Fix: Verify NATS URL in config, check NATS health

Template Errors

Symptoms:

Log warnings: "Missing template - ID: 256, Protocol: V9"
Flows from certain routers not appearing
Intermittent flow data

Understanding Templates:

NetFlow v9 and IPFIX use template-based flow encoding:

Router sends template definition (which fields are in flows)
Router sends flow data using template ID
Collector must receive template before data

Templates can be:

Lost in transit (UDP is unreliable)
Arrive after data (out of order)
Cleared on router reboot (but collector still has old version)
Expired (per TTL)

Quick Diagnostics:

# Check for missing template warnings
docker logs serviceradar-flow-collector-mtls | grep "Missing template"

# Check for template learned events
docker logs serviceradar-flow-collector-mtls | grep "Template learned"

# Check template cache stats
docker logs serviceradar-flow-collector-mtls | grep "Template Cache"

Solutions:

Wait 60 seconds: Routers re-send templates periodically (default: 60s)
- Most "missing template" warnings resolve automatically
- Check logs to see if template arrives

Restart collector if persistent: Clears corrupted template cache

docker restart serviceradar-flow-collector-mtls
# or
kubectl rollout restart deployment/serviceradar-flow-collector

Reboot router (last resort): Clears router's template state
- Only if problem persists after collector restart
- Router will send fresh templates on startup
Check template cache size: May need larger cache for many routers
- Verify: Check "Template Cache" logs show size near max
- Fix: Increase max_templates in config (default: 2000)

High CPU Usage

Symptoms:

Collector using >80% CPU
System load high
Slow flow processing

Causes:

Very high flow rate (>50,000 flows/sec)
- Check: Look at flow ingestion rate in logs
- Fix: Enable sampling on routers (1:100 or 1:1000)
Complex templates (many fields)
- Check: Look at template learned events for field counts
- Fix: Simplify flow records on routers
Insufficient batching
- Check: batch_size in config
- Fix: Increase from 100 to 500-1000
Too many concurrent parsers
- Fix: Ensure only one collector instance per host

Tuning:

{
  "batch_size": 500,          // Increase from 100
  "channel_size": 50000,      // Increase from 10000
  "publish_timeout_ms": 10000 // Increase from 5000
}

Dropped Flows

Symptoms:

Log warnings: "Publisher channel full, dropping flow message"
Flow counts lower than expected
Gaps in flow data

Causes:

NATS JetStream slow or unavailable
- Check: NATS JetStream health and latency
- Fix: Scale NATS cluster, check network latency
Channel too small for burst traffic
- Check: Warnings appear during traffic spikes
- Fix: Increase channel_size to 50,000+
Batch publish taking too long
- Check: NATS publish latency in logs
- Fix: Reduce batch_size or improve NATS performance

Solutions:

{
  "channel_size": 50000,  // Up from 10000
  "batch_size": 200       // Balance between throughput and latency
}

Backpressure behavior:

Each listener owns a bounded mpsc channel of depth channel_size. When the channel is full, the listener drops the incoming datagram (drop-newest) and increments a per-subject drop counter exposed on the metrics endpoint. There is no operator-tunable policy — raise channel_size, increase batch_size, or improve NATS publish latency to keep the channel drained.

Low Template Cache Hit Ratio

Symptoms:

Cache stats show hit ratio < 90%
Many cache misses in logs
Performance degradation

Example Log:

V9 Template Cache [192.168.1.1:2055] - Templates: 1850/2000, Data: 950/2000,
  Template Hits/Misses: 5000/800

Hit ratio = 5000 / (5000 + 800) = 86% (unhealthy)

Causes:

Cache too small: Not enough room for all templates
- Check: current_size near max_size
- Fix: Increase max_templates
Templates expiring too quickly
- Check: Many "Template expired" events
- Fix: Increase router template refresh rate
Too many unique flows: Data cache evicting frequently
- Check: Data cache size near max
- Fix: Increase max_templates (affects both caches)

Solutions:

{
  "max_templates": 5000  // Up from 2000
}

For 10+ sources:

{
  "max_templates": 10000  // 1000 per source
}

Memory Usage Higher Than Expected

Symptoms:

Collector using more memory than expected
OOM (Out of Memory) errors

Expected Memory Usage:

Base: ~500MB
Per Source: ~50MB per active exporter
10 sources: ~1GB total
100 sources: ~5.5GB total

AutoScopedParser keeps a separate template cache per source IP, so memory grows roughly linearly with the number of active exporters. This is expected behavior.

If memory exceeds expectations:

Check active source count: May have more exporters than expected

# Count unique sources in logs
grep "Template learned" /var/log/netflow-collector.log | \
  grep -oP '\d+\.\d+\.\d+\.\d+:\d+' | sort -u | wc -l

Identify rogue sources: Unexpected devices exporting

# List all sources
grep "Template Cache" /var/log/netflow-collector.log | \
  grep -oP '\[.*?\]' | sort -u

Reduce cache size if needed: Trade-off with hit ratio
```
{
  "max_templates": 1000  // From 2000
}
```
Filter unwanted sources: Firewall rules to block unauthorized exporters

NATS Connection Failures

Symptoms:

Log errors: "Failed to connect to NATS"
Log errors: "NATS publish failed"
Flows not reaching database

Quick Diagnostics:

# Check NATS is running
docker ps | grep nats
kubectl get pods -l app=nats

# Check NATS health
nats account info

# Test connection from collector host
telnet <nats-host> 4222

Solutions:

NATS not running: Start NATS

docker-compose up -d nats
kubectl scale deployment/nats --replicas=1

Wrong NATS URL in config: Verify URL

{
  "nats_url": "nats://nats:4222"  // Check host and port
}

mTLS certificate issues: Check certificates
- Verify cert files exist and are readable
- Check cert expiration: openssl x509 -in netflow-client.crt -noout -dates
- Verify CA matches
Network isolation: NATS not reachable from collector
- Check network policies (Kubernetes)
- Check Docker networks (Docker Compose)
- Verify firewall rules

Monitoring Template Cache Health

Healthy Cache Indicators:

V9 Template Cache [192.168.1.1:2055] - Templates: 15/2000, Data: 8/2000,
  Template Hits/Misses: 12500/150, Data Hits/Misses: 84200/80

✅ Good indicators:

Template hit ratio: 12500/(12500+150) = 98.8% (>95%)
Data hit ratio: 84200/(84200+80) = 99.9% (>95%)
Size: 15/2000 = 0.75% (<50% is healthy)
Few evictions

Unhealthy Cache Indicators:

V9 Template Cache [192.168.1.1:2055] - Templates: 1950/2000, Data: 1980/2000,
  Template Hits/Misses: 5000/2000, Data Hits/Misses: 10000/5000

❌ Bad indicators:

Template hit ratio: 5000/(5000+2000) = 71.4% (<90%)
Data hit ratio: 10000/(10000+5000) = 66.7% (<90%)
Size: 1950/2000 = 97.5% (near max)
Likely many evictions

Action: Increase max_templates to 5000+

Device Configuration Validation

Quick checklist for router/switch/firewall:

# Cisco IOS: Verify NetFlow config
show flow exporter SERVICERADAR-COLLECTOR
show flow monitor SERVICERADAR-MONITOR
show flow interface

# Cisco NXOS: Verify NetFlow config
show flow exporter SERVICERADAR
show flow monitor SERVICERADAR-MONITOR

# Juniper: Verify IPFIX config
show services flow-monitoring
show forwarding-options sampling instance SERVICERADAR-INSTANCE

Expected output:

Destination IP matches collector
Port is 2055
Interfaces are enabled
Template refresh configured
No error messages

See Device Configuration for full examples.

Performance Degradation

Symptoms:

Flows taking longer to appear in database
High latency from UDP receipt to database write

Measurement:

Enable debug logging and measure:

UDP receipt → parse complete
Parse complete → NATS publish
NATS publish → Zen processing
Zen processing → database write

Bottleneck Identification:

# Check NATS JetStream lag
nats stream info events

# Check Zen consumer lag
docker logs zen | grep "Processing message"

# Check db-event-writer throughput
docker logs db-event-writer | grep "Batch write"

Solutions:

Collector bottleneck: Increase batch_size, more CPU
NATS bottleneck: Scale NATS cluster, check disk I/O
Zen bottleneck: Scale Zen replicas
Database bottleneck: Scale CNPG, optimize indexes, partition tables

Still Having Issues?

Collect diagnostics:

# Collector logs (last 1000 lines)
docker logs --tail 1000 netflow-collector > netflow-collector.log

# Collector stats
docker stats netflow-collector

# NATS stream info
nats stream info events > nats-stream-info.txt

# Database flow count
psql -c "SELECT
  DATE_TRUNC('hour', time) as hour,
  COUNT(*) as flow_count,
  COUNT(DISTINCT src_endpoint_ip) as unique_sources
FROM ocsf_network_activity
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour DESC;" > flow-stats.txt

# Network capture (30 seconds)
sudo timeout 30 tcpdump -i any -n port 2055 -w netflow-capture.pcap

Report issue with:

Collector version (check logs for startup message)
Router/switch vendor and model
Number of exporters
Approximate flow rate
Logs and diagnostics collected above

References:

NetFlow Ingest Guide - Full configuration guide
Version-specific changes: rust/netflow-collector/CHANGELOG.md
Testing procedures: rust/netflow-collector/TESTING.md

MTR Automation

ServiceRadar can run automated MTR (My Traceroute) captures to baseline network paths and react to state transitions. The behavior is controlled with feature flags in serviceradar_core — no code changes required.

Feature Flags

MTR_AUTOMATION_ENABLED: global default for all automated MTR workers.
MTR_AUTOMATION_BASELINE_ENABLED: baseline scheduler.
MTR_AUTOMATION_TRIGGER_ENABLED: state-transition trigger worker.
MTR_AUTOMATION_CONSENSUS_ENABLED: cohort consensus and causal emitter worker.

Each MTR_AUTOMATION_*_ENABLED flag defaults to the global value when unset. After changing any flag, restart or redeploy serviceradar_core so the supervision tree is rebuilt with the new worker set.

Recommended Staged Rollout

Baseline only:
- MTR_AUTOMATION_ENABLED=true
- MTR_AUTOMATION_BASELINE_ENABLED=true
- MTR_AUTOMATION_TRIGGER_ENABLED=false
- MTR_AUTOMATION_CONSENSUS_ENABLED=false
Add state-triggered capture: set MTR_AUTOMATION_TRIGGER_ENABLED=true.
Add consensus + causal emission: set MTR_AUTOMATION_CONSENSUS_ENABLED=true.

Rollback Switches

Stop all automated MTR immediately: MTR_AUTOMATION_ENABLED=false
Stop only event-driven runs: MTR_AUTOMATION_TRIGGER_ENABLED=false
Stop only causal consensus/emission while keeping dispatch: MTR_AUTOMATION_CONSENSUS_ENABLED=false
Stop only baseline scheduling while keeping incident capture: MTR_AUTOMATION_BASELINE_ENABLED=false

Helm Values

For chart-based deploys, set the same behavior under core.mtrAutomation:

core:
  mtrAutomation:
    enabled: false
    baselineEnabled: false
    triggerEnabled: false
    consensusEnabled: false
    baselineTickMs: 60000
    consensusCohortRetentionMs: 300000

OTEL

TLS failures: Double-check the OTLP gateway certificate bundle. Clients should trust the CA described in Self-Signed Certificates.
Backpressure: Inspect the gateway metrics; enable batching in exporters. Follow the OTEL guide for tuning tips.
Missing spans: Ensure service.name and other attributes are populated—SRQL filters rely on them.

Discovery

Empty results: Confirm discovery jobs exist and are scoped correctly in the admin UI or API. Reconcile job ownership using the Discovery guide.
Mapper stalled: Tail serviceradar-agent logs for mapper scheduler messages. Confirm the discovery job is enabled, scoped to the right partition/agent, and that credentials cover the target CIDRs.
Missing interfaces/topology: Confirm the mapper job discovery_type includes interfaces/topology and that results are flowing through agent-gateway into core.
Duplicate devices: Enable canonical matching in the embedded sync runtime so NetBox and Armis merges succeed.
Sweep failures: Check gateway network reachability and throttling limits.

Network Sweeps

No devices match: Confirm target criteria in Settings > Networks and verify tags exist on devices.
Sweep never runs: Ensure the group is enabled and has a valid schedule (interval or cron).
No results arriving: Check agent logs for sweep execution, and gateway logs for streaming/forwarding errors.
Unexpected targets: Review static targets and criteria operators (especially has_any vs has_all).
Stale availability: Confirm agents are polling for new configs and that the gateway is reachable.

Integrations

Armis

Refresh client secrets and inspect serviceradar-agent logs. The Armis integration doc covers faker resets and pagination tuning.
Compare Faker vs. production counts to spot ingestion gaps.

NetBox

Verify API token scopes and rate limits. See the NetBox integration guide for advanced settings.
Check that prefixes are importing as expected; toggle expand_subnets if sweep jobs look incomplete.

Dashboards and UI

Login problems: Verify the admin bootstrap credentials are set (Helm/Docker Compose manage this) and confirm the active auth mode under Settings -> Authentication. For SSO or Gateway Proxy deployments, use /auth/local for administrator password sign-in.
Missing charts: Double-check CNPG retention windows and confirm you are ingesting the underlying telemetry (SNMP, Syslog, NetFlow, OTEL).
SRQL errors: Reference the SRQL language guide when writing complex joins.

Still Stuck?

Capture failing commands, logs, and SRQL queries so the issue can be reproduced.
Open an issue with those details so the problem can be tracked and resolved.

Edge Agents​

Connection Issues​

Certificate Issues​

Registration Issues​

gRPC Diagnostics​

Core Services​

Observability Rollups​

SNMP​

Syslog​

NetFlow​

Missing Flows​

Template Errors​

High CPU Usage​

Dropped Flows​

Low Template Cache Hit Ratio​

Memory Usage Higher Than Expected​

NATS Connection Failures​

Monitoring Template Cache Health​

Device Configuration Validation​

Performance Degradation​

Still Having Issues?​

MTR Automation​

Feature Flags​

Recommended Staged Rollout​

Rollback Switches​

Helm Values​

OTEL​

Discovery​

Network Sweeps​

Integrations​

Armis​

NetBox​

Dashboards and UI​

Still Stuck?​

Edge Agents

Connection Issues

Certificate Issues

Registration Issues

gRPC Diagnostics

Core Services

Observability Rollups

SNMP

Syslog

NetFlow

Missing Flows

Template Errors

High CPU Usage

Dropped Flows

Low Template Cache Hit Ratio

Memory Usage Higher Than Expected

NATS Connection Failures

Monitoring Template Cache Health

Device Configuration Validation

Performance Degradation

Still Having Issues?

MTR Automation

Feature Flags

Recommended Staged Rollout

Rollback Switches

Helm Values

OTEL

Discovery

Network Sweeps

Integrations

Armis

NetBox

Dashboards and UI

Still Stuck?