Agents & Demo Operations
This runbook captures the operational steps we used while debugging the canonical device pipeline in the demo cluster. It focuses on the pieces that interact with the "agent" side of the world (faker → sync → core) and the backing Timeplus/Proton database.
Armis Faker Service
- Deployment:
serviceradar-faker
(k8s/demo/base/serviceradar-faker.yaml
). - Persistent state lives on the PVC
serviceradar-faker-data
and must be mounted at/var/lib/serviceradar/faker
. The deployment now mounts the same volume at/var/lib/serviceradar/faker
and/data
so the generator can savefake_armis_devices.json
. - The faker always generates 50 000 devices and shuffles a percentage of their IPs every minute. Restarting the pod without the PVC used to create a fresh dataset—which is why the database ballooned past 150 k devices.
Useful checks:
kubectl get pods -n demo -l app=serviceradar-faker
kubectl exec -n demo deploy/serviceradar-faker -- ls /var/lib/serviceradar/faker
Resetting the Device Pipeline
This clears Timeplus/Proton and repopulates it with a fresh discovery crawl from the faker service.
-
Quiesce sync – stop new writes while we clear the streams:
kubectl scale deployment/serviceradar-sync -n demo --replicas=0
-
Truncate Proton streams – run the following against the
default
database (each command can be executed withcurl
from a toolbox pod):ALTER STREAM device_updates DELETE WHERE 1;
ALTER STREAM unified_devices DELETE WHERE 1;
ALTER STREAM unified_devices_registry DELETE WHERE 1;After the deletes, verify counts:
SELECT count() FROM table(device_updates);
SELECT count() FROM table(unified_devices);
SELECT count() FROM table(unified_devices_registry); -
Ensure the materialized view exists – drop and recreate
unified_device_pipeline_mv
so it reflects the current schema and filters tombstoned rows (_merged_into
,_deleted
):
The latest schemas can be found in @pkg/db/migrations
-
Verify counts – typical numbers for the demo environment:
SELECT count() FROM table(unified_devices); -- ≈ 50–70k
SELECT uniq_exact(metadata['armis_device_id']) FROM table(unified_devices);
SELECT count() FROM table(unified_devices_registry); -
Resume discovery – start the sync pipeline again:
kubectl scale deployment/serviceradar-sync -n demo --replicas=1
kubectl logs deployment/serviceradar-sync -n demo --tail 50Once the sync pod reports “Completed streaming results”, the canonical tables will match the faker dataset.
Proton Reset (PVC Rotation)
If the telemetry tables balloon again, it is faster to rotate Proton’s volume than to hand-truncate every dependent stream. The helper script below scales Proton down, recreates the PVC, brings Proton back online, and restarts core so it can rebuild the schema from scratch:
# from repo root; defaults to the demo namespace
scripts/reset-proton.sh
# or explicitly choose a namespace
scripts/reset-proton.sh staging
What the script does:
kubectl scale deployment/serviceradar-proton --replicas=0
- Delete and recreate the
serviceradar-proton-data
PVC (512 Gi by default, override withPVC_SIZE
andSTORAGE_CLASS
) - Scale Proton back up and wait for the rollout to finish
kubectl rollout restart deployment/serviceradar-core
so the schema is reseeded immediately
After the reset:
- Spot-check counts with either
/api/query
or the Proton client (SELECT count() FROM otel_traces
,otel_spans_enriched
,otel_metrics
,otel_trace_summaries
). - Tail
kubectl -n <ns> logs deploy/serviceradar-otel --tail=20
to confirm span batches stay in the single digits. - Hard-refresh the dashboards so cached trace totals drop.
Proton Client From serviceradar-tools
- Launch the toolbox with
kubectl exec -it -n demo deploy/serviceradar-tools -- bash
. The image ships the upstream Proton CLI (/usr/local/bin/proton.bin
) plus a wrapper (/usr/local/bin/proton-client
) that applies ServiceRadar TLS defaults and the new glibc runtime automatically. - The toolbox pod mounts the
serviceradar-secrets
secret at/etc/serviceradar/credentials/proton-password
.proton-client
reads this path (orPROTON_PASSWORD[_FILE]
) before falling back to/etc/proton-server/generated_password.txt
, so manual password entry is rarely required. - Helpful commands once you are inside the pod:
proton-info # show host/port/database/password source
proton-version # SELECT version() via the wrapper
proton-sql "SELECT 1" # preferred SQL helper (runs proton-client)
proton_sql 'SELECT count() FROM table(unified_devices)'
proton-client --query 'SHOW STREAMS' - To run a one-off query from outside the pod, export the secret directly and hand it to the wrapper:
export PROTON_PASSWORD=$(kubectl -n demo get secret serviceradar-secrets \
-o jsonpath='{.data.proton-password}' | base64 -d)
kubectl -n demo exec deploy/serviceradar-tools -- \
env PROTON_PASSWORD="$PROTON_PASSWORD" proton_sql 'SELECT 1' - The raw
proton
binary is also available as/usr/local/bin/proton.bin
for advanced troubleshooting; pass--config-file /etc/serviceradar/proton-client/config.xml
to reuse the ServiceRadar TLS material when bypassing the wrapper. - JetStream helpers share a context named
serviceradar
; either run the aliases from the MOTD (nats-streams
,nats-events
, …) or invoke the CLI directly:kubectl exec -n demo deploy/serviceradar-tools -- \
nats --context serviceradar stream ls
Proton Streaming Safeguards
- The demo Proton config now enforces conservative streaming thresholds:
queue_buffering_max_messages=50000
,queue_buffering_max_kbytes=524288
,fetch_message_max_bytes=524288
,max_insert_block_size=2048
(withmax_block_size
matched in the server config), and JetStream flush caps ofshared_subscription_flush_threshold_count=2000
,shared_subscription_flush_threshold_size=4194304 (4 MiB)
,shared_subscription_flush_threshold_ms=500
. - These limits prevent
TOO_LARGE_RECORD
failures without raisinglog_max_record_size
. The values live inpackaging/proton/config/config.yaml
and are propagated to theserviceradar-proton
image and ConfigMap overlays. - Validate the active settings from the toolbox with:
proton-sql "SELECT name, value FROM system.settings WHERE name IN \
('queue_buffering_max_messages','queue_buffering_max_kbytes', \
'fetch_message_max_bytes','shared_subscription_flush_threshold_size', \
'shared_subscription_flush_threshold_count','max_insert_block_size')" max_block_size
currently exposes as a session-scoped setting; if you need to override it temporarily, runproton-sql "SET max_block_size=2048"
before a large replay.- Any change for non-demo clusters should be mirrored in the shared config and rolled via
bazel run //docker/images:serviceradar-proton_push
followed by akubectl rollout restart deployment/serviceradar-proton -n <namespace>
.
Canonical Identity Flow
- Sync no longer BatchGets canonical identity keys; the
core
registry now hydrates canonical IDs per batch using thedevice_canonical_map
KV (WithIdentityResolver
). - Expect
serviceradar-core
logs to show non-zerocanonicalized_by_*
counters once batches replay. If they stay at 0, recheck KV health vianats-kv
and ensureserviceradar-core
pods run the latest image. - Toolbox helper to spot-check canonical entries:
proton-sql "SELECT count(), uniq_exact(metadata['armis_device_id']) FROM table(unified_devices)"
nats --context serviceradar kv get device_canonical_map/armis-id/<ARMIS_ID>
Common Error Notes
rpc error: code = Unimplemented desc =
– emitted by core when the poller is stopped; safe to ignore while the pipeline is paused.json: cannot unmarshal object into Go value of type []*models.DeviceUpdate
– happens if the discovery queue contains an object instead of an array. Clearing the streams and replaying new discovery data resolves it.TOO_LARGE_RECORD
when inserting intounified_devices_registry
– confirm the streaming safeguards above are active, replay stuck data withproton-sql "DROP VIEW IF EXISTS unified_device_pipeline_mv"
followed by the migration definition, and, when necessary, re-shard replays (hash ondevice_id
) so every insert batch remains under ~4 MiB.
Quick Reference Commands
# Run a SQL statement against Proton (default creds, database=default)
kubectl run ch-sql --rm -i --tty --image=curlimages/curl:8.9.1 -n demo --restart=Never --command -- \
sh -c "echo <base64-sql> | base64 -d >/tmp/query.sql \
&& curl -sk -u default:<password> --data-binary @/tmp/query.sql \
https://serviceradar-proton:8443/?database=default"
# Check distinct Armis IDs
curl -sk -u default:<password> --data-binary \
"SELECT uniq_exact(metadata['armis_device_id']) FROM table(unified_devices)" \
https://serviceradar-proton:8443/?database=default
Keep this document up to date as we refine the tooling around the agents and the demo environment.