Search Planner Operations
The device search planner fronts /api/devices/search and selects between the in-memory registry and the SRQL service. This runbook explains how to manage the rollout, interpret diagnostics, and monitor the new telemetry.
Feature Flags
Planner usage can be toggled independently in the core service and the web UI:
- Core:
features.use_device_search_plannerin theserviceradar-configConfigMap (core.json, defaulttrue).
Edit the config map (kubectl edit configmap serviceradar-config -n demo) or updatek8s/demo/base/configmap.yamlbefore redeploying. After changing the flag, restart the core deployment:
kubectl rollout restart deployment/serviceradar-core -n demo. - Core enforcement:
features.require_device_registry(defaultfalse).
When set totrue, the API refuses to fall back to Proton for/api/deviceslist/detail routes. Leave it disabled if you need the legacy Proton-backed endpoints during incident response. - Web UI:
NEXT_PUBLIC_FEATURE_DEVICE_SEARCH_PLANNER(defaulttrue).
Update the ConfigMap before rollout (deploy.shgenerates it) or patch the deployment in place:kubectl set env deployment/serviceradar-web NEXT_PUBLIC_FEATURE_DEVICE_SEARCH_PLANNER=true -n demo
kubectl set env deployment/serviceradar-web FEATURE_DEVICE_SEARCH_PLANNER=true -n demo
When either flag is disabled, the UI falls back to the legacy device list results and attaches diagnostics with engine_reason: "feature_flag_disabled".
Planner Diagnostics
/api/devices/search responses include a diagnostics map with the following keys:
| Field | Description |
|---|---|
mode | Caller-supplied planner mode (auto, registry_only, srql_only). |
engine_reason | Why a backend was chosen (query_supported, query_not_supported, registry_constraints, mode_forced, registry_only_available). |
engine | Backend that executed the query (registry or srql). |
duration_ms | End-to-end planner latency, including registry lookup or SRQL round trip. |
unsupported_tokens | SRQL tokens that forced a hand-off (only present when the planner routes to SRQL). |
Example
curl -sk -H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"query":"in:devices status:online","filters":{"search":"serviceradar"},"pagination":{"limit":10}}' \
https://core.serviceradar-demo/api/devices/search | jq '.diagnostics'
Output:
{
"mode": "auto",
"engine_reason": "query_supported",
"engine": "registry",
"duration_ms": 2.3
}
Telemetry Metrics
Planner execution emits the following OpenTelemetry metrics (exported via the standard serviceradar.search meter):
search_registry_duration_seconds(histogram) — latency for registry-backed searches.
Dimensions:mode,status(successorerror),result_state(empty,lt10,lt50,lt100,gte100).search_srql_duration_seconds(histogram) — latency for SRQL-backed searches with the same dimensions.search_planner_fallback_total(counter) — count of queries forced to SRQL because the registry could not satisfy them.
Dimensions:reason(query_not_supported,registry_constraints, orunknown) andmode.
Prometheus Queries
histogram_quantile(0.95, rate(search_registry_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(search_srql_duration_seconds_bucket[5m]))
increase(search_planner_fallback_total[15m])
Use these to confirm registry latency stays sub-millisecond and to identify unexpected fallback spikes.
Rollout Playbook
- Stage in core: Flip
features.use_device_search_plannertotrueinserviceradar-configand restartserviceradar-core. Keep the UI flag disabled so only API clients exercise the planner. - Validate: Tail planner metrics (
kubectl logs deployment/serviceradar-core -n demo | grep search_planner) and verify histograms appear in Prometheus. - Enable UI flag: Set
NEXT_PUBLIC_FEATURE_DEVICE_SEARCH_PLANNER=true(and matching server flag) in the web deployment and redeploy. - Monitor: Watch
search_planner_fallback_totaland SRQL latency. Sustained increases indicate unsupported SRQL patterns; inspectengine_reasondiagnostics to pinpoint problem queries. - Rollback: Set both flags to
falseand redeploy core/web. The legacy/api/deviceslist path remains available as a safe fallback.
Troubleshooting
- Unexpected SRQL traffic: Check the planner diagnostics for
engine_reason: "query_not_supported". Queries containing aggregations (count(,sum(), joins, or metadata fan-out currently require SRQL. - Empty registry results: Confirm the device registry is hydrated (
core_device_stats_processed_recordsgauge) and that the trigram index contains entries (SearchDevicesunit tests cover expected behavior). - Slow SRQL latency: Use
search_srql_duration_secondsto detect regressions and review the SRQL service logs (ocaml/srql). Consider increasing SRQL timeouts or scaling the service if demand spikes.