Observability Rollup Recovery
Use this runbook when /observability?tab=traces or /analytics shows stale or obviously wrong trace-derived data, or when the UI warns that trace rollups need attention.
Symptoms
- Trace counts or durations stay at zero even though
otel_tracesis ingesting data /observability?tab=tracesloads but shows old trace rows only/analyticsand trace summary cards stop changing- The UI warns that
platform.otel_trace_summariesorplatform.traces_stats_5mis missing or stale
Quick Verification
Connect to the primary CNPG instance:
kubectl exec -it <cnpg-primary-pod> -n <namespace> -- psql -U serviceradar -d serviceradar
Check that the maintained trace assets exist:
SELECT to_regclass('platform.otel_trace_summaries') AS trace_summary_table;
SELECT view_schema, view_name
FROM timescaledb_information.continuous_aggregates
WHERE view_schema = 'platform'
AND view_name = 'traces_stats_5m';
Compare raw ingest with maintained outputs:
SELECT
(SELECT max(timestamp) FROM platform.otel_traces) AS raw_latest,
(SELECT max(timestamp) FROM platform.otel_trace_summaries) AS summary_latest,
(SELECT max(bucket) FROM platform.traces_stats_5m) AS rollup_latest;
If raw_latest is materially newer than summary_latest or rollup_latest, trace maintenance is stale.
Check the Scheduler
Inspect the trace summary worker and stale-job reaper:
SELECT worker, state, queue, attempt, attempted_at, completed_at, scheduled_at
FROM platform.oban_jobs
WHERE worker IN (
'ServiceRadar.Jobs.RefreshTraceSummariesWorker',
'ServiceRadar.Jobs.ReapStalePeriodicJobsWorker'
)
ORDER BY inserted_at DESC
LIMIT 20;
Healthy steady state looks like:
- recent
completedrows forServiceRadar.Jobs.RefreshTraceSummariesWorker - recent
completedrows forServiceRadar.Jobs.ReapStalePeriodicJobsWorker - no long-lived
executingrows for the trace refresh worker
Remediation
1. Reap stale periodic jobs
The product fix includes an automatic reaper, but you can trigger it immediately if recovery is lagging. Open a release shell in the core runtime:
kubectl exec -it deploy/serviceradar-core-elx -n <namespace> -- /app/bin/serviceradar_core_elx remote
Then enqueue the stale-job reaper:
alias ServiceRadar.Jobs.ReapStalePeriodicJobsWorker
{:ok, _job} = Oban.insert(ReapStalePeriodicJobsWorker.new(%{}, queue: :maintenance))
2. Trigger trace summary refresh
After stale jobs are cleared, enqueue a refresh from the same shell:
alias ServiceRadar.Jobs.RefreshTraceSummariesWorker
{:ok, _job} = Oban.insert(RefreshTraceSummariesWorker.new(%{}, queue: :maintenance))
3. Verify progress
Re-run the freshness query:
SELECT
(SELECT max(timestamp) FROM platform.otel_traces) AS raw_latest,
(SELECT max(timestamp) FROM platform.otel_trace_summaries) AS summary_latest,
(SELECT max(bucket) FROM platform.traces_stats_5m) AS rollup_latest;
Also confirm recent completed jobs:
SELECT worker, state, completed_at
FROM platform.oban_jobs
WHERE worker = 'ServiceRadar.Jobs.RefreshTraceSummariesWorker'
ORDER BY inserted_at DESC
LIMIT 5;
When Recovery Fails
If the summary worker keeps retrying or timing out:
- check
serviceradar-core-elxlogs for DB checkout or statement timeouts - verify
platform.otel_trace_summariesis still being pruned and is not growing without bound - verify the latest migrations ran successfully, especially the migrations that create
platform.traces_stats_5mand the stale periodic-job reaper
If the trace summary table or CAGG is missing entirely after upgrade, rerun migrations before attempting manual recovery.