Agent Release Management
ServiceRadar agent release management lets operators publish a signed agent release, roll it out to a selected cohort, and rely on automatic rollback when the updated runtime does not come back healthy in time.
This runbook covers:
- publishing a release into the control plane,
- configuring the agent verification key,
- rollout guardrails for canary and broad fleet updates,
- rollback and diagnostics during a failed rollout.
Prerequisites
Before using release management in production:
- Install agents with the package-managed launcher and updater layout.
- Ensure the agent runtime host has write access to
/var/lib/serviceradar/agent. - Ensure the control plane has the trusted Ed25519 public key configured before operators publish releases.
- Ensure every managed agent package embeds the trusted Ed25519 public key through build-time
ReleaseSigningPublicKeyinjection. - Publish artifacts over HTTPS.
- Include per-platform artifact metadata in the release manifest, including
os,arch,url,sha256, and optionalformatandentrypoint. - If repository-hosted release assets redirect to object storage or a CDN, keep the redirect chain on HTTPS. The control plane mirrors those artifacts into internal storage at publish time, and agents still reject insecure redirects, digest mismatches, and manifest-signature failures.
Publish A Release
Use the authenticated release-management page:
- Open
/settings/agents/releases. - For production releases, prefer
Import Repository Release. The page automatically loads the latest repository releases for the selected GitHub repo or the pinned Forgejo hosthttps://code.carverauto.dev, and lets operators import a ready release with one click when the configured manifest and signature assets are present. - If the desired release is older than the recent list or uses a custom tag workflow, use the specific-tag import field and point it at the repo-hosted release tag plus the signed manifest asset and signature asset names.
- For developer and local validation workflows, keep using
Publish Release Manuallyand enter the semantic version, release notes, manifest signature, artifact URL, SHA256 digest, OS, architecture, and artifact format directly. - Publish the release.
The control plane stores:
- the desired version,
- the signed manifest,
- rollout eligibility metadata for supported agent platforms,
- internal object-store references for each mirrored rollout artifact.
The current implementation expects the manifest signature field to contain the Ed25519 signature for the canonical manifest JSON. The control plane mirrors the referenced artifacts into internal datasvc-backed object storage, and the agent verifies that same signature before staging any artifact fetched through agent-gateway.
Security guardrails:
- Repository import only trusts GitHub-owned release hosts and the pinned Forgejo host
code.carverauto.dev. - Import and mirroring reject non-HTTPS, loopback, link-local, private-network, and unresolved destinations.
- Provider auth tokens are not forwarded to untrusted asset hosts.
Recommended repository-release asset convention:
serviceradar-agent-release-manifest.jsonserviceradar-agent-release-manifest.sigserviceradar-agent_<version>_linux_amd64.tar.gz
The manifest asset should contain the full multi-platform release manifest, including the final artifact URLs, SHA256 digests, platform metadata, and optional format / entrypoint fields.
The GitHub release pipeline now publishes these assets automatically when SERVICERADAR_AGENT_RELEASE_PRIVATE_KEY is configured for the release job. Manual repository releases must attach the same three assets for one-click import to work.
Signing Key Handling
Keep signing private keys out of the fleet.
Recommended handling:
- Sign release manifests in CI or an offline release workflow.
- Distribute only the public verification key to agents.
- Rotate the public key by shipping a new package-managed agent/updater build before switching signing infrastructure.
- Never reuse the release-management UI as the source of truth for private signing material.
Relevant agent settings:
- Package-managed agents verify release manifests with the build-time embedded
ReleaseSigningPublicKey. - Package-managed agents use fixed package-owned paths for the updater, seed binary, and mutable runtime root.
- The control plane still reads
SERVICERADAR_AGENT_RELEASE_PUBLIC_KEYat runtime for release import and validation.
Onboarding propagation:
- Agent onboarding bundles do not distribute the managed release verification key for package-managed agents.
serviceradar-cli enroll --core-url ... --token ...preserves unrelated/etc/serviceradar/kv-overrides.enventries, but protected agent release trust keys are ignored if they appear in a bundle.
Migration guidance:
- Older hosts may still carry
SERVICERADAR_AGENT_RELEASE_PUBLIC_KEYin/etc/serviceradar/kv-overrides.env. - After upgrading to a hardened package-managed agent build, those stale entries no longer define the managed release trust anchor.
Rollout Guardrails
Prefer controlled staged rollouts over broad fleet pushes.
Recommended operator workflow:
- Publish the release.
- Create a canary rollout from
/settings/agents/releaseswith a small explicit cohort. - Use
batch_size: 1or another small batch for the first rollout. - Leave a non-zero
batch_delay_secondswhen validating a new train in production. - Use the rollout compatibility preview on
/settings/agents/releasesto confirm the selected cohort resolves to known agents and matches the published platform set before starting the rollout. The page disables rollout submission for unresolved agent IDs, empty cohorts, or unsupported platforms, and the control plane also rejects those invalid cohorts at submit time. - Watch
/agentsfor version distribution, rollout-state counts, and per-agent target version drift. - Pause the rollout immediately if agents begin failing verification, restart, or health checks.
- Resume only after the failure mode is understood.
Operational guidance:
- Keep cohorts explicit and audit-friendly.
- Do not mix unrelated platform variants in the same canary unless you have validated artifact coverage for each one.
- Treat digest mismatch or signature verification failures as release-pipeline problems, not agent-host problems, until proven otherwise.
- Use cancel only to stop undispatched targets; already running targets will continue until they reach a terminal state.
Rollback Behavior
Agent activation uses a separate updater and a stable package-managed launcher.
Runtime behavior:
- The agent stages the verified payload under the mutable runtime root.
- The rollout command points the agent at an authenticated HTTPS download path on
agent-gateway, not at the original GitHub/Forgejo/Harbor host. agent-gatewayresolves the authorized rollout target, fetches the mirrored object from internal storage, and streams it back to the agent.- The updater switches the
currentsymlink atomically to the new versioned payload. - The service restarts against the new runtime.
- If the updated agent does not report healthy before the reconnect deadline, the updater restores the previous target and restarts again.
The control plane records terminal per-agent states such as:
healthyfailedrolled_backcanceled
Diagnostics
Use these surfaces first during rollout triage:
/settings/agents/releases: published releases, supported platform badges, rollout compatibility preview, disabled invalid rollout submission, rollout-creation validation, recent rollouts, pause/resume/cancel actions, and per-target diagnostics./agents: version distribution, rollout-state counts, target-version filtering, per-agent last update error./agents/:uid: current version, desired version, rollout state, last update error, and recent rollout targets.
Common failure patterns:
- Invalid signature: the agent rejected the manifest before staging.
- Digest mismatch: the downloaded artifact did not match the published SHA256.
- No matching artifact: the control plane could not select an artifact for the agent's
osandarch. The releases page surfaces this asno matching release artifact for agent platform <os>/<arch>and highlights the affected target as an unsupported platform. - Unresolved custom IDs: the rollout cohort includes agent IDs that do not currently resolve to inventory records, so rollout creation is rejected until those entries are corrected.
- Rolled back: the updater switched versions, but the new runtime did not become healthy before the deadline.
Recovery Playbook
When a rollout is unhealthy:
- Pause the rollout.
- Inspect
last_update_errorand the per-agent target state. - If failures are verification-related, fix the manifest, signature, or artifact publication and publish a corrected version instead of reusing the broken one.
- If failures are post-restart regressions, let the automatic rollback settle and confirm agents return to the previous version.
- Start a fresh canary rollout for the corrected release.
Do not manually mutate files under the package-managed launcher paths during incident response. ServiceRadar-managed runtime payloads are the mutable layer; launcher, unit files, and baseline updater assets remain package-owned.