Skip to main content

Ansible Integration

ServiceRadar drives Ansible playbook execution against devices in its inventory by talking to a customer-side AWX/AAP controller, and surfaces every run as a first-class resource with live status, per-host outcomes, full audit, and the same OCSF-based universal-log-viewer search as the rest of the platform. It is the supported alternative to running ARA next to a stand-alone AWX deployment.

This guide covers:

Architecture

ServiceRadar core (SaaS / control plane)

│ AgentCommandBus.dispatch/4

ServiceRadar agent-gateway

│ ControlStream (gRPC)

ServiceRadar agent (runs inside the customer's network)

│ invokes WASM plugin

awx WASM plugin (cmd/wasm-plugins/awx/)

│ HTTPS REST + per-call credential broker grant

AWX / AAP controller (in the customer's network)

AWX usually lives in the customer's private network. ServiceRadar core cannot reach it directly; the agent can. So every AWX REST call traverses the chain above — core → agent-gateway → agent → awx WASM plugin → AWX. The plugin is the network bridge; orchestration and persistence stay in Elixir.

Two execution patterns in the plugin:

EntrypointModePurpose
run_checkOn-demand via CommandRequestEvery AWX REST verb: awx.ping, awx.list_*, awx.fetch_template, awx.launch_job, awx.fetch_job, awx.cancel_job, awx.fetch_events_for_jobs
inventory_syncScheduled assignmentWalks AWX inventories and emits a DeviceDiscovery aggregate via the same pipeline proxmox-inventory uses — DIRE merges the records and flips Device.ansible_managed = true

Pulse-based event ingestion. ServiceRadar does not maintain a long-lived stream to AWX. Each registered controller has a RunPulseWorker Oban job that ticks every run_pulse_interval_ms (default 2000) and dispatches a single awx.fetch_events_for_jobs command covering every non-terminal PlaybookRun's (awx_job_id, last_event_id) watermark. The plugin makes one HTTP call per active job and returns aggregated events; EventIngestor persists them, drives the run state machine, and projects each result into an OCSF Application Activity event for the universal log viewer.

Two playbook sources, both surfaced as Playbook rows with a source_type discriminator:

  • :git — registered git repositories; cloned + parsed by GitCatalogSyncWorker. Operators bind each git-sourced playbook to an AWX job template before it becomes launchable.
  • :awx — AWX Job Templates auto-mirrored by AwxCatalogSyncWorker. Launchable by definition.

A single playbook can appear via both sources; both source types coexist in the catalog UI.

Deployment

Prerequisites

  • An AWX or AAP instance reachable from at least one ServiceRadar agent.
  • An OAuth2 personal access token from AWX with read access to inventories / projects / job_templates, plus permission to launch the templates ServiceRadar will run.
  • A ServiceRadar agent registered to the gateway and reachable from the AWX network (typical: same Kubernetes cluster, same VPC, same VLAN).

Apply the schema migration

The Ansible tables are added via a single named Ash migration. From a freshly checked out copy with the runtime running against your target database:

cd elixir/serviceradar_core
mix ash.codegen add_ansible_integration # generates the migration
mix ash.migrate # applies it

Verify with:

SELECT table_name FROM information_schema.tables
WHERE table_schema = 'platform' AND table_name LIKE 'ansible_%'
ORDER BY table_name;

You should see 14 tables (resources + their AshPaperTrail _versions mirrors for the four audited resources): controllers, playbook_repositories, playbooks, playbook_runs (+ run_targets / plays / tasks / task_results / contents), and playbook_schedules.

Deploy the awx WASM plugin

The plugin lives at go/cmd/wasm-plugins/awx/ and exposes two manifests:

  • plugin.yaml — the on-demand run_check entrypoint for REST verbs
  • plugin.inventory_sync.yaml — the scheduled inventory_sync entrypoint that emits DeviceDiscovery records

Build:

cd go/cmd/wasm-plugins/awx
tinygo build -o awx.wasm \
-target=wasi -gc=conservative -scheduler=none -no-debug ./

Import the signed package through ServiceRadar's plugin staging flow (the platform requires Rekor / cosign verification for plugin imports — see the WASM Plugins guide for the publishing flow). Then assign both plugin manifests to the agent(s) that reach your AWX network. The inventory_sync assignment is what drives the Device.ansible_managed flag — without it, no devices flip to ansible-managed.

Configure environment variables

ServiceRadar exposes the operator-tunable knobs as env vars surfaced in both docker-compose.yml and helm/serviceradar/values.yaml. The defaults are conservative; tune only when needed.

Env varDefaultWhat it does
ANSIBLE_RETENTION_RUN_DETAIL_DAYS90Prune PlaybookPlay / PlaybookTask / PlaybookTaskResult past this age. 0 disables detail pruning.
ANSIBLE_RETENTION_RUN_SUMMARY_DAYS(empty)When set, delete the entire PlaybookRun (cascades to targets / plays / tasks / results) past this age. Empty = keep forever.
ANSIBLE_RETENTION_INTERVAL_SECONDS86400How often RetentionWorker scans.
AWX_CONTROLLER_HEALTH_INTERVAL_SECONDS30ControllerHealthWorker cadence (one awx.ping per registered controller).
AWX_RUN_WATCHDOG_INTERVAL_SECONDS60RunWatchdog interval — flags stuck non-terminal runs (past 2× job_template timeout, or 1 h fallback).
AWX_SCHEDULE_EVALUATOR_INTERVAL_SECONDS60ScheduleEvaluatorWorker cron evaluation cadence.
ANSIBLE_CATALOG_BASE_DIR/var/lib/serviceradar/ansible_catalog (helm) / <tmp> (compose)Base directory for GitCatalogSyncWorker repo clones. Mount a PVC at this path in Kubernetes to keep the cache warm across pod restarts.

In Helm, these live under core.ansible.*:

core:
ansible:
runDetailDays: 90
runSummaryDays: "" # keep forever
retentionIntervalSeconds: 86400
controllerHealthIntervalSeconds: 30
runWatchdogIntervalSeconds: 60
scheduleEvaluatorIntervalSeconds: 60
catalogBaseDir: "/var/lib/serviceradar/ansible_catalog"

The current effective values are surfaced at runtime in Settings → Ansible → Retention.

Operator guide

Permissions: this guide assumes ansible.controllers.manage + ansible.repositories.manage + ansible.schedules.manage. Admins have these by default; see RBAC reference for the full set.

1. Store the AWX API token in the credential broker

The Ansible integration never sees a plaintext token — it always passes a credential broker grant referencing a stored secret. Create the secret first.

From iex -S mix against core-elx:

alias ServiceRadar.Actors.SystemActor
alias ServiceRadar.Credentials.NetworkCredentialSecret

{:ok, secret} =
NetworkCredentialSecret.create_secret(
%{
name: "awx-prod",
provider: "awx",
credential_kind: :api_token,
secret_payload: "PASTE-AWX-OAUTH2-TOKEN-HERE"
},
actor: SystemActor.system(:setup)
)

IO.inspect(secret.id, label: "credential_secret_id")

The credential broker, not ServiceRadar core, handles plaintext. SSH keys, become passwords, and vault passwords are never stored here — those live in AWX's credential vault. The only secret ServiceRadar holds is the AWX OAuth2 token, encrypted at rest via AshCloak.

2. Register an AWX controller

Navigate to Settings → Ansible → Controllers and click + Add controller. Fill in:

FieldNotes
NameOperator-facing label, unique. Used in run logs, OCSF events, audit trails.
Agent IDThe ServiceRadar agent that reaches this AWX. Must have both plugin assignments.
DescriptionOptional.
Base URLhttps://awx.internal.example.com — must include scheme.
Credential secret IDThe UUID from step 1. (v1 limitation: paste manually; picker UX comes later.)
Inventory sync (s)Plugin-side cadence for inventory_sync. Default 300.
Catalog sync (s)AwxCatalogSyncWorker cadence (mirrors AWX templates as :awx-sourced playbooks). Default 600.
Run pulse (ms)RunPulseWorker cadence — lower for snappier UI, higher for lower AWX API load. Default 2000.

Save. Within AWX_CONTROLLER_HEALTH_INTERVAL_SECONDS (default 30), ControllerHealthWorker dispatches awx.ping → plugin → AWX → EventIngestor writes last_health_at + flips status to :ok. Refresh the row.

If the status stays :unknown past two health intervals, see Troubleshooting.

3. Register a git playbook repository (optional)

Navigate to Settings → Ansible → Repositories and click + Add repository. Fill in:

FieldNotes
NameUnique.
RefBranch or tag. Default main.
DescriptionOptional.
Git URLHTTPS only. SSH is a v2 feature.
Deploy token secret IDOptional. Public repos: leave blank. Private repos: create a NetworkCredentialSecret with the HTTPS deploy token (same shape as the AWX API secret) and paste the UUID here.
Sync interval (s)GitCatalogSyncWorker cadence. Min 60s; default 600s.

Save. GitCatalogSyncWorker clones the repo to $ANSIBLE_CATALOG_BASE_DIR/<repository_id>/, walks .yml / .yaml files, parses each as an Ansible playbook (the first play's metadata becomes the row), and upserts one Playbook row per file with source_type: :git.

Per-file YAML parse failures are surfaced inline rather than dropped — the row shows up with parse_status: :error and a diagnostic on the catalog page, so operators can spot broken playbooks instead of wondering why they're missing.

Bind git-sourced playbooks to an AWX template before launching. Git-sourced rows show up in /ansible/catalog with an unbound warning badge until an awx_job_template_id is set. Operators are expected to keep the AWX project + job_template configured to match; ServiceRadar does not auto-create AWX templates from git playbooks in v1.

4. Watch inventory flow in automatically

If the inventory_sync plugin assignment is wired up on the controller's agent, within inventory_sync_interval_seconds you should see devices in your inventory flipping to ansible_managed: true with ansible_inventory_ref populated. This is driven by:

  1. Plugin runs on schedule.
  2. Plugin calls AWX inventory API.
  3. Plugin emits a DeviceDiscovery aggregate (source: "awx") via result.WithDeviceDiscovery(...).
  4. Agent → gateway → DIRE merges the records with existing devices (matching on ansible_host IP, hostname, or AWX host name in priority order).
  5. Matched devices get ansible_managed = true; AWX hosts that DIRE cannot match surface in Settings → Ansible → Controllers as a "needs review" list (v2 feature; currently they're emitted but not yet rendered).

No manual "mark Ansible-managed" toggle exists — the state is fully derived.

5. Create a scheduled run (optional)

Navigate to Settings → Ansible → Schedules and click + Add schedule. Fill in:

FieldNotes
NameUnique.
EnabledDefault on.
PlaybookDropdown filtered to launchable playbooks (anything with awx_job_template_id).
Target device UIDsComma-separated OCSF UIDs (e.g. sr:abc,sr:def). All must share one controller. (v2: proper multi-select picker.)
CronStandard 5-field expression — e.g. 0 3 * * * for daily at 03:00.
TimezoneUTC / Etc/UTC only in v1. (:tzdata is not currently a dependency.)
Allow concurrent runsDefault off — when the previous run is still non-terminal, the next fire is recorded as :skipped_overlap.
extra_vars (JSON)Passed to AWX on each fire.

ScheduleEvaluatorWorker fires every minute by default (configurable via AWX_SCHEDULE_EVALUATOR_INTERVAL_SECONDS); each due schedule produces a PlaybookRun exactly as if a human had launched it from the UI.

6. Retention

Navigate to Settings → Ansible → Retention for a read-only view of the effective retention windows + worker cadences. Changes are env-var driven; redeploy after editing values.yaml / docker-compose.yml.

User guide

Permissions: this section requires ansible.runs.launch. The Run Task button is hidden for users without it. To view runs only, ansible.runs.view is enough.

Browse the catalog

/ansible/catalog shows every Playbook ServiceRadar has discovered, both :git- and :awx-sourced. Filter by:

  • Source (all / git / awx)
  • Binding (all / launchable / unbound)
  • Free-text on name + description

Bound rows show a green badge with the AWX job template id; unbound rows show a warning badge — those aren't launchable until an awx_job_template_id is set on the row.

Launch against multiple devices

  1. Visit /devices.
  2. Tick the checkbox on each ansible-managed device you want to target. Non-managed devices have the checkbox available but the Run Task button validates them on submit.
  3. Click + Run Task in the bulk-action toolbar (top-right of the inventory table, next to Bulk Edit / Bulk Delete).
  4. ServiceRadar navigates to /ansible/launch?devices=... pre-filled with your selection.

The Launch page validates the targets:

  • All devices must be ansible_managed.
  • All devices must point at the same AWX controller. Mixed-controller selections are rejected with a clear error.

Launch against a single device

On a device detail page, the Run Task action button appears in the page header (next to Edit / Delete / Console) if all of: you have ansible.runs.launch, the device is not soft-deleted, AND the device is ansible_managed.

Clicking it goes to /ansible/launch?devices=<uid> with the single device pre-filled.

The launch form

  1. Targets: read-only summary of selected devices with an ansible-managed badge per row.
  2. Playbook: dropdown of launchable playbooks; for each pick, the variable form below re-renders with typed inputs derived from the playbook's variable schema:
    • AWX-sourced playbooks use the survey_spec (text / textarea / password / integer / float / multiplechoice / multiselect).
    • git-sourced playbooks use vars_prompt (text, plus password when private: true). Defaults are pre-filled; required fields are marked.
  3. Override extra_vars as raw JSON (checkbox). When toggled on, a textarea appears whose contents merge over the typed inputs at submit. Use this for variables not declared in the playbook's schema.
  4. Launch — submits; on success, you're redirected to /ansible/runs/:id.

Watch a run

/ansible/runs/:id subscribes to a per-run PubSub topic and live-updates as RunPulseWorker drains events from AWX. You'll see:

  • Header card: state pill (pending / launching / running / succeeded / partial / failed / unreachable / canceled), AWX job id, scheduled-vs-ad-hoc badge, timestamps, duration.
  • Targets table: per-host status with ok / changed / failed / skipped / unreachable counts.
  • Plays accordion: per-play status + task count; click to expand and see individual tasks.

The state machine is enforced: a terminal run (succeeded, partial, failed, unreachable, canceled) never transitions further. Late events arriving from AWX after a terminal transition are still persisted to the task table for completeness but don't move the state.

Listing runs

/ansible/runs is the cross-controller index, filterable by state (all / pending / launching / running / succeeded / partial / failed / unreachable / canceled). The "Refresh" button reloads with the current filter; PubSub also live-inserts rows that match the active filter as their state changes.

Cron-driven runs

Schedules registered in Settings → Ansible → Schedules fire automatically. Each fire produces a regular PlaybookRun with schedule_id set — visible in both /ansible/runs (with a "scheduled" badge) and on the schedule's row in the settings tab (last fire + outcome badge).

Universal log viewer

Every state transition + every task result also produces an OCSF Application Activity (class 6003) event in the universal log viewer. Each event carries an unmapped.ansible block with run_id, playbook_id, controller_id, awx_job_id, task_name, awx_host_name, device_uid, etc., so you can search:

  • by run id to see one run's full event stream
  • by device uid to see every ansible activity that touched a host
  • by task name across all runs ever
  • by status to filter for failures globally

This is the supported replacement for ARA's UI for cross-run search — ServiceRadar's structured per-run pages are richer than ARA for one run, and the universal log viewer is richer than ARA for cross-run aggregation.

Configuration reference

Per-controller overrides

Stored on each AnsibleController row; override the deployment-wide cadence per controller:

ColumnDefaultOverride
inventory_sync_interval_seconds300Plugin's inventory_sync assignment cadence for this controller.
catalog_sync_interval_seconds600AwxCatalogSyncWorker cadence for this controller.
run_pulse_interval_ms2000RunPulseWorker cadence — lower = snappier UI, higher = lower AWX API load.

Editable in the Controllers tab.

Per-repository overrides

ColumnDefaultOverride
sync_interval_seconds600GitCatalogSyncWorker cadence for this repo. Min 60s.

Per-schedule overrides

ColumnDefaultOverride
cronrequiredStandard 5-field cron expression.
timezoneUTCUTC / Etc/UTC only in v1.
allow_concurrentfalseWhen true, fire even if the previous run is still non-terminal.

RBAC reference

The eight ansible permission keys, with the default role assignments:

KeyDefault rolesWhat it grants
ansible.controllers.manageadminRegister / edit / delete AnsibleController. Required to reach the Controllers tab.
ansible.repositories.manageadminRegister / edit / delete PlaybookRepository. Required to reach the Repositories tab.
ansible.catalog.viewall (viewer / helpdesk / operator / admin)Browse /ansible/catalog.
ansible.runs.viewallView /ansible/runs and /ansible/runs/:id.
ansible.runs.launchoperator / adminLaunch playbooks; the Run Task button is hidden without this.
ansible.runs.canceloperator / adminCancel an in-progress run.
ansible.schedules.viewallView existing schedules.
ansible.schedules.manageoperator / adminRegister / edit / enable / disable / delete schedules.

These map to Ash resources via ServiceRadarWebNGWeb.Authorization.Permissions. Per-event Permit gates in each LiveView enforce the right verb on the right resource.

Troubleshooting

Controller status stays :unknown past two health intervals

Probable causes, in order of likelihood:

  1. Plugin not assigned. Confirm both awx plugin manifests are assigned to the controller's agent_id via /settings/plugins (or iexServiceRadar.Plugins). Without the run_check entrypoint assigned the agent can't even respond to awx.ping.
  2. Agent offline. Check the agent's connection state. The launcher (RunLauncher.launch/2) explicitly fails launches with a typed error when the agent isn't connected; controller health calls fail silently. Look for [error] AWX ControllerHealthWorker: dispatch failed in the core-elx logs.
  3. Agent can't reach AWX. From inside the agent's network namespace: curl -k -H "Authorization: Bearer <token>" https://<base_url>/api/v2/ping/. If this fails, the plugin will too.
  4. TLS verification. The default is to verify; if AWX uses a self-signed cert the plugin will error 401/x509. The controller resource's metadata.insecure_skip_verify = true flag bypasses verification but is not yet exposed in the form — set it manually via iex for now.

Status flips to :unauthorized

The token is wrong, expired, or missing scope. Check Controller.last_health_summary — it'll surface the operator-safe 401 message from the plugin. Update the NetworkCredentialSecret's secret_payload and re-trigger health by editing any field on the controller (re-save trips the health check).

No playbooks appear in /ansible/catalog

  • AWX-sourced: AwxCatalogSyncWorker ticks every 600s by default. The first sync after registering a controller can take that long. Lower catalog_sync_interval_seconds on the controller if you want faster turnaround for setup.
  • Git-sourced: GitCatalogSyncWorker ticks every 600s by default. Confirm the agent / pod has filesystem write access to ANSIBLE_CATALOG_BASE_DIR. Look for [warning] AWX GitCatalogSyncWorker: git sync failed in logs — the PlaybookRepository.last_sync_summary field surfaces the sanitized error.

Devices don't flip to ansible_managed: true

  • Confirm the inventory_sync plugin manifest is assigned (separate from the on-demand manifest).
  • Confirm the agent can reach AWX (same constraint as health).
  • Check DiscoveryRecord ingestion in DIRE — the AWX hosts may be matching but onto different devices (hostname collision). Look for awx in the device's discovery_sources set.
  • Hosts AWX has that DIRE can't match are emitted but not yet surfaced; check the agent logs for inventory_sync discovery records.

Run stuck in :pending

The launch command never returned a result from AWX. Likely causes:

  1. Agent disconnected after dispatch. RunPulseWorker doesn't auto-retry launches; the run will eventually be picked up by RunWatchdog (~hourly fallback) and transitioned to :unreachable.
  2. AWX rejected the launch payload. Check [info] Ansible launch failed in logs and the user's flash message at launch time.
  3. The launch's awx.launch_job command result was lost. Check platform.agent_commands for the command_id of the run's last dispatch — its status and failure_reason columns will tell you.

Run stuck in :running past terminal

RunWatchdog transitions any non-terminal run past 2 × job_template.timeout (or 1 h fallback) to :unreachable with a diagnostic recording the watchdog reason. If you want a tighter watchdog, set a job_template_timeout_seconds on the run's metadata at launch time (currently iex-only).

"AWX rejected the request" 401 / 403 on launch

The plugin surfaces these as operator-safe typed errors with "check controller token" in the message. Either:

  • The token expired — rotate in AWX and update the NetworkCredentialSecret.
  • The token doesn't have launch permission on this job template — adjust the AWX user / team for the token, OR use a different token with broader scope.

Schedule not firing

  • Check the row in the Schedules tab — the last_evaluation_outcome badge tells you what happened on the last tick (fired, skipped_overlap, skipped_disabled, skipped_ineligible_targets, error).
  • Confirm the schedule is enabled (toggle in the action column).
  • Confirm next_run_at is populated and in the past. If it's nil, the cron expression failed to parse — the form validates client-side via Oban.Cron.Expression.parse/1 but legacy rows might predate validation; edit and re-save.
  • Non-UTC timezones return :timezone_database_unavailable and the schedule never fires. Stick to UTC / Etc/UTC until :tzdata is added.

Per-run RBAC questions

If a user has ansible.runs.launch but launches fail with a 403:

  • They may not have ansible.runs.view. The LaunchLive page itself only checks runs.launch, but the post-launch redirect to /ansible/runs/:id requires runs.view.
  • The Permit per-event gates enforce verbs on specific resources; check ServiceRadarWebNGWeb.Authorization.Permissions for the mapping if a permission seems to not be honored.

v1 limitations

These are documented constraints, not bugs. Each is tracked for a future v2:

  • AWX-sourced execution only. Direct ansible-playbook execution by a ServiceRadar agent is reserved for a follow-up; v1 requires an AWX/AAP controller. Workflows that orbit AWX (its credential vault, its inventory plugins, its executor pool) are the supported path.
  • UTC schedules only. Non-UTC timezones need the :tzdata Elixir dependency, which isn't currently bundled.
  • Public HTTPS git repos. The GitCatalogSyncWorker supports HTTPS deploy tokens via the credential broker but not SSH keys yet.
  • Schedules require AWX-sourced playbooks. Git-sourced playbooks can be launched ad-hoc once bound to an AWX template, but the schedule worker rejects them in v1 with :git_sourced_not_supported_v1.
  • Multi-device UI launches require a single controller. AWX uses limit: to scope to specific hosts; mixed-controller selections are rejected at submit time. Multi-controller fan-out is a v2 design question.
  • Webhook ingestion is deferred. The proposal's design notes a sketched agent-side receiver that would augment pulse polling for lower-latency state-transition updates from very large AWX deployments. Pulse polling is the only ingestion path in v1.
  • OCSF class selection. Events project as Application Activity (6003). If operator search habits favor Process Activity (1007) instead, the mapping module can be swapped without touching the data model.
  • Run retention exclusion window. The proposal called for "skip runs accessed within the last hour" in retention sweeps, but the worker doesn't yet check accessed_at (the column hasn't been added). Set generous ANSIBLE_RETENTION_RUN_DETAIL_DAYS if you frequently revisit old runs.
  • Manual UUID paste for credential secret references. Both controller and repository forms expect operators to paste a UUID from Settings → Credentials. A picker UX is a planned v2 improvement.