Capability Unlocks: A Better Causal Engine from the Start
Companion to Integration-assessment.md. That document establishes what V1 of the causal engine can ship against ServiceRadar's existing schema. This one inventories what additional capabilities V1 gains when the seven identified gaps are closed. Gaps are ranked by net utility, weighted toward cross-domain reach.
Premise. V1 ships against today's data schema. Every gap closed before or during V1 makes the engine measurably better at the moment it goes live. The pre-V1 audits (Gap B, Gap G) cost days. The cross-domain bridge (Gap A) is the qualitative leap. Together they raise the floor of what V1 means.
How to read this document
For each gap, four pieces:
- Domains bridged. Which previously-disjoint signal domains the closed gap connects. Cross-domain reach is the primary ranking driver.
- New capability. What the engine can do once the gap is closed that it cannot do today.
- Why ranked here. The trade-off between unlock size, engineering cost, and gating effect on other work.
- How to close. Concrete steps. Schema diffs, identifier matching, audit queries, code touchpoints. Each gap names every join key it needs and every column it adds.
The list is sorted by net utility. Skim the ranking, read the gap that matches the work you can fund, and treat the rest as the surrounding investment landscape.
Ranking criterion
How many cross-domain causaloids each closed gap enables, with single-domain unlocks ranked below cross-domain ones of comparable engineering cost. Cross-domain bridges are where DeepCausality's hypergraph model pays for itself. Within a single domain, the engine can reason; across domains, the engine can reason about consequences.
Rank 1. Gap A: service to flow bridge
Domains bridged. Netflow ↔ OTEL ↔ service_status ↔ ocsf_devices.
Four domains.
New capability.
- Service dependency graph derived from OTEL spans, where parent/child pairs across services become directed edges. Runtime-observed; stays current.
- C10 (blast radius) upgraded from IP granularity to service granularity.
- A new causaloid class: "Service A latency rising ⟹ predict degradation for all services with observed call edges → A."
- Trace-anchored root cause: when N services degrade, the OTEL graph identifies the minimum upstream set that explains all N.
Why rank 1. The engine's reach jumps from infrastructure to what the business delivers. Engineering cost is low if OTEL traces already carry span parent/child IDs. The bridge is a derivation, not a schema change. Highest leverage-per-effort in the entire list.
How to close. Two halves: service-to-service derivation, and service-to-flow binding.
-
Audit
otel_tracesandotel_trace_summaries. Confirm the columns carryservice_name(orresource.service.name),trace_id,span_id, andparent_span_id. The standard OTEL data model has all four, but ServiceRadar's normalization may have flattened or renamed them. If present, derive directed service-to-service edges by self-joining spans within the sametrace_id:SELECT parent.service_name AS caller,child.service_name AS callee,count(*) AS observed_calls,avg(child.duration_ms) AS avg_latencyFROM otel_traces childJOIN otel_traces parentON parent.span_id = child.parent_span_idAND parent.trace_id = child.trace_idWHERE parent.service_name <> child.service_nameGROUP BY 1, 2;This is the service dependency graph. No schema change required if the columns exist.
-
Bind services to IP and port. The seam between
service_statusandnetflow_metricsneeds a(service_id, ip, port, protocol)mapping. Two options ranked by cost:- Agent self-report (preferred). Agents already know what they
monitor and where it listens. Extend the agent's
service_statussubmission to include the listen address. Add three columns toservice_status(or a newservice_endpointstable keyed on(gateway_id, agent_id, service_name)):listen_ip inet,listen_port integer,protocol text. - Operator declaration. A small CRUD surface (Ash resource) lets operators bind a service to its IP and port when self-report is not possible.
- Agent self-report (preferred). Agents already know what they
monitor and where it listens. Extend the agent's
-
Join netflow to services. Once the mapping exists:
SELECT s.service_name AS callee,n.src_ip AS caller_ip,sum(n.bytes) AS bytesFROM netflow_metrics nJOIN service_endpoints sON n.dst_ip = s.listen_ipAND n.dst_port = s.listen_portAND n.protocol = s.protocolGROUP BY 1, 2;caller_ipresolves to a device viaocsf_devices.ip, and from there to any service that device hosts. Now C10 runs at service granularity.
Identifier glue. The chain is
otel.service_name → service_endpoints.service_name → (ip, port) → netflow_metrics → ocsf_devices.ip → agent → service. Every join key
already exists in the schema except service_endpoints.(ip, port). That
is the only net-new data to populate.
Rank 2. Gap G: articulation points and bridges (closing upstream)
Domains bridged. Within network topology, but unlocks a standing-predictive output class that no other gap enables.
New capability.
- C5 and C5b ship as single library calls.
- Blast-radius enumeration: for each cut vertex or edge, the connected components after removal are the affected sets. Standing predictions ranked by component size.
- Composition with
pathway_betweenness_centrality. Articulation points partition the graph; pathway-restricted centrality partitions observed traffic. Operators prioritize differently. - Composition with C1 (virt cascade). An articulation vertex that is also a virt host has blast radius equal to its component ∪ its hosted guests.
- Planning-facing output: "if we put D in maintenance, which subnets and services lose reachability?" Same algorithm, different framing.
Why rank 2. Standing-predictive output is the most operationally valuable class. It answers "what should I worry about right now?" without anything having to fail. Implementation cost is roughly 200 LOC upstream, not a ServiceRadar change.
How to close. Upstream PR to deepcausality-rs/deep_causality,
ultragraph crate. No ServiceRadar-side schema or data work required.
-
Implement Tarjan's algorithm on
CsmGraph(the frozen state). One DFS pass over the CSR adjacency. The biconnected-components decomposition is the most general form and subsumes both outputs. -
Add three trait methods to
StructuralGraphAlgorithms:fn articulation_points(&self, directed: bool) -> Result<Vec<usize>, GraphError>;fn bridges(&self) -> Result<Vec<(usize, usize)>, GraphError>;fn biconnected_components(&self) -> Result<Vec<Vec<usize>>, GraphError>; -
Match the existing API convention from
betweenness_centrality(directed, normalized). Classical articulation-point and bridge detection are defined on undirected graphs, which matches ServiceRadar'sCONNECTS_TOphysical-link semantics. -
Tests against known graphs (Tarjan's original example, a Petersen graph, a star, a complete graph, two disjoint cliques connected by a bridge).
-
Bench against
betweenness_centralityfor comparison.
Identifier glue. None. ServiceRadar passes its existing
(node_id, edge_list) to ultragraph and receives back lists of indices
into the same node space. No new keys, no joins.
Rank 3. Gap F: structured physical components
Domains bridged. SNMP timeseries ↔ device inventory ↔ AGE topology ↔ (eventually) redundancy modeling.
New capability.
- Per-component health for physical hosts (PSU, fan, temp, disk).
- Physical-host containment causaloid at parity with C1 and C2.
- Substrate for the redundancy story. Redundancy groups become hyperedges over component entities. The PSU and RAID examples from the original integration doc become writable, after this gap is closed.
- Vendor-neutral component model with OID template normalization.
Why rank 3. Largest absolute new substrate, but also the largest engineering cost (schema design, SNMP collector work, OID template curation, ingest pipeline change). High payoff, slow delivery, with cross-domain reach contained to physical infrastructure.
How to close. OpenSpec change proposal per repo convention. Four parts.
-
Schema. New table
platform.device_components:CREATE TABLE platform.device_components (id uuid PRIMARY KEY DEFAULT gen_random_uuid(),device_uid text NOT NULL REFERENCES platform.ocsf_devices(uid),component_type text NOT NULL, -- psu | fan | temp_sensor | disk | memory_module | niccomponent_index integer NOT NULL, -- stable per device + typeparent_index integer, -- nests components (e.g., disk-in-bay)model text,vendor text,serial text,health text NOT NULL, -- controlled vocabulary, see Gap Cattributes jsonb DEFAULT '{}',first_seen_time timestamp,last_seen_time timestamp,UNIQUE (device_uid, component_type, component_index)); -
Identity matching from SNMP. The cross-vendor standard is ENTITY-MIB (
entPhysicalTable). Every component is a row withentPhysicalIndex(the stable index),entPhysicalClass(mapping tocomponent_type:powerSupply(6),fan(7),sensor(8),module(9),port(10)),entPhysicalDescr,entPhysicalSerialNum,entPhysicalModelName,entPhysicalContainedIn(mapping toparent_index). Vendors that do not implement ENTITY-MIB fully (some appliances, older gear) need per-vendor MIB overlays: CiscoCISCO-ENVMON-MIB, JuniperJUNIPER-MIB, DelliDRAC, HPcpqHe*. Health source:entSensorValue,cefcFanTrayStatus,cefcModuleOperStatus, vendor-specific OIDs. -
Collector. Extend the SNMP polling path to walk
entPhysicalTablefirst, populatedevice_componentsrows, then emit per-component health intohealth_events(joined byentity_type='component',entity_id=device_components.id). Replaces the current pattern of anonymoustimeseries_metricsrows for environmental data. Keeptimeseries_metricsfor the numeric sample stream; referencedevice_components.idinmetadata. -
AGE edges. Project
device_componentsinto the AGE graph asComponentvertices, withDevice CONTAINS Componentedges. EachComponentcarriescomponent_typeandhealthas properties. The existingtopology_graph.exprojection extends with one new MERGE pattern per component.
Identifier glue. Primary key is
(device_uid, component_type, component_index). ENTITY-MIB's
entPhysicalIndex is the source for component_index where the MIB is
implemented. For vendors without ENTITY-MIB, the collector synthesizes a
stable index from a deterministic hash of (oid_subtree, position).
Rank 4. Gap B: capacity_bps coverage audit
Domains bridged. Netflow (load) ↔ AGE topology (capacity).
New capability.
- C6 ships reliably across all edges with populated capacity.
- Time-to-saturation projections become a standing predictive output across the entire network surface.
- Composes with C9 (shared-hop bottleneck). A hop that is both shared and approaching capacity is a sharper prediction than either alone.
Why rank 4. Narrow cross-domain reach, but trivially cheap to close (audit task, not schema change). Could legitimately ship before V1.
How to close. A measurement task with a small backfill.
-
Trace the source.
GodViewStreamreadscapacity_bpson edges. Walk the projection intopology_graph.exand the runtime graph builder to find which table or AGE edge property is read. Most likely candidates:discovered_interfaces.if_high_speedorif_speed, possibly an AGE edge property set during projection, possiblyinterface_settings. -
Measure coverage. Once the source is known, run:
SELECT count(*) AS total,count(*) FILTER (WHERE capacity_bps IS NULL) AS null_count,count(*) FILTER (WHERE capacity_bps = 0) AS zero_count,count(DISTINCT device_uid) AS distinct_devicesFROM <source_table>;Repeat partitioned by
evidence_classandconfidence_tier. The result tells you whether C6 is reliable ondirect-physicaledges only or across the broader topology. -
Backfill the gap. SNMP-derived
if_high_speed(ifHighSpeed OID1.3.6.1.2.1.31.1.1.1.15) is the standard source. For inferred or logical edges where no SNMP value exists, accept operator declaration viainterface_settings.capacity_bps_override(one new column). -
Document the contract. Mark edges without populated capacity as ineligible for C6. The God-View envelope already carries
telemetry_eligible, which is the natural place to encode this.
Identifier glue. None new. Join keys (device_uid, if_index,
interface_id) already exist.
Rank 5. Gap C: health-state controlled vocabulary
Domains bridged. Across entity types rather than across signal domains.
New capability.
- C11 generalizes uniformly across devices, services, components, and BGP peers.
- State-transition causaloids become a primitive the engine recognizes uniformly.
- Multi-state verdict output closes the producer/consumer loop on health vocabulary.
Why rank 5. Real cross-entity reach, but the unlock is quality rather than new capability. Moderate engineering cost (audit, define, migrate).
How to close. Three steps.
-
Audit existing values.
SELECT entity_type, new_state, count(*) AS occurrencesFROM platform.health_eventsGROUP BY 1, 2ORDER BY 1, 3 DESC;This is the unconstrained alphabet the system is actually using today.
-
Define the controlled vocabulary. Proposed set:
healthy | degrading | degraded | reduced-redundancy | failing | failed | unknown. Seven values, ordered by severity. Each maps to a small integer for cheap comparisons. Same vocabulary applies to devices, services, components, and BGP peers. -
Migrate. Add a check constraint, or convert
new_stateto a Postgres ENUM. Provide a lookup table mapping legacy free-text values to the controlled set. Update every emit site (the SNMP collector, the agent health reporter, BGP/BMP handlers, theCausalSignalsprocessor) to use the controlled values. -
Align engine output. The causal-engine emitter publishes verdicts using the same vocabulary. Symmetry between input and output closes the producer/consumer loop without translation.
Identifier glue. The health_events.(entity_type, entity_id)
polymorphic key already exists. The vocabulary change is value-side only,
not key-side.
Rank 6. Gap E: OOB gateway flag
A narrow false-positive class for C3. Only matters in deployments where gateways are not on a dedicated OOB network.
How to close. One column, one migration.
ALTER TABLE platform.gateways
ADD COLUMN network_class text NOT NULL DEFAULT 'in-band'
CHECK (network_class IN ('in-band', 'out-of-band', 'management'));
Update the gateway registration flow (Ash resource action) to accept the
class at create time. Default existing rows to in-band. C3 reads the
column when classifying root cause: a gateway flagged out-of-band whose
observed devices all become unavailable points to the devices' shared
in-band path, not to the gateway itself.
Identifier glue. None. The flag is a property of the existing gateway identity.
Rank 7. Gap D: reverse MANAGES edge
Pure ergonomics. A C4 performance improvement only. No new causaloids unlocked.
How to close. Either add a reverse edge in AGE, or add a Postgres
index. The AGE approach is more consistent with how other relationships
are projected. In topology_graph.ex, where the MANAGED_BY MERGE
already runs, add the inverse:
MERGE (mgmt:Device {id: '...mgmt_id...'})
MERGE (child:Device {id: '...child_id...'})
MERGE (mgmt)-[r:MANAGES]->(child)
The relational alternative is a single Postgres index on
ocsf_devices(management_device_id), which is faster to ship if AGE
projection changes are blocked.
Identifier glue. None. The pair (management_device_id, uid) already
exists in ocsf_devices. The change is index or edge direction only.
Why cross-domain bridges dominate
Today's signal domains are largely disjoint. Inventory reasons about itself; telemetry reasons about itself; routing about itself; flow about itself; service about itself; application about itself; virtualization about itself. The V1 causaloids already bridge some of these (C3 bridges inventory and service health; C7 bridges inventory and service; C8 bridges routing and reachability; C10 bridges flow and service availability), but every bridge has a seam. It uses IPs where it would prefer service IDs, or agent presence where it would prefer application health.
Closing Gap A removes the largest such seam. OTEL traces and netflow
data become the same dependency graph at different granularities, and
service_status becomes the health overlay on that graph. This is the
configuration where DC's hypergraph machinery earns its keep: cross-domain
inference whose conclusions cannot be reached by any single-domain
reasoner.
Closing Gap F is the second-largest unlock, but it deepens a single domain. It is the natural follow-on once cross-domain reasoning is in place, because component-failure predictions become more valuable when the engine can also trace their consequences through the service dependency graph that Gap A provides.
Recommended sequencing
For a stronger V1 at launch, close in this order:
| When | Gap | Cost | What V1 gains at launch |
|---|---|---|---|
| Pre-V1 (days) | G: articulation points, bridges | ~200 LOC upstream | C5 and C5b ship as library calls; standing single-point-of-failure predictions on day one. |
| Pre-V1 (days) | B: capacity_bps audit | Audit + small backfill | C6 trustworthy across the network surface, not just a subset of edges. |
| Concurrent with V1 (weeks) | A: service to flow bridge | Mapping table + OTEL audit | Service-level reasoning replaces IP-level reasoning. C10 becomes operationally meaningful. |
| Post-V1 (months) | C: health vocabulary | Audit + migrate | Multi-state reasoning generalizes across entity types. |
| Post-V1 (multi-quarter) | F: structured components | OpenSpec + collector work | Physical-host containment at parity with virtualization; substrate for redundancy modeling. |
| When convenient | E: OOB flag | One column | C3 false-positive reduction in mixed OOB deployments. |
| When convenient | D: reverse MANAGES | Index or MERGE | C4 graph-side performance. |
Closing G and B before V1 ships costs days and changes the launch deliverable from "C6 sometimes, C5 not at all" to "C5, C5b, C6 all live." That alone is worth the budget. Closing A during V1 ships service-level reasoning in the same window as the engine itself, which is where the qualitative narrative of the engine lives.