ServiceRadar Onboarding Process Review (November 2025)
Executive Summary
This document provides a comprehensive review of ServiceRadar's service onboarding processes across all deployment models: Kubernetes, Docker Compose, and edge deployments. The review examines how services register with Core, how agents associate with pollers, and identifies gaps in the current implementation.
Key Findings:
- ✅ K8s onboarding is fully automated via SPIFFE Controller and ClusterSPIFFEID CRDs
- ✅ Edge onboarding has strong foundation with zero-touch library (pkg/edgeonboarding)
- ⚠️ Service registry is implicit - no dedicated table tracking all known services
- ⚠️ Agent registration is passive - agents only appear when they report services
- ⚠️ Checker onboarding automation is partial - manual KV updates still required (GH-1909)
- ⚠️ Main docker-compose stack lacks SPIFFE automation - relies on static TLS certificates
1. Current Onboarding Mechanisms
1.1 Kubernetes Deployments
Method: Declarative SPIFFE Identity Assignment via Controller
How It Works:
- SPIRE Controller Manager runs as sidecar in SPIRE Server StatefulSet
ClusterSPIFFEIDCRDs define SPIFFE identities for workloads:apiVersion: spire.spiffe.io/v1alpha1
kind: ClusterSPIFFEID
metadata:
name: serviceradar-core
spec:
spiffeIDTemplate: spiffe://carverauto.dev/ns/demo/sa/serviceradar-core
podSelector:
matchLabels:
app: serviceradar-core- Controller watches pod creation/deletion and automatically creates/deletes SPIRE registration entries
- Workloads access SPIFFE Workload API via
/run/spire/sockets/agent.sockmounted from DaemonSet
Service Registration with Core:
- Pollers appear as "known" once they have a
ClusterSPIFFEIDresource - Core's
isKnownPoller()checks:- Static
KnownPollerslist in core config (legacy) - Edge onboarding service's allowed poller list (dynamic, KV-backed)
- Static
- No explicit "registration" API call - pollers become known declaratively
Status: ✅ Fully Automated - Zero manual intervention after CRD creation
Files:
k8s/demo/base/spire/spire-clusterspiffeid-*.yaml- Identity definitionsk8s/demo/base/spire/spire-controller-manager-rbac.yaml- Controller RBACpkg/core/pollers.go:701-isKnownPoller()function
1.2 Edge Deployments (Docker/Bare Metal)
Method: Token-Based Onboarding with Nested SPIRE
How It Works:
-
Package Creation (Admin → Core):
serviceradar-cli edge create-package \
--name "Remote Site A" \
--type poller \
--metadata-json "$(cat metadata.json)"- Core creates
EdgeOnboardingPackagerecord with statusissued - Generates SPIRE join token (15min TTL) and download token (24hr TTL)
- Package includes:
edge-poller.env- Service configurationspire/upstream-join-token- SPIRE attestation tokenspire/upstream-bundle.pem- Trust bundlemetadata.json- Core/KV endpoints, SPIFFE IDs
- Core creates
-
Package Download (Edge Host → Core):
# Via API with download token
curl -H "Authorization: Bearer $TOKEN" \
https://core/api/admin/edge-packages/download/$PACKAGE_ID- Package status changes to
delivered
- Package status changes to
-
Service Bootstrap (Edge Host):
# Zero-touch onboarding (implemented in GH-1915)
docker run -e ONBOARDING_TOKEN=$TOKEN \
ghcr.io/carverauto/serviceradar-poller:latestpkg/edgeonboardinglibrary handles:- Downloading package from Core
- Extracting SPIRE credentials
- Configuring nested SPIRE (upstream agent + downstream server)
- Generating service configuration
- Starting service
-
Activation (Poller → Core):
- Poller sends first
ReportStatusRPC to Core - Core calls
edgeOnboarding.RecordActivation()inpkg/core/services.go:869 - Package status changes to
activated - Poller added to in-memory allowed list, broadcast to all Core instances
- Poller sends first
Nested SPIRE Architecture:
┌─────────────────────────────────────────────┐
│ Edge Site (Docker Compose) │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Poller Container │ │
│ │ │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ SPIRE Upstream Agent │ │ │
│ │ │ (connects to cluster SPIRE) │ │ │
│ │ └──────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ SPIRE Downstream Server │ │ │
│ │ │ (issues certs to agent) │ │ │
│ │ └──────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ Poller Service │ │ │
│ │ └──────────────────────────────┘ │ │
│ └──────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Agent Container (shares network/PID) │ │
│ │ │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ SPIRE Downstream Agent │ │ │
│ │ │ (gets cert from poller) │ │ │
│ │ └──────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ Agent Service │ │ │
│ │ └──────────────────────────────┘ │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
│
│ SPIFFE mTLS
▼
┌─────────────────────────────────────────────┐
│ Cluster SPIRE Server (Kubernetes) │
└─────────────────────────────────────────────┘
Status: ✅ Implemented for Pollers (GH-1915/serviceradar-57)
- Zero-touch onboarding working
- DataSvc self-registration implemented
- Checker template registration added
Status: ⚠️ Partial for Agents/Checkers (GH-1909 in progress)
- Can create agent/checker packages
- KV automation not complete
- Manual KV updates still required
Files:
pkg/edgeonboarding/- Bootstrap librarypkg/core/edge_onboarding.go- Package creation/activationdocker/compose/poller-stack.compose.yml- Edge stack definitiondocker/compose/bootstrap-nested-spire.sh- SPIRE bootstrap helper
1.3 Main Docker Compose Stack
Method: Static TLS Certificates (Legacy)
How It Works:
cert-generatorcontainer runsgenerate-certs.shto create self-signed certs- Certificates stored in
poller-cert-datavolume - Services mount certificates from volume
- Services connect using TLS, not SPIFFE
Status: ⚠️ Legacy Approach - Not using SPIFFE/SPIRE
- Used for local development
- No automatic identity management
- Static certificates need manual rotation
Note: Per GH-1915, main docker-compose stack is intentionally separate from edge onboarding and should NOT be modified to include edge logic.
2. Service Registration and Tracking
2.1 Service Registry Architecture
Key Finding: ServiceRadar does NOT maintain a dedicated "service registry" table. Instead, services are tracked implicitly through the services stream.
Services Stream Schema:
CREATE STREAM IF NOT EXISTS services (
timestamp DateTime64(3),
poller_id string,
agent_id string,
service_name string,
service_type string,
config string,
partition string
) ENGINE = Stream(1, 1, rand())
TTL to_start_of_day(coalesce(timestamp, _tp_time)) + INTERVAL 3 DAY
How Service Registration Works:
-
Poller Authorization (
pkg/core/pollers.go:812):func (s *Server) ReportStatus(ctx context.Context, req *proto.PollerStatusRequest) {
if !s.isKnownPoller(ctx, req.PollerId) {
s.logger.Warn().Str("poller_id", req.PollerId).
Msg("Ignoring status report from unknown poller")
return &proto.PollerStatusResponse{Received: true}, nil
}
// Process status report...
} -
Service Payload Processing (
pkg/core/metrics.go:797):processServicePayload()handles each service in the status report- For gRPC checkers, calls
ensureServiceDevice()to register the device - Extracts host IP/hostname from checker payload
- Creates
DeviceUpdatewithDiscoverySourceSelfReported
-
Service Storage (
pkg/core/services.go:860):- Services written to
servicesstream - Includes
poller_id,agent_id,service_type,service_name - Records become the implicit service registry
- Services written to
2.2 Agent-to-Poller Association
Mechanism: Agents are associated with pollers through the services stream, NOT through explicit registration.
Query for Agent List (pkg/db/pollers.go:423):
SELECT
agent_id,
poller_id,
MAX(timestamp) as last_seen,
groupArray(DISTINCT service_type) as service_types
FROM table(services)
WHERE agent_id != ''
GROUP BY agent_id, poller_id
ORDER BY last_seen DESC
Key Implications:
- Agents only appear after they start reporting services
- No "pre-registration" of agents
- Agent-poller relationship is derived, not declared
- If agent stops reporting, it disappears from the list (TTL = 3 days)
Edge Onboarding Activation (pkg/core/services.go:876):
if s.edgeOnboarding != nil {
if agentID != "" && agentID != pollerID {
if err := s.edgeOnboarding.RecordActivation(ctx,
models.EdgeOnboardingComponentTypeAgent,
agentID, pollerID, sourceIP, "", timestamp); err != nil {
// Log error but continue
}
}
}
- When services report with both
poller_idandagent_id, Core callsRecordActivation - This promotes edge-onboarded agents from "issued" to "activated"
- Updates
edge_onboarding_packagestable
2.3 Checker Registration
Current State: Checkers are registered implicitly like all other services.
gRPC Checker Flow (pkg/core/devices.go:31):
func (s *Server) ensureServiceDevice(
ctx context.Context,
agentID, pollerID, partition string,
svc *proto.ServiceStatus,
serviceData json.RawMessage,
timestamp time.Time,
) {
// Only gRPC checkers embed host context
if svc.ServiceType != grpcServiceType {
return
}
hostIP, hostname, hostID := extractCheckerHostIdentity(serviceData)
if hostIP == "" {
return // Cannot register without IP
}
deviceID := fmt.Sprintf("%s:%s", partition, hostIP)
metadata := map[string]string{
"source": "checker",
"checker_service": svc.ServiceName,
"collector_agent_id": agentID,
"collector_poller_id": pollerID,
// ...
}
// Emit DeviceUpdate to device registry
deviceUpdate := &models.DeviceUpdate{
DeviceID: deviceID,
Source: models.DiscoverySourceSelfReported,
// ...
}
}
Checker Template Registration (Implemented in serviceradar-57):
- Checkers self-register default templates to KV on startup
- Template stored at
templates/checkers/{kind}.json - Edge onboarding fetches template, applies variable substitution
- Writes instance config to
agents/{agent_id}/checkers/{kind}.json
3. SPIFFE Identity Management
3.1 Kubernetes SPIFFE Flow
Components:
-
SPIRE Server StatefulSet (
spire-server-0)- Runs upstream SPIRE server binary
- Runs SPIRE Controller Manager sidecar
- Stores state in CloudNativePG cluster (
spire-pg)
-
SPIRE Agent DaemonSet
- Runs on every node
- Exposes Workload API at
/run/spire/sockets/agent.sock - Uses
k8s_satnode attestor (Kubernetes Service Account Tokens)
-
SPIRE Controller Manager
- Watches
ClusterSPIFFEIDCRDs - Creates/updates SPIRE registration entries via admin API
- Reconciles pod selectors automatically
- Watches
Identity Assignment:
# Example: serviceradar-core identity
apiVersion: spire.spiffe.io/v1alpha1
kind: ClusterSPIFFEID
metadata:
name: serviceradar-core
spec:
spiffeIDTemplate: spiffe://carverauto.dev/ns/demo/sa/serviceradar-core
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: demo
podSelector:
matchLabels:
app: serviceradar-core
Services with SPIFFE Identities in K8s:
- ✅ serviceradar-core
- ✅ serviceradar-datasvc
- ✅ serviceradar-poller
- ✅ serviceradar-agent
- ✅ serviceradar-mapper
- ✅ serviceradar-sync
- ✅ serviceradar-db-event-writer
- ✅ serviceradar-flowgger
- ✅ serviceradar-trapd
- ✅ serviceradar-rperf-checker
- ✅ serviceradar-snmp-checker
- ✅ serviceradar-zen
3.2 Edge SPIFFE Flow (Nested SPIRE)
Upstream Agent Attestation:
- Edge poller's upstream agent connects to cluster SPIRE server
- Uses join token attestation (not
k8s_sat) - Join token stored in package:
spire/upstream-join-token - Parent ID:
spiffe://carverauto.dev/ns/edge/poller-nested-spire - Token TTL: 15 minutes (expires quickly for security)
Downstream Server Configuration:
- Poller runs nested SPIRE server for downstream workloads (agent, checkers)
- Bootstrap script (
bootstrap-nested-spire.sh) creates downstream entries - Downstream agent (in agent container) uses Unix selectors for attestation:
unix:uid:0
unix:gid:0
unix:user:root
unix:path:/opt/spire/bin/spire-server
Network/PID Sharing:
# poller-stack.compose.yml
agent:
network_mode: "service:poller"
pid: "service:poller"
- Agent shares network namespace with poller (accesses localhost:50051)
- Agent shares PID namespace for SPIRE Workload API attestation
4. Identified Gaps
4.1 No Centralized Service Registry
Gap: There is no dedicated table or persistent store listing all registered services across the system.
Current Behavior:
- Services are tracked only in the ephemeral
servicesstream (3-day TTL) - Query
ListAgentsWithPollers()derives agents from recent service reports - If agent stops reporting, it disappears from system view after 3 days
- No historical record of what services existed
Impact:
- Cannot answer "what services are registered in the system?" without querying time-series data
- Cannot distinguish between "service stopped reporting" vs "service was never onboarded"
- No persistent association between agent and poller
- Cannot pre-register agents/checkers before they start reporting
Recommendation: Create dedicated registration tables:
-- Persistent poller registry
CREATE TABLE pollers_registry (
poller_id string,
component_id string, -- from edge onboarding
status string, -- 'pending', 'active', 'inactive', 'revoked'
first_registered DateTime64(3),
last_seen DateTime64(3),
registration_source string, -- 'edge_onboarding', 'k8s_spiffe', 'config'
metadata string -- JSON
) PRIMARY KEY poller_id;
-- Persistent agent registry
CREATE TABLE agents_registry (
agent_id string,
poller_id string, -- parent poller
component_id string, -- from edge onboarding
status string,
first_registered DateTime64(3),
last_seen DateTime64(3),
registration_source string,
metadata string
) PRIMARY KEY agent_id;
-- Persistent checker registry
CREATE TABLE checkers_registry (
checker_id string,
agent_id string, -- parent agent
poller_id string, -- grandparent poller
checker_kind string, -- 'snmp', 'sysmon', 'rperf', etc.
status string,
first_registered DateTime64(3),
last_seen DateTime64(3),
registration_source string,
metadata string
) PRIMARY KEY checker_id;
Benefits:
- Clear audit trail of all services ever registered
- Pre-registration support (mark as 'pending' before first report)
- Distinguish between "never reported" vs "stopped reporting"
- Persistent agent-to-poller associations
- Support for "list all pollers" without relying on time-series data
4.2 Agent/Checker Onboarding Not Fully Automated
Gap: While poller onboarding is fully automated, agent and checker onboarding still requires manual KV updates (GH-1909 not complete).
Current Agent Onboarding:
- ✅ Can create agent package via API
- ⚠️ Package creation does NOT update KV automatically
- ❌ Must manually update
config/pollers/<poller-id>/agents/<agent-id>.jsonin KV - ⚠️ Agent only becomes "known" after it starts reporting services
Current Checker Onboarding:
- ✅ Checker templates can be registered automatically (serviceradar-57)
- ⚠️ Must manually update
config/agents/<agent-id>/checkers/<checker-id>.jsonin KV - ❌ No UI/API to create checker packages linked to agents
Expected Flow (from GH-1909):
Admin creates agent package with parent_id=poller-123
→ Core automatically writes config/pollers/poller-123/agents/agent-456.json
→ Status: 'pending'
Agent first reports in
→ Core updates agent config status to 'active'
→ Poller immediately starts routing work to agent
Recommendation:
-
Complete GH-1909 implementation:
- Auto-create KV entries on package creation
- Support agent and checker package types
- Validate parent exists (poller for agent, agent for checker)
- Update status from 'pending' → 'active' on first report
-
Implement checker package creation:
serviceradar-cli edge create-package \
--type checker \
--parent-id agent-456 \
--checker-kind sysmon \
--metadata-json "$(cat sysmon-config.json)"
4.3 No Pre-Registration Support
Gap: Services must start reporting before they appear in the system. Cannot pre-register services.
Use Cases That Don't Work:
-
Admin wants to prepare 10 agents for a new site
- Cannot create agent records in advance
- Cannot verify configuration before deployment
- Cannot assign agents to pollers before they start
-
Admin wants to see "pending" services that haven't checked in yet
- No way to distinguish "hasn't reported yet" from "doesn't exist"
- Cannot track deployment progress
-
Admin wants to allocate agent IDs before deployment
- Agent IDs are generated at package creation
- But agent doesn't appear in system until first report
- Cannot verify agent configuration remotely before install
Recommendation:
Use the proposed *_registry tables (Gap 4.1) with status tracking:
pending- Package created, waiting for first reportactive- Service is reportinginactive- Service stopped reporting (exceeded threshold)revoked- Package was revoked
This enables:
- Pre-registration via edge package creation
- Clear deployment progress tracking
- Verification of expected vs actual services
4.4 Main Docker Compose Stack Not Using SPIFFE
Gap: The main docker-compose stack (used for local development) still uses static TLS certificates, not SPIFFE/SPIRE.
Current Behavior:
cert-generatorcontainer creates self-signed certificates- Services mount certificates from shared volume
- No automatic rotation, no central identity management
- Different authentication path than K8s and edge deployments
Rationale (from GH-1915):
- Main stack is for local development and trusted environments
- Edge onboarding is intentionally separate
- K8s uses SPIFFE Controller, edge uses nested SPIRE, main uses static certs
Impact:
- Developers test with different auth model than production
- Certificate rotation is manual
- No unified SPIFFE story across all deployment models
Recommendation: This appears to be intentional architectural decision. Document the three authentication models clearly:
- K8s Production: SPIFFE Controller + ClusterSPIFFEID CRDs
- Edge Sites: Nested SPIRE with join tokens
- Local Development: Static TLS certificates
If unifying is desired, consider running a lightweight SPIRE setup in docker-compose, but this may add unnecessary complexity for local development.
4.5 Join Token Expiration Friction
Gap: SPIRE join tokens expire after 15 minutes, creating deployment friction for edge sites.
Current Behavior:
- Admin creates edge package with join token
- Token expires in 15 minutes
- If edge site doesn't deploy within 15 minutes, installation fails
- Must create new package and re-distribute
Impact:
- Manual coordination required between package creation and deployment
- Cannot pre-stage packages for later deployment
- Difficult to automate deployment pipelines
Workaround:
pkg/edgeonboarding library could implement automatic token refresh:
- Package includes refresh token (longer TTL)
- On bootstrap, if join token expired, request new one via refresh token
- Continue with onboarding
Recommendation: This is marked as "Out of Scope (Future Work)" in GH-1915. Consider prioritizing if edge deployments become frequent.
Alternative: Support X.509 PoP attestation for edge sites where join tokens are impractical.
4.6 No Service Health Dashboard
Gap: While individual services report health, there's no centralized view of all registered services and their health status.
Current Behavior:
- Can query
/api/admin/pollersto see pollers - Can query
/api/admin/agentsto see agents (recently added) - No unified view of:
- Which services are registered
- Which are healthy vs unhealthy
- Which are pending activation
- Which have stopped reporting
Recommendation: Create unified service registry dashboard:
- List all pollers, agents, checkers with status
- Show parent-child relationships (agent → poller, checker → agent)
- Highlight services that haven't reported recently
- Show pending edge onboarding packages
This requires implementing Gap 4.1 (centralized service registry) first.
4.7 SPIRE Server Credentials in Kubernetes
Gap: From GH-1891, SPIRE database credentials are stored in a YAML file with placeholder password.
File: k8s/demo/base/spire/spire-db-credentials.yaml
Current Approach:
- Placeholder password in source control
- Admin must manually generate and apply real password
- Easy to forget, easy to commit secrets accidentally
Recommendation: Use Kubernetes External Secrets Operator or similar:
- Store SPIRE database credentials in HashiCorp Vault / AWS Secrets Manager
- External Secrets Operator syncs to Kubernetes Secret
- SPIRE StatefulSet references the synced Secret
- No secrets in source control, automatic rotation support
5. Architecture Strengths
5.1 Clean Separation of Onboarding Models
Strength: The system correctly separates three distinct onboarding models without conflating them:
-
K8s (Trusted): ClusterSPIFFEID CRDs + Controller Manager
- Fully declarative
- Native Kubernetes integration
- Zero touch for pod-based services
-
Edge (Untrusted): Package-based onboarding with nested SPIRE
- Token-based authentication
- Supports disconnected/intermittent connectivity
- Designed for hostile networks
-
Local Dev: Static certificates in docker-compose
- Simple, no external dependencies
- Fast iteration for developers
Benefit: Each model is optimized for its use case. No "one size fits all" compromise.
5.2 Edge Onboarding Library (pkg/edgeonboarding)
Strength: The common onboarding library (from GH-1915/serviceradar-57) provides excellent abstraction.
Features:
- ✅ Zero-touch bootstrap with single token
- ✅ Automatic package download and extraction
- ✅ SPIRE credential configuration
- ✅ Service config generation
- ✅ Deployment type detection (Docker vs bare metal)
Code Quality:
- Well-tested
- Clear separation of concerns
- Reusable across all edge services
Impact: Drastically reduces edge deployment complexity. Single command vs 5+ shell scripts.
5.3 Implicit Service Discovery
Strength: The implicit service registry (services via heartbeat) has advantages:
Benefits:
- Self-healing - services that stop reporting automatically disappear
- No stale data - only active services are visible
- Simple implementation - no explicit registration API needed
- Loose coupling - services don't need to know about registration
Trade-offs:
- Lack of pre-registration (Gap 4.3)
- No historical view (Gap 4.1)
- Cannot distinguish "never existed" from "stopped reporting"
Verdict: Good design for operational visibility, needs supplement for management/audit use cases.
5.4 Nested SPIRE for Edge
Strength: Using nested SPIRE for edge deployments is architecturally sound.
Benefits:
- Edge sites get their own trust boundary
- Reduces load on central SPIRE server
- Edge continues working if connection to cluster is lost (downstream SVIDs still issued)
- Proper security model for untrusted networks
Implementation:
- Poller runs both upstream agent (to cluster) and downstream server (for edge)
- Agent/checkers get SVIDs from poller's downstream server
- Network/PID namespace sharing enables Workload API attestation
6. Recommendations Summary
Priority 1: Complete Agent/Checker Automation (GH-1909)
Action Items:
-
Complete automatic KV updates on agent package creation
- Write to
config/pollers/<poller-id>/agents/<agent-id>.json - Set status: 'pending'
- Write to
-
Implement checker package creation
- Support parent_id (agent) validation
- Auto-update
config/agents/<agent-id>/checkers/<checker-id>.json - Include checker-specific metadata (credentials, targets)
-
Update UI to support component type selection
- Dropdown for poller/agent/checker
- Parent selector for agents and checkers
- Metadata forms tailored to component type
Impact: Eliminates manual KV updates, major friction point for edge deployments
Priority 2: Create Centralized Service Registry
Action Items:
-
Add registry tables:
pollers_registryagents_registrycheckers_registry
-
Update Core to write to registry on:
- Edge package activation
- First service report from new service
- K8s pod creation (via ClusterSPIFFEID reconciliation hook)
-
Implement status tracking:
- 'pending' → 'active' → 'inactive' → 'revoked'
- Background job to mark as 'inactive' if no reports for X hours
-
Add registry queries to API:
GET /api/admin/services/pollersGET /api/admin/services/agentsGET /api/admin/services/checkers- Include status, parent relationships, last_seen
Impact: Enables pre-registration, historical audit, service lifecycle management
Priority 3: Service Health Dashboard
Action Items:
-
Create unified UI dashboard showing:
- All registered services (from registry)
- Health status (active/inactive/pending)
- Parent-child relationships
- Pending edge packages awaiting activation
-
Add filters:
- By status
- By component type
- By poller/agent
- By registration source
-
Add alerting for:
- Services pending activation > 30 minutes
- Services that stopped reporting unexpectedly
- Edge packages with expired join tokens
Impact: Operational visibility, proactive issue detection
Priority 4: Document Onboarding Models
Action Items:
-
Update documentation to clearly explain three models:
- K8s (ClusterSPIFFEID)
- Edge (pkg/edgeonboarding + nested SPIRE)
- Local dev (static TLS)
-
Document when to use each model
-
Document trade-offs and security implications
-
Create decision tree for onboarding new services
Impact: Reduces confusion, helps operators choose correct approach
Priority 5: Secret Management in K8s
Action Items:
- Replace static SPIRE DB credentials with External Secrets Operator
- Document secret management best practices
- Implement credential rotation procedures
Impact: Security hardening, prevent accidental credential leakage
7. Conclusion
ServiceRadar has a solid onboarding foundation with excellent separation of concerns across deployment models. The K8s SPIFFE Controller provides fully automated onboarding for cluster deployments, and the new edge onboarding library (GH-1915) drastically simplifies edge deployments.
Key Strengths:
- ✅ Clean architectural separation (K8s/Edge/Dev)
- ✅ Zero-touch edge onboarding library
- ✅ Nested SPIRE for edge security
- ✅ Implicit service discovery via heartbeat
Key Gaps:
- ⚠️ No centralized service registry
- ⚠️ Agent/checker automation incomplete (GH-1909)
- ⚠️ No pre-registration support
- ⚠️ Main docker-compose uses static TLS
- ⚠️ Join token expiration friction
- ⚠️ No unified service health dashboard
Recommended Next Steps:
- Complete GH-1909 (agent/checker automation) - Highest Priority
- Implement centralized service registry
- Build service health dashboard
- Document onboarding models clearly
These improvements will eliminate remaining manual steps, provide better operational visibility, and complete the zero-touch onboarding vision outlined in GH-1891.
References
- GH-1891: Implement zero-touch onboarding across the ServiceRadar stack
- GH-1909: Edge onboarding: support agents and checkers
- GH-1915 / serviceradar-57: Create common onboarding library to eliminate edge deployment friction
- GH-1899: feat: onboarding agents
docs/docs/edge-agent-onboarding.md- Edge agent onboarding guidedocs/docs/edge-onboarding.md- Edge onboarding runbookdocs/docs/custom-checkers.md- Custom checker development guidedocs/docs/spiffe-identity.md- SPIFFE/SPIRE platform documentationdocs/docs/spire-onboarding-plan.md- SPIRE integration planningpkg/edgeonboarding/- Edge onboarding library implementationpkg/core/edge_onboarding.go- Core edge onboarding servicepkg/core/pollers.go- Poller management and isKnownPoller()pkg/core/devices.go- Device registration via checkerspkg/db/pollers.go- Poller/agent database queriesk8s/demo/base/spire/- Kubernetes SPIFFE/SPIRE manifestsdocker/compose/poller-stack.compose.yml- Edge deployment stack
Document created: November 1, 2025 Author: ServiceRadar Core Team Status: Review Complete