ADR-009: Control Plane Architecture
Status
Accepted
Context
As Aragora scaled to enterprise deployments, the system needed:
- Multi-tenant isolation for workspace separation
- Centralized agent management across distributed instances
- Task scheduling with priority queuing and retry logic
- Health monitoring with circuit breaker patterns
- Policy enforcement for access control and resource limits
The original design tightly coupled agents to specific debates, limiting scalability.
Decision
We implemented a Unified Control Plane with Redis-backed coordination for distributed agent management.
Core Components
ControlPlaneCoordinator (aragora/control_plane/coordinator.py):
class ControlPlaneCoordinator:
"""
Unified coordinator providing high-level operations that coordinate:
- AgentRegistry: Service discovery and agent management
- TaskScheduler: Task distribution and lifecycle
- HealthMonitor: Health tracking and circuit breakers
"""
async def register_agent(
agent_id: str,
capabilities: List[str],
model: str,
) -> None
async def submit_task(
task_type: str,
payload: Dict[str, Any],
required_capabilities: List[str],
) -> str
async def wait_for_result(task_id: str, timeout: float) -> Any
ControlPlaneConfig:
@dataclass
class ControlPlaneConfig:
redis_url: str = "redis://localhost:6379"
key_prefix: str = "aragora:cp:"
heartbeat_timeout: float = 30.0
heartbeat_interval: float = 10.0
task_timeout: float = 300.0
max_task_retries: int = 3
Subsystems
AgentRegistry (aragora/control_plane/registry.py):
- Agent capability declaration (debate, code, search)
- Status tracking (IDLE, BUSY, OFFLINE)
- Heartbeat-based health detection
- Capability-based task routing
TaskScheduler (aragora/control_plane/scheduler.py):
- Priority queuing (CRITICAL, HIGH, NORMAL, LOW)
- Task lifecycle management (PENDING, RUNNING, COMPLETED, FAILED)
- Automatic retry with exponential backoff
- Result caching and cleanup
HealthMonitor (aragora/control_plane/health.py):
- Liveness and readiness probes
- Circuit breaker pattern for failing agents
- Health aggregation across instances
- Alerting integration points
Multi-Tenancy
TenancyMiddleware (aragora/server/middleware/tenancy.py):
- Workspace isolation via headers/tokens
- Per-tenant resource quotas
- Audit logging for compliance
PolicyEngine (aragora/policy/engine.py):
- Rule-based access control
- Rate limiting per workspace
- Feature flag evaluation
- Cost tracking and budgets
HTTP Handlers
Control Plane API (aragora/server/handlers/control_plane.py):
POST /api/control-plane/agents/register
GET /api/control-plane/agents
POST /api/control-plane/tasks/submit
GET /api/control-plane/tasks/\{task_id\}
GET /api/control-plane/health
Consequences
Positive:
- Horizontal scaling via Redis coordination
- True multi-tenancy with workspace isolation
- Unified agent management across instances
- Circuit breakers prevent cascade failures
- Policy-based governance for enterprise compliance
Negative:
- Redis dependency for distributed state
- Added latency for cross-instance coordination
- Complexity in debugging distributed flows
- Heartbeat overhead for large agent pools
Mitigations:
- Local caching with Redis as source of truth
- Connection pooling for Redis
- Structured logging with correlation IDs
- Configurable heartbeat intervals
Security Considerations
- Workspace tokens validated per request
- Agent registration requires authentication
- Task payloads can be encrypted at rest
- Audit trail for all control plane operations
References
aragora/control_plane/coordinator.py- Main coordinatoraragora/control_plane/registry.py- Agent registryaragora/control_plane/scheduler.py- Task scheduleraragora/control_plane/health.py- Health monitoringaragora/policy/engine.py- Policy enforcementaragora/server/middleware/tenancy.py- Multi-tenant middlewarearagora/server/handlers/control_plane.py- HTTP API