Aragora Operations Runbook
This document provides operational guidance for running, monitoring, and troubleshooting Aragora in production.
Table of Contents
- Quick Start
- Server Management
- Monitoring & Observability
- Admin Console & Developer Portal
- Common Issues & Debugging
- Scaling Guide
- Incident Response
- Database Operations
- Backup & Recovery
- Storage Cleanup
- Knowledge Mound Operations
- Security & Governance Hardening
- Decision Router Cache Invalidation
- Webhook Delivery Manager
- Integration Store Metrics
Quick Start
Starting the Server
# Production mode
aragora serve --api-port 8080 --ws-port 8765
# Development mode (same entrypoint; use env vars for local tuning)
aragora serve --api-port 8080 --ws-port 8765
# With custom data directory
ARAGORA_DATA_DIR=/data/aragora aragora serve --api-port 8080 --ws-port 8765
Verifying Server Health
# HTTP health check
curl http://localhost:8080/api/health
# Expected response:
# {"status": "healthy", "version": "1.0.0", "uptime": 3600}
# WebSocket health check (public endpoint, no auth required)
curl http://localhost:8080/api/health/ws
# Expected response:
# {"status": "healthy", "clients": 5}
# or when unavailable:
# {"status": "unavailable", "clients": 0, "message": "WebSocket manager not configured"}
# WebSocket connectivity test
wscat -c ws://localhost:8765/ws
Environment Variables
AI Providers (at least one required):
| Variable | Required | Default | Description |
|---|---|---|---|
ANTHROPIC_API_KEY | Yes* | - | Anthropic Claude API key |
OPENAI_API_KEY | Yes* | - | OpenAI API key |
OPENROUTER_API_KEY | No | - | Fallback provider (auto-used on 429) |
MISTRAL_API_KEY | No | - | Mistral API key |
GEMINI_API_KEY | No | - | Google Gemini API key |
XAI_API_KEY | No | - | xAI Grok API key |
Server Configuration:
| Variable | Required | Default | Description |
|---|---|---|---|
ARAGORA_DATA_DIR | No | .nomic | Runtime data directory (databases, backups) |
ARAGORA_API_TOKEN | No | - | API authentication token |
ARAGORA_ALLOWED_ORIGINS | No | See ENVIRONMENT.md | CORS allowed origins (wildcard disallowed) |
ARAGORA_LOG_LEVEL | No | INFO | Log level (DEBUG, INFO, WARNING, ERROR) |
Authentication (required for production):
| Variable | Required | Default | Description |
|---|---|---|---|
ARAGORA_JWT_SECRET | Prod | - | Secret for JWT signing (min 32 chars) |
ARAGORA_JWT_EXPIRY_HOURS | No | 24 | Token expiration in hours |
ARAGORA_REFRESH_TOKEN_EXPIRY_DAYS | No | 30 | Refresh token expiration in days |
GOOGLE_OAUTH_CLIENT_ID | No | - | Google OAuth client ID |
GOOGLE_OAUTH_CLIENT_SECRET | No | - | Google OAuth client secret |
GOOGLE_OAUTH_REDIRECT_URI | No | - | OAuth callback URL |
OAUTH_SUCCESS_URL | No | - | Post-login redirect |
OAUTH_ERROR_URL | No | - | Auth error redirect |
OAUTH_ALLOWED_REDIRECT_HOSTS | No | - | Allowed redirect hosts |
Persistence (optional):
| Variable | Required | Default | Description |
|---|---|---|---|
SUPABASE_URL | No | - | Supabase project URL |
SUPABASE_KEY | No | - | Supabase anon key |
ARAGORA_REDIS_URL | No | - | Redis URL for distributed caching |
*At least one AI provider key is required.
Server Management
Process Supervision
Use systemd for production deployments:
# /etc/systemd/system/aragora.service
[Unit]
Description=Aragora Multi-Agent Debate Server
After=network.target
[Service]
Type=simple
User=aragora
WorkingDirectory=/opt/aragora
Environment=PYTHONPATH=/opt/aragora
EnvironmentFile=/opt/aragora/.env
ExecStart=/opt/aragora/venv/bin/aragora serve --api-port 8080 --ws-port 8765
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
# Enable and start
sudo systemctl enable aragora
sudo systemctl start aragora
# Check status
sudo systemctl status aragora
# View logs
journalctl -u aragora -f
Graceful Shutdown
The server handles SIGTERM gracefully:
# Graceful shutdown (waits for active debates to complete)
kill -TERM $(pgrep -f "aragora.server")
# Force shutdown (immediate)
kill -9 $(pgrep -f "aragora.server")
Monitoring & Observability
Prometheus Metrics
Metrics are exposed at /api/metrics:
curl http://localhost:8080/api/metrics
Key metrics:
| Metric | Type | Description |
|---|---|---|
aragora_debates_total | Counter | Total debates started |
aragora_debates_completed | Counter | Completed debates |
aragora_agent_latency_seconds | Histogram | Agent response latency |
aragora_agent_tokens_total | Counter | Total tokens consumed |
aragora_fallback_activations_total | Counter | Fallback activations |
aragora_fallback_success_total | Counter | Fallback outcomes |
aragora_fallback_latency_seconds | Histogram | Fallback latency |
aragora_websocket_connections | Gauge | Active WebSocket connections |
aragora_circuit_breaker_state | Gauge | Circuit breaker status per agent |
Grafana Dashboard
Import the provided dashboards:
# Dashboard files at:
deploy/grafana/dashboards/
├── debate-metrics.json # Debate success rates, rounds, outcomes
├── api-latency.json # API endpoint latency tracking
├── agent-performance.json # Agent response times and errors
└── slo-tracking.json # Service level objectives
Log Levels
Set log level via environment:
export ARAGORA_LOG_LEVEL=DEBUG # DEBUG, INFO, WARNING, ERROR
Log output includes:
- Request correlation IDs (
X-Request-ID) - Debate IDs in all related log lines
- Agent response times and token counts
Alerting Rules
Recommended Prometheus alerting rules:
groups:
- name: aragora
rules:
- alert: HighAgentLatency
expr: histogram_quantile(0.95, aragora_agent_response_seconds) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Agent response time > 30s at p95"
- alert: CircuitBreakerOpen
expr: aragora_circuit_breaker_state == 2
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker open for {{ $labels.agent }}"
- alert: HighErrorRate
expr: rate(aragora_debates_failed[5m]) / rate(aragora_debates_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Debate failure rate > 10%"
Admin Console & Developer Portal
Admin Console (/admin)
The admin console surfaces system health, circuit breaker state, recent errors, and rate-limit status. It is intended for on-call and operations use.
Operational dependencies:
- Auth must be enabled (JWT) and the user role must be
admin. - The console reads from these endpoints:
GET /api/healthGET /api/system/circuit-breakersGET /api/system/errors?limit=20(optional)GET /api/system/rate-limits(optional)
Notes:
- If optional endpoints are not available, the UI degrades gracefully.
- For production, restrict
/adminbehind SSO, an allowlist, or an internal network boundary.
Developer Portal (/developer)
The developer portal provides API key management and usage telemetry for authenticated users.
Operational dependencies:
- Auth must be enabled (JWT). Standard users can access their own portal.
- The portal reads from these endpoints:
GET /api/auth/mePOST /api/auth/api-keyDELETE /api/auth/api-keyGET /api/billing/usage
Notes:
- API keys are bearer credentials; display them once and store only client-side.
- Encourage users to rotate keys on compromise or after team changes.
Common Issues & Debugging
Agent Timeouts
Symptoms: Debates hang, agents stop responding
Diagnosis:
# Check circuit breaker state
curl http://localhost:8080/api/agents/health
# Check agent-specific logs
grep "agent=openai-api" /var/log/aragora/server.log | tail -100
Resolution:
- Check API key validity
- Verify rate limits not exceeded
- Check network connectivity to provider
- Circuit breaker will auto-recover after 60s
WebSocket Disconnections
Symptoms: Real-time updates stop, clients disconnect
Diagnosis:
# Check active connections
curl http://localhost:8080/api/ws/stats
# Check for connection errors
grep "WebSocket" /var/log/aragora/server.log | grep -i error
Resolution:
- Check nginx/proxy timeout settings (increase to 300s)
- Verify client heartbeat interval matches server
- Check for network issues (firewall, NAT timeout)
Database Locks
Symptoms: Slow queries, write failures
Diagnosis:
# Check for WAL bloat
ls -la /data/*.db*
# Check active locks
sqlite3 /data/aragora_memory.db ".shell fuser /data/aragora_memory.db"
Resolution:
# Force WAL checkpoint
sqlite3 /data/aragora_memory.db "PRAGMA wal_checkpoint(TRUNCATE);"
# Vacuum database (offline)
sqlite3 /data/aragora_memory.db "VACUUM;"
Memory Leaks
Symptoms: Increasing memory usage over time
Diagnosis:
# Check process memory
ps aux | grep aragora
# Profile memory (example)
python -m memray run -m aragora.server --http-port 8080 --port 8765
Resolution:
- Check for unclosed database connections
- Review event buffer sizes
- Restart server (gracefully) during low-traffic periods
Scaling Guide
Horizontal Scaling
Aragora supports horizontal scaling with shared state:
┌─────────────────┐
│ Load Balancer │
└────────┬────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ :8080 │ │ :8080 │ │ :8080 │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌────────▼────────┐
│ Redis/Postgres │
│ (shared state) │
└─────────────────┘
Requirements for horizontal scaling:
- Shared database (PostgreSQL or Redis for session state)
- Sticky sessions for WebSocket connections
- Shared file storage for replays/checkpoints
Vertical Scaling
Single-node optimization:
# Increase worker threads
export ARAGORA_WORKERS=4
# Increase connection pool
export ARAGORA_DB_POOL_SIZE=20
# Increase event buffer
export ARAGORA_EVENT_BUFFER_SIZE=10000
Load Balancer Configuration
nginx example for WebSocket support:
upstream aragora {
ip_hash; # Sticky sessions for WebSocket
server 127.0.0.1:8080;
server 127.0.0.1:8081;
}
server {
listen 443 ssl http2;
server_name api.aragora.ai;
location / {
proxy_pass http://aragora;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
Incident Response
Severity Levels
| Level | Description | Response Time | Examples |
|---|---|---|---|
| P1 | Service down | 15 min | Server crash, all agents failing |
| P2 | Major degradation | 1 hour | 50%+ debates failing, high latency |
| P3 | Minor issue | 4 hours | Single agent failing, UI bugs |
| P4 | Low priority | 1 day | Documentation issues, minor UX |
Response Playbook
P1: Service Down
-
Verify outage scope
curl -I http://localhost:8080/api/health -
Check server process
systemctl status aragora
journalctl -u aragora --since "5 minutes ago" -
Attempt restart
sudo systemctl restart aragora -
Check database integrity
sqlite3 /data/aragora_memory.db "PRAGMA integrity_check;" -
Rollback if needed
cd /opt/aragora && git checkout v1.x.x
sudo systemctl restart aragora
P2: High Error Rate
-
Identify failing component
curl http://localhost:8080/api/agents/health -
Check rate limits
grep "rate_limit\|429" /var/log/aragora/server.log -
Check API provider status
- Anthropic: https://status.anthropic.com
- OpenAI: https://status.openai.com
-
Enable fallback providers
export OPENROUTER_API_KEY="..."
sudo systemctl restart aragora
Post-Incident Review
Document in .nomic/incidents/:
# Incident: YYYY-MM-DD-title
## Summary
Brief description of what happened
## Timeline
- HH:MM - First alert
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Service restored
## Root Cause
Technical description
## Action Items
- [ ] Preventive measure 1
- [ ] Preventive measure 2
Database Operations
Routine Maintenance
# Daily: WAL checkpoint
sqlite3 /data/aragora_memory.db "PRAGMA wal_checkpoint(PASSIVE);"
# Weekly: Analyze for query optimization
sqlite3 /data/aragora_memory.db "ANALYZE;"
# Monthly: Vacuum (during maintenance window)
sqlite3 /data/aragora_memory.db "VACUUM;"
Schema Migrations
Migrations are in aragora/migrations/:
# Apply pending migrations
python -m aragora.migrations.apply
# Check migration status
python -m aragora.migrations.status
# Rollback last migration
python -m aragora.migrations.rollback
Backup Procedures
# Hot backup (while server running)
sqlite3 /data/aragora_memory.db ".backup /backups/aragora_memory_$(date +%Y%m%d).db"
# Full backup script
#!/bin/bash
BACKUP_DIR=/backups/$(date +%Y%m%d)
mkdir -p $BACKUP_DIR
for db in /data/*.db; do
sqlite3 "$db" ".backup $BACKUP_DIR/$(basename $db)"
done
# Compress and upload
tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR
aws s3 cp $BACKUP_DIR.tar.gz s3://aragora-backups/
Backup & Recovery
Backup Schedule
| Type | Frequency | Retention | Storage |
|---|---|---|---|
| WAL checkpoint | Hourly | 24 hours | Local |
| Hot backup | Daily | 7 days | S3 |
| Full backup | Weekly | 30 days | S3 + Glacier |
| Debate traces | On completion | 90 days | S3 |
Recovery Procedures
Point-in-Time Recovery
# Stop server
sudo systemctl stop aragora
# Restore from backup
cp /backups/20240115/aragora_memory.db /data/aragora_memory.db
# Replay WAL if available
sqlite3 /data/aragora_memory.db "PRAGMA wal_checkpoint(RESTART);"
# Verify integrity
sqlite3 /data/aragora_memory.db "PRAGMA integrity_check;"
# Start server
sudo systemctl start aragora
Disaster Recovery
- Provision new server
- Install Aragora from git
- Restore latest backup from S3
- Update DNS to point to new server
- Verify functionality
Storage Cleanup
The .nomic/ directory accumulates data over time: backups, checkpoints, session telemetry, and artifacts. Use the cleanup script to manage storage.
Cleanup Script
# Preview what would be cleaned up (always run first)
python scripts/cleanup_nomic_state.py --dry-run
# Actually perform cleanup with default settings (7 days retention)
python scripts/cleanup_nomic_state.py
# Customize retention periods
python scripts/cleanup_nomic_state.py \
--backup-days 14 \
--session-days 3 \
--checkpoint-days 7
# Archive old data before deleting
python scripts/cleanup_nomic_state.py --archive-to /backups/nomic_archive/
What Gets Cleaned
| Category | Default Retention | Description |
|---|---|---|
| Backups | 7 days (keep 5 latest) | Nomic loop cycle backups |
| Sessions | 3 days (test: 1 day) | Session telemetry data |
| Checkpoints | 7 days | Debate state checkpoints |
| Artifacts | 7 days | Timestamped debug artifacts |
| Orphaned WAL | Immediate | WAL/SHM files without parent DB |
Essential Databases (Never Deleted)
The cleanup script preserves these core databases:
core.db,memory.db,agents.db,debates.dbagent_elo.db,agent_memories.db,agent_calibration.dbconsensus_memory.db,continuum.db,continuum_memory.dbusers.db,usage.db,scheduled_debates.db
Cleanup Schedule
Recommended: Run cleanup weekly via cron:
# Add to crontab
0 3 * * 0 cd /opt/aragora && python scripts/cleanup_nomic_state.py --yes >> /var/log/aragora/cleanup.log 2>&1
Manual Analysis
To analyze storage usage without cleaning:
python scripts/cleanup_nomic_state.py --analyze-only
Decision Receipt Retention
Decision receipts (cryptographic audit trails for debates) are automatically cleaned up by the Receipt Retention Scheduler. This runs as a background async task.
Configuration
| Environment Variable | Default | Description |
|---|---|---|
ARAGORA_RECEIPT_RETENTION_DAYS | 2555 (~7 years) | How long to keep receipts |
ARAGORA_RECEIPT_CLEANUP_INTERVAL_HOURS | 24 | How often to run cleanup |
Monitoring
Check scheduler status via the API:
curl -s http://localhost:8080/api/admin/schedulers/receipt-retention/status | jq .
# Expected response:
# {
# "running": true,
# "interval_hours": 24,
# "retention_days": 2555,
# "stats": {
# "total_runs": 5,
# "total_receipts_deleted": 0,
# "failures": 0,
# "success_rate": 1.0
# }
# }
Manual Cleanup
Trigger immediate cleanup (for maintenance or testing):
curl -X POST http://localhost:8080/api/admin/schedulers/receipt-retention/cleanup
Prometheus Metrics
The scheduler exposes these metrics:
aragora_receipt_cleanup_total- Total cleanup operationsaragora_receipt_cleanup_duration_seconds- Cleanup duration histogramaragora_receipts_deleted_total- Total receipts deleted
Knowledge Mound Operations
The Knowledge Mound (KM) is Aragora's unified knowledge storage system that enables cross-debate learning and organizational knowledge accumulation.
Health Monitoring
# Comprehensive KM health check
curl -s http://localhost:8080/api/health/knowledge-mound | jq .
# Expected healthy response:
# {
# "status": "healthy",
# "summary": {"total_components": 11, "healthy": 11, "active": 8},
# "components": {...}
# }
Environment Configuration
| Variable | Default | Description |
|---|---|---|
KNOWLEDGE_MOUND_DATABASE_URL | SQLite | PostgreSQL URL for production |
KNOWLEDGE_MOUND_REDIS_URL | - | Redis URL for KM caching |
CP_ENABLE_KM | true | Enable Control Plane → KM integration |
CP_KM_WORKSPACE | default | Default workspace for Control Plane |
Prometheus Metrics
Key KM metrics exposed at /metrics:
| Metric | Type | Description |
|---|---|---|
aragora_km_operations_total | Counter | KM operations by type and status |
aragora_km_operation_latency_seconds | Histogram | Operation latency |
aragora_km_adapter_syncs_total | Counter | Adapter sync operations |
aragora_km_cp_task_outcomes_total | Counter | Control Plane task outcomes stored |
aragora_km_forward_sync_latency_seconds | Histogram | Forward sync latency by adapter |
aragora_km_reverse_query_latency_seconds | Histogram | Reverse query latency |
aragora_km_semantic_search_total | Counter | Semantic search operations |
aragora_km_cross_debate_reuse_total | Counter | Knowledge reused across debates |
Bidirectional Adapters
KM uses adapters to sync data bidirectionally between subsystems:
| Adapter | Direction | Data Synced |
|---|---|---|
| ContinuumAdapter | ↔ | Multi-tier memory entries |
| ConsensusAdapter | ↔ | Debate consensus outcomes |
| CritiqueAdapter | ↔ | Critique patterns and feedback |
| EvidenceAdapter | ↔ | Evidence snippets with quality scores |
| BeliefAdapter | ↔ | Belief network nodes and cruxes |
| InsightsAdapter | ↔ | Debate insights and Trickster flips |
| EloAdapter | ↔ | Agent rankings and calibration |
| PulseAdapter | ↔ | Trending topics and scheduled debates |
| CostAdapter | ↔ | Budget alerts and cost patterns |
| RankingAdapter | ↔ | Agent expertise by domain |
| CultureAdapter | ↔ | Organizational culture patterns |
| ControlPlaneAdapter | ↔ | Task outcomes and agent capabilities |
Control Plane Integration
The Control Plane stores task outcomes and capability records in KM:
# Task outcomes automatically stored on completion
coordinator.complete_task(task_id, result, agent_id="claude-3", latency_ms=5000)
# Query agent recommendations from KM
recommendations = await coordinator.get_agent_recommendations("debate")
Cross-Workspace Learning
Share insights across workspaces via the Control Plane adapter:
from aragora.knowledge.mound.adapters import ControlPlaneAdapter, CrossWorkspaceInsight
adapter = ControlPlaneAdapter(knowledge_mound=km, workspace_id="workspace_a")
# Share insight
insight = CrossWorkspaceInsight(
insight_id="insight_001",
source_workspace="workspace_a",
target_workspaces=["workspace_b", "workspace_c"],
task_type="debate",
content="Structured 3-round debates work best for consensus",
confidence=0.85,
created_at=datetime.now().isoformat(),
)
await adapter.share_insight_cross_workspace(insight)
# Query insights from other workspaces
insights = await adapter.get_cross_workspace_insights("debate")
Troubleshooting
KM not storing data:
- Check health endpoint:
curl localhost:8080/api/health/knowledge-mound - Verify database connectivity
- Check adapter sync metrics:
aragora_km_adapter_syncs_total
Slow queries:
- Monitor
aragora_km_operation_latency_secondshistogram - Check Redis cache hit rate:
aragora_km_cache_hits_total / (hits + misses) - Consider enabling RLM summaries:
enable_rlm_summaries: true
Cross-workspace insights not appearing:
- Verify workspace IDs match target_workspaces
- Check
aragora_km_cp_cross_workspace_shares_totalmetric - Ensure minimum confidence threshold (default 0.6)
Appendix
Useful Commands
# Watch server logs
tail -f /var/log/aragora/server.log | jq .
# Count active debates
curl -s http://localhost:8080/api/debates | jq '.debates | length'
# List agents and their ELO
curl -s http://localhost:8080/api/leaderboard/rankings | jq '.agents[] | {name, elo}'
# View circuit breaker status
curl -s http://localhost:8080/api/circuit-breakers | jq .
# Export debate history
curl -s http://localhost:8080/api/debates/export?format=json > debates.json
Configuration Reference
Full configuration options live in aragora/config/settings.py (Pydantic settings)
and aragora/config/legacy.py (legacy constants). Environment variables are the
primary configuration surface; see docs/ENVIRONMENT.md.
# Example overrides
export ARAGORA_API_TOKEN="your-secret-token"
export ARAGORA_WS_MAX_MESSAGE_SIZE=65536
export ARAGORA_DB_POOL_SIZE=10
export ARAGORA_DB_POOL_TIMEOUT=30
Support Contacts
- GitHub Issues: https://github.com/an0mium/aragora/issues
- Documentation: https://docs.aragora.ai
- Status Page: https://status.aragora.ai
Security & Governance Hardening
This section covers the security hardening features including encryption at rest, RBAC, and data migration.
Encryption at Rest
Sensitive data is encrypted using AES-256-GCM with field-level encryption. The following stores support encryption:
- SyncStore: Connector credentials (api_key, secret, password, token, etc.)
- IntegrationStore: Integration settings with sensitive fields
- GmailTokenStore: OAuth access and refresh tokens
Configuration
# Required: Set a 32-byte (64 hex characters) encryption key
export ARAGORA_ENCRYPTION_KEY="your-64-hex-character-encryption-key-here"
# Generate a secure key:
python -c "import secrets; print(secrets.token_hex(32))"
Migrating Existing Data
If you have existing unencrypted data, use the migration utility:
# Dry run (preview what will be migrated)
python -m aragora.storage.migrations.encrypt_existing_data --all
# Execute migration
python -m aragora.storage.migrations.encrypt_existing_data --all --execute
# Migrate specific stores
python -m aragora.storage.migrations.encrypt_existing_data --sync-store --execute
python -m aragora.storage.migrations.encrypt_existing_data --integration-store --execute
python -m aragora.storage.migrations.encrypt_existing_data --gmail-tokens --execute
The migration is backward-compatible: encrypted fields are marked with _encrypted: true, so the system can read both encrypted and plaintext data during transition.
Verifying Encryption
# Check if encryption is properly configured
python -c "
from aragora.security.encryption import get_encryption_service, CRYPTO_AVAILABLE
print(f'Crypto available: \{CRYPTO_AVAILABLE\}')
if CRYPTO_AVAILABLE:
svc = get_encryption_service()
print(f'Active key: {svc.get_active_key_id()}')
"
Encryption Key Rotation
Key rotation is critical for maintaining security. Follow this procedure to rotate encryption keys without data loss.
When to Rotate Keys
- Regular rotation: Every 90 days (recommended) or per compliance requirements
- Incident response: If key compromise is suspected
- Personnel changes: When staff with key access leave
Key Rotation Procedure
Step 1: Generate a new encryption key
# Generate a new 32-byte (64 hex character) key
python -c "import secrets; print(secrets.token_hex(32))"
Step 2: Add the new key to environment (multi-key support)
The encryption service supports multiple keys via comma-separated values. The first key is used for encryption, others for decryption only:
# Format: new_key,old_key1,old_key2,...
export ARAGORA_ENCRYPTION_KEY="new-64-hex-key,old-64-hex-key"
Step 3: Re-encrypt existing data
# Dry run to verify (reads with old key, would write with new key)
python -m aragora.storage.migrations.encrypt_existing_data --all
# Execute re-encryption
python -m aragora.storage.migrations.encrypt_existing_data --all --execute
Step 4: Verify re-encryption
# Check that all records use the new key
python -c "
from aragora.security.encryption import get_encryption_service
svc = get_encryption_service()
print(f'Active encryption key: {svc.get_active_key_id()}')
"
Step 5: Remove old keys (after verification period)
After a grace period (e.g., 7 days) to ensure no issues:
# Remove old keys from environment
export ARAGORA_ENCRYPTION_KEY="new-64-hex-key"
Rolling Deployment Key Rotation
For zero-downtime rotation across multiple instances:
-
Phase 1: Deploy with both keys (new primary, old for decryption)
export ARAGORA_ENCRYPTION_KEY="new_key,old_key" -
Phase 2: Run migration job to re-encrypt all data
python -m aragora.storage.migrations.encrypt_existing_data --all --execute -
Phase 3: After all instances updated and migration complete, remove old key
Emergency Key Compromise Procedure
If a key compromise is detected:
- Immediately generate and deploy new key to all instances
- Re-encrypt all data using the migration utility
- Revoke any API tokens that may have been issued during the compromised period
- Review audit logs for suspicious access patterns
- Document the incident per your incident response policy
Key Storage Best Practices
| Environment | Recommended Storage |
|---|---|
| Development | .env file (gitignored) |
| Staging | Environment variables or secrets manager |
| Production | AWS Secrets Manager, HashiCorp Vault, or cloud KMS |
# Example: AWS Secrets Manager integration
export ARAGORA_ENCRYPTION_KEY=$(aws secretsmanager get-secret-value \
--secret-id aragora/encryption-key \
--query SecretString --output text)
RBAC (Role-Based Access Control)
Workflow endpoints are protected with RBAC. The following permissions are enforced:
| Permission | Endpoints |
|---|---|
workflows.read | GET /api/workflows, GET /api/workflow-templates, GET /api/workflow-approvals, GET /api/workflow-executions |
workflows.create | POST /api/workflows |
workflows.update | PATCH /api/workflows/{id} |
workflows.delete | DELETE /api/workflows/{id} |
workflows.execute | POST /api/workflows/{id}/execute |
workflows.approve | POST /api/workflow-approvals/{id}/resolve |
Authentication Methods
RBAC context is extracted from:
- JWT Token: Primary authentication (set via
Authorization: Bearer <token>) - Request Headers: Fallback for service-to-service calls
X-User-ID: User identifierX-Org-ID: Organization/tenant identifierX-User-Roles: Comma-separated list of roles
Example Request with RBAC
# Using JWT token
curl -H "Authorization: Bearer $JWT_TOKEN" \
http://localhost:8080/api/workflows
# Using headers (service-to-service)
curl -H "X-User-ID: user-123" \
-H "X-Org-ID: org-456" \
-H "X-User-Roles: admin,member" \
http://localhost:8080/api/workflows
Observability Metrics
New Prometheus metrics are available for security monitoring:
Encryption Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
aragora_encryption_operations_total | Counter | operation, store | Total encrypt/decrypt operations |
aragora_encryption_operation_latency_seconds | Histogram | operation | Operation latency |
aragora_encryption_errors_total | Counter | operation, error_type | Encryption errors |
RBAC Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
aragora_rbac_permission_checks_total | Counter | permission, result | Permission check counts |
aragora_rbac_permission_denied_total | Counter | permission, handler | Permission denials |
aragora_rbac_check_latency_seconds | Histogram | - | Permission check latency |
Migration Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
aragora_migration_records_total | Counter | store, status | Records processed (migrated/skipped/failed) |
aragora_migration_errors_total | Counter | store, error_type | Migration errors |
Grafana Queries
# Encryption operation rate
rate(aragora_encryption_operations_total[5m])
# Permission denial rate
rate(aragora_rbac_permission_denied_total[5m])
# Migration progress
sum by (status) (aragora_migration_records_total)
Storage Backends
Critical data stores support multiple backends for durability:
| Store | Backends | Default |
|---|---|---|
| ApprovalRequestStore | SQLite, PostgreSQL, Redis | SQLite |
| SessionStore | In-memory, Redis | Redis if available |
| OAuthStateStore | In-memory, Redis | Redis if available |
Configure via environment variables:
# Use PostgreSQL for approval requests
export ARAGORA_APPROVAL_STORE_BACKEND=postgres
export DATABASE_URL=postgresql://user:pass@host:5432/db
# Use Redis for sessions
export ARAGORA_REDIS_URL=redis://localhost:6379
Troubleshooting
Encryption Key Issues
# Error: "Data will not be recoverable after restart"
# Fix: Set ARAGORA_ENCRYPTION_KEY environment variable
# Error: "Key not found" during decryption
# Cause: Data was encrypted with a different key
# Fix: Use the same key that was used for encryption, or re-migrate data
RBAC Permission Denied
# Check what permissions a user has
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:8080/api/auth/me | jq '.permissions'
# Common causes:
# - Missing required role for the permission
# - Token expired (check exp claim)
# - Org ID mismatch between token and resource
Migration Failures
# Check migration status
python -m aragora.storage.migrations.encrypt_existing_data --all -v
# Common issues:
# - ARAGORA_ENCRYPTION_KEY not set
# - Database connection failures
# - Records with corrupt data (logged as errors, others continue)
Security Audit Checklist
Use this checklist to verify security posture before production deployment or during periodic security reviews.
Pre-Deployment Checklist
Encryption
-
ARAGORA_ENCRYPTION_KEYis set (64 hex characters) - Key is stored in a secrets manager (not environment file in production)
- Run
python -c "from aragora.security.encryption import CRYPTO_AVAILABLE; print(CRYPTO_AVAILABLE)"returnsTrue - Migration utility has been run to encrypt existing plaintext data
Authentication
-
ARAGORA_JWT_SECRETis set (32+ characters, cryptographically random) - JWT expiration is reasonable (default: 24 hours)
- Refresh token expiration is reasonable (default: 30 days)
- OAuth redirect URLs are validated against allowlist
- MFA is enabled for admin users (SOC 2 CC5-01)
Authorization (RBAC)
- All sensitive endpoints require authentication
- Permission checks are in place for CRUD operations
- Admin endpoints require admin role + MFA
- Service-to-service calls use proper authentication headers
Network Security
- CORS origins are explicitly configured (no wildcards in production)
- TLS is enabled for all external connections
- API rate limiting is configured
- WebSocket connections are authenticated
Data Protection
- Sensitive fields are encrypted at rest (credentials, tokens, secrets)
- Audit logging is enabled
- PII handling complies with privacy regulations
- Backup encryption is enabled
Periodic Security Review
Weekly
- Review
aragora_rbac_permission_denied_totalfor unusual patterns - Check
aragora_encryption_errors_totalfor failures - Review admin impersonation audit logs
Monthly
- Review user access and remove inactive accounts
- Verify backup integrity and restore capability
- Check for unused API tokens and revoke them
- Review OAuth provider connections
Quarterly
- Rotate encryption keys (see Key Rotation section)
- Update dependencies for security patches
- Review and update security documentation
- Conduct access review with team leads
Security Metrics Dashboard
Monitor these metrics continuously:
| Metric | Alert Threshold | Description |
|---|---|---|
aragora_rbac_permission_denied_total | >10/min sustained | Potential attack or misconfiguration |
aragora_encryption_errors_total | Any | Encryption failures need immediate attention |
aragora_auth_failures_total | >5/min per IP | Potential brute force attack |
aragora_admin_impersonate_total | Any | All impersonations should be reviewed |
Grafana Alert Rules
# Alert on high permission denial rate
- alert: HighPermissionDenialRate
expr: rate(aragora_rbac_permission_denied_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: High rate of permission denials
# Alert on encryption errors
- alert: EncryptionErrors
expr: increase(aragora_encryption_errors_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: Encryption errors detected
# Alert on admin impersonation
- alert: AdminImpersonation
expr: increase(aragora_admin_impersonate_total[5m]) > 0
labels:
severity: info
annotations:
summary: Admin impersonation event - review audit log
Compliance Mapping
| Control | Implementation | Verification |
|---|---|---|
| SOC 2 CC5-01 | Admin MFA enforcement | Check enforce_admin_mfa_policy() in admin handlers |
| SOC 2 CC6-01 | Encryption at rest | Verify CRYPTO_AVAILABLE and key configuration |
| SOC 2 CC6-07 | Access control | RBAC permission checks in all handlers |
| GDPR Art. 32 | Data protection | Field-level encryption, audit logging |
| HIPAA 164.312 | Access controls | RBAC, MFA, audit trails |
Incident Response Contacts
Maintain this section with current contact information:
Security Team Lead: [Name] - [Contact]
On-Call Engineer: [Rotation Schedule/PagerDuty]
Legal/Compliance: [Name] - [Contact]
Data Protection Officer: [Name] - [Contact]
Decision Router Cache Invalidation
The DecisionRouter includes a response cache that improves performance by caching routing decisions. The cache supports multiple invalidation strategies for different scenarios.
Cache Configuration
from aragora.server.middleware.decision_routing import (
get_decision_router,
invalidate_cache_for_workspace,
invalidate_cache_for_policy_change,
invalidate_cache_for_agent_upgrade,
get_cache_stats,
)
# Get the router with caching enabled
router = await get_decision_router()
Invalidation Strategies
Workspace-Scoped Invalidation
Invalidate all cached decisions for a specific workspace:
# When workspace configuration changes
count = await invalidate_cache_for_workspace("workspace-123")
print(f"Invalidated \{count\} cached entries for workspace")
Use cases:
- Workspace policy updates
- Team membership changes
- Per-workspace routing rule changes
Policy Version Invalidation
When global routing policies change, update the policy version:
# Deploy new routing policy
await invalidate_cache_for_policy_change("v2.0.0")
# All cached entries with older policy versions become stale
# and will be re-computed on next access
This uses lazy invalidation - entries are marked stale but not immediately removed.
Agent Version Invalidation
When agents are upgraded, invalidate responses that used old versions:
# After upgrading Claude from v3.5 to v4
count = await invalidate_cache_for_agent_upgrade("claude", "3.5")
print(f"Invalidated \{count\} cached entries using claude v3.5")
Tag-Based Invalidation
Entries can be tagged for granular invalidation:
from aragora.server.middleware.decision_routing import ResponseCache
cache = ResponseCache(max_size=1000, ttl_seconds=3600)
# Store with tags
cache.set(
key="debate-123",
result={"decision": "approve"},
workspace_id="ws-1",
tags=["debate", "high-priority"],
)
# Invalidate all entries with a specific tag
count = cache.invalidate_by_tag("high-priority")
Cache Monitoring
# Get cache statistics
stats = await get_cache_stats()
print(f"Total entries: {stats['total_entries']}")
print(f"Hit rate: {stats['hit_rate']:.2%}")
print(f"Policy version: {stats['current_policy_version']}")
Prometheus Metrics
| Metric | Type | Description |
|---|---|---|
aragora_decision_cache_hits_total | Counter | Cache hit count |
aragora_decision_cache_misses_total | Counter | Cache miss count |
aragora_decision_cache_invalidations_total | Counter | Invalidation events by type |
aragora_decision_cache_size | Gauge | Current cache size |
Deployment Recommendations
- Rolling deployments: Set new policy version before deploying new code
- Agent upgrades: Invalidate old agent versions after confirming new agent is healthy
- Emergency changes: Use workspace invalidation for targeted cache clearing
- Monitoring: Alert on sudden drops in cache hit rate
Webhook Delivery Manager
The Webhook Delivery Manager provides reliable webhook delivery with retry logic, dead-letter queuing, and circuit breaker protection.
Quick Start
from aragora.server.webhook_delivery import (
deliver_webhook,
get_delivery_manager,
WebhookDeliveryManager,
)
# Simple delivery
delivery = await deliver_webhook(
webhook_id="wh-123",
event_type="debate_end",
payload={"debate_id": "d-456", "result": "consensus"},
url="https://example.com/webhook",
secret="your-hmac-secret", # Optional: enables signature verification
)
print(f"Status: {delivery.status}") # DELIVERED, RETRYING, or DEAD_LETTERED
Configuration
manager = WebhookDeliveryManager(
max_retries=5, # Maximum retry attempts (default: 5)
base_delay_seconds=1.0, # Initial retry delay (default: 1.0)
max_delay_seconds=300.0, # Maximum retry delay cap (default: 300)
timeout_seconds=30.0, # Request timeout (default: 30)
circuit_breaker_threshold=5, # Failures before circuit opens (default: 5)
)
Delivery States
| State | Description |
|---|---|
PENDING | Queued for delivery |
IN_PROGRESS | Currently being delivered |
DELIVERED | Successfully delivered (2xx response) |
RETRYING | Failed, scheduled for retry |
DEAD_LETTERED | Max retries exceeded, moved to dead-letter queue |
Retry Behavior
Retries use exponential backoff with jitter:
delay = min(base_delay * (2 ^ attempt), max_delay) + random_jitter
Default progression: 1s → 2s → 4s → 8s → 16s → ... → 300s max
Circuit Breaker
The circuit breaker prevents overwhelming failing endpoints:
# Circuit opens after threshold consecutive failures
# While open, deliveries to that URL are immediately queued for retry
# Check circuit state
manager = await get_delivery_manager()
is_open = manager._is_circuit_open("https://example.com/webhook")
# Circuit automatically closes after the cooldown period (default: 60s)
Dead-Letter Queue
Failed deliveries are moved to the dead-letter queue after max retries:
# List dead-lettered deliveries
dead_letters = await manager.get_dead_letter_queue(limit=100)
for delivery in dead_letters:
print(f"ID: {delivery.delivery_id}")
print(f"Event: {delivery.event_type}")
print(f"Last error: {delivery.last_error}")
print(f"Attempts: {delivery.attempts}")
# Retry a dead-lettered delivery
success = await manager.retry_dead_letter(delivery_id="dlv-123")
HMAC Signature Verification
When a secret is provided, webhooks include an HMAC-SHA256 signature:
# Delivery includes X-Webhook-Signature header
delivery = await deliver_webhook(
webhook_id="wh-123",
event_type="debate_end",
payload={"debate_id": "d-456"},
url="https://example.com/webhook",
secret="your-secret-key",
)
# Recipient verifies signature:
# X-Webhook-Signature: sha256=<hex-encoded-hmac>
Verification code for recipients:
import hmac
import hashlib
import json
def verify_signature(payload: dict, signature: str, secret: str) -> bool:
expected = "sha256=" + hmac.new(
secret.encode(),
json.dumps(payload).encode(),
hashlib.sha256
).hexdigest()
return hmac.compare_digest(expected, signature)
Monitoring
# Get delivery metrics
metrics = await manager.get_metrics()
print(f"Total deliveries: {metrics['total_deliveries']}")
print(f"Success rate: {metrics['success_rate']}%")
print(f"Average latency: {metrics['avg_latency_ms']}ms")
print(f"Dead-lettered: {metrics['dead_lettered']}")
print(f"Retries pending: {metrics['retries_pending']}")
Prometheus Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
aragora_webhook_deliveries_total | Counter | status, event_type | Total delivery attempts |
aragora_webhook_delivery_latency_seconds | Histogram | event_type | Delivery latency |
aragora_webhook_retries_total | Counter | event_type | Retry attempts |
aragora_webhook_dead_lettered_total | Counter | event_type | Dead-lettered count |
aragora_webhook_circuit_breaker_state | Gauge | endpoint | Circuit breaker state (0=closed, 1=open) |
Operational Recommendations
- Monitor dead-letter queue: Set up alerts when queue grows
- Review circuit breaker opens: Investigate endpoints that frequently trip
- Set appropriate timeouts: Match recipient's expected processing time
- Use secrets: Always use HMAC signatures in production
- Idempotency: Include
delivery_idin payload for recipient-side deduplication
Integration Store Metrics
The Integration Store Metrics system provides observability into integration storage operations, including latency tracking, error rates, and health monitoring.
Quick Start
from aragora.storage.integration_store_metrics import (
InstrumentedIntegrationStore,
get_integration_metrics,
get_integration_health,
)
# Wrap an existing store with instrumentation
instrumented = InstrumentedIntegrationStore(base_store, backend_type="postgresql")
# Use normally - metrics are collected automatically
await instrumented.save({"type": "slack", "user_id": "u-123", "token": "..."})
config = await instrumented.get("slack", "u-123")
Tracked Operations
| Operation | Metrics Collected |
|---|---|
get | Latency, success/failure count, cache hit/miss |
save | Latency, success/failure count |
delete | Latency, success/failure count |
list | Latency, active integration count |
refresh_token | Latency, success/failure count |
Health Monitoring
The store tracks consecutive failures and marks itself unhealthy after 3 failures:
# Check health status
health = await instrumented.health_check()
print(f"Healthy: {health['healthy']}")
print(f"Backend: {health['backend_type']}")
print(f"Consecutive failures: {health['consecutive_failures']}")
print(f"Active integrations: {health['active_integrations']}")
Health automatically recovers after a successful operation.
Getting Metrics
# Get comprehensive metrics
metrics = await get_integration_metrics()
# Structure:
# {
# "backend_type": "postgresql",
# "is_healthy": true,
# "cache_hit_rate": 85.5,
# "operations": {
# "get": {"total_calls": 1000, "success_rate": 99.5, "avg_latency_seconds": 0.012},
# "save": {"total_calls": 200, "success_rate": 100.0, "avg_latency_seconds": 0.025},
# ...
# }
# }
# Get health summary
health = await get_integration_health()
# {
# "healthy": true,
# "backend_type": "postgresql",
# "consecutive_failures": 0,
# "operations_summary": {
# "get_success_rate": 99.5,
# "save_success_rate": 100.0,
# "cache_hit_rate": 85.5
# }
# }
Prometheus Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
aragora_integration_store_operations_total | Counter | operation, status | Operation count by type |
aragora_integration_store_latency_seconds | Histogram | operation | Operation latency |
aragora_integration_store_health | Gauge | backend | Health status (1=healthy, 0=unhealthy) |
aragora_integration_store_active_integrations | Gauge | - | Count of active integrations |
aragora_integration_store_cache_hit_rate | Gauge | - | Cache hit rate percentage |
aragora_integration_store_consecutive_failures | Gauge | - | Consecutive failure count |
Alerting Rules
groups:
- name: integration_store
rules:
- alert: IntegrationStoreUnhealthy
expr: aragora_integration_store_health == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Integration store unhealthy (3+ consecutive failures)"
- alert: IntegrationStoreHighLatency
expr: histogram_quantile(0.95, aragora_integration_store_latency_seconds) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Integration store p95 latency > 1s"
- alert: IntegrationStoreLowSuccessRate
expr: |
sum(rate(aragora_integration_store_operations_total{status="success"}[5m])) /
sum(rate(aragora_integration_store_operations_total[5m])) < 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "Integration store success rate below 95%"
Troubleshooting
High consecutive failure count:
- Check database connectivity
- Review recent error logs for the backend
- Verify credentials and connection strings
Low cache hit rate:
- Review access patterns - are integrations being fetched repeatedly?
- Consider increasing cache TTL
- Check if cache invalidation is too aggressive
High latency:
- Check database performance (slow queries, locks)
- Review network latency to database
- Consider connection pool sizing