Aragora Operations Runbook

This runbook provides procedures for common operational tasks and incident response.

Health Checks
Common Issues
Incident Response
Maintenance Procedures
Scaling Operations
Backup & Recovery

Health Checks

Service Health

# Check backend health
curl -s http://localhost:8080/api/health | jq

# Expected response:
{
  "status": "healthy",
  "version": "1.0.0",
  "components": {
    "database": "healthy",
    "redis": "healthy",
    "agents": "healthy"
  }
}

Kubernetes Health

# Check pod status
kubectl -n aragora get pods

# Check resource usage
kubectl -n aragora top pods

# Check recent events
kubectl -n aragora get events --sort-by=.metadata.creationTimestamp | tail -20

Database Health

# Check PostgreSQL connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

# Check database size
psql -c "SELECT pg_size_pretty(pg_database_size('aragora'));"

Common Issues

Issue: High API Latency

Symptoms:

Response times >2s
Grafana alerts firing
User complaints

Diagnosis:

# Check current latency
curl -w "Time: %\{time_total\}s\n" -o /dev/null -s http://localhost:8080/api/health

# Check active debates
curl -s http://localhost:8080/api/admin/stats | jq '.active_debates'

# Check agent response times
kubectl -n aragora logs deployment/aragora-backend --tail=100 | grep "agent_response_time"

Resolution:

Scale up backend replicas: kubectl -n aragora scale deployment/aragora-backend --replicas=4
Check for slow database queries
Verify external API quotas (Anthropic, OpenAI)
Clear Redis cache if stale: redis-cli FLUSHDB

Issue: WebSocket Disconnections

Symptoms:

Clients losing real-time updates
"Connection lost" errors in UI
High reconnection rate in metrics

Diagnosis:

# Check WebSocket connections
kubectl -n aragora logs deployment/aragora-backend | grep "websocket"

# Check nginx ingress logs
kubectl -n ingress-nginx logs -l app.kubernetes.io/name=ingress-nginx | grep "websocket"

Resolution:

Verify ingress WebSocket configuration
Check proxy timeout settings (should be >60s)
Verify keep-alive settings
Check for memory pressure causing pod restarts

Issue: Agent API Errors

Symptoms:

Debates failing to complete
429 errors in logs
Agent timeouts

Diagnosis:

# Check error rates
kubectl -n aragora logs deployment/aragora-backend | grep "rate_limit\|429\|timeout"

# Check API key status
curl -s https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2024-01-01" \
  -d '{"model":"claude-3-opus-20240229","max_tokens":1,"messages":[{"role":"user","content":"test"}]}'

Resolution:

Check API quotas with provider
Enable OpenRouter fallback
Reduce concurrent debate limit
Implement request queuing

Issue: Database Connection Exhaustion

Symptoms:

"too many connections" errors
Slow queries
Backend pods failing health checks

Diagnosis:

# Check connection count
psql -c "SELECT count(*) FROM pg_stat_activity;"

# Check waiting queries
psql -c "SELECT query, state, wait_event FROM pg_stat_activity WHERE state != 'idle' LIMIT 20;"

Resolution:

Increase connection pool size
Kill idle connections: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';
Scale database instance
Add PgBouncer for connection pooling

Incident Response

Severity Levels

Level	Description	Response Time	Examples
P1	Service down	15 min	Complete outage, data loss
P2	Major degradation	1 hour	50%+ errors, major feature broken
P3	Minor degradation	4 hours	Single endpoint slow, minor bugs
P4	Low impact	Next business day	Cosmetic issues, minor UX

P1 Incident Procedure

Acknowledge (within 5 min)
- Join incident channel
- Assign incident commander
Assess (within 15 min)
- Check all dashboards
- Identify affected systems
- Determine blast radius
Mitigate (ASAP)
- Rollback recent deployments
- Scale up resources
- Enable maintenance mode if needed
Communicate
- Update status page
- Notify affected customers
- Regular updates every 30 min
Resolve
- Implement fix
- Verify fix in production
- Close incident
Post-mortem (within 48 hours)
- Document timeline
- Root cause analysis
- Action items

Rollback Procedure

# Get previous deployment
kubectl -n aragora rollout history deployment/aragora-backend

# Rollback to previous revision
kubectl -n aragora rollout undo deployment/aragora-backend

# Verify rollback
kubectl -n aragora rollout status deployment/aragora-backend

Maintenance Procedures

Planned Maintenance Window

Schedule (48 hours notice)
- Update status page
- Notify customers via email
- Set maintenance window in monitoring
Pre-maintenance
- Complete running debates gracefully
- Disable new debate creation
- Backup databases
During maintenance
- Apply updates/changes
- Run migrations
- Test functionality
Post-maintenance
- Enable services
- Verify health checks
- Monitor for issues
- Update status page

Database Migration

# Backup first
pg_dump aragora > backup_$(date +%Y%m%d).sql

# Run migrations
python -m aragora.migrations.runner migrate

# Verify
python -m aragora.migrations.runner status

Certificate Renewal

Certificates are auto-renewed by cert-manager. Manual process if needed:

# Check certificate status
kubectl -n aragora get certificate

# Force renewal
kubectl -n aragora delete secret aragora-tls
kubectl -n aragora annotate certificate aragora-cert cert-manager.io/issue-temporary-certificate="true"

Scaling Operations

Horizontal Scaling

# Scale backend
kubectl -n aragora scale deployment/aragora-backend --replicas=5

# Scale frontend
kubectl -n aragora scale deployment/aragora-frontend --replicas=3

# Verify
kubectl -n aragora get pods -w

Vertical Scaling

# Update resource limits
kubectl -n aragora set resources deployment/aragora-backend \
  --limits=memory=4Gi,cpu=2000m \
  --requests=memory=1Gi,cpu=500m

Database Scaling

For PostgreSQL on cloud:

Create read replica
Update connection string for read operations
Monitor replication lag

Backup & Recovery

Daily Backups

Automated via cron:

# Manual backup
kubectl -n aragora exec -it postgres-0 -- pg_dump -U aragora aragora > backup.sql

# Verify backup
head -100 backup.sql

Point-in-Time Recovery

# Restore from backup
kubectl -n aragora exec -i postgres-0 -- psql -U aragora aragora < backup.sql

# Or restore to specific time (if WAL archiving enabled)
pg_restore --target-time="2026-01-20 12:00:00" ...

Knowledge Base Backup

# Export knowledge mound
curl -X POST http://localhost:8080/api/knowledge/export \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -o knowledge_backup.json

# Restore
curl -X POST http://localhost:8080/api/knowledge/import \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d @knowledge_backup.json

Useful Commands

Quick Diagnostics

# All pods status
kubectl -n aragora get pods -o wide

# Recent logs
kubectl -n aragora logs deployment/aragora-backend --tail=100 -f

# Resource usage
kubectl -n aragora top pods

# Events
kubectl -n aragora get events --sort-by=.lastTimestamp | tail -20

Debug Mode

# Enable debug logging
kubectl -n aragora set env deployment/aragora-backend ARAGORA_LOG_LEVEL=DEBUG

# Disable when done
kubectl -n aragora set env deployment/aragora-backend ARAGORA_LOG_LEVEL=INFO

Emergency Contacts

Role	Contact	Escalation
On-call Engineer	PagerDuty	-
Engineering Lead	[email]	After 30 min
Platform Team	Slack #platform	P1/P2 only

Appendix: Monitoring Queries

Prometheus Queries

# Error rate
sum(rate(aragora_http_requests_total{status=~"5.."}[5m])) / sum(rate(aragora_http_requests_total[5m])) * 100

# P95 latency
histogram_quantile(0.95, sum(rate(aragora_http_request_duration_seconds_bucket[5m])) by (le))

# Active debates
aragora_active_debates

# Memory usage
process_resident_memory_bytes{job="aragora-backend"} / 1024 / 1024 / 1024

Table of Contents​

Health Checks​

Service Health​

Kubernetes Health​

Database Health​

Common Issues​

Issue: High API Latency​

Issue: WebSocket Disconnections​

Issue: Agent API Errors​

Issue: Database Connection Exhaustion​

Incident Response​

Severity Levels​

P1 Incident Procedure​

Rollback Procedure​

Maintenance Procedures​

Planned Maintenance Window​

Database Migration​

Certificate Renewal​

Scaling Operations​

Horizontal Scaling​

Vertical Scaling​

Database Scaling​

Backup & Recovery​

Daily Backups​

Point-in-Time Recovery​

Knowledge Base Backup​

Useful Commands​

Quick Diagnostics​

Debug Mode​

Emergency Contacts​

Appendix: Monitoring Queries​

Prometheus Queries​

Table of Contents

Health Checks

Service Health

Kubernetes Health

Database Health

Common Issues

Issue: High API Latency

Issue: WebSocket Disconnections

Issue: Agent API Errors

Issue: Database Connection Exhaustion

Incident Response

Severity Levels

P1 Incident Procedure

Rollback Procedure

Maintenance Procedures

Planned Maintenance Window

Database Migration

Certificate Renewal

Scaling Operations

Horizontal Scaling

Vertical Scaling

Database Scaling

Backup & Recovery

Daily Backups

Point-in-Time Recovery

Knowledge Base Backup

Useful Commands

Quick Diagnostics

Debug Mode

Emergency Contacts

Appendix: Monitoring Queries

Prometheus Queries