Aragora Operations Runbook
This runbook provides procedures for common operational tasks and incident response.
Table of Contents
- Health Checks
- Common Issues
- Incident Response
- Maintenance Procedures
- Scaling Operations
- Backup & Recovery
Health Checks
Service Health
# Check backend health
curl -s http://localhost:8080/api/health | jq
# Expected response:
{
"status": "healthy",
"version": "1.0.0",
"components": {
"database": "healthy",
"redis": "healthy",
"agents": "healthy"
}
}
Kubernetes Health
# Check pod status
kubectl -n aragora get pods
# Check resource usage
kubectl -n aragora top pods
# Check recent events
kubectl -n aragora get events --sort-by=.metadata.creationTimestamp | tail -20
Database Health
# Check PostgreSQL connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
# Check database size
psql -c "SELECT pg_size_pretty(pg_database_size('aragora'));"
Common Issues
Issue: High API Latency
Symptoms:
- Response times >2s
- Grafana alerts firing
- User complaints
Diagnosis:
# Check current latency
curl -w "Time: %\{time_total\}s\n" -o /dev/null -s http://localhost:8080/api/health
# Check active debates
curl -s http://localhost:8080/api/admin/stats | jq '.active_debates'
# Check agent response times
kubectl -n aragora logs deployment/aragora-backend --tail=100 | grep "agent_response_time"
Resolution:
- Scale up backend replicas:
kubectl -n aragora scale deployment/aragora-backend --replicas=4 - Check for slow database queries
- Verify external API quotas (Anthropic, OpenAI)
- Clear Redis cache if stale:
redis-cli FLUSHDB
Issue: WebSocket Disconnections
Symptoms:
- Clients losing real-time updates
- "Connection lost" errors in UI
- High reconnection rate in metrics
Diagnosis:
# Check WebSocket connections
kubectl -n aragora logs deployment/aragora-backend | grep "websocket"
# Check nginx ingress logs
kubectl -n ingress-nginx logs -l app.kubernetes.io/name=ingress-nginx | grep "websocket"
Resolution:
- Verify ingress WebSocket configuration
- Check proxy timeout settings (should be >60s)
- Verify keep-alive settings
- Check for memory pressure causing pod restarts
Issue: Agent API Errors
Symptoms:
- Debates failing to complete
- 429 errors in logs
- Agent timeouts
Diagnosis:
# Check error rates
kubectl -n aragora logs deployment/aragora-backend | grep "rate_limit\|429\|timeout"
# Check API key status
curl -s https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2024-01-01" \
-d '{"model":"claude-3-opus-20240229","max_tokens":1,"messages":[{"role":"user","content":"test"}]}'
Resolution:
- Check API quotas with provider
- Enable OpenRouter fallback
- Reduce concurrent debate limit
- Implement request queuing
Issue: Database Connection Exhaustion
Symptoms:
- "too many connections" errors
- Slow queries
- Backend pods failing health checks
Diagnosis:
# Check connection count
psql -c "SELECT count(*) FROM pg_stat_activity;"
# Check waiting queries
psql -c "SELECT query, state, wait_event FROM pg_stat_activity WHERE state != 'idle' LIMIT 20;"
Resolution:
- Increase connection pool size
- Kill idle connections:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes'; - Scale database instance
- Add PgBouncer for connection pooling
Incident Response
Severity Levels
| Level | Description | Response Time | Examples |
|---|---|---|---|
| P1 | Service down | 15 min | Complete outage, data loss |
| P2 | Major degradation | 1 hour | 50%+ errors, major feature broken |
| P3 | Minor degradation | 4 hours | Single endpoint slow, minor bugs |
| P4 | Low impact | Next business day | Cosmetic issues, minor UX |
P1 Incident Procedure
-
Acknowledge (within 5 min)
- Join incident channel
- Assign incident commander
-
Assess (within 15 min)
- Check all dashboards
- Identify affected systems
- Determine blast radius
-
Mitigate (ASAP)
- Rollback recent deployments
- Scale up resources
- Enable maintenance mode if needed
-
Communicate
- Update status page
- Notify affected customers
- Regular updates every 30 min
-
Resolve
- Implement fix
- Verify fix in production
- Close incident
-
Post-mortem (within 48 hours)
- Document timeline
- Root cause analysis
- Action items
Rollback Procedure
# Get previous deployment
kubectl -n aragora rollout history deployment/aragora-backend
# Rollback to previous revision
kubectl -n aragora rollout undo deployment/aragora-backend
# Verify rollback
kubectl -n aragora rollout status deployment/aragora-backend
Maintenance Procedures
Planned Maintenance Window
-
Schedule (48 hours notice)
- Update status page
- Notify customers via email
- Set maintenance window in monitoring
-
Pre-maintenance
- Complete running debates gracefully
- Disable new debate creation
- Backup databases
-
During maintenance
- Apply updates/changes
- Run migrations
- Test functionality
-
Post-maintenance
- Enable services
- Verify health checks
- Monitor for issues
- Update status page
Database Migration
# Backup first
pg_dump aragora > backup_$(date +%Y%m%d).sql
# Run migrations
python -m aragora.migrations.runner migrate
# Verify
python -m aragora.migrations.runner status
Certificate Renewal
Certificates are auto-renewed by cert-manager. Manual process if needed:
# Check certificate status
kubectl -n aragora get certificate
# Force renewal
kubectl -n aragora delete secret aragora-tls
kubectl -n aragora annotate certificate aragora-cert cert-manager.io/issue-temporary-certificate="true"
Scaling Operations
Horizontal Scaling
# Scale backend
kubectl -n aragora scale deployment/aragora-backend --replicas=5
# Scale frontend
kubectl -n aragora scale deployment/aragora-frontend --replicas=3
# Verify
kubectl -n aragora get pods -w
Vertical Scaling
# Update resource limits
kubectl -n aragora set resources deployment/aragora-backend \
--limits=memory=4Gi,cpu=2000m \
--requests=memory=1Gi,cpu=500m
Database Scaling
For PostgreSQL on cloud:
- Create read replica
- Update connection string for read operations
- Monitor replication lag
Backup & Recovery
Daily Backups
Automated via cron:
# Manual backup
kubectl -n aragora exec -it postgres-0 -- pg_dump -U aragora aragora > backup.sql
# Verify backup
head -100 backup.sql
Point-in-Time Recovery
# Restore from backup
kubectl -n aragora exec -i postgres-0 -- psql -U aragora aragora < backup.sql
# Or restore to specific time (if WAL archiving enabled)
pg_restore --target-time="2026-01-20 12:00:00" ...
Knowledge Base Backup
# Export knowledge mound
curl -X POST http://localhost:8080/api/knowledge/export \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-o knowledge_backup.json
# Restore
curl -X POST http://localhost:8080/api/knowledge/import \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d @knowledge_backup.json
Useful Commands
Quick Diagnostics
# All pods status
kubectl -n aragora get pods -o wide
# Recent logs
kubectl -n aragora logs deployment/aragora-backend --tail=100 -f
# Resource usage
kubectl -n aragora top pods
# Events
kubectl -n aragora get events --sort-by=.lastTimestamp | tail -20
Debug Mode
# Enable debug logging
kubectl -n aragora set env deployment/aragora-backend ARAGORA_LOG_LEVEL=DEBUG
# Disable when done
kubectl -n aragora set env deployment/aragora-backend ARAGORA_LOG_LEVEL=INFO
Emergency Contacts
| Role | Contact | Escalation |
|---|---|---|
| On-call Engineer | PagerDuty | - |
| Engineering Lead | [email] | After 30 min |
| Platform Team | Slack #platform | P1/P2 only |
Appendix: Monitoring Queries
Prometheus Queries
# Error rate
sum(rate(aragora_http_requests_total{status=~"5.."}[5m])) / sum(rate(aragora_http_requests_total[5m])) * 100
# P95 latency
histogram_quantile(0.95, sum(rate(aragora_http_request_duration_seconds_bucket[5m])) by (le))
# Active debates
aragora_active_debates
# Memory usage
process_resident_memory_bytes{job="aragora-backend"} / 1024 / 1024 / 1024