Alert Runbooks
This document provides operational procedures for responding to Aragora alerts. Each runbook includes:
- Alert description and severity
- Initial triage steps
- Resolution procedures
- Escalation paths
Table of Contents
- API Availability Alerts
- Latency Alerts
- Debate Alerts
- Agent Alerts
- Security Alerts
- Infrastructure Alerts
- SLO Error Budget Alerts
API Availability Alerts
ServiceDown
Severity: Critical SLO Impact: API Availability (99.9%)
Alert Condition:
up{job="aragora"} == 0
Initial Triage:
- Check Kubernetes pod status:
kubectl get pods -l app=aragora - View pod logs:
kubectl logs -l app=aragora --tail=100 - Check recent deployments:
kubectl rollout history deployment/aragora
Resolution Steps:
- If pod is CrashLoopBackOff:
- Check logs for startup errors
- Verify environment variables are set
- Check secrets/configmaps are mounted
- If pod is Pending:
- Check node resources:
kubectl describe node - Verify PVC bindings
- Check node resources:
- If OOMKilled:
- Increase memory limits
- Review for memory leaks
Escalation:
- After 5 min: Page on-call engineer
- After 15 min: Escalate to team lead
- After 30 min: Incident commander takes over
HighErrorRate
Severity: Critical SLO Impact: API Availability (99.9%)
Alert Condition:
sum(rate(aragora_api_requests_total{status="error"}[5m])) /
sum(rate(aragora_api_requests_total[5m])) > 0.01
Initial Triage:
- Check error breakdown by endpoint:
sum by (endpoint) (rate(aragora_api_requests_total{status="error"}[5m])) - Check recent error logs:
kubectl logs -l app=aragora | grep ERROR - Review recent deployments or config changes
Resolution Steps:
- If specific endpoint failing:
- Check dependent services (database, Redis, AI providers)
- Review endpoint-specific logs
- If all endpoints affected:
- Check database connectivity
- Verify Redis is healthy
- Check AI provider API status
- If caused by bad deployment:
- Rollback:
kubectl rollout undo deployment/aragora
- Rollback:
Escalation:
- Page on-call immediately for >5% error rate
- Engage database team if DB-related
- Engage AI team if provider issues
Latency Alerts
HighAPILatency
Severity: Warning SLO Impact: API Latency p99 (<2s)
Alert Condition:
histogram_quantile(0.99, rate(aragora_api_latency_seconds_bucket[5m])) > 2
Initial Triage:
- Identify slow endpoints:
histogram_quantile(0.99, sum by (endpoint) (rate(aragora_api_latency_seconds_bucket[5m]))) > 2 - Check database query times
- Check external API latencies (AI providers)
Resolution Steps:
- If database slow:
- Check for missing indexes
- Review recent schema changes
- Check for table locks
- If AI provider slow:
- Check provider status page
- Consider fallback to OpenRouter
- If memory pressure:
- Scale horizontally
- Review memory usage patterns
Escalation:
- Warning: Monitor for 15 min before escalating
- If sustained >30 min: Page backend engineer
Agent Alerts
HighAgentLatency
Severity: Warning SLO Impact: Agent Reliability (98%)
Alert Condition:
histogram_quantile(0.99, rate(aragora_agent_latency_seconds_bucket[5m])) > 30
Initial Triage:
- Check which agents are slow:
histogram_quantile(0.99, sum by (agent) (rate(aragora_agent_latency_seconds_bucket[5m]))) - Check AI provider status pages
- Review circuit breaker status
Resolution Steps:
- If specific provider slow:
- Check provider status
- Temporarily reduce traffic to that provider
- Enable fallback routing
- If all providers slow:
- Check network connectivity
- Review request complexity (token count)
- Enable circuit breaker if needed:
from aragora.resilience import get_circuit_breaker
cb = get_circuit_breaker("anthropic")
cb.trip() # Force open circuit
Escalation:
- Contact AI provider support if issue persists >1 hour
Debate Alerts
HighDebateFailureRate
Severity: Critical SLO Impact: Debate Completion (99.5%)
Alert Condition:
sum(rate(aragora_debates_completed_total{status=~"error|timeout"}[5m])) /
sum(rate(aragora_debates_completed_total[5m])) > 0.005
Initial Triage:
- Check failure reasons:
sum by (status, reason) (rate(aragora_debates_completed_total{status!="completed"}[5m])) - Check for specific agent failures
- Review debate logs for patterns
Resolution Steps:
- If timeout failures:
- Increase debate timeout
- Check for slow agents
- Review debate complexity
- If agent failures:
- Check circuit breakers
- Enable fallback agents
- Review agent error logs
- If consensus failures:
- Review convergence settings
- Check for conflicting agent configurations
Escalation:
-
1% failure rate: Page immediately
- Engage ML team for consensus issues
DebateTakingTooLong
Severity: Warning SLO Impact: Debate Duration (95% < 5min)
Alert Condition:
histogram_quantile(0.95, rate(aragora_debate_duration_seconds_bucket[5m])) > 300
Initial Triage:
- Check which debate types are slow
- Review agent response times
- Check round count distribution
Resolution Steps:
- If too many rounds:
- Review convergence threshold
- Check for hollow consensus
- If agents slow:
- See HighAgentLatency runbook
- If memory operations slow:
- Check ContinuumMemory performance
- Review Knowledge Mound latency
Security Alerts
HighAuthFailureRate
Severity: High SLO Impact: Authentication (99.9%)
Alert Condition:
sum(rate(aragora_auth_failures_total[5m])) /
sum(rate(aragora_api_requests_total{endpoint=~"/api/auth/.*"}[5m])) > 0.1
Initial Triage:
- Check failure reasons:
sum by (reason) (rate(aragora_auth_failures_total[5m])) - Check for brute force patterns (same IP)
- Review JWT service health
Resolution Steps:
- If brute force detected:
- Enable IP rate limiting
- Consider temporary IP block
- Review anomaly detection alerts
- If JWT issues:
- Check secret rotation status
- Verify JWT signing key
- If provider issues (SSO):
- Check IdP status
- Verify OIDC configuration
Escalation:
- Potential security incident: Page security team immediately
- Enable enhanced logging for forensics
BruteForceAttemptDetected
Severity: High
Alert Condition:
sum by (ip_address) (rate(aragora_auth_failures_total[5m])) > 10
Initial Triage:
- Identify source IP addresses
- Check if any accounts were compromised
- Review affected user accounts
Resolution Steps:
- Block offending IP:
kubectl exec -it aragora-pod -- python -c "
from aragora.security.anomaly_detection import get_anomaly_detector
# IP will be tracked automatically
" - Notify affected users
- Force password reset if needed
- Review audit logs
Escalation:
- Invoke incident response if account compromise confirmed
- Engage security team
Infrastructure Alerts
CircuitBreakerOpen
Severity: Warning
Alert Condition:
aragora_circuit_breakers_open > 0
Initial Triage:
- Check which circuit breakers are open:
aragora_circuit_breaker_state{state="open"} - Review failure rate for affected service
- Check service health
Resolution Steps:
- Check underlying service health
- Review error patterns in logs
- Wait for automatic recovery or manual reset:
from aragora.resilience import get_circuit_breaker
cb = get_circuit_breaker("service_name")
cb.reset()
Escalation:
- If multiple circuits open: Page infrastructure team
HighMemoryUsage
Severity: Warning
Alert Condition:
process_resident_memory_bytes / process_virtual_memory_bytes > 0.9
Initial Triage:
- Check memory trend over time
- Identify memory-heavy operations
- Review recent traffic patterns
Resolution Steps:
- If gradual increase (leak):
- Identify leak with memory profiler
- Schedule pod restart
- If spike (traffic):
- Scale horizontally
- Enable request queuing
- If cache issue:
- Review cache eviction policy
- Clear caches if needed
SLO Error Budget Alerts
FastBurnRate
Severity: Critical
Alert Condition:
burn_rate: 14.4 # Budget exhausted in ~2 days
duration: 1h
Initial Triage:
- Identify which SLO is burning
- Check for recent changes (deploy, config)
- Review error patterns
Resolution Steps:
- Identify root cause using SLO-specific runbook
- Consider rollback if deploy-related
- Implement immediate mitigation
Escalation:
- Immediate page to on-call
- 30-minute status update cadence
SlowBurnRate
Severity: Warning
Alert Condition:
burn_rate: 6.0 # Budget exhausted in ~5 days
duration: 6h
Initial Triage:
- Review error budget dashboard
- Identify contributing factors
- Project budget exhaustion date
Resolution Steps:
- Create ticket for investigation
- Schedule remediation work
- Consider feature freeze if needed
Escalation:
- Team standup discussion
- Engineering manager if budget <50%
General Procedures
Incident Response Flow
- Acknowledge alert within 5 minutes
- Assess severity and impact
- Communicate via incident channel
- Mitigate to restore service
- Resolve root cause
- Review in post-incident meeting
Communication Templates
Initial Update:
[INCIDENT] Aragora - \{AlertName\}
Impact: {Description of user impact}
Status: Investigating
ETA: Assessing
Resolution Update:
[RESOLVED] Aragora - \{AlertName\}
Impact: {Description of user impact}
Resolution: {What fixed it}
Duration: {How long}
Follow-up: {Any follow-up actions}
Useful Commands
# Check pod status
kubectl get pods -l app=aragora -o wide
# View logs
kubectl logs -l app=aragora --tail=100 -f
# Check metrics
curl http://localhost:9090/api/v1/query?query=up{job="aragora"}
# Rollback deployment
kubectl rollout undo deployment/aragora
# Scale up
kubectl scale deployment/aragora --replicas=5
# Check circuit breakers
curl http://aragora:8080/api/health/circuits
Contact Information
| Role | Contact | Escalation Time |
|---|---|---|
| On-Call Engineer | PagerDuty | Immediate |
| Backend Lead | Slack @backend-lead | 15 min |
| Security Team | Slack #security-ops | For security alerts |
| Infrastructure | Slack #infrastructure | For infra alerts |
| Incident Commander | PagerDuty escalation | 30 min |
Document Version: 1.0 Last Updated: January 21, 2026 Owner: Platform Team