Alert Runbooks

This document provides operational procedures for responding to Aragora alerts. Each runbook includes:

Alert description and severity
Initial triage steps
Resolution procedures
Escalation paths

API Availability Alerts
Latency Alerts
Debate Alerts
Agent Alerts
Security Alerts
Infrastructure Alerts
SLO Error Budget Alerts

API Availability Alerts

ServiceDown

Severity: Critical SLO Impact: API Availability (99.9%)

Alert Condition:

up{job="aragora"} == 0

Initial Triage:

Check Kubernetes pod status: kubectl get pods -l app=aragora
View pod logs: kubectl logs -l app=aragora --tail=100
Check recent deployments: kubectl rollout history deployment/aragora

Resolution Steps:

If pod is CrashLoopBackOff:
- Check logs for startup errors
- Verify environment variables are set
- Check secrets/configmaps are mounted
If pod is Pending:
- Check node resources: kubectl describe node
- Verify PVC bindings
If OOMKilled:
- Increase memory limits
- Review for memory leaks

Escalation:

After 5 min: Page on-call engineer
After 15 min: Escalate to team lead
After 30 min: Incident commander takes over

HighErrorRate

Severity: Critical SLO Impact: API Availability (99.9%)

Alert Condition:

sum(rate(aragora_api_requests_total{status="error"}[5m])) /
sum(rate(aragora_api_requests_total[5m])) > 0.01

Initial Triage:

Check error breakdown by endpoint:

sum by (endpoint) (rate(aragora_api_requests_total{status="error"}[5m]))

Check recent error logs: kubectl logs -l app=aragora | grep ERROR
Review recent deployments or config changes

Resolution Steps:

If specific endpoint failing:
- Check dependent services (database, Redis, AI providers)
- Review endpoint-specific logs
If all endpoints affected:
- Check database connectivity
- Verify Redis is healthy
- Check AI provider API status
If caused by bad deployment:
- Rollback: kubectl rollout undo deployment/aragora

Escalation:

Page on-call immediately for >5% error rate
Engage database team if DB-related
Engage AI team if provider issues

Latency Alerts

HighAPILatency

Severity: Warning SLO Impact: API Latency p99 (<2s)

Alert Condition:

histogram_quantile(0.99, rate(aragora_api_latency_seconds_bucket[5m])) > 2

Initial Triage:

Identify slow endpoints:

histogram_quantile(0.99, sum by (endpoint) (rate(aragora_api_latency_seconds_bucket[5m]))) > 2

Check database query times
Check external API latencies (AI providers)

Resolution Steps:

If database slow:
- Check for missing indexes
- Review recent schema changes
- Check for table locks
If AI provider slow:
- Check provider status page
- Consider fallback to OpenRouter
If memory pressure:
- Scale horizontally
- Review memory usage patterns

Escalation:

Warning: Monitor for 15 min before escalating
If sustained >30 min: Page backend engineer

Agent Alerts

HighAgentLatency

Severity: Warning SLO Impact: Agent Reliability (98%)

Alert Condition:

histogram_quantile(0.99, rate(aragora_agent_latency_seconds_bucket[5m])) > 30

Initial Triage:

Check which agents are slow:

histogram_quantile(0.99, sum by (agent) (rate(aragora_agent_latency_seconds_bucket[5m])))

Check AI provider status pages
Review circuit breaker status

Resolution Steps:

If specific provider slow:
- Check provider status
- Temporarily reduce traffic to that provider
- Enable fallback routing
If all providers slow:
- Check network connectivity
- Review request complexity (token count)

Enable circuit breaker if needed:

from aragora.resilience import get_circuit_breaker
cb = get_circuit_breaker("anthropic")
cb.trip()  # Force open circuit

Escalation:

Contact AI provider support if issue persists >1 hour

Debate Alerts

HighDebateFailureRate

Severity: Critical SLO Impact: Debate Completion (99.5%)

Alert Condition:

sum(rate(aragora_debates_completed_total{status=~"error|timeout"}[5m])) /
sum(rate(aragora_debates_completed_total[5m])) > 0.005

Initial Triage:

Check failure reasons:

sum by (status, reason) (rate(aragora_debates_completed_total{status!="completed"}[5m]))

Check for specific agent failures
Review debate logs for patterns

Resolution Steps:

If timeout failures:
- Increase debate timeout
- Check for slow agents
- Review debate complexity
If agent failures:
- Check circuit breakers
- Enable fallback agents
- Review agent error logs
If consensus failures:
- Review convergence settings
- Check for conflicting agent configurations

Escalation:

1% failure rate: Page immediately
Engage ML team for consensus issues

DebateTakingTooLong

Severity: Warning SLO Impact: Debate Duration (95% < 5min)

Alert Condition:

histogram_quantile(0.95, rate(aragora_debate_duration_seconds_bucket[5m])) > 300

Initial Triage:

Check which debate types are slow
Review agent response times
Check round count distribution

Resolution Steps:

If too many rounds:
- Review convergence threshold
- Check for hollow consensus
If agents slow:
- See HighAgentLatency runbook
If memory operations slow:
- Check ContinuumMemory performance
- Review Knowledge Mound latency

Security Alerts

HighAuthFailureRate

Severity: High SLO Impact: Authentication (99.9%)

Alert Condition:

sum(rate(aragora_auth_failures_total[5m])) /
sum(rate(aragora_api_requests_total{endpoint=~"/api/auth/.*"}[5m])) > 0.1

Initial Triage:

Check failure reasons:

sum by (reason) (rate(aragora_auth_failures_total[5m]))

Check for brute force patterns (same IP)
Review JWT service health

Resolution Steps:

If brute force detected:
- Enable IP rate limiting
- Consider temporary IP block
- Review anomaly detection alerts
If JWT issues:
- Check secret rotation status
- Verify JWT signing key
If provider issues (SSO):
- Check IdP status
- Verify OIDC configuration

Escalation:

Potential security incident: Page security team immediately
Enable enhanced logging for forensics

BruteForceAttemptDetected

Severity: High

Alert Condition:

sum by (ip_address) (rate(aragora_auth_failures_total[5m])) > 10

Initial Triage:

Identify source IP addresses
Check if any accounts were compromised
Review affected user accounts

Resolution Steps:

Block offending IP:

kubectl exec -it aragora-pod -- python -c "
from aragora.security.anomaly_detection import get_anomaly_detector
# IP will be tracked automatically
"

Notify affected users
Force password reset if needed
Review audit logs

Escalation:

Invoke incident response if account compromise confirmed
Engage security team

Infrastructure Alerts

CircuitBreakerOpen

Severity: Warning

Alert Condition:

aragora_circuit_breakers_open > 0

Initial Triage:

Check which circuit breakers are open:

aragora_circuit_breaker_state{state="open"}

Review failure rate for affected service
Check service health

Resolution Steps:

Check underlying service health
Review error patterns in logs

Wait for automatic recovery or manual reset:

from aragora.resilience import get_circuit_breaker
cb = get_circuit_breaker("service_name")
cb.reset()

Escalation:

If multiple circuits open: Page infrastructure team

HighMemoryUsage

Severity: Warning

Alert Condition:

process_resident_memory_bytes / process_virtual_memory_bytes > 0.9

Initial Triage:

Check memory trend over time
Identify memory-heavy operations
Review recent traffic patterns

Resolution Steps:

If gradual increase (leak):
- Identify leak with memory profiler
- Schedule pod restart
If spike (traffic):
- Scale horizontally
- Enable request queuing
If cache issue:
- Review cache eviction policy
- Clear caches if needed

SLO Error Budget Alerts

FastBurnRate

Severity: Critical

Alert Condition:

burn_rate: 14.4  # Budget exhausted in ~2 days
duration: 1h

Initial Triage:

Identify which SLO is burning
Check for recent changes (deploy, config)
Review error patterns

Resolution Steps:

Identify root cause using SLO-specific runbook
Consider rollback if deploy-related
Implement immediate mitigation

Escalation:

Immediate page to on-call
30-minute status update cadence

SlowBurnRate

Severity: Warning

Alert Condition:

burn_rate: 6.0  # Budget exhausted in ~5 days
duration: 6h

Initial Triage:

Review error budget dashboard
Identify contributing factors
Project budget exhaustion date

Resolution Steps:

Create ticket for investigation
Schedule remediation work
Consider feature freeze if needed

Escalation:

Team standup discussion
Engineering manager if budget <50%

General Procedures

Incident Response Flow

Acknowledge alert within 5 minutes
Assess severity and impact
Communicate via incident channel
Mitigate to restore service
Resolve root cause
Review in post-incident meeting

Communication Templates

Initial Update:

[INCIDENT] Aragora - \{AlertName\}
Impact: {Description of user impact}
Status: Investigating
ETA: Assessing

Resolution Update:

[RESOLVED] Aragora - \{AlertName\}
Impact: {Description of user impact}
Resolution: {What fixed it}
Duration: {How long}
Follow-up: {Any follow-up actions}

Useful Commands

# Check pod status
kubectl get pods -l app=aragora -o wide

# View logs
kubectl logs -l app=aragora --tail=100 -f

# Check metrics
curl http://localhost:9090/api/v1/query?query=up{job="aragora"}

# Rollback deployment
kubectl rollout undo deployment/aragora

# Scale up
kubectl scale deployment/aragora --replicas=5

# Check circuit breakers
curl http://aragora:8080/api/health/circuits

Contact Information

Role	Contact	Escalation Time
On-Call Engineer	PagerDuty	Immediate
Backend Lead	Slack @backend-lead	15 min
Security Team	Slack #security-ops	For security alerts
Infrastructure	Slack #infrastructure	For infra alerts
Incident Commander	PagerDuty escalation	30 min

Document Version: 1.0 Last Updated: January 21, 2026 Owner: Platform Team

Table of Contents​

API Availability Alerts​

ServiceDown​

HighErrorRate​

Latency Alerts​

HighAPILatency​

Agent Alerts​

HighAgentLatency​

Debate Alerts​

HighDebateFailureRate​

DebateTakingTooLong​

Security Alerts​

HighAuthFailureRate​

BruteForceAttemptDetected​

Infrastructure Alerts​

CircuitBreakerOpen​

HighMemoryUsage​

SLO Error Budget Alerts​

FastBurnRate​

SlowBurnRate​

General Procedures​

Incident Response Flow​

Communication Templates​

Useful Commands​

Contact Information​

Table of Contents

API Availability Alerts

ServiceDown

HighErrorRate

Latency Alerts

HighAPILatency

Agent Alerts

HighAgentLatency

Debate Alerts

HighDebateFailureRate

DebateTakingTooLong

Security Alerts

HighAuthFailureRate

BruteForceAttemptDetected

Infrastructure Alerts

CircuitBreakerOpen

HighMemoryUsage

SLO Error Budget Alerts

FastBurnRate

SlowBurnRate

General Procedures

Incident Response Flow

Communication Templates

Useful Commands

Contact Information