Skip to main content

Production Readiness Checklist

Comprehensive checklist for deploying Aragora to production. Complete all sections before go-live.

Document Version: 1.0.0 Last Updated: 2026-01-13


Quick Status Check

# Run production readiness validator
python -m aragora.config --validate-production

# Expected output:
# [PASS] Environment mode is production
# [PASS] JWT secret configured (32+ chars)
# [PASS] At least one AI provider configured
# [PASS] Database connection verified
# [WARN] Redis not configured (rate limiting will use in-memory)

Pre-Launch Checklist

1. Security Configuration

ItemStatusVerification Command
JWT Secret (32+ chars)[ ]echo $ARAGORA_JWT_SECRET | wc -c (should be >32)
Production environment mode[ ]echo $ARAGORA_ENV (should be production)
API keys in secret manager[ ]Not in .env file, use K8s secrets or vault
CORS origins restricted[ ]Check ARAGORA_ALLOWED_ORIGINS excludes localhost
Rate limiting enabled[ ]curl /api/rate-limits/config returns limits
TLS/HTTPS enforced[ ]curl -I https://your-domain/api/health
Token blacklist backend[ ]echo $ARAGORA_BLACKLIST_BACKEND (recommend redis or sqlite)

Critical Security Variables:

# REQUIRED in production
ARAGORA_ENV=production
ARAGORA_JWT_SECRET=<32+ character random string>

# Generate secure JWT secret:
python -c "import secrets; print(secrets.token_urlsafe(48))"

2. High Availability

ItemStatusVerification Command
Minimum 2 replicas[ ]kubectl get deploy aragora -o jsonpath='{.spec.replicas}'
HPA configured[ ]kubectl get hpa aragora
PDB configured[ ]kubectl get pdb aragora-pdb
Anti-affinity rules[ ]Check deployment manifest
Multi-zone distribution[ ]kubectl get pods -o wide (different nodes)

Kubernetes HA Manifest:

# deploy/kubernetes/deployment.yaml
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0

3. Database & Persistence

ItemStatusVerification Command
Database connection works[ ]curl /api/health | jq '.checks.database'
Connection pool configured[ ]Check ARAGORA_DB_POOL_SIZE
Automated backups scheduled[ ]Check cron/backup job
Backup restore tested[ ]Document last test date
Migration scripts applied[ ]kubectl logs job/aragora-migrate

Database Configuration:

# Supabase (recommended for production)
SUPABASE_URL=https://xxx.supabase.co
SUPABASE_KEY=<service_role_key>

# Connection tuning
ARAGORA_DB_POOL_SIZE=20
ARAGORA_DB_MAX_OVERFLOW=10
ARAGORA_DB_TIMEOUT=60.0

4. Redis Configuration

ItemStatusVerification Command
Redis connection works[ ]redis-cli ping
Redis URL configured[ ]Check ARAGORA_REDIS_URL
Rate limit uses Redis[ ]curl /api/rate-limits/config shows backend: redis
Redis persistence enabled[ ]redis-cli CONFIG GET save
Redis memory limit set[ ]redis-cli CONFIG GET maxmemory

Redis Configuration:

ARAGORA_REDIS_URL=redis://redis.aragora.svc:6379/0
ARAGORA_REDIS_KEY_PREFIX=aragora:prod:
ARAGORA_BLACKLIST_BACKEND=redis

# Fail-open for high availability (optional)
ARAGORA_RATE_LIMIT_FAIL_OPEN=true

5. Monitoring & Observability

ItemStatusVerification Command
Prometheus scraping[ ]curl /metrics | head
Grafana dashboards imported[ ]Check Grafana UI
Alert rules configured[ ]Check Alertmanager
Sentry DSN configured[ ]Check SENTRY_DSN
Log aggregation working[ ]Check logging platform

Monitoring Setup:

# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: aragora
labels:
release: prometheus
spec:
selector:
matchLabels:
app.kubernetes.io/name: aragora
endpoints:
- port: http
path: /metrics
interval: 15s

6. SSL/TLS Certificates

ItemStatusVerification Command
cert-manager installed[ ]kubectl get pods -n cert-manager
ClusterIssuer configured[ ]kubectl get clusterissuer
Certificate issued[ ]kubectl get certificate -n aragora
Certificate valid[ ]echo | openssl s_client -connect domain:443 2>/dev/null | openssl x509 -noout -dates
Auto-renewal working[ ]cert-manager handles automatically

Certificate Verification:

# Check certificate status
kubectl describe certificate aragora-tls -n aragora

# Check expiration
kubectl get certificate aragora-tls -n aragora -o jsonpath='{.status.notAfter}'

# Test HTTPS
curl -vI https://your-domain.com/api/health 2>&1 | grep -E "SSL|subject|expire"

7. OAuth & Authentication

ItemStatusVerification Command
OAuth redirect URLs set[ ]Check GOOGLE_OAUTH_REDIRECT_URI, OAUTH_SUCCESS_URL
Allowed hosts configured[ ]Check OAUTH_ALLOWED_REDIRECT_HOSTS
SSO configured (if used)[ ]See SSO_SETUP.md
Session timeout reasonable[ ]Check ARAGORA_JWT_EXPIRY_HOURS (max 168h)

Production OAuth Configuration:

# REQUIRED for OAuth in production
GOOGLE_OAUTH_CLIENT_ID=1234567890-abc.apps.googleusercontent.com
GOOGLE_OAUTH_CLIENT_SECRET=your-client-secret
GOOGLE_OAUTH_REDIRECT_URI=https://api.yourdomain.com/api/auth/oauth/google/callback
OAUTH_SUCCESS_URL=https://yourdomain.com/auth/success
OAUTH_ERROR_URL=https://yourdomain.com/auth/error
OAUTH_ALLOWED_REDIRECT_HOSTS=yourdomain.com,api.yourdomain.com

# Frontend URLs
NEXT_PUBLIC_API_URL=https://api.yourdomain.com
NEXT_PUBLIC_WS_URL=wss://api.yourdomain.com

8. API Provider Configuration

ItemStatusVerification Command
Primary provider configured[ ]ANTHROPIC_API_KEY or OPENAI_API_KEY set
Fallback provider configured[ ]OPENROUTER_API_KEY for quota exhaustion
Circuit breakers enabled[ ]curl /api/agents/circuit-breakers
Agent timeout configured[ ]Check ARAGORA_AGENT_TIMEOUT_SECONDS

Provider Configuration:

# Primary providers (at least one required)
ANTHROPIC_API_KEY=sk-ant-xxx
OPENAI_API_KEY=sk-xxx

# Fallback provider (highly recommended)
OPENROUTER_API_KEY=sk-or-xxx

# Optional additional providers
MISTRAL_API_KEY=xxx
GEMINI_API_KEY=AIzaSy...
XAI_API_KEY=xai-xxx

9. RLM (Recursive Language Models)

ItemStatusVerification Command
RLM package installed[ ]python -c "from aragora.rlm import HAS_OFFICIAL_RLM; print(HAS_OFFICIAL_RLM)"
TRUE RLM enabled[ ]Check server logs for Created AragoraRLM with TRUE RLM support
Compression fallback NOT active[ ]Logs should NOT show Created AragoraRLM with compression fallback

Verification Commands:

# Check if TRUE RLM is available
python -c "from aragora.rlm import HAS_OFFICIAL_RLM; print(f'TRUE RLM: \{HAS_OFFICIAL_RLM\}')"
# Expected: TRUE RLM: True

# Check server logs after startup
docker logs aragora 2>&1 | grep -E "RLM Factory|TRUE RLM|compression fallback"
# Expected: [RLM Factory] Created AragoraRLM with TRUE RLM support
# NOT Expected: [RLM Factory] Created AragoraRLM with compression fallback

# Verify rlm package is installed
pip show rlm || echo "RLM package not installed - install with: pip install '.[rlm]'"

Why TRUE RLM Matters:

  • TRUE RLM uses REPL-based context navigation for efficient memory retrieval
  • Compression fallback uses a 5-level hierarchy (50%→80%→95% compression) which loses context fidelity
  • TRUE RLM significantly improves knowledge retrieval quality in production workloads

Troubleshooting:

  • If HAS_OFFICIAL_RLM is False: Ensure the rlm extra is in your pip install command
  • Docker builds: Verify .[rlm] is included in the Dockerfile pip install line
  • Lightsail/EC2: Ensure setup scripts include pip install -e ".[rlm]"

10. Resource Limits

ItemStatusVerification Command
CPU limits set[ ]Check deployment manifest
Memory limits set[ ]Check deployment manifest
Concurrent debate limit[ ]Check ARAGORA_MAX_CONCURRENT_DEBATES
Rate limits configured[ ]Check ARAGORA_RATE_LIMIT

Recommended Resource Limits:

# Kubernetes deployment
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"

10. Disaster Recovery

ItemStatusVerification Command
Backup schedule configured[ ]Check cron jobs
Backup retention policy[ ]Document: ___ days
Restore procedure tested[ ]Document last test: ______
Rollback procedure documented[ ]See DISASTER_RECOVERY.md
Incident response plan[ ]See RUNBOOK.md

Go-Live Verification

Run these commands immediately before and after go-live:

Pre-Launch Health Check

#!/bin/bash
# pre_launch_check.sh

echo "=== Pre-Launch Health Check ==="

# 1. Health endpoint
echo -n "Health endpoint: "
curl -sf https://your-domain/api/health | jq -r '.status' || echo "FAILED"

# 2. Database
echo -n "Database: "
curl -sf https://your-domain/api/health | jq -r '.checks.database.status' || echo "FAILED"

# 3. Redis
echo -n "Redis: "
curl -sf https://your-domain/api/health | jq -r '.checks.redis.status' || echo "FAILED"

# 4. Certificate validity
echo -n "TLS Certificate: "
echo | openssl s_client -connect your-domain:443 2>/dev/null | openssl x509 -noout -checkend 604800 && echo "OK (>7 days)" || echo "EXPIRING SOON"

# 5. Replicas
echo -n "Replicas: "
kubectl get deploy aragora -n aragora -o jsonpath='{.status.readyReplicas}/{.spec.replicas}'
echo ""

# 6. Pod distribution
echo "Pod distribution:"
kubectl get pods -n aragora -o wide | awk '{print $1, $7}' | tail -n +2

# 7. Circuit breakers
echo -n "Circuit breakers open: "
curl -sf https://your-domain/api/agents/circuit-breakers | jq '[.[] | select(.state=="open")] | length' || echo "?"

echo "=== Check Complete ==="

Post-Launch Smoke Test

#!/bin/bash
# smoke_test.sh

echo "=== Post-Launch Smoke Test ==="

BASE_URL="https://your-domain"
TOKEN="your-test-token"

# 1. Create a test debate
echo "Creating test debate..."
DEBATE_ID=$(curl -sf -X POST "$BASE_URL/api/debates" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"topic": "Smoke test debate", "agents": ["claude", "gpt-4"], "max_rounds": 1}' \
| jq -r '.debate_id')

if [ "$DEBATE_ID" == "null" ] || [ -z "$DEBATE_ID" ]; then
echo "FAILED: Could not create debate"
exit 1
fi
echo "Created debate: $DEBATE_ID"

# 2. Check debate status
sleep 5
echo "Checking debate status..."
STATUS=$(curl -sf "$BASE_URL/api/debates/$DEBATE_ID" \
-H "Authorization: Bearer $TOKEN" \
| jq -r '.status')
echo "Debate status: $STATUS"

# 3. WebSocket connectivity
echo "Testing WebSocket..."
timeout 5 websocat "wss://your-domain/ws" -1 && echo "WebSocket: OK" || echo "WebSocket: FAILED"

# 4. Metrics endpoint
echo -n "Metrics endpoint: "
curl -sf "$BASE_URL/metrics" | head -1 && echo "OK" || echo "FAILED"

echo "=== Smoke Test Complete ==="

Environment Variable Quick Reference

Minimum Production Configuration

# Environment mode
ARAGORA_ENV=production

# Authentication (REQUIRED)
ARAGORA_JWT_SECRET=&lt;64-character-random-string>

# AI Providers (at least one required)
ANTHROPIC_API_KEY=sk-ant-xxx
OPENAI_API_KEY=sk-xxx
OPENROUTER_API_KEY=sk-or-xxx # Fallback

# Database
SUPABASE_URL=https://xxx.supabase.co
SUPABASE_KEY=<service_role_key>

# Redis (recommended)
ARAGORA_REDIS_URL=redis://redis:6379/0

# OAuth URLs (required for authentication)
GOOGLE_OAUTH_REDIRECT_URI=https://api.yourdomain.com/api/auth/oauth/google/callback
OAUTH_SUCCESS_URL=https://yourdomain.com/auth/success
OAUTH_ERROR_URL=https://yourdomain.com/auth/error
OAUTH_ALLOWED_REDIRECT_HOSTS=yourdomain.com,api.yourdomain.com

# Frontend
NEXT_PUBLIC_API_URL=https://api.yourdomain.com
NEXT_PUBLIC_WS_URL=wss://api.yourdomain.com

Production Tuning

# Rate limiting
ARAGORA_RATE_LIMIT=100 # Requests per minute per token
ARAGORA_IP_RATE_LIMIT=200 # Requests per minute per IP
ARAGORA_BURST_MULTIPLIER=2.0 # Allow short bursts

# Debate limits
ARAGORA_MAX_CONCURRENT_DEBATES=20
ARAGORA_AGENT_TIMEOUT_SECONDS=90

# WebSocket
ARAGORA_WS_MAX_MESSAGE_SIZE=131072 # 128KB
ARAGORA_WS_HEARTBEAT=30

# Observability
SENTRY_DSN=https://xxx@sentry.io/xxx
ARAGORA_LOG_LEVEL=INFO
ARAGORA_TELEMETRY_LEVEL=CONTROLLED

# Security
ARAGORA_BLACKLIST_BACKEND=redis

Rollback Procedures

Quick Rollback (Kubernetes)

# View rollout history
kubectl rollout history deployment/aragora -n aragora

# Rollback to previous version
kubectl rollout undo deployment/aragora -n aragora

# Rollback to specific revision
kubectl rollout undo deployment/aragora -n aragora --to-revision=3

# Verify rollback
kubectl rollout status deployment/aragora -n aragora

Manual Rollback (Docker/Systemd)

# Stop current version
systemctl stop aragora

# Switch symlink to previous version
cd /opt/aragora
rm current
ln -s releases/v0.9.0 current # Previous version

# Start
systemctl start aragora

# Verify
curl http://localhost:8080/api/health | jq '.version'

Database Rollback

# Stop service first
kubectl scale deployment aragora --replicas=0 -n aragora

# Restore from backup
pg_restore -d aragora /backups/aragora_pre_release.dump

# Restart service
kubectl scale deployment aragora --replicas=3 -n aragora

Alert Configuration

Critical Alerts (PagerDuty/Immediate)

AlertConditionResponse Time
AragoraDownHealth check fails for 2 minImmediate
HighErrorRate>5% error rate for 5 min5 min
DatabaseDownDB check fails for 1 minImmediate
CertificateExpiring<7 days until expiry24 hours

Warning Alerts (Slack/Email)

AlertConditionResponse Time
HighMemory>80% memory for 10 min1 hour
CircuitBreakerOpenAny agent circuit open15 min
HighLatencyp95 >500ms for 10 min1 hour
RateLimitSpike>1000 429s in 5 min30 min

Sign-Off

RoleNameDateSignature
Engineering Lead
Security Review
Operations
Product Owner