Skip to main content

Aragora Deployment Guide

Version: 2.3.0 Last Updated: January 25, 2026

This guide covers deploying Aragora to production environments.

Prerequisites

  • Docker 20.10+
  • Kubernetes 1.25+ (for K8s deployment)
  • At least one AI provider API key (Anthropic, OpenAI, etc.)

Quick Start: Docker Compose

The simplest way to run Aragora in production:

# 1. Copy environment template
cp .env.example .env

# 2. Edit .env with your API keys
vim .env

# 3. Start services
docker compose up -d

# 4. Check health
curl http://localhost:8080/api/health

Production Readiness Checklist

Use this checklist before deploying to production:

Environment Setup

  • All required environment variables configured (see ENVIRONMENT.md)
  • At least one AI provider API key set (ANTHROPIC_API_KEY or OPENAI_API_KEY)
  • Database connection configured (ARAGORA_POSTGRES_DSN)
  • Redis URL configured for caching (ARAGORA_REDIS_URL)
  • Secrets properly encrypted (not plain text in configs)

Security

  • TLS/HTTPS configured with valid certificates
  • CORS origins restricted (ARAGORA_ALLOWED_ORIGINS)
  • Rate limiting enabled
  • API authentication enabled (API keys or OAuth)
  • RBAC roles and permissions configured
  • Audit logging enabled

Infrastructure

  • Container image built and pushed to registry
  • Resource limits set (CPU, memory)
  • Health checks configured (/api/health, /api/ready)
  • Persistent storage configured for database
  • Backup strategy in place

Monitoring

  • Prometheus metrics endpoint enabled (/metrics)
  • Grafana dashboards deployed
  • Alerting rules configured
  • Log aggregation configured
  • Error tracking (Sentry) configured

Performance

  • Load tested with expected traffic
  • Response time < 500ms for P99
  • Memory usage stable under load
  • Connection pooling configured

Verification Commands

# Verify health endpoints
curl -f https://your-domain/api/health
curl -f https://your-domain/api/ready

# Verify metrics
curl https://your-domain/metrics | head -20

# Verify WebSocket
wscat -c wss://your-domain/ws

# Verify API authentication
curl -H "Authorization: Bearer $TOKEN" https://your-domain/api/v1/agents

Kubernetes Deployment

1. Prepare Secrets

First, create your secrets file from the template:

cp deploy/k8s/secret.yaml deploy/k8s/secret-local.yaml
# Edit with your actual values
vim deploy/k8s/secret-local.yaml

For production, use Sealed Secrets:

kubeseal --format yaml < deploy/k8s/secret-local.yaml > deploy/k8s/sealed-secret.yaml

2. Build and Push Docker Image

# Build production image
docker build -t your-registry/aragora:latest .

# Push to registry
docker push your-registry/aragora:latest

3. Update Kustomization

Edit deploy/k8s/kustomization.yaml:

images:
- name: aragora
newName: your-registry/aragora
newTag: v1.0.0

4. Deploy

# Apply all resources
kubectl apply -k deploy/k8s/

# Watch rollout
kubectl -n aragora rollout status deployment/aragora

# Check pods
kubectl -n aragora get pods

5. Configure Ingress

Edit deploy/k8s/ingress.yaml with your domain:

spec:
tls:
- hosts:
- aragora.yourdomain.com
secretName: aragora-tls
rules:
- host: aragora.yourdomain.com

6. TLS Configuration with cert-manager

Aragora includes cert-manager ClusterIssuers for automatic TLS certificate management.

Install cert-manager

# Install cert-manager (v1.14.0+)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml

# Wait for cert-manager to be ready
kubectl wait --for=condition=ready pod -l app=cert-manager -n cert-manager --timeout=120s
kubectl wait --for=condition=ready pod -l app=webhook -n cert-manager --timeout=120s
kubectl wait --for=condition=ready pod -l app=cainjector -n cert-manager --timeout=120s

Apply ClusterIssuers

The cert-manager.yaml file includes three issuers:

IssuerUse Case
letsencrypt-stagingTesting (avoids rate limits, issues untrusted certs)
letsencrypt-prodProduction (real trusted certificates)
selfsigned-issuerLocal/dev environments
# Update email in cert-manager.yaml first
vim deploy/k8s/cert-manager.yaml # Change admin@aragora.ai to your email

# Apply ClusterIssuers
kubectl apply -f deploy/k8s/cert-manager.yaml

Configure Ingress for TLS

The ingress is already configured to use cert-manager. Update the domain:

# deploy/k8s/ingress.yaml
metadata:
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod" # or letsencrypt-staging for testing
spec:
tls:
- hosts:
- aragora.yourdomain.com
secretName: aragora-tls
rules:
- host: aragora.yourdomain.com

Verify TLS Setup

# Check ClusterIssuers are ready
kubectl get clusterissuers

# Check certificate is issued
kubectl -n aragora get certificate

# Check certificate secret
kubectl -n aragora get secret aragora-tls

# Test HTTPS
curl -v https://aragora.yourdomain.com/api/health

Troubleshooting TLS

# Check certificate status
kubectl -n aragora describe certificate aragora-tls

# Check cert-manager logs
kubectl -n cert-manager logs -l app=cert-manager

# Check ACME challenges
kubectl -n aragora get challenges

Common Issues:

  1. Challenge failed: Ensure DNS points to your ingress controller
  2. Rate limited: Switch to letsencrypt-staging while testing
  3. Webhook timeout: Restart cert-manager pods

7. PostgreSQL Configuration (Multi-Instance Required)

For production multi-instance deployments, PostgreSQL is required instead of SQLite.

Deploy PostgreSQL StatefulSet

# Apply PostgreSQL resources
kubectl apply -f deploy/k8s/postgres-statefulset.yaml

# Wait for PostgreSQL to be ready
kubectl -n aragora wait --for=condition=ready pod postgres-0 --timeout=120s

Or Use Managed PostgreSQL

For production, consider managed services:

  • AWS RDS: postgresql://user:pass@rds-instance.region.rds.amazonaws.com:5432/aragora?sslmode=require
  • Google Cloud SQL: Use Cloud SQL Auth Proxy
  • Supabase: postgresql://postgres.project-ref:password@aws-0-region.pooler.supabase.com:6543/postgres

Configure via secrets:

# In aragora-secrets
stringData:
ARAGORA_POSTGRES_DSN: "postgresql://aragora:password@postgres-primary:5432/aragora?sslmode=require"

Initialize Schema

# Run schema initialization
kubectl -n aragora exec -it deploy/aragora -- python scripts/init_postgres_db.py

# Verify tables
kubectl -n aragora exec -it deploy/aragora -- python scripts/init_postgres_db.py --verify

9. Database Migrations

For PostgreSQL deployments, run migrations before starting the application:

# Option 1: Manual migration (before first deploy)
kubectl apply -f deploy/k8s/migration-job.yaml
kubectl -n aragora wait --for=condition=complete job/aragora-migrate --timeout=120s
kubectl -n aragora logs job/aragora-migrate

# Option 2: With Argo CD (automatic)
# The migration job has PreSync hook annotations - runs automatically before each sync

# Check migration status
kubectl apply -f deploy/k8s/migration-job.yaml --dry-run=client -o yaml | \
grep -A 100 'name: aragora-migrate-status' | kubectl apply -f -
kubectl -n aragora logs job/aragora-migrate-status

For more database setup details, see POSTGRESQL_MIGRATION.md.

Environment Variables

Required

VariableDescription
ANTHROPIC_API_KEYAnthropic API key for Claude
OPENAI_API_KEYOpenAI API key (alternative to Anthropic)
VariableDefaultDescription
ARAGORA_REDIS_URLredis://localhost:6379/0Redis for rate limiting and caching
REDIS_URLredis://localhost:6379Legacy Redis URL used by queues/oauth/token revocation
ARAGORA_JWT_SECRET(required for auth)32+ character secret for JWT tokens
ARAGORA_API_TOKEN(optional)API token for authenticated endpoints

Optional Providers

VariableDescription
OPENROUTER_API_KEYOpenRouter for fallback
GEMINI_API_KEYGoogle Gemini
XAI_API_KEYxAI Grok
MISTRAL_API_KEYMistral AI

Billing (Optional)

VariableDescription
STRIPE_SECRET_KEYStripe API key
STRIPE_WEBHOOK_SECRETStripe webhook signing secret

Multi-Tenant Configuration (v2.0.0+)

VariableDefaultDescription
ARAGORA_MULTI_TENANTfalseEnable multi-tenant isolation
ARAGORA_DEFAULT_TENANTdefaultDefault tenant ID for legacy requests
ARAGORA_TENANT_HEADERX-Tenant-IDHTTP header for tenant identification

Tenant Quotas

VariableDefaultDescription
ARAGORA_QUOTA_API_CALLS100000Monthly API call limit per tenant
ARAGORA_QUOTA_TOKENS10000000Monthly token limit per tenant
ARAGORA_QUOTA_STORAGE_GB100Storage limit in GB per tenant
ARAGORA_QUOTA_DEBATES1000Monthly debate limit per tenant

API Versioning (v2.0.0+)

VariableDefaultDescription
ARAGORA_API_VERSIONv2Current API version
ARAGORA_API_LEGACY_ENABLEDtrueSupport legacy unversioned endpoints
ARAGORA_API_V1_SUNSET2026-12-31Sunset date for API v1

The API supports both URL prefix versioning (/api/v2/debates) and header-based versioning (X-API-Version: v2).

Metering & Usage Tracking (v2.0.0+)

VariableDefaultDescription
ARAGORA_METERING_ENABLEDtrueEnable usage metering
ARAGORA_METERING_FLUSH_INTERVAL60Seconds between metering flushes
ARAGORA_METERING_BACKENDprometheusMetering backend (prometheus/statsd)

Resource Requirements

Minimum (Development)

  • CPU: 0.5 cores
  • Memory: 512MB
  • Storage: 1GB
  • CPU: 2 cores
  • Memory: 2GB
  • Storage: 10GB
  • Redis: 256MB

Scaling Guidelines

Concurrent DebatesReplicasCPUMemory
1-511 core1GB
5-202-32 cores2GB
20-503-54 cores4GB
50+5-108 cores8GB

Health Checks

Liveness Probe

GET /api/v2/health

Returns 200 if server is running.

Readiness Probe

GET /api/v2/health/ready

Returns 200 if server can accept requests.

Monitoring

Prometheus Metrics

Metrics are exposed at /metrics (port 8080):

# prometheus.yml
scrape_configs:
- job_name: 'aragora'
static_configs:
- targets: ['aragora:8080']

Key Metrics

MetricDescription
aragora_debates_totalTotal debates run
aragora_debate_duration_secondsDebate duration histogram
aragora_agent_errors_totalAgent error count by type
aragora_consensus_rateConsensus achievement rate

v2.0.0 Metrics

MetricDescriptionAlert Threshold
aragora_rlm_compression_ratioRLM context compression< 0.5
aragora_tenant_requests_totalPer-tenant request rate-
aragora_connector_sync_duration_secondsConnector sync timep95 > 60s
aragora_billing_events_totalBilling events by tenant-
aragora_quota_usage_ratioQuota utilization per tenant> 0.9

Grafana Dashboard

Import the pre-built dashboard from k8s/monitoring/aragora-dashboard.json:

# Port-forward Grafana
kubectl -n monitoring port-forward svc/grafana 3000:3000

# Import via Grafana UI:
# 1. Go to Dashboards > Import
# 2. Upload k8s/monitoring/aragora-dashboard.json
# 3. Select Prometheus data source

Alerting Rules

Apply alert rules from k8s/monitoring/alerts.yaml:

kubectl apply -f k8s/monitoring/alerts.yaml

Key alerts included:

  • AragoraHighErrorRate - Agent error rate > 10/min
  • AragoraSlowDebates - p95 debate duration > 5min
  • AragoraQuotaNearLimit - Tenant quota > 90%
  • AragoraConnectorSyncFailed - Connector sync failures

See docs/RUNBOOK_METRICS.md for alert response procedures.

Troubleshooting

Pod CrashLoopBackOff

  1. Check logs: kubectl -n aragora logs deploy/aragora
  2. Verify secrets: kubectl -n aragora get secret aragora-secrets -o yaml
  3. Check resource limits

Redis Connection Failed

  1. Verify Redis is running: kubectl -n aragora get pods -l app.kubernetes.io/name=aragora-redis
  2. Check service: kubectl -n aragora get svc aragora-redis
  3. Test connection: kubectl -n aragora exec -it deploy/aragora -- redis-cli -h aragora-redis ping

High Memory Usage

  1. Check debate queue: Reduce ARAGORA_MAX_CONCURRENT_DEBATES
  2. Enable memory limits in deployment
  3. Consider horizontal scaling via HPA

High Availability Deployment

For production deployments requiring high availability, Aragora includes pre-configured manifests for multi-replica, zone-distributed deployments.

HA Architecture

                    ┌─────────────────┐
│ Ingress/LB │
└────────┬────────┘

┌──────────────┼──────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Aragora │ │ Aragora │ │ Aragora │
│ Replica 1 │ │ Replica 2 │ │ Replica 3 │
│ Zone A │ │ Zone B │ │ Zone C │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└──────────────┼──────────────┘

┌────────▼────────┐
│ Redis │
│ (sessions/ │
│ token store) │
└─────────────────┘

Deploy HA Configuration

# Apply HA deployment (uses deploy/kubernetes/)
kubectl apply -k deploy/kubernetes/

# Verify replicas
kubectl -n aragora get pods -l app.kubernetes.io/name=aragora

# Check HPA status
kubectl -n aragora get hpa

# Check PodDisruptionBudget
kubectl -n aragora get pdb

Key HA Components

1. Horizontal Pod Autoscaler (HPA)

Automatically scales pods based on load:

# deploy/kubernetes/hpa.yaml
spec:
minReplicas: 2 # Minimum for HA
maxReplicas: 10 # Scale up under load
metrics:
- type: Resource
resource:
name: cpu
target:
averageUtilization: 70

2. Pod Disruption Budget (PDB)

Ensures availability during node maintenance:

# deploy/kubernetes/pdb.yaml
spec:
minAvailable: 1 # At least 1 pod always running
# OR: maxUnavailable: 1

3. Pod Anti-Affinity

Spreads pods across nodes/zones:

# In deployment.yaml
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: aragora
topologyKey: kubernetes.io/hostname

4. Topology Spread Constraints

Distributes across availability zones:

topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway

Redis for Shared State

The HA deployment uses Redis for:

  • Session storage (enables sticky-session-free load balancing)
  • Token blacklist (for logout across all replicas)
  • Rate limiting state

Deploy Redis:

kubectl apply -f deploy/k8s/redis/statefulset.yaml
kubectl apply -f deploy/k8s/redis/service.yaml

For Redis HA in production, consider:

  • Redis Sentinel for automatic failover
  • Redis Cluster for horizontal scaling
  • Managed Redis (AWS ElastiCache, GCP Memorystore)

Load Testing

Verify HA setup with included Locust tests:

# Install locust
pip install locust

# Run load test (headless)
locust -f tests/load/locustfile.py --host=https://aragora.yourdomain.com \
--headless -u 100 -r 10 --run-time 5m

# Or with web UI
locust -f tests/load/locustfile.py --host=https://aragora.yourdomain.com
# Open http://localhost:8089

HA Checklist

  • At least 2 replicas running
  • HPA configured and active
  • PDB prevents total outage during updates
  • Redis deployed for shared state
  • Pods spread across zones (check with kubectl get pods -o wide)
  • Health checks passing (/healthz, /readyz)
  • Load tested with expected traffic

Rollback Procedures (v2.0.0+)

Kubernetes Rollback

# View rollout history
kubectl -n aragora rollout history deployment/aragora

# Rollback to previous version
kubectl -n aragora rollout undo deployment/aragora

# Rollback to specific revision
kubectl -n aragora rollout undo deployment/aragora --to-revision=2

# Verify rollback
kubectl -n aragora rollout status deployment/aragora

Database Rollback

# Rollback one alembic migration
alembic downgrade -1

# Rollback to specific revision
alembic downgrade abc123

# Restore from backup
pg_restore -d aragora backup_20260118.dump

API Version Rollback

If you need to revert API changes while maintaining v2.0.0 server:

# Set environment to use legacy endpoints
export ARAGORA_API_VERSION=v1
export ARAGORA_API_LEGACY_ENABLED=true

# Apply config change
kubectl -n aragora set env deployment/aragora ARAGORA_API_VERSION=v1

Security Recommendations

  1. Use Sealed Secrets or External Secrets for API keys
  2. Enable TLS via cert-manager or your ingress controller
  3. Set resource limits to prevent resource exhaustion
  4. Use NetworkPolicies to restrict traffic
  5. Enable Pod Security Standards (restricted profile)
  6. Enable audit logging for multi-tenant environments
  7. Configure tenant isolation for shared deployments

Backup and Recovery

Database Backup

# SQLite backup (if using default storage)
kubectl -n aragora exec deploy/aragora -- sqlite3 /app/data/aragora.db ".backup /tmp/backup.db"
kubectl -n aragora cp aragora-0:/tmp/backup.db ./aragora-backup.db

Redis Backup

# Trigger RDB snapshot
kubectl -n aragora exec aragora-redis-0 -- redis-cli BGSAVE

Pod Security Standards (PSS) Enforcement

Aragora enforces Kubernetes Pod Security Standards at the restricted level for maximum security hardening.

Namespace Configuration

The aragora namespace (deploy/k8s/namespace.yaml) enforces PSS with the following labels:

metadata:
labels:
# Enforce restricted policy - pods violating this will be rejected
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
# Audit restricted violations (logged but not rejected)
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/audit-version: latest
# Warn about restricted violations
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: latest

Security Context Requirements

All pods in the aragora namespace must comply with the restricted profile:

Pod-Level Security Context

All deployments, statefulsets, jobs, and cronjobs include:

spec:
template:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000 # Non-root user (varies by workload)
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault # Required for restricted profile

Container-Level Security Context

All containers include:

securityContext:
allowPrivilegeEscalation: false # Required for restricted profile
readOnlyRootFilesystem: true # Where possible
capabilities:
drop:
- ALL # Required for restricted profile

Workload-Specific Notes

WorkloadUIDreadOnlyRootFilesystemNotes
aragora (main)1000trueUses emptyDir for /tmp and /app/logs
aragora-backend1000trueUses emptyDir for /tmp and /app/logs
aragora-frontend1001trueUses emptyDir for /tmp and /.next/cache
aragora-redis999trueUses volumeClaimTemplate for /data
postgres70falsePostgreSQL requires writable /var/run/postgresql
migration jobs1000falseMigration may need temp file writes
secrets-rotation1000trueRead-only rotation job

Volume Mounts for Read-Only Root Filesystems

When readOnlyRootFilesystem: true, applications need writable directories via emptyDir volumes:

volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /app/.cache
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}

Verifying PSS Compliance

# Check namespace labels
kubectl get namespace aragora --show-labels

# Dry-run a pod to check compliance
kubectl run test --image=nginx --dry-run=server -n aragora

# Check for PSS violations in audit logs
kubectl logs -n kube-system -l component=kube-apiserver | grep "pod-security"

# List pods with their security contexts
kubectl get pods -n aragora -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.securityContext}{"\n"}\{end\}'

Troubleshooting PSS Violations

Error: "pods violate PodSecurity 'restricted'"

  1. Check the specific violation in the error message
  2. Common fixes:
    • Add seccompProfile.type: RuntimeDefault to pod spec
    • Add allowPrivilegeEscalation: false to container spec
    • Add capabilities.drop: ["ALL"] to container spec
    • Set runAsNonRoot: true in pod spec

Error: "container has runAsNonRoot and image will run as root"

  1. Specify explicit runAsUser in the pod or container spec
  2. Or rebuild the container image to run as non-root

Migrating from Baseline to Restricted

If migrating from baseline to restricted PSS:

  1. First enable audit/warn only:

    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
  2. Review warnings in API server audit logs

  3. Update workloads to comply with restricted

  4. Enable restricted enforcement:

    pod-security.kubernetes.io/enforce: restricted

Reference