Disaster Recovery Runbook
Comprehensive disaster recovery procedures for Aragora.
Table of Contents
- Recovery Objectives
- Incident Classification
- Backup Strategy
- Recovery Procedures
- Service Recovery Checklist
- Communication Templates
- Post-Incident Review
Recovery Objectives
Recovery Time Objective (RTO)
| Service Tier | RTO Target | Description |
|---|---|---|
| Critical | < 1 hour | Core debate engine, authentication |
| High | < 4 hours | API endpoints, WebSocket streaming |
| Medium | < 8 hours | Analytics, leaderboards, telemetry |
| Low | < 24 hours | Historical data, archives |
Recovery Point Objective (RPO)
| Data Type | RPO Target | Backup Frequency |
|---|---|---|
| User accounts | < 1 hour | Continuous replication |
| Active debates | < 15 minutes | Transaction log shipping |
| Debate history | < 1 hour | Hourly snapshots |
| Agent ratings | < 1 hour | Hourly snapshots |
| Audit logs | < 24 hours | Daily backups |
| Telemetry | < 24 hours | Daily exports |
Incident Classification
Severity Levels
| Level | Name | Definition | Response Time |
|---|---|---|---|
| SEV-1 | Critical | Complete service outage, data loss risk | < 15 minutes |
| SEV-2 | High | Major functionality impaired | < 1 hour |
| SEV-3 | Medium | Partial service degradation | < 4 hours |
| SEV-4 | Low | Minor issues, workarounds exist | < 24 hours |
Common Incident Types
| Type | Severity | Description |
|---|---|---|
| Database corruption | SEV-1 | Data integrity compromised |
| Complete outage | SEV-1 | All services unavailable |
| Authentication failure | SEV-1 | Users cannot log in |
| Data breach | SEV-1 | Unauthorized data access |
| API degradation | SEV-2 | Slow or failing API calls |
| Agent failures | SEV-2 | LLM providers unavailable |
| Memory exhaustion | SEV-2 | Server OOM conditions |
| Partial outage | SEV-3 | Some features unavailable |
| Performance issues | SEV-3 | Slow response times |
| Scheduled maintenance | SEV-4 | Planned downtime |
Backup Strategy
Automated Backups
Database Backups
# SQLite backup (default)
# Location: .nomic/backups/
# Frequency: Hourly
# Retention: 14 days
# Verify backup schedule
ls -la .nomic/backups/
# Manual backup
python scripts/migrate_databases.py --backup
PostgreSQL Backups (Production)
# Continuous archiving (WAL)
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'
# Full backup (daily)
pg_dump -Fc aragora > /backup/aragora_$(date +%Y%m%d).dump
# Point-in-time recovery enabled
restore_command = 'cp /backup/wal/%f %p'
Redis Backups
# RDB snapshots (default every 15 minutes)
save 900 1
save 300 10
save 60 10000
# AOF persistence
appendonly yes
appendfsync everysec
# Manual backup
redis-cli BGSAVE
cp /var/lib/redis/dump.rdb /backup/redis_$(date +%Y%m%d).rdb
Backup Verification
Daily Verification Script
#!/bin/bash
# scripts/verify_backups.sh
set -e
echo "=== Backup Verification $(date) ==="
# 1. Check backup files exist
BACKUP_DIR=".nomic/backups"
LATEST_BACKUP=$(ls -t $BACKUP_DIR | head -1)
if [ -z "$LATEST_BACKUP" ]; then
echo "ERROR: No backups found!"
exit 1
fi
echo "Latest backup: $LATEST_BACKUP"
# 2. Verify backup age (should be < 24 hours old)
BACKUP_AGE=$(( ($(date +%s) - $(stat -f %m "$BACKUP_DIR/$LATEST_BACKUP")) / 3600 ))
if [ $BACKUP_AGE -gt 24 ]; then
echo "WARNING: Backup is $BACKUP_AGE hours old!"
fi
# 3. Validate database integrity
echo "Validating database integrity..."
python scripts/migrate_databases.py --validate
# 4. Test restore to temporary location
TEMP_DIR=$(mktemp -d)
echo "Testing restore to $TEMP_DIR..."
cp -r "$BACKUP_DIR/$LATEST_BACKUP"/* "$TEMP_DIR/"
# 5. Verify restored data
sqlite3 "$TEMP_DIR/aragora_users.db" "SELECT COUNT(*) FROM users" > /dev/null
sqlite3 "$TEMP_DIR/aragora_debates.db" "SELECT COUNT(*) FROM debates" > /dev/null
echo "Restore verification: PASSED"
# Cleanup
rm -rf "$TEMP_DIR"
echo "=== Verification Complete ==="
Weekly Full Recovery Test
#!/bin/bash
# scripts/test_full_recovery.sh
# Run in isolated environment (Docker)
docker-compose -f docker-compose.recovery-test.yml up -d
# Restore from backup
docker exec aragora-recovery python scripts/migrate_databases.py --restore /backup/latest
# Run health checks
docker exec aragora-recovery curl -f http://localhost:8080/api/health
# Run integration tests
docker exec aragora-recovery pytest tests/integration/ -v --timeout=300
# Cleanup
docker-compose -f docker-compose.recovery-test.yml down -v
Recovery Procedures
Procedure 1: Database Recovery (SQLite)
Scenario: Database corruption or data loss
Steps:
-
Stop the service
# Kubernetes
kubectl scale deployment aragora --replicas=0
# Systemd
sudo systemctl stop aragora -
Assess damage
# Check database integrity
sqlite3 .nomic/aragora_users.db "PRAGMA integrity_check"
sqlite3 .nomic/aragora_debates.db "PRAGMA integrity_check" -
List available backups
ls -la .nomic/backups/
# Output shows: backup_YYYYMMDD_HHMMSS/ -
Restore from backup
# Move corrupted files
mv .nomic/aragora_users.db .nomic/aragora_users.db.corrupted
mv .nomic/aragora_debates.db .nomic/aragora_debates.db.corrupted
# Restore from backup
BACKUP="backup_20260113_100000"
cp ".nomic/backups/$BACKUP/aragora_users.db" .nomic/
cp ".nomic/backups/$BACKUP/aragora_debates.db" .nomic/ -
Verify restoration
python scripts/migrate_databases.py --validate -
Restart service
kubectl scale deployment aragora --replicas=3
# OR
sudo systemctl start aragora -
Verify service health
curl http://localhost:8080/api/health
Procedure 2: Database Recovery (PostgreSQL)
Scenario: PostgreSQL database failure
Steps:
-
Stop application connections
kubectl scale deployment aragora --replicas=0 -
Connect to PostgreSQL
psql -h $DB_HOST -U postgres -
Drop and recreate database (if necessary)
DROP DATABASE IF EXISTS aragora;
CREATE DATABASE aragora; -
Restore from backup
# From pg_dump
pg_restore -d aragora /backup/aragora_YYYYMMDD.dump
# OR point-in-time recovery
# Edit recovery.conf:
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = '2026-01-13 10:00:00' -
Run migrations
python scripts/migrate_databases.py --migrate -
Restart application
kubectl scale deployment aragora --replicas=3
Procedure 3: Redis Recovery
Scenario: Redis data loss or corruption
Steps:
-
Stop Redis
redis-cli SHUTDOWN NOSAVE -
Restore RDB snapshot
cp /backup/redis_YYYYMMDD.rdb /var/lib/redis/dump.rdb
chown redis:redis /var/lib/redis/dump.rdb -
Start Redis
systemctl start redis -
Verify data
redis-cli INFO keyspace
redis-cli DBSIZE
Note: Aragora automatically falls back to in-memory storage if Redis is unavailable. Rate limiting and session data may be temporarily reset.
Procedure 4: Complete Service Recovery
Scenario: Total infrastructure failure
Steps:
-
Provision infrastructure
# Terraform (if using IaC)
cd terraform/
terraform apply -auto-approve
# OR Kubernetes
kubectl apply -f k8s/ -
Deploy application
# Build and push image
docker build -t aragora:recovery .
docker push registry/aragora:recovery
# Deploy
kubectl set image deployment/aragora aragora=registry/aragora:recovery -
Restore databases
# Follow database recovery procedures above -
Restore secrets
# From backup or secret manager
kubectl create secret generic aragora-secrets \
--from-literal=ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
--from-literal=OPENAI_API_KEY=$OPENAI_API_KEY \
--from-literal=ARAGORA_JWT_SECRET=$ARAGORA_JWT_SECRET -
Verify all services
# Health check
curl http://aragora.example.com/api/health
# Run smoke tests
pytest tests/smoke/ -v
Procedure 5: Security Incident Response
Scenario: Suspected data breach or security incident
Steps:
-
Immediate containment
# Revoke all active sessions
redis-cli FLUSHDB
# Rotate JWT secret
kubectl patch secret aragora-secrets -p '{"data":{"ARAGORA_JWT_SECRET":"'$(openssl rand -base64 32 | base64)'"}}'
# Restart application (forces re-authentication)
kubectl rollout restart deployment/aragora -
Preserve evidence
# Snapshot current state
kubectl logs deployment/aragora > /evidence/logs_$(date +%Y%m%d_%H%M%S).log
# Backup databases before any changes
python scripts/migrate_databases.py --backup
# Export audit logs
sqlite3 .nomic/aragora_audit.db ".dump" > /evidence/audit_$(date +%Y%m%d).sql -
Investigate
# Review audit logs
sqlite3 .nomic/aragora_audit.db "SELECT * FROM audit_log WHERE timestamp > datetime('now', '-24 hours')"
# Check for suspicious activity
grep -E "(failed|unauthorized|suspicious)" /evidence/logs_*.log -
Notify stakeholders
- Use communication templates below
- Contact security@aragora.ai
-
Remediate
- Patch vulnerability if identified
- Reset affected user passwords
- Update firewall rules if necessary
-
Document
- Complete post-incident review
- Update runbooks if needed
Service Recovery Checklist
Pre-Recovery
- Incident classified and severity assigned
- On-call personnel notified
- Communication sent to stakeholders
- Backup availability confirmed
- Recovery environment prepared
During Recovery
- Services stopped gracefully
- Corrupted data preserved for analysis
- Backup restored successfully
- Data integrity verified
- Services restarted in correct order:
- Database (PostgreSQL/SQLite)
- Cache (Redis)
- Application servers
- Load balancer health checks passing
Post-Recovery
- All health checks passing
- User authentication working
- Debate creation/retrieval functional
- WebSocket connections established
- Agent API calls successful
- Rate limiting operational
- Metrics and logging flowing
- Users notified of recovery
- Post-incident review scheduled
Communication Templates
Initial Incident Notification
Subject: [ARAGORA] Service Incident - [SEVERITY]
Team,
We are currently investigating an incident affecting Aragora services.
**Status:** Investigating
**Severity:** [SEV-1/SEV-2/SEV-3/SEV-4]
**Impact:** [Brief description of user impact]
**Started:** [Timestamp]
We will provide updates every [15/30/60] minutes.
Current actions:
- [Action 1]
- [Action 2]
Next update: [Timestamp]
---
Aragora Incident Response Team
Status Update
Subject: [ARAGORA] Incident Update - [STATUS]
**Incident ID:** [ID]
**Status:** [Investigating/Identified/Monitoring/Resolved]
**Duration:** [X hours Y minutes]
**Update:**
[Description of current status and actions taken]
**Next Steps:**
- [Step 1]
- [Step 2]
**ETA to Resolution:** [Estimate or "TBD"]
Next update: [Timestamp]
Resolution Notification
Subject: [ARAGORA] Incident Resolved - [BRIEF SUMMARY]
Team,
The incident affecting Aragora services has been resolved.
**Resolution Time:** [Timestamp]
**Total Duration:** [X hours Y minutes]
**Root Cause:** [Brief summary]
**Actions Taken:**
- [Action 1]
- [Action 2]
**Data Impact:**
- [Any data loss or recovery details]
- [Time range affected]
A post-incident review will be conducted within 48 hours.
---
Aragora Incident Response Team
User-Facing Status Page
**Current Status: [Operational/Degraded/Outage]**
[Date] [Time] - [Update Message]
We are currently experiencing [issues with X / degraded performance / an outage].
Affected services:
- [Service 1]
- [Service 2]
What you may experience:
- [Symptom 1]
- [Symptom 2]
We are actively working to resolve this issue. Updates will be posted here.
Last updated: [Timestamp]
Post-Incident Review
Review Timeline
| Task | Deadline |
|---|---|
| Incident timeline documented | 24 hours |
| Post-incident review meeting | 48 hours |
| Action items assigned | 72 hours |
| Root cause analysis complete | 1 week |
| Preventive measures implemented | 2 weeks |
Review Template
# Post-Incident Review: [Incident Title]
**Date:** [Date]
**Duration:** [X hours Y minutes]
**Severity:** [SEV-X]
**Author:** [Name]
## Summary
[1-2 paragraph summary of what happened]
## Timeline
| Time | Event |
|------|-------|
| HH:MM | [Event description] |
| HH:MM | [Event description] |
## Root Cause
[Detailed explanation of what caused the incident]
## Impact
- **Users affected:** [Number/percentage]
- **Data loss:** [Yes/No, details]
- **Revenue impact:** [If applicable]
## What Went Well
- [Item 1]
- [Item 2]
## What Could Be Improved
- [Item 1]
- [Item 2]
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Action] | [Name] | [Date] | [Open/Done] |
## Lessons Learned
[Key takeaways and recommendations]
Emergency Contacts
| Role | Contact | Escalation Time |
|---|---|---|
| On-call Engineer | [PagerDuty/Slack] | Immediate |
| Engineering Lead | [Contact] | 15 minutes |
| Security Team | security@aragora.ai | SEV-1 only |
| Infrastructure | [Contact] | As needed |
Component-Specific Recovery
Knowledge Mound Recovery
Scenario: Knowledge Mound data loss or corruption
Backup Procedures:
# Knowledge Mound stores data in SQLite
# Location: .nomic/knowledge_mound.db
# Manual backup
sqlite3 .nomic/knowledge_mound.db ".backup '/backup/km_$(date +%Y%m%d_%H%M%S).db'"
# Verify backup integrity
sqlite3 /backup/km_*.db "PRAGMA integrity_check"
sqlite3 /backup/km_*.db "SELECT COUNT(*) FROM knowledge_nodes"
Recovery Steps:
-
Stop KM operations
# Disable KM endpoints temporarily
curl -X POST http://localhost:8080/api/admin/features/knowledge-mound/disable -
Restore from backup
mv .nomic/knowledge_mound.db .nomic/knowledge_mound.db.corrupted
cp /backup/km_YYYYMMDD_HHMMSS.db .nomic/knowledge_mound.db -
Rebuild vector indices (if needed)
python -c "
from aragora.knowledge.mound import get_knowledge_mound
km = get_knowledge_mound()
km.rebuild_indices()
" -
Re-enable KM
curl -X POST http://localhost:8080/api/admin/features/knowledge-mound/enable
Job Queue Recovery
Scenario: Interrupted jobs (transcription, routing, gauntlet)
The job queue system (aragora/queue/) automatically recovers interrupted jobs on startup. However, manual recovery may be needed after certain failures.
Recovery Commands:
# Check pending/interrupted jobs
python -c "
from aragora.queue.job_queue import JobQueueStore
store = JobQueueStore('.nomic/job_queue.db')
import asyncio
async def check():
pending = await store.list_jobs(status='pending')
processing = await store.list_jobs(status='processing')
failed = await store.list_jobs(status='failed')
print(f'Pending: {len(pending)}, Processing: {len(processing)}, Failed: {len(failed)}')
asyncio.run(check())
"
# Recover specific job types
python -c "
from aragora.queue.workers import (
recover_interrupted_transcriptions,
recover_interrupted_gauntlets,
recover_interrupted_routing,
)
import asyncio
async def recover():
t = await recover_interrupted_transcriptions()
g = await recover_interrupted_gauntlets()
r = await recover_interrupted_routing()
print(f'Recovered: \{t\} transcriptions, \{g\} gauntlets, \{r\} routing jobs')
asyncio.run(recover())
"
# Manually retry failed jobs
python -c "
from aragora.queue.job_queue import JobQueueStore
store = JobQueueStore('.nomic/job_queue.db')
import asyncio
async def retry_failed():
failed = await store.list_jobs(status='failed')
for job in failed:
if job.retry_count < 3:
await store.update_job_status(job.job_id, 'pending')
print(f'Retried job {job.job_id}')
asyncio.run(retry_failed())
"
Encrypted Secrets Recovery
Scenario: Encryption key loss or secrets corruption
CRITICAL: The encryption key (ARAGORA_ENCRYPTION_KEY) must be backed up securely. Without it, encrypted secrets cannot be recovered.
Key Backup:
# NEVER store encryption key in plain text alongside backups
# Use a secrets manager (AWS Secrets Manager, HashiCorp Vault, etc.)
# Example: Store in AWS Secrets Manager
aws secretsmanager create-secret \
--name aragora/encryption-key \
--secret-string "$ARAGORA_ENCRYPTION_KEY"
# Example: Store in HashiCorp Vault
vault kv put secret/aragora/encryption-key value="$ARAGORA_ENCRYPTION_KEY"
Recovery Steps:
-
Retrieve encryption key from backup
# From AWS Secrets Manager
export ARAGORA_ENCRYPTION_KEY=$(aws secretsmanager get-secret-value \
--secret-id aragora/encryption-key \
--query SecretString --output text)
# From HashiCorp Vault
export ARAGORA_ENCRYPTION_KEY=$(vault kv get -field=value secret/aragora/encryption-key) -
Verify key matches encrypted data
python -c "
from aragora.security.encryption import get_encryption_service
svc = get_encryption_service()
# Will fail if key is wrong
print('Encryption service initialized successfully')
" -
If key is lost and secrets cannot be recovered:
- Generate new encryption key:
openssl rand -hex 32 - Users must re-enter sensitive credentials (API keys, tokens)
- Update integration store entries with new encrypted values
- Generate new encryption key:
Debate Origin Recovery
Scenario: Lost debate origin mapping (debates not routed back to source)
Origin data (aragora/server/debate_origin.py) maps debates to their source (Slack, Discord, email, etc.).
Recovery Steps:
-
Check origin store
python -c "
from aragora.server.debate_origin import get_origin_store
import asyncio
async def check():
store = get_origin_store()
# SQLite-backed store
count = await store.count()
print(f'Stored origins: \{count\}')
asyncio.run(check())
" -
Rebuild from debate metadata (partial recovery)
python -c "
from aragora.storage.debate_store import get_debate_store
from aragora.server.debate_origin import get_origin_store, DebateOrigin
import asyncio
async def rebuild():
debate_store = get_debate_store()
origin_store = get_origin_store()
debates = await debate_store.list_all()
recovered = 0
for d in debates:
if d.metadata and 'source' in d.metadata:
origin = DebateOrigin(
debate_id=d.debate_id,
platform=d.metadata['source'],
channel_id=d.metadata.get('channel_id'),
user_id=d.metadata.get('user_id'),
)
await origin_store.save(origin)
recovered += 1
print(f'Recovered \{recovered\} origins from debate metadata')
asyncio.run(rebuild())
"
Consensus Healing Recovery
Scenario: Multiple stale or failed consensus states
The consensus healing worker (aragora/queue/workers/consensus_healing_worker.py) automatically identifies and heals problematic consensus states.
Manual Healing:
# Start consensus healing with custom config
python -c "
from aragora.queue.workers import (
ConsensusHealingWorker,
HealingConfig,
HealingAction,
)
import asyncio
async def heal():
config = HealingConfig(
enabled=True,
scan_interval_seconds=60,
stale_threshold_hours=24,
max_auto_actions_per_scan=10,
allowed_actions=[
HealingAction.RE_DEBATE,
HealingAction.EXTEND_ROUNDS,
HealingAction.ARCHIVE,
],
)
worker = ConsensusHealingWorker(config=config)
# Single scan
candidates = await worker._scan_for_candidates()
print(f'Found {len(candidates)} healing candidates')
for c in candidates:
print(f' - {c.debate_id}: {c.reason.value}')
asyncio.run(heal())
"
# Force archive old stale debates
python -c "
from aragora.queue.workers import get_consensus_healing_worker
import asyncio
async def archive_stale():
worker = get_consensus_healing_worker()
results = await worker.force_archive_stale(older_than_days=30)
print(f'Archived {len(results)} stale debates')
asyncio.run(archive_stale())
"
Human Checkpoint Recovery
Scenario: Lost pending approvals after restart
Human checkpoints (aragora/workflow/nodes/human_checkpoint.py) now persist to GovernanceStore.
Recovery Steps:
# List all pending approvals
python -c "
from aragora.storage.governance_store import get_governance_store
import asyncio
async def list_pending():
store = get_governance_store()
pending = await store.list_approvals(status='pending')
print(f'Pending approvals: {len(pending)}')
for p in pending:
print(f' - {p.approval_id}: {p.title} (requested: {p.requested_at})')
asyncio.run(list_pending())
"
# Recover pending approvals on startup
python -c "
from aragora.workflow.nodes.human_checkpoint import HumanCheckpointNode
import asyncio
async def recover():
node = HumanCheckpointNode()
recovered = await node.recover_pending_approvals()
print(f'Recovered \{recovered\} pending approvals')
asyncio.run(recover())
"
Automated Recovery Integration
Startup Recovery Hooks
Add to your application startup sequence:
# aragora/server/startup.py
async def run_recovery_hooks():
"""Run automated recovery on startup."""
from aragora.queue.workers import (
recover_interrupted_transcriptions,
recover_interrupted_gauntlets,
recover_interrupted_routing,
)
from aragora.workflow.nodes.human_checkpoint import HumanCheckpointNode
# Recover interrupted jobs
await recover_interrupted_transcriptions()
await recover_interrupted_gauntlets()
await recover_interrupted_routing()
# Recover pending approvals
checkpoint = HumanCheckpointNode()
await checkpoint.recover_pending_approvals()
# Start consensus healing (background)
from aragora.queue.workers import start_consensus_healing
await start_consensus_healing()
Deployment Validation
Before accepting traffic after recovery, run deployment validation:
python -c "
from aragora.deploy.validator import validate_deployment, ValidationLevel
import asyncio
async def validate():
result = await validate_deployment(level=ValidationLevel.FULL)
if not result.passed:
print('Deployment validation FAILED:')
for check in result.checks:
if not check.passed:
print(f' - {check.name}: {check.message}')
exit(1)
print('Deployment validation PASSED')
asyncio.run(validate())
"
Disaster Recovery Testing Schedule
Regular DR testing ensures procedures remain valid and teams stay practiced.
Testing Cadence
| Test Type | Frequency | Last Tested | Next Scheduled |
|---|---|---|---|
| Backup restoration | Monthly | - | TBD |
| Database failover | Quarterly | - | TBD |
| Full DR drill | Annually | - | TBD |
| Security incident drill | Quarterly | - | TBD |
| Communication drill | Quarterly | - | TBD |
Monthly Backup Restoration Test
# 1. Select random backup from past week
BACKUP=$(ls -t .nomic/backups/ | shuf | head -1)
# 2. Restore to isolated environment
aragora backup restore $BACKUP --output /tmp/dr-test/aragora.db
# 3. Verify data integrity
sqlite3 /tmp/dr-test/aragora.db "PRAGMA integrity_check"
sqlite3 /tmp/dr-test/aragora.db "SELECT COUNT(*) FROM debates"
# 4. Document results
echo "$(date): Tested $BACKUP - PASSED" >> /var/log/dr-test.log
# 5. Cleanup
rm -rf /tmp/dr-test/
Quarterly Failover Drill
- Announce drill to team (30 min notice)
- Simulate primary database failure
- Execute failover procedure
- Verify service restoration within RTO
- Document time to recovery
- Fail back to primary
- Post-drill review meeting
Annual Full DR Exercise
- Schedule 4-hour window
- Notify all stakeholders
- Simulate complete infrastructure loss
- Execute full recovery from backups
- Verify all services operational
- Measure RTO/RPO compliance
- Document lessons learned
- Update runbook as needed
Programmatic Backup API
BackupManager
The BackupManager provides comprehensive backup operations with verification:
from aragora.backup.manager import (
BackupManager,
BackupType,
RetentionPolicy,
)
# Initialize with custom retention
policy = RetentionPolicy(
keep_daily=7,
keep_weekly=4,
keep_monthly=3,
min_backups=1,
)
manager = BackupManager(
backup_dir="/var/backups/aragora",
retention_policy=policy,
compression=True,
verify_after_backup=True,
)
# Create backup
backup = manager.create_backup(
source_path="/path/to/aragora.db",
backup_type=BackupType.FULL,
metadata={"reason": "pre-deployment"},
)
print(f"Backup created: {backup.id}, verified: {backup.verified}")
# Comprehensive verification
result = manager.verify_restore_comprehensive(backup.id)
print(f"Schema valid: {result.schema_validation.valid}")
print(f"Integrity valid: {result.integrity_check.valid}")
print(f"Table checksums valid: {result.table_checksums_valid}")
# Restore with dry run
manager.restore_backup(backup.id, "/path/to/restore.db", dry_run=True)
# Cleanup expired backups
deleted = manager.cleanup_expired()
print(f"Deleted {len(deleted)} expired backups")
BackupScheduler
Automated backup scheduling with DR drill integration:
from aragora.backup.scheduler import (
BackupScheduler,
BackupSchedule,
start_backup_scheduler,
)
from datetime import time
# Configure schedule
schedule = BackupSchedule(
hourly_minute=30, # :30 every hour
daily=time(2, 0), # 2 AM daily
weekly_day=6, # Sunday
weekly_time=time(3, 0), # 3 AM Sunday
monthly_day=1, # 1st of month
monthly_time=time(4, 0), # 4 AM
verify_after_backup=True,
retention_cleanup_after=True,
enable_dr_drills=True,
dr_drill_interval_days=30, # Monthly DR drills
)
# Start scheduler
scheduler = await start_backup_scheduler(backup_manager, schedule)
# Manual backup
job = await scheduler.backup_now(verify=True, cleanup=True)
print(f"Backup job: {job.id}, status: {job.status}")
# Get stats
stats = scheduler.get_stats()
print(f"Total: {stats.total_backups}, Success: {stats.successful_backups}")
print(f"Next daily: {stats.next_daily}")
DRDrillScheduler
Automated DR drill scheduling for SOC 2 CC9 compliance:
from aragora.scheduler.dr_drill_scheduler import (
DRDrillScheduler,
DRDrillConfig,
DrillType,
ComponentType,
get_dr_drill_scheduler,
)
# Configure drills
config = DRDrillConfig(
monthly_drill_day=15, # 15th of each month
quarterly_drill_months=[3, 6, 9, 12],
annual_drill_month=1, # January
target_rto_seconds=3600, # 1 hour
target_rpo_seconds=300, # 5 minutes
)
scheduler = DRDrillScheduler(config)
await scheduler.start()
# Manual drill execution
drill = await scheduler.execute_drill(
drill_type=DrillType.BACKUP_RESTORATION,
components=[ComponentType.DATABASE, ComponentType.OBJECT_STORAGE],
)
print(f"Drill: {drill.drill_id}")
print(f"Status: {drill.status.value}")
print(f"RTO: {drill.rto_seconds:.1f}s (target: {drill.target_rto_seconds}s)")
print(f"RPO: {drill.rpo_seconds:.1f}s (target: {drill.target_rpo_seconds}s)")
print(f"Compliant: {drill.is_compliant}")
print(f"Recommendations: {drill.recommendations}")
# Compliance report
report = scheduler.get_compliance_report()
print(f"Compliance rate: {report['compliance_rate']:.0%}")
print(f"Average RTO: {report['average_rto_seconds']:.0f}s")
DR Drill CLI
# Full DR drill
python scripts/dr_drill.py --mode full \
--api-url https://api.aragora.ai \
--backup-dir /var/backups/aragora
# Backup verification only
python scripts/dr_drill.py --mode backup
# Failover testing only
python scripts/dr_drill.py --mode failover
# Generate report
python scripts/dr_drill.py --mode full --output dr_report.md
SOC 2 Compliance
This runbook supports the following SOC 2 Trust Services Criteria:
| Control | Description | How Addressed |
|---|---|---|
| CC9.1 | Business continuity planning | Automated DR drills, documented procedures |
| CC9.2 | Recovery testing | Monthly backup restoration, quarterly failover |
| A1.2 | Backup procedures | Automated daily/weekly/monthly backups |
| A1.3 | Recovery procedures | Step-by-step restore procedures |
Compliance Metrics
Track these metrics for SOC 2 audits:
| Metric | Target | Measurement |
|---|---|---|
| RTO | < 1 hour | Measured during DR drills |
| RPO | < 5 minutes | Backup frequency verification |
| Backup verification rate | 100% | All backups verified after creation |
| DR drill success rate | > 95% | Monthly drill pass rate |
| Backup retention compliance | 100% | Retention policy enforcement |
Related Documentation
- SECURITY.md - Security policies and incident response
- TROUBLESHOOTING.md - Common issues and solutions
- RUNBOOK.md - Operational procedures
- DATABASE.md - Database operations and encryption
- SECRETS_MANAGEMENT.md - Encryption key management
- QUEUE.md - Job queue operations
- RBAC_MATRIX.md - Role-based access control permissions