Skip to main content

ELO Rating and Calibration System

This document describes how Aragora tracks agent skill ratings and calibration quality.

Overview

The ELO system provides:

  • Global ELO ratings for overall agent skill
  • Domain-specific ELO for specialized expertise
  • Calibration scores measuring prediction accuracy
  • Relationship tracking for agent dynamics

ELO Rating Basics

Standard ELO Formula

Expected score for agent A against agent B:

E(A) = 1 / (1 + 10^((R_B - R_A) / 400))

New rating after a match:

R'_A = R_A + K * (S_A - E(A))

Where:

  • R_A, R_B: Current ratings
  • K: K-factor (default: 32)
  • S_A: Actual score (1=win, 0.5=draw, 0=loss)
  • E(A): Expected score

Default Values

ParameterValueDescription
Initial ELO1500Starting rating for new agents
K-factor32Rating volatility
Min calibration5Predictions needed for calibration score

Recording Matches

from aragora.ranking.elo import EloSystem

elo = EloSystem()

# Modern API: multi-agent with scores
changes = elo.record_match(
debate_id="debate-001",
participants=["claude", "gpt4", "gemini"],
scores={"claude": 0.8, "gpt4": 0.6, "gemini": 0.4},
domain="security",
confidence_weight=0.9
)

# Legacy API: two-player
changes = elo.record_match(
winner="claude",
loser="gpt4",
draw=False,
task="code review"
)

Confidence Weighting

Low-confidence debates have reduced ELO impact:

confidence_weight = max(0.1, min(1.0, confidence_weight))
effective_k = K_FACTOR * confidence_weight

Pairwise ELO Changes

For multi-agent matches, ELO changes are computed pairwise:

def calculate_pairwise_elo_changes(participants, scores, ratings):
changes = {p: 0.0 for p in participants}

for a, b in combinations(participants, 2):
expected_a = expected_score(ratings[a].elo, ratings[b].elo)

# Normalize scores to 0-1 range
total = scores[a] + scores[b]
actual_a = scores[a] / total if total > 0 else 0.5

delta = K_FACTOR * (actual_a - expected_a)
changes[a] += delta
changes[b] -= delta

return changes

Domain-Specific ELO

Agents have separate ratings per domain:

rating = elo.get_rating("claude")

# Global ELO
print(rating.elo) # 1623

# Domain ELOs
print(rating.domain_elos) # {"security": 1580, "legal": 1490, ...}

Domain Rating Update

When a match has a domain:

# Update domain ELO (starts at global ELO if new)
if domain:
domain_elo = rating.domain_elos.get(domain, rating.elo)
rating.domain_elos[domain] = domain_elo + elo_change

Calibration Scoring

Calibration measures how well agents predict outcomes with appropriate confidence.

Brier Score

The Brier score measures prediction accuracy:

Brier = (predicted_probability - actual_outcome)^2

Range: 0 (perfect) to 1 (worst)

Recording Predictions

# Tournament winner prediction
elo.record_winner_prediction(
tournament_id="tourney-001",
predictor_agent="claude",
predicted_winner="gpt4",
confidence=0.75
)

# Resolve tournament
brier_scores = elo.resolve_tournament_calibration(
tournament_id="tourney-001",
actual_winner="claude"
)

Calibration Score

Combined calibration score (higher is better):

@property
def calibration_score(self) -> float:
if self.calibration_total < MIN_COUNT:
return 0.0

# Confidence scales with sample size
confidence = min(1.0, 0.5 + 0.5 * (count - MIN_COUNT) / 40)

# Score = (1 - Brier) weighted by confidence
return (1 - brier_score) * confidence

Calibration-Based K-Factor

Poorly calibrated agents receive higher K-factors, making their ratings more volatile:

def compute_calibration_k_multipliers(participants, calibration_tracker):
multipliers = {}
for agent in participants:
if calibration_tracker:
quality = calibration_tracker.get_quality(agent)
# Poor calibration -> higher K -> more volatile
multipliers[agent] = 2.0 - quality # Range: 1.0-2.0
else:
multipliers[agent] = 1.0
return multipliers

Domain Calibration

Track calibration per domain for grounded personas:

# Record domain-specific prediction
elo.record_domain_prediction(
agent_name="claude",
domain="legal",
confidence=0.8,
correct=True
)

# Get calibration curve (by confidence bucket)
buckets = elo.get_calibration_by_bucket("claude", domain="legal")
# Returns: [{"bucket_key": "0.8-0.9", "accuracy": 0.82, ...}, ...]

# Expected Calibration Error
ece = elo.get_expected_calibration_error("claude")

Confidence Buckets

Predictions are grouped into 10% buckets:

BucketConfidence RangeExpected Accuracy
0.0-0.10-10%5%
0.1-0.210-20%15%
.........
0.9-1.090-100%95%

Well-calibrated agents have actual accuracy matching expected accuracy.

Agent Rating Structure

@dataclass
class AgentRating:
agent_name: str
elo: float = 1500
domain_elos: dict[str, float] = {}

# Win/loss record
wins: int = 0
losses: int = 0
draws: int = 0
debates_count: int = 0

# Critique tracking
critiques_accepted: int = 0
critiques_total: int = 0

# Calibration
calibration_correct: int = 0
calibration_total: int = 0
calibration_brier_sum: float = 0.0

# Computed properties
win_rate: float # wins / total games
critique_acceptance_rate: float
calibration_accuracy: float # correct / total predictions
calibration_brier_score: float
calibration_score: float # Combined metric

Leaderboards

# Global leaderboard
top_agents = elo.get_leaderboard(limit=20)

# Domain leaderboard
top_security = elo.get_leaderboard(limit=10, domain="security")

# Calibration leaderboard
best_calibrated = elo.get_calibration_leaderboard(limit=10)

Leaderboard Caching

Leaderboards are cached with configurable TTL:

# Cache settings (from config)
CACHE_TTL_LEADERBOARD = 300 # 5 minutes
CACHE_TTL_RECENT_MATCHES = 60 # 1 minute
CACHE_TTL_LB_STATS = 600 # 10 minutes
CACHE_TTL_CALIBRATION_LB = 300 # 5 minutes

Relationship Tracking

Track dynamics between agent pairs:

# Update relationship
elo.update_relationship(
agent_a="claude",
agent_b="gpt4",
debate_increment=1,
agreement_increment=1,
critique_a_to_b=2,
critique_accepted_a_to_b=1,
a_win=1
)

# Get relationship metrics
metrics = elo.compute_relationship_metrics("claude", "gpt4")
# Returns: {"rivalry_score": 0.4, "alliance_score": 0.6, ...}

# Find rivals and allies
rivals = elo.get_rivals("claude", limit=5)
allies = elo.get_allies("claude", limit=5)

Relationship Metrics

MetricDescription
rivalry_scoreCompetition intensity (0-1)
alliance_scoreCollaboration tendency (0-1)
relationshipClassification: "rival", "ally", "neutral"
agreement_rateHow often they agree
head_to_headWin/loss record

Red Team Integration

Adjust ELO based on vulnerability testing:

elo_change = elo.record_redteam_result(
agent_name="claude",
robustness_score=0.85,
successful_attacks=3,
total_attacks=20,
critical_vulnerabilities=0,
session_id="session-001"
)

summary = elo.get_vulnerability_summary("claude")

Formal Verification Integration

Adjust ELO based on verified/disproven claims:

elo_change = elo.update_from_verification(
agent_name="claude",
domain="mathematics",
verified_count=5,
disproven_count=1,
k_factor=16.0
)

Verified claims boost ELO; disproven claims reduce it.

Performance Optimizations

Batch Operations

# Batch rating fetch
ratings = elo.get_ratings_batch(["claude", "gpt4", "gemini"])

# Batch relationship updates
elo.update_relationships_batch([
{"agent_a": "claude", "agent_b": "gpt4", "debate_increment": 1},
{"agent_a": "gpt4", "agent_b": "gemini", "agreement_increment": 1},
])

JSON Snapshots

For high-read scenarios, snapshots avoid SQLite locking:

# Fast leaderboard from snapshot
leaderboard = elo.get_snapshot_leaderboard(limit=20)

# Fast recent matches from snapshot
matches = elo.get_cached_recent_matches(limit=10)