Document Analysis

Natural language querying and analysis capabilities for uploaded documents.

Overview

The Analysis module enables asking natural language questions about documents with AI-powered answer synthesis and source citations. It combines hybrid semantic search with LLM-powered answer generation.

Key Features:

Natural language question answering
Multiple query modes (factual, analytical, comparative, summary)
Source citations with relevance scores
Multi-turn conversation context
Streaming response support
Document comparison and summarization

Quick Start

from aragora.analysis import query_documents, summarize_document

# Quick query
result = await query_documents(
    question="What are the key terms in this contract?",
    document_ids=["doc_123"]
)
print(result.answer)
for citation in result.citations:
    print(f"  - {citation.document_name}: {citation.snippet}")

# Summarize a document
summary = await summarize_document(
    document_id="doc_123",
    focus="payment terms"
)
print(summary.answer)

Full Engine Usage

from aragora.analysis import DocumentQueryEngine, QueryConfig

# Create engine with custom config
config = QueryConfig(
    max_chunks=15,
    min_relevance=0.4,
    model="claude-3.5-sonnet",
    include_quotes=True,
)

engine = await DocumentQueryEngine.create(config=config)

# Simple query
result = await engine.query(
    question="What contracts mention exclusivity clauses?",
    workspace_id="ws_123",
    document_ids=["doc1", "doc2"]
)

# Multi-turn conversation
result1 = await engine.query(
    question="What is the notice period?",
    conversation_id="conv_456"
)
result2 = await engine.query(
    question="Are there any exceptions to that?",  # Context preserved
    conversation_id="conv_456"
)

# Compare documents
comparison = await engine.compare_documents(
    document_ids=["contract_v1", "contract_v2"],
    aspects=["payment terms", "liability clauses"]
)

# Extract structured information
fields = await engine.extract_information(
    document_ids=["contract_123"],
    extraction_template={
        "parties": "Who are the parties to this agreement?",
        "effective_date": "What is the effective date?",
        "term": "What is the term or duration?",
        "payment": "What are the payment terms?",
    }
)

Query Modes

The engine automatically detects query intent or you can specify the mode explicitly.

Mode	Description	Triggers
`FACTUAL`	Direct fact extraction	Default mode
`ANALYTICAL`	Analysis and reasoning	"why", "analyze", "explain"
`COMPARATIVE`	Compare across documents	"compare", "difference", "vs"
`SUMMARY`	Summarize content	"summarize", "overview"
`EXTRACTIVE`	Extract specific information	"list", "extract", "find all"

Configuration

QueryConfig Options

Option	Default	Description
`max_chunks`	`10`	Maximum chunks to retrieve per query
`min_relevance`	`0.3`	Minimum relevance score threshold
`vector_weight`	`0.7`	Weight for semantic vs keyword search
`max_answer_length`	`500`	Maximum answer length in words
`include_quotes`	`True`	Include direct quotes from sources
`require_citations`	`True`	Always cite sources in answers
`model`	`claude-3.5-sonnet`	Primary model for answer generation
`fallback_model`	`gemini-1.5-flash`	Fallback if primary fails
`expand_query`	`True`	Generate query variations for better retrieval
`detect_intent`	`True`	Auto-detect question type/intent
`enable_context`	`True`	Use conversation history
`max_context_turns`	`3`	Maximum conversation turns to include

Response Types

QueryResult

@dataclass
class QueryResult:
    query_id: str           # Unique query identifier
    question: str           # Original question
    answer: str             # Generated answer
    confidence: AnswerConfidence  # HIGH, MEDIUM, LOW, NONE
    citations: list[Citation]     # Source citations
    query_mode: QueryMode         # Detected query mode
    chunks_searched: int          # Total chunks searched
    chunks_relevant: int          # Chunks meeting relevance threshold
    processing_time_ms: int       # Processing time
    model_used: str              # Model that generated answer

Citation

@dataclass
class Citation:
    document_id: str        # Source document ID
    document_name: str      # Document name
    chunk_id: str           # Specific chunk ID
    snippet: str            # Relevant excerpt (200 chars)
    page: int | None        # Page number if available
    relevance_score: float  # Relevance score (0-1)
    heading_context: str    # Section heading context

AnswerConfidence

Level	Description
`HIGH`	Strong evidence, clear answer (max relevance > 0.8, avg > 0.5)
`MEDIUM`	Moderate evidence, likely answer (max relevance > 0.5)
`LOW`	Weak evidence, uncertain
`NONE`	No relevant information found

Streaming Responses

For real-time UI updates, use streaming:

async for chunk in engine.query_stream(
    question="Summarize the agreement",
    document_ids=["doc_123"]
):
    if chunk.is_final:
        # Final chunk includes citations
        for citation in chunk.citations:
            print(f"Source: {citation.document_name}")
    else:
        # Partial answer text
        print(chunk.text, end="", flush=True)

API Endpoints

The Analysis module is exposed via the Knowledge API:

Method	Endpoint	Description
POST	`/api/knowledge/query`	Natural language query
POST	`/api/knowledge/mound/query`	Semantic query against knowledge mound

Example Request

curl -X POST http://localhost:8080/api/knowledge/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What are the payment terms?",
    "workspace_id": "enterprise",
    "document_ids": ["contract_001"],
    "options": {
      "max_chunks": 10,
      "min_relevance": 0.4,
      "include_quotes": true
    }
  }'

Example Response

{
  "query_id": "query_abc123def456",
  "question": "What are the payment terms?",
  "answer": "According to the contract [Source 1], payment is due within 30 days of invoice...",
  "confidence": "high",
  "citations": [
    {
      "document_id": "contract_001",
      "document_name": "Service Agreement.pdf",
      "chunk_id": "chunk_789",
      "snippet": "Payment Terms: All invoices are due and payable within thirty (30) days...",
      "page": 4,
      "relevance_score": 0.92,
      "heading_context": "Section 5: Payment"
    }
  ],
  "query_mode": "factual",
  "chunks_searched": 42,
  "chunks_relevant": 8,
  "processing_time_ms": 1250,
  "model_used": "claude-3.5-sonnet"
}

Integration with Knowledge Mound

The Analysis module integrates with the Knowledge Mound for broader knowledge queries:

from aragora.knowledge.mound import KnowledgeMound
from aragora.analysis import DocumentQueryEngine

# Initialize both systems
mound = KnowledgeMound(workspace_id="enterprise")
await mound.initialize()

engine = await DocumentQueryEngine.create()

# Query documents
doc_result = await engine.query("What are our SLA requirements?")

# Query knowledge mound for related facts
mound_result = await mound.query("SLA requirements", limit=5)

# Combine insights from both
combined_answer = f"""
Document Analysis:
{doc_result.answer}

Related Knowledge:
{chr(10).join(item.content for item in mound_result.items)}
"""

Best Practices

Scope Queries When Possible - Use document_ids to limit search scope for faster, more relevant results
Use Conversation IDs for Follow-ups - Enable multi-turn context with conversation_id for related questions
Adjust Relevance Thresholds - Lower min_relevance for broader searches, raise for precision
Monitor Confidence Levels - Check result.confidence before displaying answers to users
Handle NONE Confidence - When confidence is NONE, suggest users refine their question or upload more relevant documents
Use Streaming for Long Answers - For better UX, use query_stream to show answers as they're generated

Error Handling

result = await engine.query("What is the penalty clause?")

if result.confidence == AnswerConfidence.NONE:
    print("No relevant information found in the documents.")
elif result.confidence == AnswerConfidence.LOW:
    print("Warning: Low confidence answer - please verify manually.")
    print(f"Answer: {result.answer}")
else:
    print(result.answer)
    for citation in result.citations:
        print(f"  Source: {citation.document_name}, p.{citation.page}")

Overview​

Quick Start​

Full Engine Usage​

Query Modes​

Configuration​

QueryConfig Options​

Response Types​

QueryResult​

Citation​

AnswerConfidence​

Streaming Responses​

API Endpoints​

Example Request​

Example Response​

Integration with Knowledge Mound​

Best Practices​

Error Handling​

See Also​