Formats citations and validates source attribution for answers generated from multiple document sources (ChromaDB and web search). Use when adding source references to generated answers, validating citation accuracy, or ensuring quality of source attribution across different source types.
Format citations and validate source attribution using functions in components/generator.py. Handles both internal knowledge base (ChromaDB) sources and external web search sources with appropriate citation formats.
Default workflow:
# After generating answer, format citations based on source type
citations = format_citations(documents)
validated = validate_citations(answer, documents)
Key capabilities:
Documents from internal knowledge base (catalog, faq, troubleshooting collections):
Document structure:
{
'document': 'product_id: SKU001 | name: TechBook Pro...',
'collection': 'catalog', # 'faq' or 'troubleshooting'
'metadata': {
'source': 'techmart_catalog.csv',
'row_index': 0
}
}
Citation format:
["techmart_catalog.csv", "techmart_faq.csv"]["catalog", "faq"]Documents from external web search (Exa API):
Document structure:
{
'document': '<web page content>',
'collection': 'web_search',
'source': 'https://techradar.com/article', # URL
'title': 'Best Gaming Laptops 2024',
'author': 'John Doe',
'published_date': '2024-03-15'
}
Citation format:
[Title](URL)["https://techradar.com/article", "https://pcmag.com/review"]["web_search"]# Input documents
documents = [
{'collection': 'catalog', 'metadata': {'source': 'techmart_catalog.csv'}},
{'collection': 'faq', 'metadata': {'source': 'techmart_faq.csv'}}
]
# Output
{
'sources': ['techmart_catalog.csv', 'techmart_faq.csv'],
'collections_used': ['catalog', 'faq'],
'source_types': ['chromadb', 'chromadb'],
'citation_format': 'internal_kb'
}
# Input documents
documents = [
{
'collection': 'web_search',
'source': 'https://techradar.com/best-laptops',
'title': 'Best Gaming Laptops 2024'
},
{
'collection': 'web_search',
'source': 'https://pcmag.com/laptop-review',
'title': 'Gaming Laptop Reviews'
}
]
# Output
{
'sources': [
'https://techradar.com/best-laptops',
'https://pcmag.com/laptop-review'
],
'collections_used': ['web_search'],
'source_types': ['web', 'web'],
'citation_format': 'web_urls',
'formatted_citations': [
'[Best Gaming Laptops 2024](https://techradar.com/best-laptops)',
'[Gaming Laptop Reviews](https://pcmag.com/laptop-review)'
]
}
# Input documents
documents = [
{'collection': 'catalog', 'metadata': {'source': 'techmart_catalog.csv'}},
{
'collection': 'web_search',
'source': 'https://example.com/article',
'title': 'External Review'
}
]
# Output
{
'sources': [
'techmart_catalog.csv',
'https://example.com/article'
],
'collections_used': ['catalog', 'web_search'],
'source_types': ['chromadb', 'web'],
'citation_format': 'mixed',
'formatted_citations': [
'Product Catalog',
'[External Review](https://example.com/article)'
]
}
# Validate that generated answer only uses information from provided documents
answer = "The TechBook Pro costs $1,499"
documents = [
{'document': 'product_id: SKU001 | price: 1499.0 | name: TechBook Pro'}
]
validation = validate_citations(answer, documents)
# Returns: {'valid': True, 'grounded': True, 'hallucination_detected': False}
The citation skill is used AFTER generation to format and validate sources:
# Step 1: Generate answer (generation_skill)
answer_result = generate_answer(query, documents)
# Step 2: Format citations (citation_skill)
citation_result = format_citations(documents)
# Step 3: Combine results
final_result = {
'answer': answer_result['answer'],
'sources': citation_result['sources'],
'collections_used': citation_result['collections_used'],
'source_types': citation_result['source_types'],
'formatted_citations': citation_result.get('formatted_citations', [])
}
The citation skill performs these quality validations:
collection == 'web_search' to identify web sources[Title](URL) format for proper renderingEdit PDFs with natural-language instructions using the nano-pdf CLI.