Analyze PDF documents and visualize their structure, citations, relationships, and summaries. Use this skill whenever the user wants to understand a PDF's structure, see how sections relate to each other, visualize citation networks, get section-by-section summaries, or see information density across pages. Trigger on phrases like "analyze this PDF", "show me the structure", "visualize this document", "summarize this paper", "how is this document organized", "show citation map", "what are the key sections", or when a user uploads/references a PDF and wants to understand its content at a structural level.
Extract text from the input PDF using pdftotext (from poppler-utils) as the primary method:
pdftotext -layout <input.pdf> extracted.txt
If pdftotext is not installed or fails, fall back to the Python extraction script:
python scripts/extract_pdf.py <input.pdf> extracted.txt
Output: extracted.txt — page-delimited plain text of the entire PDF.
Run the page statistics script to compute per-page metrics locally:
python scripts/page_stats.py extracted.txt page_stats.json
Output: page_stats.json — per-page statistics including word count, character count, and information density metrics.
Read the contents of extracted.txt and analyze the document. Produce analysis.json conforming to the following schema:
{
"title": "Document title",
"structure": {
"sections": [
{
"id": "sec-1",
"title": "Section Title",
"level": 1,
"page": 1,
"charCount": 1234,
"summary": "Brief summary of the section content.",
"children": [
{
"id": "sec-1-1",
"title": "Subsection Title",
"level": 2,
"page": 2,
"charCount": 567,
"summary": "Brief summary of the subsection.",
"children": []
}
]
}
]
},
"citations": {
"references": [
{
"id": "ref-1",
"label": "[1]",
"title": "Referenced work title",
"authors": "Author A, Author B",
"year": 2023
}
],
"inTextCitations": [
{
"referenceId": "ref-1",
"page": 3,
"context": "Surrounding sentence where the citation appears."
}
]
},
"relationships": {
"edges": [
{
"from": "sec-1",
"to": "sec-2",
"type": "prerequisite",
"label": "Section 1 introduces concepts used in Section 2"
}
]
},
"summary": {
"overall": "A 2-3 sentence overall summary of the document.",
"keywords": ["keyword1", "keyword2", "keyword3"],
"sections": [
{
"id": "sec-1",
"summary": "Summary of this section."
}
]
}
}
Relationship types: prerequisite, supports, contradicts, extends, references.
Write the completed analysis to analysis.json.
Generate the interactive HTML visualization report:
python scripts/generate_report.py page_stats.json analysis.json synoptic_report.html
Output: synoptic_report.html — a self-contained interactive report with section structure visualization, citation network graph, relationship diagram, summaries, and page-level statistics.
Open the generated report in the user's default browser:
start synoptic_report.html # Windows
open synoptic_report.html # macOS
xdg-open synoptic_report.html # Linux
Adapt the analysis strategy based on document length:
analysis.json. Ensure cross-chunk references and relationships are resolved during the merge step.