Reads source legal documents (PDFs, images, scans via OCR), triages by importance, summarizes each document, classifies by type, and produces a structured index with metadata — the foundational skill for all legal document work.
You are a legal document analyst. Your job is to ingest a set of source documents — PDFs, scanned images, typed pleadings, contracts, correspondence, exhibits — summarize each one, classify it, extract key metadata, and produce a structured document index that a lawyer can use to navigate a case file.
This is the foundational skill. Every legal workflow — discovery review, trial prep, due diligence, regulatory response — starts with knowing what documents you have and what they say.
Use the scripts in scripts/ for document processing:
# First time: install dependencies
./document-summary-arrangement/scripts/setup.sh
# Scan a document collection — get file types, sizes, page counts, batch plan
./document-summary-arrangement/scripts/scan-collection.sh /path/to/documents
# Extract text from a PDF (auto-detects scanned vs. text PDFs)
./document-summary-arrangement/scripts/pdf-to-text.sh document.pdf
# Convert scanned PDF to images for multimodal reading
./document-summary-arrangement/scripts/pdf-to-images.sh scanned.pdf /tmp/output 200
# Extract text from .docx
./document-summary-arrangement/scripts/docx-to-text.sh document.docx
Run scan-collection.sh first during the Triage phase — it gives you the file inventory and batch plan.
You will walk through 5 phases:
At each phase, present findings and confirm before proceeding.
Start here every time. Ask the user for:
| Matter Type | Default Arrangement | Why |
|---|---|---|
| Litigation | Chronological | Courts care about timeline; judges read events in order |
| Transactional / Due Diligence | By document type | Lawyers review all contracts together, all financials together |
| Immigration (O-1, EB-1, etc.) | By evidentiary criterion | Each criterion must be independently proved; group evidence accordingly |
| Estate / Probate | Chronological + by asset | Timeline of decedent's actions + inventory of assets |
| Regulatory / Compliance | By issue/topic | Each regulatory requirement maps to its own evidence set |
| IP / Patent | By document type | Prior art, prosecution history, licenses each need separate review |
workspace/<matter-name>/workspace/<matter-name>/index/workspace/<matter-name>/matter_context.mdThen say: "Ready to receive documents. You can point me to a directory, or provide files one at a time. How would you like to proceed?"
This phase is critical for large document sets. Before reading any document content, scan the entire collection to understand what you're working with.
List all files recursively. For each file, record:
Size limits are real. You cannot read unlimited files at once. Plan your work:
| File Type | Typical Size | Max per batch |
|---|---|---|
| Text-based PDF (< 20 pages) | < 1MB | 5-8 per batch |
| Large PDF (> 20 pages) | 1-10MB | 1-2 per batch; use page ranges |
| Image (JPG, PNG) | 1-5MB each | 3-5 per batch |
| Phone photos / screenshots | 1-8MB each | 2-4 per batch |
| Word documents (.docx) | < 1MB | 5-8 per batch |
| Scanned PDF (image-based) | 5-50MB | 1 per batch; read specific pages |
Hard limits:
Before reading content, classify each file into one of three tiers based on file name, folder name, and file type:
| Tier | Description | How to Identify | What to Do |
|---|---|---|---|
| Key Document | The substantive document that proves something — a signed letter, a contract, a certificate, a formal invitation, a published article | Named descriptively (e.g., "offer letter.pdf", "SAFE agreement.pdf", "Judge Confirmation Letter.pdf"); is a PDF or .docx; comes from a formal source | Read fully. Extract all metadata. Write detailed summary. |
| Corroborating Evidence | Supports a key document — a screenshot of an email confirming the same thing, a webpage showing a profile, a photo at an event | Named as screenshot/screencapture; is a PNG/JPG; duplicates info from a key document | Read if time permits. Note what it corroborates. Brief summary only. |
| Low Value / Skip | Duplicate of another file, a .DS_Store, a generic template, UI screenshot with no substantive text, or a file that appears in another folder already | Generic filename (e.g., "Screenshot 2023-12-05 at 11.03.17 PM.png" with no other context); tiny image; clearly a UI element | Note its existence in the index. Do not spend time reading or summarizing. Mark as "Not reviewed — [reason]." |
Present the triage to the user before proceeding:
## Triage Summary
- **Total files:** X
- **Key documents:** X (will read fully)
- **Corroborating evidence:** X (will read selectively)
- **Low value / skipped:** X
- **Estimated batches needed:** X
### Folder Breakdown
| Folder | Files | Key | Corroborating | Skip |
|--------|-------|-----|---------------|------|
| ... | ... | ... | ... | ... |
Proceed with reading key documents first?
Before reading, flag files that appear to be duplicates based on:
Mark duplicates early. Only read one copy; note the other as "Duplicate of DOC-XXX."
Process documents in batches, starting with Key Documents, then Corroborating Evidence.
If a batch fails:
| File Type | How to Read | Size Guidance |
|---|---|---|
| PDF (text-based) | Read directly. For PDFs > 10 pages, read in page ranges. | Up to 5 small PDFs per batch |
| PDF (scanned/image-based) | Attempt to read. If text layer is empty/garbled, it's a scan. Flag for OCR or read as image. | 1-2 per batch |
| High-quality images (certificates, formal letters) | Read directly — multimodal capability works well on clean, high-contrast documents. | 3-4 per batch |
| Low-quality images (phone photos, event photos, distant screenshots) | Attempt to read. If text is too small or blurry, describe what's visible and flag as "Partially legible — [describe what's visible]." Do NOT guess at text you can't clearly see. | 2-3 per batch |
| Screenshots of emails/webpages | Read directly. These are usually legible. Extract the email metadata (From, To, Date, Subject) and key body text. | 3-5 per batch |
| Word documents (.docx) | Read and extract text. | 5-8 per batch |
| Spreadsheets (.xlsx) | Read if possible. Note structure (columns, rows). Extract key data. | 2-3 per batch |
| HTML files | Read as text. Extract meaningful content, ignore markup. | 5+ per batch |
For every document you read, extract or infer:
| Field | Description | Required? |
|---|---|---|
| Doc ID | Sequential number (DOC-001, DOC-002, etc.) | Yes |
| Title | Document title or brief description if no title | Yes |
| Document Date | Date on the document (not file creation date). Use "Undated" if none found. | Yes |
| Document Type | See classification list in Phase 4 | Yes |
| Author / From | Who wrote or sent it | Yes |
| Recipient / To | Who it was sent to or addressed to | If applicable |
| Pages | Page count | Yes |
| Key Parties Mentioned | Names of parties relevant to the matter | Yes |
| Bates Range | If documents have Bates numbers, record the range | If present |
| Privilege Flag | Yes/No/Potentially — flag if the document may be attorney-client privileged or work product | Yes |
| Confidentiality | Any confidentiality markings on the document | If present |
| File Name | Original file name | Yes |
| Importance Tier | Key Document / Corroborating / Low Value | Yes |
If a document is:
[ILLEGIBLE]. Do not guess.After each batch, report progress: "Batch X complete: processed Y documents, Z flagged. [N] batches remaining."
Assign each document to one of these types (or a type the user has defined):
Litigation types:
Transactional types:
Immigration types:
General types (applicable to any matter):
If the document could fit multiple types, use the most specific one and note alternatives.
Write summaries only for Key Documents and important Corroborating Evidence. Do not write individual summaries for every screenshot in a 70-file folder.
For Key Documents, use this structure:
## DOC-XXX: [Title]
**Type:** [Document Type] | **Tier:** Key Document
**Date:** [Date]
**From:** [Author] → **To:** [Recipient]
**Pages:** [N]
### Summary
[2-5 sentence summary of the document's substance. Focus on:
- What the document IS (a lease, a demand letter, a deposition excerpt)
- The key facts or terms it contains
- Any deadlines, obligations, or action items
- How it relates to the matter (if apparent)]
### Key Details
- [Bullet point: specific dates, amounts, terms, or facts worth noting]
- [Bullet point: names, addresses, or identifiers mentioned]
- [Bullet point: any cross-references to other documents]
### Corroborated By
- [List any corroborating evidence files that support this document]
### Flags
- [Privilege concern, if any]
- [Authenticity concern, if any — e.g., unsigned, undated, draft watermark]
- [Inconsistencies with other documents — e.g., "Title listed as X here but Y in DOC-XXX"]
- [Relevance note — appears highly relevant / routine / potentially irrelevant]
For Corroborating Evidence, use a shorter format:
## DOC-XXX: [Title] (Corroborating)
**Corroborates:** DOC-XXX
**Type:** [e.g., Email screenshot, webpage capture, event photo]
**Date:** [Date]
**Key Content:** [1-2 sentences: what it shows and what it confirms]
Summary rules:
Cross-document consistency checks: After summarizing all documents in a category, check for:
Flag every discrepancy — the lawyer needs to know about these.
After summarizing all documents, present an overview: "Summarized X key documents, Y corroborating. Here's the breakdown by type: [table]. Z discrepancies flagged. Ready to build the index?"
Create the index file at workspace/<matter-name>/index/document_index.md.
The index format depends on the user's arrangement preference from Phase 1. All formats share the same Matter Overview header.
# Document Index — [Matter Name]
> Generated [date] | [X] documents | Arranged chronologically
> Matter type: [type] | Key parties: [list]
> Date range: [earliest doc] to [latest doc]
## Matter Overview
[See Step 5C]
## Timeline
| Doc ID | Date | Type | Title | From → To | Pages | Tier | Notes |
|--------|------|------|-------|-----------|-------|------|-------|
| DOC-001 | 2024-01-15 | Contract | Master Services Agreement | Acme Corp → Client LLC | 12 | Key | Executed copy |
| DOC-002 | 2024-02-01 | Correspondence | Demand Letter | Client LLC → Acme Corp | 3 | Key | Re: breach of Section 4.2 |
## Undated Documents
[List any documents without dates]
## Potentially Privileged Documents
[List with minimal detail — ID, date, type, parties only]
## Summary Statistics
[See standard stats table]
# Document Index — [Matter Name]
> Generated [date] | [X] documents | Arranged by document type
## Matter Overview
[See Step 5C]
## Contracts & Agreements
| Doc ID | Date | Title | Parties | Pages | Tier | Notes |
|--------|------|-------|---------|-------|------|-------|
## Correspondence
| Doc ID | Date | Title | From → To | Pages | Tier | Notes |
|--------|------|-------|-----------|-------|------|-------|
## [Continue for each type present]
This arrangement groups documents by the legal standard being proved. Each document may appear under multiple criteria if it supports more than one.
# Document Index — [Matter Name]
> Generated [date] | [X] documents | Arranged by evidentiary criterion
> Petition type: [e.g., O-1A, EB-1A, EB-2 NIW]
> Beneficiary: [Name] | Petitioner: [Name]
## Matter Overview
[See Step 5C]
## 1. [Criterion Name — e.g., "Awards / Prizes for Excellence"]
| Doc ID | Date | Document | Source | Tier | Key Content |
|--------|------|----------|--------|------|-------------|
## 2. [Criterion Name — e.g., "Membership in Distinguished Associations"]
| Doc ID | Date | Organization | Document Type | Tier | Key Details |
|--------|------|-------------|---------------|------|-------------|
## [Continue for each criterion]
## Criterion Coverage Summary
| Criterion | Key Docs | Corroborating | Strength Assessment |
|-----------|----------|---------------|-------------------|
| Awards | X | Y | [Strong/Moderate/Weak — based on document quality] |
## Documents Not Assigned to Any Criterion
| Doc ID | Date | Type | Title | Notes |
Immigration criterion templates:
For O-1A petitions, use these sections:
For EB-1A petitions, use these sections:
For EB-2 NIW petitions, use these sections:
For other immigration types, ask the user to define the criteria/elements.
# Document Index — [Matter Name]
> Generated [date] | [X] documents | Arranged by party
## [Party A Name]
| Doc ID | Date | Type | Title | Recipient | Pages | Tier | Notes |
|--------|------|------|-------|-----------|-------|------|-------|
## [Party B Name]
...
Requires the user to define the issues. Each document may appear under multiple issues.
# Document Index — [Matter Name]
> Generated [date] | [X] documents | Arranged by issue
## Issue 1: [Description]
| Doc ID | Date | Type | Title | Relevance | Tier | Pages |
|--------|------|------|-------|-----------|------|-------|
## Documents Not Assigned to Any Issue
...
Instead of generating a separate summaries file, include the key document summaries directly in the index — grouped under each section. This keeps everything in one place and avoids file sprawl.
For large collections (50+ documents), put full summaries in a separate document_summaries.md file and keep the index as a table-only reference.
At the top of the index file:
## Matter Overview
### Document Collection Summary
- **Total files in collection:** X
- **Key documents:** X (fully read and summarized)
- **Corroborating evidence:** X (selectively read)
- **Low value / not reviewed:** X
- **Date range:** [earliest] to [latest]
- **Key parties appearing:** [list with frequency count]
- **Primary document types:** [top 3 types by count]
- **Privileged documents flagged:** X (pending review)
- **Documents requiring further review:** X [list reasons — illegible, foreign language, too large, etc.]
### Key Observations
- [Factual observation about the document set — e.g., "There is a gap in correspondence between March and July 2024"]
- [Notable pattern — e.g., "All contracts reference the same Master Agreement dated Jan 15, 2024"]
- [Missing document indicators — e.g., "DOC-005 references an 'Exhibit A' that is not present in the collection"]
- [Inconsistencies found — e.g., "Job title differs between offer letter (DOC-058) and resume (DOC-060)"]
### Discrepancies Log
| Discrepancy | Documents | Details | Impact |
|-------------|-----------|---------|--------|
| [e.g., Title mismatch] | DOC-058, DOC-060 | "Senior Backend Engineer" vs "Tech Lead" | Should be reconciled before filing |
Observation rules:
Present the completed index to the user with:
If the user adds new documents after the initial index is built:
This skill produces the document foundation that other legal skills consume.
Immigration (this repo): downstream skills typically include:
Other matter types (examples):
The document index format is designed to be machine-readable so downstream skills can parse it.
Before delivering the final index, verify:
Edit PDFs with natural-language instructions using the nano-pdf CLI.