Extract structured data from multiple documents into comparison matrix with citations. Use for bulk document review.
Production-ready toolkit for extracting structured data from multiple legal documents into a comparison matrix with citations. Supports user-defined extraction columns, parallel processing with up to 10 agents, confidence scoring, and output in markdown table or structured JSON. Designed for legal teams performing bulk contract review, NDA comparison, employment agreement analysis, and lease review.
scripts/document_discovery.py)Scan a directory for legal documents and generate an inventory manifest.
python scripts/document_discovery.py /path/to/contracts
python scripts/document_discovery.py /path/to/ndas --types pdf,docx --json
python scripts/document_discovery.py /path/to/leases --types pdf,docx,txt,md --min-size 1024
scripts/extraction_aggregator.py)Aggregate multiple extraction result JSONs into a unified comparison matrix.
python scripts/extraction_aggregator.py \
--results extraction_1.json extraction_2.json extraction_3.json
python scripts/extraction_aggregator.py \
--results-dir ./extraction_results/ --json
python scripts/extraction_aggregator.py \
--results-dir ./extraction_results/ \
--format markdown \
--output review_matrix.md
python scripts/extraction_aggregator.py \
--results extraction_1.json extraction_2.json \
--columns "Parties,Effective Date,Term,Governing Law"
| Reference | Purpose |
|---|---|
references/extraction_methodology.md | Document extraction best practices, JSON schema, agent prompts |
references/common_extraction_columns.md | Pre-defined column sets for contracts, NDAs, employment, leases |
| Step | Action | Tool | Output |
|---|---|---|---|
| 1. Gather Requirements | Define document folder, output filename, columns to extract | Manual | Column list, file path |
| 2. Discover Documents | Scan directory for target documents | document_discovery.py | Document manifest JSON |
| 3. Process Documents | Extract values per column with citations (parallel agents) | AI agents (external) | Per-document extraction JSONs |
| 4. Collect Results | Aggregate extraction JSONs into unified matrix | extraction_aggregator.py | Consolidated matrix |
| 5. Generate Output | Export as markdown table or structured JSON | extraction_aggregator.py | Final deliverable |
| Agents | Documents per Agent | Use When |
|---|---|---|
| 1 | All | 1-5 documents |
| 2-3 | ceil(N/agents) | 6-15 documents |
| 4-6 | ceil(N/agents) | 16-40 documents |
| 7-10 | ceil(N/agents) | 41-100 documents |
| 10 (max) | ceil(N/10) | 100+ documents |
Each agent receives a prompt structured as:
You are reviewing {count} legal documents. For each document, extract the
following columns:
{column_definitions}
For each value extracted:
1. Provide the exact value found
2. Include the page number (PDF) or section/paragraph (DOCX/MD)
3. Rate your confidence: HIGH (exact match), MEDIUM (inferred), LOW (uncertain)
4. If not found, record "NOT FOUND" with confidence LOW
Output as JSON per the extraction schema.
| Level | Color Code | Definition |
|---|---|---|
| HIGH | Green | Exact value found with clear citation |
| MEDIUM | Yellow | Value inferred from context; multiple possible interpretations |
| LOW | Red / Not Found | Value uncertain or not found in document |
Sheet 1: Document Review
| Document | Parties | Effective Date | Term | Governing Law | ... |
|---|---|---|---|---|---|
| contract_a.pdf | Acme / Beta [p.1] | 2026-01-15 [p.2] | 3 years [p.3] | Delaware [p.12] | ... |
| contract_b.pdf | Gamma / Delta [p.1] | NOT FOUND | 2 years [p.4] | New York [p.10] | ... |
Sheet 2: Summary
| Metric | Value |
|---|---|
| Documents processed | 25 |
| Columns extracted | 8 |
| Average confidence | 87% |
| Not found rate | 12% |
| Column | What to Extract |
|---|---|
| Parties | All contracting parties with full legal names |
| Effective Date | Contract effective or execution date |
| Term | Duration of the agreement |
| Renewal | Auto-renewal terms and notice period |
| Governing Law | Jurisdiction governing the agreement |
| Liability Cap | Maximum liability amount or formula |
| Indemnification | Indemnification obligations and scope |
| IP Ownership | Intellectual property ownership provisions |
| Termination Rights | Termination triggers and notice requirements |
| Data Protection | Data protection or privacy obligations |
| Column | What to Extract |
|---|---|
| Parties | Disclosing and receiving parties |
| Type | Mutual or one-way |
| Definition Scope | How "confidential information" is defined |
| Exceptions | Standard exceptions to confidentiality |
| Term | Duration of confidentiality obligations |
| Survival | Survival period after termination |
| Return/Destruction | Obligations on termination |
| Remedies | Available remedies for breach |
| Problem | Cause | Solution |
|---|---|---|
| Discovery finds 0 documents | Wrong path or file types | Verify path exists; check --types matches actual file extensions |
| Extraction JSONs have wrong schema | Agent prompt incomplete | Use the extraction schema from extraction_methodology.md |
| Aggregator shows conflicts | Multiple values for same cell | Review source documents; aggregator marks conflicts for manual review |
| High "NOT FOUND" rate | Columns too specific for document type | Use column definitions from common_extraction_columns.md; broaden definitions |
| Confidence all LOW | Agent unable to locate values | Check column definitions are specific enough; verify document is readable |
| Aggregator crashes on large set | Too many result files loaded at once | Process in batches of 50 results; use --columns to limit output width |
| Markdown table misaligned | Long values or special characters | Use --format json for machine processing; truncate long values |
| Missing citations | Agent did not include page/section references | Reinforce citation requirement in agent prompt; check extraction schema |
This skill covers:
This skill does NOT cover:
| Anti-Pattern | Why It Fails | Better Approach |
|---|---|---|
| Vague column definitions | "Date" could match dozens of dates in a contract | Use specific definitions: "Effective Date" with guidance on where to look |
| Skipping document discovery | Unknown document count leads to wrong agent allocation | Always run discovery first; use manifest for pipeline planning |
| Ignoring LOW confidence results | Missing or uncertain data treated as fact | Review all LOW confidence cells manually; flag in final report |
| Processing 100+ docs with 1 agent | Slow, context window overflow, quality degradation | Use parallel processing: ceil(N/10) documents per agent, max 10 agents |
| No citation requirement | Cannot verify extracted values against source | Require page/section citation for every extraction; reject uncited values |
scripts/document_discovery.pyScan directory for legal documents and generate inventory manifest.