Extract and process Fidelity Tax Reporting Statements (1099-B and 1099-DIV). Use for parsing Fidelity consolidated 1099 PDFs to extract sales transactions, dividends, and distributions. Generates estate accounting schedules with monthly totals using auditor terminology (Total Income, Total Disbursements, Net FMV Change). Triggers on Fidelity 1099, tax statement, estate accounting, brokerage tax forms, 1099-B sales, 1099-DIV dividends.
Parse Fidelity Tax Reporting Statement PDFs to extract sales and dividend transactions, with accurate 1099-B section classification and 1099-DIV box mapping.
# Auto-generates YYYY-XXXX-Estate-Report.xlsx from PDF filename
python scripts/generate_estate_report.py path/to/fidelity_1099.pdf
# Or specify custom output path
python scripts/generate_estate_report.py path/to/fidelity_1099.pdf output_report.xlsx
Output filename is normalized to YYYY-XXXX-Estate-Report.xlsx format (e.g., 2020-9147-Estate-Report.xlsx).
from scripts.generate_estate_report import generate_report
generate_report("path/to/fidelity_1099.pdf") # Auto-generates output filename
# Or use individual extractors:
from scripts.extract_sales import extract_sales_from_pdf
from scripts.extract_dividends import extract_dividends_from_pdf
sales = extract_sales_from_pdf("fidelity_1099.pdf")
dividends, totals = extract_dividends_from_pdf("fidelity_1099.pdf")
Extract sales from text using regex patterns:
# Symbol line format: COMPANY NAME,SYMBOL,CUSIP (3 fields)
# Also handles 2-field format: DESCRIPTION,CUSIP (for exchanged securities)
symbol_pattern_3field = r'^([A-Z0-9][A-Z0-9 &\-\.\'#/\*!()\$]+?),([A-Z]{1,5}),([A-Z0-9]{9})$'
symbol_pattern_2field = r'^([A-Z][A-Z0-9 &\-\.\'\*#/()\$]+),([A-Z0-9]{9})$'
# Sale line format: Type Quantity DateAcquired DateSold Proceeds CostBasis
sale_pattern = r'^(Sale|Cash In Lieu|Merger|Redemption|Exchange)\s+([\d,.]+)\s+(\d{2}/\d{2}/\d{2})\s+(\d{2}/\d{2}/\d{2})\s+([\d,.]+)\s+([\d,.]+)'
# Unknown term pattern: Type Quantity Unknown DateSold Proceeds Unknown
unknown_term_pattern = r'^(Sale|Cash In Lieu|...)\s+([\d,.]+)\s+Unknown\s+(\d{2}/\d{2}/\d{2})\s+([\d,.]+)\s+Unknown'
Classify by section headers:
Critical: Must exclude Capital Gains Distribution pages from dividend extraction.
def is_dividend_section(text):
"""Check if page contains dividends (not capital gains)."""
if "Dividends and Distributions" not in text:
return False
# CRITICAL: Exclude Capital Gains pages
lines = text.split('\n')[:25]
for line in lines:
if "Total Capital Gains Distributions Detail" in line:
return False
if "Capital Gain Distr" in line and "2a" in line:
return False
return True
Use word-level extraction with x-position column mapping:
words = page.extract_words()
for word in words:
x = word['x0']
if 100 <= x < 200: # Box 1a Ordinary Dividends
elif 340 <= x < 420: # Box 1b Qualified Dividends
elif 480 <= x < 575: # Box 11 Exempt Interest
elif 575 <= x < 700: # Box 12 Private Activity Bond
elif 700 <= x < 800: # Box 7 Foreign Tax Paid
Critical: Handle !C prefix (CORRECTED entries):
# Matches: "01/10/20", "!C 02/11/20", "!C02/11/20"
date_pattern = r'^!?C?\s*(\d{2}/\d{2}/\d{2})(?:\s|$)'
Track current symbol while iterating through text lines. Reset on separator (dashes):
current_symbol = None
for line in lines:
if SEPARATOR_PATTERN.match(line): # Dashes reset context
current_symbol = None
continue
if symbol_pattern.match(line):
current_symbol = extract_symbol(line)
elif date_pattern.match(line) and current_symbol:
# Associate this transaction with current_symbol
The extraction automatically validates against 1099 summary totals from the PDF:
from scripts.validate_extraction import validate_extraction
report = validate_extraction(pdf_path, sales, dividends, div_totals)
print(report) # Shows pass/fail for each field with $0.02 tolerance
Validated fields:
| Issue | Symptom | Fix |
|---|---|---|
| Capital Gains in Dividends | Box 1a over-reported by ~$2,500+ | Update is_dividend_section() to exclude pages with "Total Capital Gains Distributions Detail" header |
| Missing !C entries | Box totals under-reported | Ensure date pattern handles !C prefix with/without space |
| Wrong symbol association | Amounts on wrong symbol | Reset symbol context on separator lines (dashes) |
| Column value mismatch | Wrong box amounts | Recalibrate x-position boundaries from actual PDF |
| 2-field securities missed | Missing exchanged securities | Add pattern for DESCRIPTION,CUSIP format (no symbol) |
Example debug for Box 1a failure ($2,661.91 over):
# Found $2,499.87 AEMSX on page 226 - a Capital Gains Distribution page
# Fix: Added check for "Total Capital Gains Distributions Detail" header
Generated Excel report with 7 sheets, designed for auditor review. Uses auditor terminology:
1. START HERE - Instructions and navigation guide
2. Reconciliation - Formula-based validation (auditor's first stop)
'6. Sales Detail' and '7. Dividend Detail'3. Annual Summary - High-level totals by 1099 category
4. Monthly Detail - Monthly breakdown for abbreviated reporting
5. Monthly by Symbol - Monthly breakdown by security symbol
6. Sales Detail - Individual 1099-B transactions
7. Dividend Detail - Individual 1099-DIV transactions
Reconciliation sheet formulas reference these columns:
Reports are automatically named YYYY-XXXX-Estate-Report.xlsx:
YYYY = Tax year extracted from PDF filenameXXXX = Last 4 digits of account numberExamples:
2020-Executor-9147-Consolidated-Form-1099.pdf → 2020-9147-Estate-Report.xlsx2020-Trust-Under-Agreement-9431-Consolidated-Form-1099.pdf → 2020-9431-Estate-Report.xlsxY97-127622_2020-Individual-7622-Consolidated-Form-1099.pdf → 2020-7622-Estate-Report.xlsxextract_sales.py - Extract and classify 1099-B sales transactionsextract_dividends.py - Extract 1099-DIV dividends with x-position mappinggenerate_estate_report.py - Create 7-sheet Excel report with validationvalidate_extraction.py - Validate extracted totals against PDF summary pagescolumn_positions.md - X-position boundaries for dividend columns1099_box_reference.md - IRS 1099-B and 1099-DIV box meaningspip install pdfplumber openpyxl