Extract data from PDFs into CSV or spreadsheet format. Parse tables, invoices, reports, and structured documents into clean, usable data.
Pull data out of PDFs and put it into spreadsheets. Tables, invoices, reports, forms -- if the data is in a PDF, this skill gets it out and into a format you can actually work with.
Read the PDF content, identify structured data (tables, line items, repeated patterns), extract it, and write it to a CSV file that opens in any spreadsheet app. Uses Python standard library for CSV output and text-based PDF parsing.
Trigger phrases:
Ask for the file path or accept it from the user's message. Confirm the file exists and is readable.
"Which PDF do you want me to extract data from? Give me the file path."
Use Python to extract text from the PDF. Try these approaches in order:
Approach A: PyPDF2 / pypdf (if installed)
import pypdf
reader = pypdf.PdfReader("document.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
Approach B: pdfplumber (if installed, best for tables)
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
# Each table is a list of rows, each row a list of cells
for row in table:
print(row)
Approach C: pdftotext command line (if available)
pdftotext -layout "document.pdf" "output.txt"
Approach D: Read the tool output If using Claude Code, the Read tool can read PDFs directly. Use it to view the content, then parse the structure yourself.
Check what tools are available first. Install what is needed with pip install pypdf pdfplumber if the user approves.
Look at the extracted text and identify:
Tell the user what you found:
"I found a table on page 2 with 5 columns: Date, Description, Quantity, Unit Price, Total. It has 23 rows of data. Want me to extract that?"
Parse the data into clean rows and columns.
For tables:
For invoices:
For reports:
Use Python's csv module to write clean output:
import csv
headers = ["Date", "Description", "Quantity", "Unit Price", "Total"]
rows = [
# ... extracted data ...
]
output_path = "extracted_data.csv"
with open(output_path, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(rows)
Save the CSV next to the original PDF, or in a location the user specifies.
After extraction:
"Done! Extracted 23 rows and 5 columns from your invoice.
Saved to: C:/Users/you/Documents/invoice_data.csv
Preview:
| Date | Description | Qty | Unit Price | Total |
|------------|---------------|-----|------------|---------|
| 2025-03-01 | Web Design | 1 | $2,500.00 | $2,500 |
| 2025-03-01 | Logo Package | 1 | $800.00 | $800 |
| 2025-03-15 | Hosting Setup | 1 | $150.00 | $150 |
Open it in Excel, Google Sheets, or any spreadsheet app."
If the user has a batch of PDFs (e.g., a folder of invoices):
| PDF Type | What to Extract | Output Columns |
|---|---|---|
| Invoice | Line items | Description, Qty, Price, Amount, Invoice#, Date |
| Bank Statement | Transactions | Date, Description, Debit, Credit, Balance |
| Report | Metrics | Category, Period, Value, Change |
| Form | Field values | Field Name, Value |
| Receipt | Items purchased | Item, Price |
The user receives:
Built by AetherKin -- AI that's family, not a framework.