Debug PDF text extraction issues. Use when parser is producing wrong data, missing units, or garbled text from a PDF.
PDFs store visual layout, not logical text structure. This causes extraction issues like merged columns, wrong reading order, and corrupted characters.
Create a temporary script to see exactly what pdfplumber extracts:
# parser/debug_pdf.py
import pdfplumber
with pdfplumber.open("/path/to/SomeGame.pdf") as pdf:
page = pdf.pages[PAGE_NUMBER - 1] # 0-indexed
text = page.extract_text()
for j, line in enumerate(text.split("\n")):
print(f"{j:3}: {repr(line)}")
Run with:
cd parser && uv run python debug_pdf.py
Text from sidebars gets concatenated with main content:
"sCenario 10: marChinG to CoLd harbor 4. Destroyed RR Stations: The following..."
Solution: Use regex to extract just the part you need.
Small caps fonts appear as mixed case:
"sCenario" instead of "Scenario"
Solution: Use case-insensitive matching (re.IGNORECASE).
Tables may not parse correctly with extract_text(). Try:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
# Find text containing a pattern
import re
for i, page in enumerate(pdf.pages):
text = page.extract_text() or ""
if re.search(r"pattern", text, re.IGNORECASE):
print(f"Found on page {i+1}")
Delete parser/debug_pdf.py when done—it's just for investigation.