技能档案

Debug Pdf Extraction

Name: Debug Pdf Extraction
Author: Bestra

Debug PDF text extraction issues. Use when parser is producing wrong data, missing units, or garbled text from a PDF.

Bestra0 星标2026年1月14日

职业
分类: 文档

技能内容

PDFs store visual layout, not logical text structure. This causes extraction issues like merged columns, wrong reading order, and corrupted characters.

Inspect Raw Page Text

Create a temporary script to see exactly what pdfplumber extracts:

# parser/debug_pdf.py
import pdfplumber

with pdfplumber.open("/path/to/SomeGame.pdf") as pdf:
    page = pdf.pages[PAGE_NUMBER - 1]  # 0-indexed
    text = page.extract_text()

    for j, line in enumerate(text.split("\n")):
        print(f"{j:3}: {repr(line)}")

Run with:

cd parser && uv run python debug_pdf.py

Common Issues

Merged columns

Text from sidebars gets concatenated with main content:

相关技能

Debug Pdf Extraction | Skills Pool

"sCenario 10: marChinG to CoLd harbor 4. Destroyed RR Stations: The following..."

"sCenario" instead of "Scenario"

tables = page.extract_tables()
for table in tables:
    for row in table:
        print(row)

# Find text containing a pattern
import re
for i, page in enumerate(pdf.pages):
    text = page.extract_text() or ""
    if re.search(r"pattern", text, re.IGNORECASE):
        print(f"Found on page {i+1}")

Debug Pdf Extraction

Inspect Raw Page Text

Common Issues

Merged columns

Debug Pdf Extraction

Inspect Raw Page Text

Common Issues

Merged columns

Weird capitalization

Table extraction issues

Inspect Specific Elements

Clean Up

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing