Name: Pdf Markdown Validator
Author: raphaelmansuy

FinalScore = (0.40 × TableAccuracy) + (0.40 × StyleAccuracy)
           + (0.10 × Robustness) + (0.10 × Performance)

TableAccuracy = (0.5 × TableDetectionF1) + (0.5 × CellContentAccuracy)

StyleAccuracy = macro_average(BoldF1, ItalicF1, HeadingF1)
              = (BoldF1 + ItalicF1 + HeadingF1) / 3

Robustness = (CrashFreeRate + MarkdownValidityRate + CompletenessRate) / 3

Performance = 0.5 × min(1.0, baseline_median / run_median)
            + 0.5 × min(1.0, baseline_p95 / run_p95)

# Copy reference Markdown next to PDF with .gold.md extension
cp reference_output.md test.gold.md

# Format: Each section annotated with metadata
# Gold format example:
# # Heading 1
# **bold text** and *italic text*
#
# | Column A | Column B |
# |----------|---------|
# | cell 1   | cell 2  |

# Evaluate against ground truth
cargo run -p edgequake-pdf --example real_dataset_eval -- \
  --input crates/edgequake-pdf/test-data/real_dataset \
  --gold \
  --metrics

# Generate detailed report
python3 .github/skills/pdf-markdown-validator/scripts/validate.py \
  --pdf-dir crates/edgequake-pdf/test-data/real_dataset \
  --gold-dir . \
  --output-report metrics_report.json

# View summary scores
cat metrics_report.json | jq '.summary'

# Analyze failures by category
python3 .github/skills/pdf-markdown-validator/scripts/analyze_failures.py \
  metrics_report.json

# Embed metrics in standard cargo test output
cargo test -p edgequake-pdf -- --nocapture

# Fail CI if composite score below threshold
cargo test -p edgequake-pdf --features ci-strict

from pdf_validator import PDFValidator

validator = PDFValidator(
    pdf_dir="test-data/real_dataset",
    gold_dir="test-data/gold",
    metrics=["table", "style", "robustness", "performance"]
)
score = validator.evaluate()
print(f"Composite Score: {score.composite}/100")

# GitHub Actions example
- name: Validate PDF → Markdown
  run: |
    cargo run -p edgequake-pdf --example real_dataset_eval -- --metrics
    python .github/skills/pdf-markdown-validator/scripts/validate.py \
      --ci-mode --fail-below 75

| Name | Age |
| ---- | --- |
| John | 25  |

| Name | Age |
| ---- | --- |
| John | 25  |

| Name | Age  |
| ---- | ---- |
| John | 25.0 |

# Main Heading

This is **bold** and _italic_ text.

## Sub Heading

More content here.

# Main Heading

This is **bold** and _italic_ text.

## Sub Heading

More content here.

# Main Heading

This is bold and italic text.

## Sub Heading

More content here.

# Navigate to PDF crate
cd edgequake/crates/edgequake-pdf

# Ensure ground-truth annotations exist
# Files should be named: <pdf_name>.gold.md
ls -1 test-data/real_dataset/*.gold.md

# Convert all PDFs to Markdown
cargo run -p edgequake-pdf --example real_dataset_eval -- --write

# Outputs written to: test-data/real_dataset/*.md

# Compute all metrics
python3 ../../.github/skills/pdf-markdown-validator/scripts/validate.py \
  --pdf-dir test-data/real_dataset \
  --gold-dir test-data/real_dataset \
  --output-report validation_report.json \
  --verbose

# View summary
cat validation_report.json | jq '.summary'

# Detailed per-document breakdown
cat validation_report.json | jq '.documents | .[] | {name, scores}'

# Identify failure patterns
python3 ../../.github/skills/pdf-markdown-validator/scripts/analyze_failures.py \
  validation_report.json --group-by failure_type

# Document Title (H1)

## Section Heading (H2)

This paragraph contains **bold text** and _italic text_ and **_bold-italic text_**.

### Subsection (H3)

#### Sub-subsection (H4)

**Note:** Use standard Markdown syntax. Be precise with:

- Bold: **text**
- Italic: _text_
- Bold-Italic: **_text_**
- Headings: # through #### for H1–H4

### Tables

| Column 1 | Column 2 | Column 3 |
| -------- | -------- | -------- |
| Cell 1   | Cell 2   | Cell 3   |
| Cell 4   | Cell 5   | Cell 6   |

Ensure:

- Pipes align properly
- Headers separated by `---|---` row
- No trailing spaces (can affect parsing)

### Code Blocks

\`\`\`python
def hello():
print("world")
\`\`\`

Use triple backticks with language identifier.

### Lists

Bullet list:

- Item 1
- Item 2
  - Nested item
- Item 3

Numbered list:

1. First
2. Second
3. Third

### Edge Cases

- **Multi-line table cells**: Not standard Markdown; flatten to single line
- **Merged cells**: Not representable in Markdown tables; split into separate rows
- **Vertical headers**: Use first row convention (all cells with **bold**)

# Full validation pipeline
python3 scripts/validate.py \
  --pdf-dir <path/to/pdfs> \
  --gold-dir <path/to/gold> \
  [--output-report <report.json>] \
  [--metrics table,style,robustness,performance] \
  [--ci-mode] \
  [--fail-below 75]

# Identify and categorize failures
python3 scripts/analyze_failures.py \
  <report.json> \
  [--group-by failure_type|document|metric] \
  [--export <output.csv>]

# Compare two validation runs
python3 scripts/compare_runs.py \
  <baseline_report.json> \
  <current_report.json> \
  [--show-improvements] \
  [--show-regressions]

Pdf Markdown Validator | Skills Pool

Pdf Markdown Validator

Pdf Markdown Validator

PDF → Markdown Validator Skill

When to use

Core concepts

Validation Framework

1. Table Accuracy (40% weight)

2. Style Accuracy (40% weight)

3. Robustness (10% weight)

4. Performance (10% weight)

Quick start

1. Prepare ground-truth annotations

2. Run validation

3. Interpret results

Capabilities

Core Capabilities

1. Metric Computation

2. Evaluation Harness

3. Reporting & Visualization

4. Validation Gates

Integration Points

With Cargo Testing

With Python Evaluation Scripts

With CI/CD Pipelines

Metric Definitions & Examples

Table Accuracy Example

Style Accuracy Example

Robustness Example

Performance Example

Workflow: Running a Full Validation

Step 1: Prepare Test Data

Step 2: Generate Markdown Output

Step 3: Run Validation

Step 4: Analyze Results

Step 5: Iterative Improvement

Ground-Truth Annotation Format

CLI Commands & Scripts

Validation Runner

Failure Analysis

Comparison Tool

Integration with CI/CD

GitHub Actions Example

Diffs

Nano Pdf

Summarize

Feishu Doc

Visa Doc Translate

Nutrient Document Processing