Name: Post-OCR Text Cleanup for Research Corpora
Author: scdenney

Post-OCR Text Cleanup for Research Corpora

Guide post-OCR text cleanup for research corpora. Covers LLM-based correction, rule-based fixes, quality diagnostics, multilingual considerations, and corpus-level quality assurance. Use when (1) choosing between LLM and rule-based OCR error correction, (2) designing prompts for LLM-based OCR cleanup, (3) applying constrained decoding to prevent correction hallucination, (4) building rule-based fixes for Unicode normalization or repetition artifacts, (5) evaluating cleanup quality beyond CER/WER, (6) handling diacritics restoration or script-specific spacing, (7) sampling and flagging documents for human review at corpus scale, or (8) tracking correction provenance for reproducibility.

scdenney15 星標2026年4月5日

職業
分類: 知識庫

Instructions

1. Cleanup Strategy Selection

Choose between LLM correction, rule-based fixes, or a hybrid pipeline based on error type. LLM correction excels at context-dependent errors (wrong but plausible characters, broken words, missing diacritics). Rule-based fixes handle deterministic patterns (control characters, Unicode normalization, repetition artifacts, whitespace) with zero risk of content alteration. Use rule-based fixes unconditionally for these categories.
Default to the hybrid approach for research corpora. Run LLM correction first on all pages, then apply deterministic rule fixes on top. This order matters: LLM correction may introduce formatting artifacts that rule fixes clean up, while the reverse order wastes rule-fix effort on text the LLM will rewrite (Machidon & Machidon 2025).
Pilot-test LLM correction per language before corpus-wide deployment. LLM post-correction effectiveness is highly language-dependent: English achieves 7-58% CER reduction, while some languages see no improvement or degradation (Kanerva et al. 2025). Never assume cross-language transferability.
Consider whether correction is needed at all. If the downstream analysis tolerates OCR noise (e.g., topic modeling is robust to moderate error rates), the risk of correction-introduced errors may outweigh the benefit. Define the quality threshold before choosing a strategy.

Post-OCR Text Cleanup for Research Corpora

scdenney15 星標2026年4月5日

職業
分類: 知識庫

Instructions

1. Cleanup Strategy Selection

Choose between LLM correction, rule-based fixes, or a hybrid pipeline based on error type. LLM correction excels at context-dependent errors (wrong but plausible characters, broken words, missing diacritics). Rule-based fixes handle deterministic patterns (control characters, Unicode normalization, repetition artifacts, whitespace) with zero risk of content alteration. Use rule-based fixes unconditionally for these categories.

Default to the hybrid approach for research corpora. Run LLM correction first on all pages, then apply deterministic rule fixes on top. This order matters: LLM correction may introduce formatting artifacts that rule fixes clean up, while the reverse order wastes rule-fix effort on text the LLM will rewrite (Machidon & Machidon 2025).

Pilot-test LLM correction per language before corpus-wide deployment. LLM post-correction effectiveness is highly language-dependent: English achieves 7-58% CER reduction, while some languages see no improvement or degradation (Kanerva et al. 2025). Never assume cross-language transferability.

Consider whether correction is needed at all. If the downstream analysis tolerates OCR noise (e.g., topic modeling is robust to moderate error rates), the risk of correction-introduced errors may outweigh the benefit. Define the quality threshold before choosing a strategy.

Post-OCR Text Cleanup for Research Corpora

Instructions

1. Cleanup Strategy Selection

Post-OCR Text Cleanup for Research Corpora

Instructions

1. Cleanup Strategy Selection

2. LLM-Based Correction

3. Rule-Based Fixes

4. Quality Diagnostics and Metrics

5. Multilingual Considerations

6. Corpus-Level Quality Assurance

7. Provenance and Documentation

Quality Checks

Notion

Feishu Wiki

Gemini

Obsidian Vault Maintainer

Openclaw Pr Maintainer

Wiki Maintainer