Detecting OCR corruption, encoding artifacts, diacritic quality, and Shamela-specific issues in Arabic scholarly text. Use when processing source text, evaluating text quality, debugging encoding problems, or assessing whether text needs re-extraction.
Arabic scholarly text arrives through multiple digitization paths, each with characteristic quality issues. This skill teaches how to detect corruption, assess quality, and decide whether text is pipeline-ready or needs re-extraction.
Arabic OCR has systematic confusion patterns due to letter shape similarity. These create SILENT corruption — the text looks plausible but is wrong.
| Correct | Confused With | Context | Detection Heuristic |
|---|---|---|---|
| ه (ha) | ة (ta marbuta) | Word-final | If word-final ه follows a fatha or appears in a known pattern (e.g., كتابه vs كتابة), check against dictionary |
| ة (ta marbuta) | ه (ha) | Word-final | Reverse of above — ة misread as ه changes meaning (صلاة→صلاه is invalid) |
| ي (ya) | ئ (ya+hamza) | All positions | Missing hamza seat — الشيء→الشي is common OCR error |
| ا (alif) | أ (alif+hamza) | Word-initial | أحمد→احمد — OCR frequently drops initial hamza |
| ة (ta marbuta) | ۃ (ta marbuta goal) | Word-final | Urdu ta marbuta substituted — encoding issue, not OCR |
| ك (kaf) | ک (kaf with no dot) | All positions | Persian/Urdu kaf vs Arabic kaf — U+06A9 vs U+0643 |
| ي (ya) | ى (alif maqsura) | Word-final | ى and ي are DISTINCT in Arabic but OCR confuses them. مصطفى≠مصطفي |
| ر (ra) | ز (zayn) | All positions | Dot presence/absence — OCR dot detection failure |
| د (dal) | ذ (dhal) | All positions | Same shape, dot above — OCR dot detection |
| ع (ayn) | غ (ghayn) | All positions | Dot above — same issue |
| ح (ha) | خ (kha) / ج (jim) | All positions | Three similar shapes, differentiated by dots |
| ب (ba) | ت (ta) / ث (tha) | Initial/medial | Dot count and placement |
| ن (nun) | dot of adjacent letter | Isolated | OCR merges nun dot with adjacent letter dot |
All KR text MUST be UTF-8. Common encoding artifacts:
| Symptom | Cause | Fix |
|---|---|---|
| Ù…ØÙ…د instead of محمد | UTF-8 bytes decoded as Latin-1 then re-encoded | Decode as Latin-1, re-encode as UTF-8 |
| Â followed by Arabic | Double UTF-8 encoding | Decode once, not twice |
| ? or □ replacing characters | Encoding truncation or unsupported char | Re-extract from source |
| ﻻ (U+FEFB) instead of لا | Arabic presentation form | Normalize to composed form لا (lam + alif) |
| Isolated diacritics without base letters | Encoding split | Invalid text — re-extract |
Arabic presentation forms are rendering-level characters that should NOT appear in stored text. Their presence indicates:
Shamela originally used Windows-1256. Conversion artifacts include:
? or \x00Arabic text exists on a spectrum of diacritization:
| Level | Description | Typical Source | KR Handling |
|---|---|---|---|
| Full (مشكول بالكامل) | Every letter has explicit diacritic | Quran, classical mutun, learning texts | Preserve exactly — diacritics ARE content |
| Partial (مشكول جزئياً) | Ambiguous words diacritized, common words bare | Most scholarly texts | Preserve as-is — partial diacritization is intentional |
| Minimal (غير مشكول) | No diacritics except where essential | Modern prints, newspapers | Record as undiacritized — DO NOT add diacritics |
| Inconsistent | Random diacritization — some pages full, some bare | OCR artifacts, mixed sources | Flag for quality review — likely OCR problem |
| Issue | Pattern | Detection | Severity |
|---|---|---|---|
| Missing pages | Page number jumps (e.g., p.45 → p.48) | Page sequence gap detection | HIGH — content loss |
| Concatenated books | Two different books merged in one entry | Author/topic sudden change mid-text | HIGH — misattribution |
| Metadata mismatch | Shamela author field doesn't match actual author | Compare metadata vs. colophon/introduction | MODERATE |
| HTML artifacts | Leftover &, <br>, </span> in text | Regex for HTML entities in content | LOW — cleanup |
| Empty sections | Division headers with no content | Empty div after heading | LOW — structural |
| Encoding mix | Arabic + Latin numeral confusion in page refs | ١٢٣ vs 123 mixed in same field | LOW — normalize |
| Footnote displacement | Footnotes placed at wrong location | Footnote reference without corresponding note | MODERATE |
For each source text, compute a quality assessment:
After any processing step, verify against the frozen source:
# Pseudocode — do NOT modify frozen source
original_hash = sha256(frozen_source_bytes)
processed_text_bytes = processed_text.encode('utf-8')
# The hash will differ (processing changes text), but:
# - Arabic letter count must be >= original (no lost letters)
# - Diacritic count must be == original (no lost/added diacritics)
# - Quran citations must be byte-identical to original
After ANY processing step, these must hold: