Read, convert, and OCR any document -- PDF, Word, ePub, images
You have these tools installed for working with documents. Use them directly via shell commands.
Extract text from a PDF:
pdftotext input.pdf output.txt
Extract text preserving layout:
pdftotext -layout input.pdf output.txt
Get PDF metadata (page count, dimensions, etc.):
pdfinfo input.pdf
Extract images from a PDF:
pdfimages -png input.pdf output_prefix
OCR an image to text:
tesseract image.png output_text
OCR with specific language:
tesseract image.png output_text -l eng
OCR a PDF (convert pages to images first, then OCR):
# Convert PDF pages to images
convert -density 300 input.pdf -quality 100 page-%03d.png
# OCR each page
for f in page-*.png; do tesseract "$f" "${f%.png}" -l eng; done
# Combine results
cat page-*.txt > full_text.txt
Convert between formats (Word, ePub, HTML, Markdown, etc.):
pandoc input.docx -t markdown -o output.md
pandoc input.epub -t markdown -o output.md
pandoc input.html -t markdown -o output.md
pandoc input.md -t html -o output.html
ePub to plain text:
pandoc input.epub -t plain -o output.txt
Resize an image:
convert input.png -resize 800x800 output.png
Crop an image:
convert input.png -crop WIDTHxHEIGHT+X+Y output.png
Convert image format:
convert input.png output.jpg
Get image metadata:
exiftool image.png
bash /workspace/scripts/install-tools.sh as root to reinstall them.