Convert PDF, DOCX, XLSX, and text files to clean, structured Markdown. CJK-friendly, table-friendly, privacy-first.
Convert documents (PDF, DOCX, XLSX, TXT) to clean, structured Markdown.
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai none
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai gemini
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai groq
python3 {baseDir}/cleaner.py --input "{{directory}}" --ai none --output-dir "{{output_dir}}"
python3 {baseDir}/cleaner.py --input "{{file_path}}" --dry-run --verbose
python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai none --summary
The --summary flag prints a JSON summary to stdout after processing:
{"version":"1.0.0","total":3,"success":2,"failed":1,"files":[{"file":"report.pdf","output":"./output/report.md","status":"ok"},{"file":"scan.pdf","output":null,"status":"no_content"},{"file":"data.xlsx","output":"./output/data.md","status":"ok"}]}
| Flag | Description |
|---|---|
--input, -i | File or directory to process (required, non-recursive) |
--output-dir, -o | Output directory (default: ./output) |
--ai | gemini, groq, ollama, or none (default: from config or gemini) |
--password | PDF decryption password |
--config | Path to config JSON |
--summary | Print JSON summary to stdout after processing |
--dry-run | Preview without writing files |
--verbose | Enable debug logging |
PDF (native, scanned, encrypted), DOCX, XLSX, XLS, CSV, TXT, MD
| Code | Meaning |
|---|---|
| 0 | All files processed successfully |
| 1 | Some files failed (partial success) |
| 2 | No processable files found or config error |
./output/ relative to current directorygemini, groq, or ollama) gives much better results--ai none requires zero API keys and zero network access