Usecase: I have a PDF multi-page document that was compiled by scanning the physical document. I want to read that in my kindle and make notes.
The problem: Since the PDF is made up of images, kindle renders the pages as images. Hence, the text in the pages was not selectable.
Solution:
Use OCR.
Ref:
http://ubuntuforums.org/showthread.php?t=880471
Steps:
Script:
The problem: Since the PDF is made up of images, kindle renders the pages as images. Hence, the text in the pages was not selectable.
Solution:
Use OCR.
Ref:
http://ubuntuforums.org/showthread.php?t=880471
Steps:
- pdftoppm generates a 100MB ppm per page. Should ideally iterate per page and delete
- convert ppm to tif: tesseract accepts tif
- Use tesseract for OCR: generates txt
- Append all txt to output file
- Create pdf out of the txt.
Script:
#!/bin/sh mkdir tmp cp $@ tmp cd tmppdftoppm * -f 1 -l 10 -r 600 ocrbookfor i in *.ppm; do convert "$i" "`basename "$i" .ppm`.tif"; tesseract "$i" "`basename "$i" .tif`" -l eng; cat "`basename "$i" .txt`" >> pdf-ocr-output.txt; echo "[pagebreak]" >> pdf-ocr-output.txt;done mv pdf-ocr-output.txt ..rm *cd ..rmdir tmp
No comments:
Post a Comment