Escapades: Image PDF to text PDF using OCR

Friday, April 1, 2016

Image PDF to text PDF using OCR

Usecase: I have a PDF multi-page document that was compiled by scanning the physical document. I want to read that in my kindle and make notes.

The problem: Since the PDF is made up of images, kindle renders the pages as images. Hence, the text in the pages was not selectable.

Solution:
Use OCR.

Ref:
http://ubuntuforums.org/showthread.php?t=880471

Steps:

pdftoppm generates a 100MB ppm per page. Should ideally iterate per page and delete
convert ppm to tif: tesseract accepts tif
Use tesseract for OCR: generates txt
Append all txt to output file
Create pdf out of the txt.

Script:

#!/bin/sh
mkdir tmp
cp $@ tmp
cd tmppdftoppm * -f 1 -l 10 -r 600 ocrbookfor i in *.ppm; do     convert "$i" "`basename "$i" .ppm`.tif";     tesseract "$i" "`basename "$i" .tif`" -l eng;    cat "`basename "$i" .txt`" >> pdf-ocr-output.txt;    echo "[pagebreak]" >> pdf-ocr-output.txt;done mv pdf-ocr-output.txt ..rm *cd ..rmdir tmp

Escapades

Friday, April 1, 2016

Image PDF to text PDF using OCR

No comments:

Post a Comment

Search This Blog

Contributors