Monday, December 12, 2011

Converting PDF into Plain Text

At work I often receive pdf documents with no OCR information in it, and in order to modify/extract parts from them I need to convert them to plain text.

As a Unix user, I love the command line, and I have been searching for something like:

./pdf2text file.pdf

So this is what I have written in order to provide support for this useful command line tool. Is is based on tesseract, which is now under development by Google.
Remember that in order to interpret documents which are non-English, data files for the language must be installed separately, as most distributions just bring the English one with the main package.

As tesseract is able to work with TIFF images, the first tool to be developed is pdf2tif. If none is already available on your machine, you can use the following script which is based on ghostscript:

pdf2tif:

#!/bin/sh
# Derived from pdf2ps.
# Convert PDF to TIFF file.

OPTIONS=""
while true
do
        case "$1" in
                -?*) OPTIONS="$OPTIONS $1" ;;
                *) break ;;
        esac
shift
done

if [ $# -eq 2 ]
then
        outfile=$2
elif [ $# -eq 1 ]
then
        outfile=`basename "$1" .pdf`-%02d.tif
else
        echo "Usage: `basename $0` [-dASCII85EncodePages=false]
        [-dLanguageLevel=1|2|3] input.pdf [output.tif]" 1>&2
        exit 1
fi

# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"


This script will be later invoked by out pdf2text, which will take as input a pdf file, which will create a temporary folder, convert every pdf page into a separate tif image, and then feed tesseract with them.

pdf2text:

#!/bin/sh

#!/bin/sh

# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# then runs tesseract on them

mkdir $1-dir
cp $1 $1-dir
cd $1-dir

pdf2tif $1

for j in *.tif
do
        x=`basename $j .tif`
        tesseract ${j} ${x}
        rm ${x}.raw
        rm ${x}.map
        #un-comment next line if you want to remove the .tif files when done.
        #rm ${j}
done

cat *.txt > $1.txt
mv $1.txt ..

cd ..
rm -rf $1-dir

After its execution, a .txt file with the same basename as the pdf's will be created, containing the OCR'd text in it!

No comments:

Post a Comment