Free OCR solution on OS X

Tesseract OCR is probably the best open source OCR engine available. It allows you to convert text from an image.

Install tesseract

I suppose you already have homebrew installed. If not, copy and paste this into Terminal.

ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

Once you got homebrew, installing tesseract is as simple as brew install tesseract

Automate the process

tesseract accepts input as image. If you input is PDF, you will have to convert it to image first, perferably TIFF format. You probably will want to automate this process with Automator.

Fire up Automator and create a new application.

Image not found

Web path: http://i.imgur.com/fAd0Xxx.png

Disk path: /static/http://i.imgur.com/fAd0Xxx.png

Using Page Bundles: false

Search for the action “Render PDF Pages as Images” and drag it to the right. Change the format to TIFF and increase DPI to 300 (The higher, the better accuracy rate)

Image not found

Web path: http://i.imgur.com/OFS1xLI.png

Disk path: /static/http://i.imgur.com/OFS1xLI.png

Using Page Bundles: false

Search and drag action “Rename Finder Items” to the workflow on the right and change it as the following image. Remember to tick “Make all number 4 digits long” just to ensure the numbering is right if you got lots of pages in the PDF file.

Image not found

Web path: http://i.imgur.com/u5UrpEs.png

Disk path: /static/http://i.imgur.com/u5UrpEs.png

Using Page Bundles: false

Add action “Move Finder Items” to the workflow. Select your desired folder as you see fit.

Image not found

Web path: http://i.imgur.com/ArXQ77c.png

Disk path: /static/http://i.imgur.com/ArXQ77c.png

Using Page Bundles: false

Now, this is the part where where OCR begins. We will convert each image to text by passing each image to tesseract. We then will append the output to our output.txt file. Remember to change tesseract path according to your tesseract version number. (It’s 3.02.02 as of this post)

The last step in the script is removing the images and the buffered text files. It’s optional.

rm ~/Desktop/out.txt
for f in "[email protected]"
do
  /usr/local/Cellar/tesseract/3.02.02/bin/tesseract "$f" "$f"
  cat "$f.txt" >> ~/Desktop/out.txt
  rm "$f" "$f.txt"
done

The last step looks like this in Automator

Image not found

Web path: http://i.imgur.com/tHYc6V2.png

Disk path: /static/http://i.imgur.com/tHYc6V2.png

Using Page Bundles: false

Save this Automator workflow to Desktop or wherever you want.

Next time you need to OCR a PDF document, just drag and drop it over the application’s icon. The output text will be saved as output.txt on Desktop.