URL: http://www.gnu.org/software/ocrad/ocrad.html

I might soon need a good OCR program to read scanned in pages but these pages aren't perfectly scanned pages from a novel. The kind of pages I'm scanning are stuff like printed out invoices and other stuff like that with tables, headers, logos, footers, etc.

The only program I've looked ocrad and I've had pretty decent results with it. I did scan an invoice and thanks to a quick python script I was able to find out the correct rotation with a 57% confidence (the second best was 37%). That's a start. ocrad seems very flexible and quite active judging from the mailing list

I guess I need to do more research into tuning ocrad with the right charsets, image formats and some of the immediate options of ocrad before I give up. When I scanned my invoice, the words it found did look like words but not much qualitative could be used out of. The company that sent the invoice was for example not anywhere in the recognized words :(

What do people use out there? I bet Amazon didn't just use ocrad when they did their Search Inside the Book

Comments

Your email will never ever be published.

Previous:
To br / or not to br/ March 23, 2006 Web development
Next:
Merrill Lynch's f**ked up website March 28, 2006 Web development
Related by category:
set -ex - The most useful bash trick of the year August 31, 2014 Linux
brotli_static in Nginx November 8, 2024 Linux
Be very careful with your add_header in Nginx! You might make your site insecure February 11, 2018 Linux
Linux tip: du --max-depth=1 September 27, 2007 Linux