25 March 2006 0 comments Linux
I might soon need a good OCR program to read scanned in pages but these pages aren't perfectly scanned pages from a novel. The kind of pages I'm scanning are stuff like printed out invoices and other stuff like that with tables, headers, logos, footers, etc.
The only program I've looked ocrad and I've had pretty decent results with it. I did scan an invoice and thanks to a quick python script I was able to find out the correct rotation with a 57% confidence (the second best was 37%). That's a start.
ocrad seems very flexible and quite active judging from the mailing list
I guess I need to do more research into tuning
ocrad with the right charsets, image formats and some of the immediate options of
ocrad before I give up. When I scanned my invoice, the words it found did look like words but not much qualitative could be used out of. The company that sent the invoice was for example not anywhere in the recognized words :(
What do people use out there? I bet Amazon didn't just use
ocrad when they did their Search Inside the Book