Teach me about OCR

25 March 2006   0 comments   Linux


Mind That Age!

This blog post is 12 years old! Most likely, its content is outdated. Especially if it's technical.

Powered by Fusion×

I might soon need a good OCR program to read scanned in pages but these pages aren't perfectly scanned pages from a novel. The kind of pages I'm scanning are stuff like printed out invoices and other stuff like that with tables, headers, logos, footers, etc.

The only program I've looked ocrad and I've had pretty decent results with it. I did scan an invoice and thanks to a quick python script I was able to find out the correct rotation with a 57% confidence (the second best was 37%). That's a start. ocrad seems very flexible and quite active judging from the mailing list

I guess I need to do more research into tuning ocrad with the right charsets, image formats and some of the immediate options of ocrad before I give up. When I scanned my invoice, the words it found did look like words but not much qualitative could be used out of. The company that sent the invoice was for example not anywhere in the recognized words :(

What do people use out there? I bet Amazon didn't just use ocrad when they did their Search Inside the Book


Thank you for posting a comment

Your email will never ever be published

Related posts

To br / or not to br/ 23 March 2006
Merrill Lynch's f**ked up website 28 March 2006
Related by Text:
LaTeX Word Counter 12 April 2004
Heil Jed and Dave Kuhlman 09 May 2004
Are you a web developer? Then VisiBone is for you 22 January 2006
Find print statements in Python code 12 April 2005
Jed Tags with ntags (for dummies) 11 February 2006