X-Mailer: XFMail 1.4.7 on Linux
From: Karl-Heinz Herrmann (k.-h.herrmann@fz-juelich.de)
To: linux-questions-only@ssc.com
Subject: RE: [TAG] searching PDFs made from faxes
Date: Tue, 01 Jul 2003 22:25:52 +0200 (CEST)

+-+ You've asked a question of The Answer Gang, so you've been sent the
+-+ reply directly as a courtesy. The TAG list has also been copied.
+-+ Please send all replies to linux-questions-only@ssc.com so that
+-+ we can help our other readers by publishing the exchange in our monthly
+-+ Web magazine Linux Gazette, http://www.LinuxGazette.net/

On 01-Jul-2003 Faber Fedor wrote:
> Hey Gang,
> Is anyone aware of a way to search PDF files that were created from
> faxes, e.g. tiff files?
> I'm guessing that OCR has to be utilized here, right? I've come
> across things like pdftotext, but the fact that the PDF started life
> as a TIFF is, I think, a complication.
> For the record, I'm putting together a fax server solution for a
> client. The ability to search the faxes for text strings would be
> killer.


Your guess is quite right -- if the PDF contains only a large graphic and no actual text you would need OCR. gocr or Clara OCR might come in handy (gocr seems already trained while Clara OCR is a quite different method). gocr produced reasonable results for me already 1 or 2 years back. BUT: I had clean 300 DPI scans. From a jagged looking fax..... I guess you are facing serious problems.

Date: Tue, 1 Jul 2008 11:22:57 +0100
From: Jimmy O'Regan (joregan@gmail.com)
To: The Answer Gang (tag@lists.linuxgazette.net)
Subject: [TAG] Cuneiform OCR source available for Linux

Cognitive released the source of the kernel of their OCR system (http://www.cuneiform.ru/eng/), and the Linux port (https://launchpad.net/cuneiform-linux/) has reported its first success: https://launchpad.net/cuneiform-linux/+announcement/561

With luck, they'll be able to get the layout engine working - I've used the Windows version, and it was very good at analysing a complicated mixed-language document.

[RM adds: Cited package is under a 3-clause BSD licence.]