[Project-ideas] Preliminary Survey on IR from OCR'd Indic documents

Fri Apr 12 19:26:45 PDT 2013

Thank you for taking time to write out this note. It provides me with
an opportunity to emphasize on the IR bits. The capability of Indic
OCR has been languishing for a while. There are reports which put some
OCR implementations at around 100% (but don't have demonstrated proof
or, code) and, there are papers presented which put it at somewhere
between 70-80%. Neither outcome is encouraging. Additionally, with the
increasing familiarity of frameworks like "Commons" (eg. Wikipedia,
Wikimedia Commons etc), the need to be able to share digitized data in
editable format is also important (think of Distributed Proofreaders
for example).

A strong effort on IR of OCR texts, spread over a few iterations will
help archivists and others to build a corpus as well.

On Fri, Apr 12, 2013 at 10:43 PM, Madhura Parikh
<madhuraparikh at gmail.com> wrote:

> 1. The first issue when we attempt to do IR from Indic Texts is the question
> of how  the user query should be represented. The most common way to do this
> is to use some standardized transliteration scheme like ISCII, of course a
> major project i.e. the Digital Library of India uses the Om transliteration
> scheme.

ISCII has over the years evolved through workarounds built into it as
and when they came up.

> 2. Of course step one may probably be the least of our concerns, given that
> several challenges need to be overcome at the OCR end itself. IR from OCR
> documents typically has two approaches (1) A recognition -based approach  -
> this is based on trying to first identify layout recognition of the document
> followed by character segmentation followed by a post-processing step that
> tries to correct spelling errors, etc. While this is a good approach used
> for English texts it is not suitable for Indic documents ( no proper layout
> recognition algo, shirorekha, large number of similar looking chars, etc)

The current implementation of Tesseract may have already solved this challenge.

> (2) Recognition free approach - in which rather than trying to obtain the
> text form from the document image, we skip the OCR recognition step
> entirely, rather we search word image against word image. Thus the user
> query is first converted to a word image and then the word images that are
> similar to it in the document are retrieved, with some improvements this
> approach is currently best suited for Indic scripts. Thus we will actually
> use various image features like gabors, etc to find the best matching image
> of the query word, using algorithms like dynamic time warping(DTW). A more
> efficient algorithm using clustering with locality sensitive hashing (LSH)
> may also be used.

The interesting part (for me at least) is what happens when the
image-under-test is a fragment. For example, if the digitized document
is of a scroll that is damaged, what would it take for an IR system to
be able to reconstruct the word/phrase/image?

> 3. Existing software :  Currently Parichit is an avaiable opensource OCR for
> some Indian languages. But it still has much to accomplish. A Web OCR has
> been developed by TDIL and there is also Chitrankan by C-DAC but they both
> are not open source. So several opportunities exist for improving the
> scenario wrt IR.

Could you check how Tesseract plays out in contrast to the above?

> 4. Data and Resources : One very good source to get training data for our
> purposes is the Information Retrieval Society of India and FIRE - that tries
> to achieve something similar to TREC for Indian Languges. This has very good
> datasets available which may be used to evaluate our baseline model.

ISI Kolkata's FIRE and RISOT efforts have a suitable data-set
available which can be used as a baseline.

--
sankarshan mukhopadhyay
<https://twitter.com/#!/sankarshan>