[Project-ideas] Preliminary Survey on IR from OCR'd Indic documents

Sat Apr 13 11:12:19 PDT 2013

Hi,
I have spent some time going through Tesseract :

The current implementation of Tesseract may have already solved this
challenge.
(shirorekha,etc)
There is a publication that tries to achieve this for Hindi. But they are
able to achieve a reasonable accuracy only when they assume a predefined
font-style. Otherwise the reported accuracy is ~40%, which is of no use.
Again Tesseract uses the character segmentation approach. I believe that if
this could instead be replaced by the keyword spotting approach, in which
we bypass identifying individual characters and rather try to identify word
images, the accuracy scenario can be considerably improved. A post
processing step can be added in which we try to adapt the results based on
the feedback from words that are recognized to be the correct guess with
high probability. A language model that uses the grammar /n-gram statistics
to improve the accuracy may be used with the clustering...
I found the first few slides in this
ppt<http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=10&cad=rja&ved=0CGwQFjAJ&url=http%3A%2F%2Fwww.cs.bgu.ac.il%2F~klara%2FATCS111%2FWordSpotDTW_Yaakov_Roee.pptx&ei=Xm5pUY7XH4vOrQf3g4GIBQ&usg=AFQjCNECi20vLj434RW6n8-dgwsUUtNo2Q&sig2=lQ_KMlKO7dbQcowcWGs33w>a
good guide to this approach.

The interesting part (for me at least) is what happens when the
> image-under-test is a fragment. For example, if the digitized document
> is of a scroll that is damaged, what would it take for an IR system to
> be able to reconstruct the word/phrase/image?
>

Of Course this had be the true challenge to any OCR system. But if the
approach based on clustering images (above) is used would the chances be
improved rather then going by the conventional OCR approach?

>
> > 3. Existing software :  Currently Parichit is an avaiable opensource OCR
> for
> > some Indian languages. But it still has much to accomplish. A Web OCR has
> > been developed by TDIL and there is also Chitrankan by C-DAC but they
> both
> > are not open source. So several opportunities exist for improving the
> > scenario wrt IR.
>
> Could you check how Tesseract plays out in contrast to the above?
>   Parichit seems to be in a very nascent stage currently as compared to
> Terrasect. But it  offers some training  datasets in a handful of Indic
> languages. There is also the OCRopus that has evolved off Tesseract, and
> integrates python and machine learning to OCR. However it currently appears
> to be less efficient than Tesseract
>

These are some very cursory observations. I still need to explore Tesseract
more thoroughly and get back.

Madhura Parikh
madhuraparikh at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20130413/940882b6/attachment-0003.htm>