[Project-ideas] Preliminary Survey on IR from OCR'd Indic documents

Sat Apr 13 20:51:35 PDT 2013

Update : Through a thread in this mailing list, I have also come to know
about tesseract-indic - that seems to be more promising..., I am also
looking at  that...

Regards,
Madhura

On Sat, Apr 13, 2013 at 11:42 PM, Madhura Parikh <madhuraparikh at gmail.com>wrote:

> Hi,
> I have spent some time going through Tesseract :
>
>
> The current implementation of Tesseract may have already solved this
> challenge.
> (shirorekha,etc)
> There is a publication that tries to achieve this for Hindi. But they are
> able to achieve a reasonable accuracy only when they assume a predefined
> font-style. Otherwise the reported accuracy is ~40%, which is of no use.
> Again Tesseract uses the character segmentation approach. I believe that if
> this could instead be replaced by the keyword spotting approach, in which
> we bypass identifying individual characters and rather try to identify word
> images, the accuracy scenario can be considerably improved. A post
> processing step can be added in which we try to adapt the results based on
> the feedback from words that are recognized to be the correct guess with
> high probability. A language model that uses the grammar /n-gram statistics
> to improve the accuracy may be used with the clustering...
> I found the first few slides in this ppt<http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=10&cad=rja&ved=0CGwQFjAJ&url=http%3A%2F%2Fwww.cs.bgu.ac.il%2F~klara%2FATCS111%2FWordSpotDTW_Yaakov_Roee.pptx&ei=Xm5pUY7XH4vOrQf3g4GIBQ&usg=AFQjCNECi20vLj434RW6n8-dgwsUUtNo2Q&sig2=lQ_KMlKO7dbQcowcWGs33w>a good guide to this approach.
>
> The interesting part (for me at least) is what happens when the
>> image-under-test is a fragment. For example, if the digitized document
>> is of a scroll that is damaged, what would it take for an IR system to
>> be able to reconstruct the word/phrase/image?
>>
>
> Of Course this had be the true challenge to any OCR system. But if the
> approach based on clustering images (above) is used would the chances be
> improved rather then going by the conventional OCR approach?
>
>>
>> > 3. Existing software :  Currently Parichit is an avaiable opensource
>> OCR for
>> > some Indian languages. But it still has much to accomplish. A Web OCR
>> has
>> > been developed by TDIL and there is also Chitrankan by C-DAC but they
>> both
>> > are not open source. So several opportunities exist for improving the
>> > scenario wrt IR.
>>
>> Could you check how Tesseract plays out in contrast to the above?
>>   Parichit seems to be in a very nascent stage currently as compared to
>> Terrasect. But it  offers some training  datasets in a handful of Indic
>> languages. There is also the OCRopus that has evolved off Tesseract, and
>> integrates python and machine learning to OCR. However it currently appears
>> to be less efficient than Tesseract
>>
>
> These are some very cursory observations. I still need to explore
> Tesseract more thoroughly and get back.
>
>
> Madhura Parikh
> madhuraparikh at gmail.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20130414/b9a8f20c/attachment-0003.htm>