<div dir="ltr"><div><div>Update : Through a thread in this mailing list, I have also come to know about tesseract-indic - that seems to be more promising..., I am also looking at that...<br><br></div>Regards,<br></div>Madhura<br>
</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sat, Apr 13, 2013 at 11:42 PM, Madhura Parikh <span dir="ltr"><<a href="mailto:madhuraparikh@gmail.com" target="_blank">madhuraparikh@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">Hi, <br>I have spent some time going through Tesseract :<div class="im">
<br><br><span style="color:rgb(116,27,71)">The current implementation of Tesseract may have already solved this challenge.<br>
</span></div></div><div class="gmail_quote"><span style="color:rgb(116,27,71)">(shirorekha,etc)<br></span></div><div class="gmail_quote"><span style="color:rgb(116,27,71)"><font color="#000000">There is a publication that tries to achieve this for Hindi. But they are able to achieve a reasonable accuracy only when they assume a predefined font-style. Otherwise the reported accuracy is ~40%, which is of no use. Again Tesseract uses the character segmentation approach. I believe that if this could instead be replaced by the keyword spotting approach, in which we bypass identifying individual characters and rather try to identify word images, the accuracy scenario can be considerably improved. A post processing step can be added in which we try to adapt the results based on the feedback from words that are recognized to be the correct guess with high probability. A language model that uses the grammar /n-gram statistics to improve the accuracy may be used with the clustering...<br>
</font></span></div><div class="gmail_quote"><span style="color:rgb(116,27,71)"><font color="#000000">I found the first few slides in this <a href="http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=10&cad=rja&ved=0CGwQFjAJ&url=http%3A%2F%2Fwww.cs.bgu.ac.il%2F~klara%2FATCS111%2FWordSpotDTW_Yaakov_Roee.pptx&ei=Xm5pUY7XH4vOrQf3g4GIBQ&usg=AFQjCNECi20vLj434RW6n8-dgwsUUtNo2Q&sig2=lQ_KMlKO7dbQcowcWGs33w" target="_blank">ppt</a> a good guide to this approach.<br>
</font></span></div><div class="gmail_quote"><div class="im"><span style="color:rgb(116,27,71)"><br>
</span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span style="color:rgb(153,0,255)">The interesting part (for me at least) is what happens when the<br>
image-under-test is a fragment. For example, if the digitized document<br>
is of a scroll that is damaged, what would it take for an IR system to<br>
be able to reconstruct the word/phrase/image?</span><br></blockquote><div><br></div></div><div>Of Course this had be the true challenge to any OCR system. But if the approach based on clustering images (above) is used would the chances be improved rather then going by the conventional OCR approach? <br>
</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="im"><div>
<div><br>
<span style="color:rgb(153,0,255)">> 3. Existing software : Currently Parichit is an avaiable opensource OCR for<br>
> some Indian languages. But it still has much to accomplish. A Web OCR has<br>
> been developed by TDIL and there is also Chitrankan by C-DAC but they both<br>
> are not open source. So several opportunities exist for improving the<br>
> scenario wrt IR.<br>
<br>
</span></div><span style="color:rgb(153,0,255)">Could you check how Tesseract plays out in contrast to the above?</span><br></div></div>
Parichit seems to be in a very nascent stage currently as compared to Terrasect. But it offers some training datasets in a handful of Indic languages. There is also the OCRopus that has evolved off Tesseract, and integrates python and machine learning to OCR. However it currently appears to be less efficient than Tesseract<br>
</blockquote><div><br></div><div>These are some very cursory observations. I still need to explore Tesseract more thoroughly and get back.<span class="HOEnZb"><font color="#888888"><br><br><br></font></span></div><span class="HOEnZb"><font color="#888888"><div>
Madhura Parikh<br><a href="mailto:madhuraparikh@gmail.com" target="_blank">madhuraparikh@gmail.com</a><br>
</div></font></span></div><br></div></div>
</blockquote></div><br></div>