<div dir="ltr"><div style>I have found several research papers that describe various approaches to IR from OCR'd Indic Texts. </div><div style><br></div><div style>1. The first issue when we attempt to do IR from Indic Texts is the question of how the user query should be represented. The most common way to do this is to use some standardized transliteration scheme like ISCII, of course a major project i.e. the <a href="http://www.dli.ernet.in/">Digital Library of India</a> uses the <a href="http://www.cs.cmu.edu/~madhavi/Om/">Om</a> transliteration scheme.</div>
<div style><br></div><div style>2. Of course step one may probably be the least of our concerns, given that several challenges need to be overcome at the OCR end itself. IR from OCR documents typically has two approaches (1) A recognition -based approach - this is based on trying to first identify layout recognition of the document followed by character segmentation followed by a post-processing step that tries to correct spelling errors, etc. While this is a good approach used for English texts it is not suitable for Indic documents ( no proper layout recognition algo, shirorekha, large number of similar looking chars, etc) (2) Recognition free approach - in which rather than trying to obtain the text form from the document image, we skip the OCR recognition step entirely, rather we search word image against word image. Thus the user query is first converted to a word image and then the word images that are similar to it in the document are retrieved, with some improvements this approach is currently best suited for Indic scripts. Thus we will actually use various image features like gabors, etc to find the best matching image of the query word, using algorithms like dynamic time warping(DTW). A more efficient algorithm using clustering with locality sensitive hashing (LSH) may also be used.</div>
<div style><br></div><div style>3. Existing software : Currently <a href="http://code.google.com/p/parichit/">Parichit</a> is an avaiable opensource OCR for some Indian languages. But it still has much to accomplish. A <a href="http://tdil-dc.in/index.php?option=com_vertical&parentid=77">Web OCR</a> has been developed by TDIL and there is also <a href="http://www.cdac.in/html/press/archives/atjp02/prs_rl114.aspx">Chitrankan</a> by C-DAC but they both are not open source. So several opportunities exist for improving the scenario wrt IR.</div>
<div style><br></div><div style>4. Data and Resources : One very good source to get training data for our purposes is the Information Retrieval Society of India and FIRE - that tries to achieve something similar to TREC for Indian Languges. This has very good <a href="http://www.isical.ac.in/~fire/data.html">datasets</a> available which may be used to evaluate our baseline model.</div>
<div style><br></div><div style>5. References</div><div style><br></div><div style>1. <span style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">Manrnatha, R., and C. V. Jawahata. "Challenges in the Recognition and Searching of Printed Books in Indian Languages and Scripts." </span><i style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">Multimedia Information Extraction and Digital Heritage Preservation</i><span style="font-family:Arial,sans-serif;font-size:13px;line-height:16px"> 10 (2010): 119.</span></div>
<div style><span style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">2. </span><span style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">Meshesha, Million, and C. V. Jawahar. "Matching word images for content-based retrieval from printed document images." </span><i style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">International Journal of Document Analysis and Recognition (IJDAR)</i><span style="font-family:Arial,sans-serif;font-size:13px;line-height:16px"> 11.1 (2008): 29-38.</span></div>
<div style><span style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">3. </span><span style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">Govindaraju, Venu, and Srirangaraj Setlur. </span><i style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">Guide to OCR for Indic Scripts</i><span style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">. Springer, 2009.</span></div>
<div style><span style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">4. </span><span style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">Balajapally, Prashanth, et al. "Multilingual book reader: Transliteration, word-to-word translation and full-text translation." </span><i style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">Proceeding of the 13th Biennial Conference and Exhibition Conference of Victorian Association for Library Automation Melbourne, Feb</i><span style="font-family:Arial,sans-serif;font-size:13px;line-height:16px">. 2006.</span></div>
<div style><br></div><div>Most of the work I refer to here is available by a simple query on google scholar:<br></div><div><div><br></div><div><a href="http://scholar.google.co.in/scholar?start=10&q=retrieval+from+ocr+of+Indian+texts&hl=en&as_sdt=0,5">http://scholar.google.co.in/scholar start=10&q=retrieval+from+ocr+of+Indian+texts&hl=en&as_sdt=0,5</a><br>
</div></div><div><br></div><div style>I apologise for the length of the mail. I would be glad to know of further suggestions and guidance.</div><div style><br></div><div style>Regards,</div><div style>Madhura Parikh</div>
<div style><a href="mailto:madhuraparikh@gmail.com">madhuraparikh@gmail.com</a><br></div></div>