[Project-ideas] Preliminary Survey on IR from OCR'd Indic documents

Fri Apr 12 10:17:39 PDT 2013

I have found several research papers that describe various approaches to IR
from OCR'd Indic Texts.

1. The first issue when we attempt to do IR from Indic Texts is the
question of how  the user query should be represented. The most common way
to do this is to use some standardized transliteration scheme like ISCII,
of course a major project i.e. the Digital Library of
India<http://www.dli.ernet.in/> uses
the Om <http://www.cs.cmu.edu/~madhavi/Om/> transliteration scheme.

2. Of course step one may probably be the least of our concerns, given that
several challenges need to be overcome at the OCR end itself. IR from OCR
documents typically has two approaches (1) A recognition -based approach  -
this is based on trying to first identify layout recognition of the
document followed by character segmentation followed by a post-processing
step that tries to correct spelling errors, etc. While this is a good
approach used for English texts it is not suitable for Indic documents ( no
proper layout recognition algo, shirorekha, large number of similar looking
chars, etc)  (2) Recognition free approach - in which rather than trying to
obtain the text form from the document image, we skip the OCR recognition
step entirely, rather we search word image against word image. Thus the
user query is first converted to a word image and then the word images that
are similar to it in the document are retrieved, with some improvements
this approach is currently best suited for Indic scripts. Thus we will
actually use various image features like gabors, etc to find the best
matching image of the query word, using algorithms like dynamic time
warping(DTW). A more efficient algorithm using clustering with locality
sensitive hashing (LSH) may also be used.

3. Existing software :  Currently
Parichit<http://code.google.com/p/parichit/> is
an avaiable opensource OCR for some Indian languages. But it still has much
to accomplish. A Web
OCR<http://tdil-dc.in/index.php?option=com_vertical&parentid=77> has
been developed by TDIL and there is also
Chitrankan<http://www.cdac.in/html/press/archives/atjp02/prs_rl114.aspx>
by
C-DAC but they both are not open source. So several opportunities exist for
improving the scenario wrt IR.

4. Data and Resources : One very good source to get training data for our
purposes is the Information Retrieval Society of India and FIRE - that
tries to achieve something similar to TREC for Indian Languges. This has
very good datasets <http://www.isical.ac.in/~fire/data.html> available
which may be used to evaluate our baseline model.

5. References

1.  Manrnatha, R., and C. V. Jawahata. "Challenges in the Recognition and
Searching of Printed Books in Indian Languages and Scripts." *Multimedia
Information Extraction and Digital Heritage Preservation* 10 (2010): 119.
2. Meshesha, Million, and C. V. Jawahar. "Matching word images for
content-based retrieval from printed document images." *International
Journal of Document Analysis and Recognition (IJDAR)* 11.1 (2008): 29-38.
3. Govindaraju, Venu, and Srirangaraj Setlur. *Guide to OCR for Indic
Scripts*. Springer, 2009.
4. Balajapally, Prashanth, et al. "Multilingual book reader:
Transliteration, word-to-word translation and full-text translation."
*Proceeding
of the 13th Biennial Conference and Exhibition Conference of Victorian
Association for Library Automation Melbourne, Feb*. 2006.

Most of the work I refer to here is available by a simple query on google
scholar:

http://scholar.google.co.in/scholar
start=10&q=retrieval+from+ocr+of+Indian+texts&hl=en&as_sdt=0,5<http://scholar.google.co.in/scholar?start=10&q=retrieval+from+ocr+of+Indian+texts&hl=en&as_sdt=0,5>

I apologise for the length of the mail. I would be glad to know of further
suggestions and guidance.

Regards,
Madhura Parikh
madhuraparikh at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20130412/65dcd69f/attachment-0003.htm>