[Project-ideas] OCR IR

Thu Apr 18 23:03:26 PDT 2013

Hello

I am Alok Kothari. I am interested in applying to GSoc 2013 and to work
with Ankur.

Background: I graduated from IIT Kharapur in 2009 and have been involved in
research in IR/NLP and Machine Learning for nearly 2 years.

I was interested in the project on 'Improving information retrieval methods
for OCR data sets consisting of Indic scripts'

1. I was wondering whether I could have a look at or have some
indicationto the quality of files available.
This will give me some idea about the kinds of error

2. In the project can I assume to have access to some 'clean' corpus so
that I can use that towards correcting errors in digitised corpus. for e.g.
I could learn n-grams from the know 'correct' text to improve possible
errors in OCR text. There are some ways to obtain such corpus.

3. Does the IR system have to be implemented on top of Lucene (or other
open source software) or can be completely stand alone.

Thank You!

Best,
Alok
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20130419/bdff8067/attachment-0002.htm>