[Project-ideas] OCR IR

Thu Apr 18 23:06:00 PDT 2013

In point 2. What I meant was, should I look at available corpora. A fairly
large corpus for Indian languages is EMILIE
http://www.lancs.ac.uk/fass/projects/corpus/emille/. Would I be able to use
that? (and others available)

On Fri, Apr 19, 2013 at 11:33 AM, Alok Kothari <kothari.alok at gmail.com>wrote:

> Hello
>
> I am Alok Kothari. I am interested in applying to GSoc 2013 and to work
> with Ankur.
>
> Background: I graduated from IIT Kharapur in 2009 and have been involved
> in research in IR/NLP and Machine Learning for nearly 2 years.
>
> I was interested in the project on 'Improving information retrieval
> methods for OCR data sets consisting of Indic scripts'
>
> 1. I was wondering whether I could have a look at or have some indicationto the quality of files available.
> This will give me some idea about the kinds of error
>
> 2. In the project can I assume to have access to some 'clean' corpus so
> that I can use that towards correcting errors in digitised corpus. for e.g.
> I could learn n-grams from the know 'correct' text to improve possible
> errors in OCR text. There are some ways to obtain such corpus.
>
> 3. Does the IR system have to be implemented on top of Lucene (or other
> open source software) or can be completely stand alone.
>
> Thank You!
>
> Best,
> Alok
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20130419/a82748bd/attachment-0003.htm>