[Project-ideas] OCR IR

Fri Apr 19 04:14:25 PDT 2013

On Fri, Apr 19, 2013 at 11:33 AM, Alok Kothari <kothari.alok at gmail.com> wrote:
> I am Alok Kothari. I am interested in applying to GSoc 2013 and to work with
> Ankur.

Awesome!

> Background: I graduated from IIT Kharapur in 2009 and have been involved in
> research in IR/NLP and Machine Learning for nearly 2 years.

Would it be possible to provide links to any papers/presentations or,
code that you have published?

> I was interested in the project on 'Improving information retrieval methods
> for OCR data sets consisting of Indic scripts'
>
> 1. I was wondering whether I could have a look at or have some indication to
> the quality of files available. This will give me some idea about the kinds
> of error

The project idea requires the interested candidate to propose within
the scope of the project the kind of errors the initial
iteration/release will handle.

> 2. In the project can I assume to have access to some 'clean' corpus so that
> I can use that towards correcting errors in digitised corpus. for e.g. I
> could learn n-grams from the know 'correct' text to improve possible errors
> in OCR text. There are some ways to obtain such corpus.

The FIRE team at ISI Kolkata have a set of files released which can be
used as a corpus should you so want. Additionally, introducing errors
in a document is a reasonably active area of discussion. I'm certain
you are familiar with the methods.

Continuing from your next email, the ability to use EMILIE as a
training/seed corpus depends on the license under which it is made
available

> 3. Does the IR system have to be implemented on top of Lucene (or other open
> source software) or can be completely stand alone.

I was hoping that we would be able to utilize ElasticSearch or,
similar. Lucene is an option too.

--
sankarshan mukhopadhyay
<https://twitter.com/#!/sankarshan>