Hello<br><br>I am Alok Kothari. <font><font><font>I am <font>interested in applying to </font>GSoc 2013 and <font>to work with Ankur</font>.<br><br>Background: I <font>graduated from IIT Kharapur in 2009 and have been involved in research in IR/NLP and Ma<font>chine Learning </font>for nearly 2 years.</font></font></font><br>


<br></font>I was interested in the project on '<font>Improving information retrieval methods for OCR data sets consisting of Indic scripts'<br><br><font>1. I was wonde<font>ring whether I could <font>have a look at or have some <font>i<font>ndication</font> to the quality of files available<font>. This will give me some idea about the kinds of error<font><br>


<br><font>2. <font>In the project c<font>an I </font></font>assume to have</font> access to some 'clean' corpus so that I can use that towards correcting errors in digitised corpus. for e.g. I could learn n-grams from <font>the <font>know 'correct' text to improve possible errors in OCR<font> text. <font>There are some ways to obtain such corpus.</font></font></font></font></font></font><br>


<br><font><font>3</font>. Does the <font>IR system have to be implemented on top of Lucene<font> (or other open source software) or can be completely stand alone.<br><br><font>Thank You!<br><br><font>Best,</font><br></font><font>Alok</font><br>


<br></font></font></font><br></font></font></font></font></font><br>