<p style="margin:0px 0px 0px 0px;font:12.0px Times New Roman"><span style="letter-spacing:0px">1. INDRODUCTION</span></p>

<p style="margin:0px 0px 0px 0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"></span><br></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px">I am Gourab Saha, a M.Tech(Computer Science) student of Indian Statistical Institute,Kolkata.I have a diverse computer education.I have a project experience in the field of networking,Parallel computing (using NVIDIA CUDA Technology),Web application etc.I have a decent amount of coding experience in C,C++,JAVA,PHP. My current study is focussed to <b>Information retrieval</b> .I am quick learner and a highly motivated ,hardworking person.This is the brief introduction about me.</span></p>


<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"></span><br></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px">2.PROJECT INTEREST</span></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"></span><br></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px">I am interested in the project</span></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px"> “<b>Improving information retrieval methods for OCR data sets consisting of Indic scripts”</b></span></p>


<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"><b></b></span><br></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;color:#000099"><span style>Mentor-<a href="https://fedoraproject.org/wiki/User:Sankarshan"><span style="text-decoration:underline;letter-spacing:0px">Sankarshan Mukhopadhyay</span></a> </span></p>


<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"> </span></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px">3.FIRST DRAFT OF PROJECT PROPOSAL</span></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"></span><br></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px">  3.1 AVAILABLE TRAINING DATA(This data is freely available)</span></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"><span class="Apple-tab-span" style="white-space:pre">     </span></span></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px"><span class="Apple-tab-span" style="white-space:pre">       </span>I will focuses evaluating IR effectiveness on Indic script OCR'd text data of RISOT2011. It is a collection of relevance judged collection of 62,825 articles of a leading Bangla newspaper, Anandabazar Patrika (2004-2006). For each article, both the original digital text and corresponding OCR results are given. Relevance judgments are available for 92 topics. The OCR output is obtained by rendering each digital document as a document image, which is then processed by a Bangla OCR system. The document images have variation in font faces, character styles and sizes. The character level (more specifically, Unicode level) accuracy of the OCR engine is about 92%. The same topic statements are available in english also. Here is a snapshot of available documents attached with this mail.</span></p>

<div><span style="letter-spacing:0px"><br></span></div><div><span style="letter-spacing:0px"><p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px">3.2  PROPOSAL 1 (INTRODUCING DEGRADATION MODEL INTO CORPUS)</span></p>


<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"></span><br></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px"><span class="Apple-tab-span" style="white-space:pre">       </span>A Bengali OCR system was used to convert these images into electronic text using a feature-based template matching approach .  Automatic evaluation  found the accuracy OCR engine is about 92%.A manually typed clean text from the newspaper is converted into image and then converted into OCR ed text. As the OCRed data accuracy is as high as 92% for above mentioned corpus I want to introduce a degradation model into the corpus to degrade the quality of the OCR ed document introducing some errors, which may be suitable for our experiment.Then on the degraded corpus I want to  try retrieval algorithm based on Error modeling in conjunction with stemming and overlapping n-grams. I will measure of our performance using precision,recall and other information retrieval performance evolution parameters.</span></p>


<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"></span><br></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"></span><br></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px">3.3 PROPOSAL 2  (CROSS-LANGUAGE INFORMATION RETRIEVAL ON OCR’ED BENGALI DATA)</span></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"></span><br></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px"><span class="Apple-tab-span" style="white-space:pre">       </span>This my second proposal is Cross-language information retrieval model on OCR ed text.</span></p>


<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman"><span style="letter-spacing:0px">As we have corpus of OCR ed bengali text and english corpus comprising the same topic we can try to propose some cross-language information retrieval algorithm .Our goal is to extract the relevant information from the OCRed bengali corpus using the clean text query in english.Suppose a query in english “Singur land dispute” will retrieve  relevant document from the OCR ed data set of bengali news paper Anandabazar Patrika.As a part of our second proposal I want to implement such a cross-language retrieval algorithm using RISOT 2011 bengali & english corpus.</span></p>


<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"></span><br></p>

<p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px"><span style="letter-spacing:0px"></span><br></p>

</span><span class="Apple-style-span" style="font-family:'Times New Roman';font-size:12px">I will be highly obliged if you kindly go through my project proposals and give your opinion on the viability of this project as a part of GSEOC 2012. I am exploring more about it.</span><span style="letter-spacing:0px"><p style="margin:0px 0px 0px 0px;line-height:18.0px;font:12.0px Times New Roman;min-height:15.0px">

<span style="letter-spacing:0px"></span></p></span></div>