<div dir="ltr">Hi,<div><br></div><div> I am an applicant for GSoC 2013. I am enthusiastic about working on <span style="font-size:15.555556297302246px">"</span><b style="font-family:'Times New Roman';font-weight:normal"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">Improving information retrieval methods for OCR data sets consisting of Indic scripts."</span></b></div>
<div><font color="#000000" face="Arial"><span style="white-space:pre-wrap">Before I posted the proposal I wanted to discuss what I have framed on the basis of my understanding of the project idea.</span></font></div><div>
<font color="#000000" face="Arial"><span style="white-space:pre-wrap"><br></span></font></div><div><b style="font-family:'Times New Roman';font-weight:normal"><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt">
<span style="font-family:Arial;font-weight:bold;vertical-align:baseline;white-space:pre-wrap">Synopsis: </span></p><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;margin-left:36pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">1. My first step would be familiarizing myself with the current methods and algorithms that are used in retrieval of information from digitized text and also with their shortcomings.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;margin-left:36pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">2. Figure out the reasons for shortcomings and the degradation of text.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;margin-left:36pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">3. Propose and implement a retrieval system that does not lead to degradation, i.e., improve the text processing.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;margin-left:36pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">4. Improve the existing search algorithms by weeding out inefficiencies and propose additions while increase efficiency.</span></p>
<div style="font-size:medium"><br></div></b></div><div><b style="font-family:'Times New Roman';font-weight:normal"><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt">
<span style="font-family:Arial;background-color:transparent;font-weight:bold;vertical-align:baseline;white-space:pre-wrap">Implementation details of the project:</span></p><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap"></span><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt">
<span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">1. Test the current methods of retrieval of information from digitized text to find out specific problems and areas of shortcomings. File these as issues. The shortcomings are described in terms of technical details of where the search falls short.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">2. Remove errors based on character level and make the search independent of character level error.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">3. Develop a system to classify documents according to tags. Addition of tags to the documents would help in narrowing down the search.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">4. Reduce the error by predicting words when characters are perceived to be inaccurate.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">5. Continue improving search implementation as the errors come out.</span></p>
<br><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap"></span><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-weight:bold;vertical-align:baseline;white-space:pre-wrap">Phases/Milestones with dates:</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">1. June 17- June 27: Filter out errors in specific terms and find out their causes.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">2. June 27- July 7: Make the retrieval independent of character level, i.e., improve the recognition of words as a whole.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">3. July 7- July 24: Workaround other problems in the current methods of standardized and structured text processing.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">4. July 24- August 1: Implement tagging system. (The bot decides from a list of pre-decided tags and assigns it to the documents on the basis of the first few pages, thus reducing the amount of full text search that needs to be done).</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">5. August 1- August 12: Implement information retrieval by text summarization.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">6. August 12- August 22: Implement search on the basis of text summarization.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">7. August 22- September 2: Implement the error correction methods to improve performance.</span></p>
<p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap">8. September 2- September 16: Find out loopholes in the implemented system and improve upon them.</span></p>
<div style="font-size:medium"><span style="font-size:15px;font-family:Arial;background-color:transparent;vertical-align:baseline;white-space:pre-wrap"><br></span></div></b><div> Is there something that I have missed in understanding the project? I would be happy to receive any clarifications on the project.</div>
<div><br></div>-- <br><div dir="ltr">Aarti K. Dwivedi</div><div dir="ltr">(2nd year Undergraduate,</div><div dir="ltr">IIT Roorkee )<br><div><br></div></div>
</div></div>