[Project-ideas] GSoC 2013 Query

Sat Apr 20 14:17:00 PDT 2013

Hi,

     I am an applicant for GSoC 2013. I am enthusiastic about working
on "*Improving
information retrieval methods for OCR data sets consisting of Indic
scripts."*
Before I posted the proposal I wanted to discuss what I have framed on the
basis of my understanding of the project idea.

*

Synopsis:

1. My first step would be familiarizing myself with the current methods and
algorithms  that are used in retrieval of information from digitized text
and also with their shortcomings.

2. Figure out the reasons for shortcomings and the degradation of text.

3. Propose and implement a retrieval system that does not lead to
degradation, i.e., improve the text processing.

4. Improve the existing search algorithms by weeding out inefficiencies and
propose additions while increase efficiency.

*
*

Implementation details of the project:

1. Test the current methods of retrieval of information from digitized text
to find out specific problems and areas of shortcomings. File these as
issues. The shortcomings are described in terms of technical details of
where the search falls short.

2. Remove errors based on character level and make the search independent
of character level error.

3. Develop a system to classify documents according to tags. Addition of
tags to the documents would help in narrowing down the search.

4. Reduce the error by predicting words when characters are perceived to be
inaccurate.

5. Continue improving search implementation as the errors come out.

Phases/Milestones with dates:

1. June 17- June 27: Filter out errors in specific terms and find out their
causes.

2. June 27- July 7: Make the retrieval independent of character level,
i.e., improve the recognition of words as a whole.

3. July 7- July 24: Workaround other problems in the current methods of
standardized and structured text processing.

4. July 24- August 1: Implement tagging system. (The bot decides from a
list of pre-decided tags and assigns it to the documents on the basis of
the first few pages, thus reducing the amount of full text search that
needs to be done).

5. August 1- August 12: Implement information retrieval by text
summarization.

6. August 12- August 22: Implement search on the basis of text
summarization.

7. August 22- September 2: Implement the error correction methods to
improve performance.

8. September 2- September 16: Find out loopholes in the implemented system
and improve upon them.

*
 Is there something that I have missed in understanding the project? I
would be happy to receive any clarifications on the project.

-- 
Aarti K. Dwivedi
(2nd year Undergraduate,
IIT Roorkee )
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20130421/484f7ccd/attachment-0002.htm>