[Project-ideas] GSoC 2013 Query

Sun Apr 21 07:15:32 PDT 2013

On Sun, Apr 21, 2013 at 2:47 AM, Aarti K. Dwivedi
<ellydwivedi2093 at gmail.com> wrote:
> Hi,
>
>      I am an applicant for GSoC 2013. I am enthusiastic about working on
> "Improving information retrieval methods for OCR data sets consisting of
> Indic scripts."
> Before I posted the proposal I wanted to discuss what I have framed on the
> basis of my understanding of the project idea.
>
> Synopsis:
>
> 1. My first step would be familiarizing myself with the current methods and
> algorithms  that are used in retrieval of information from digitized text
> and also with their shortcomings.
>
> 2. Figure out the reasons for shortcomings and the degradation of text.
>
> 3. Propose and implement a retrieval system that does not lead to
> degradation, i.e., improve the text processing.
>
> 4. Improve the existing search algorithms by weeding out inefficiencies and
> propose additions while increase efficiency.
>
>
> Implementation details of the project:
>
> 1. Test the current methods of retrieval of information from digitized text
> to find out specific problems and areas of shortcomings. File these as
> issues. The shortcomings are described in terms of technical details of
> where the search falls short.
>
> 2. Remove errors based on character level and make the search independent of
> character level error.
>
> 3. Develop a system to classify documents according to tags. Addition of
> tags to the documents would help in narrowing down the search.
>
> 4. Reduce the error by predicting words when characters are perceived to be
> inaccurate.
>
> 5. Continue improving search implementation as the errors come out.
>
>
> Phases/Milestones with dates:
>
> 1. June 17- June 27: Filter out errors in specific terms and find out their
> causes.
>
> 2. June 27- July 7: Make the retrieval independent of character level, i.e.,
> improve the recognition of words as a whole.
>
> 3. July 7- July 24: Workaround other problems in the current methods of
> standardized and structured text processing.
>
> 4. July 24- August 1: Implement tagging system. (The bot decides from a list
> of pre-decided tags and assigns it to the documents on the basis of the
> first few pages, thus reducing the amount of full text search that needs to
> be done).
>
> 5. August 1- August 12: Implement information retrieval by text
> summarization.
>
> 6. August 12- August 22: Implement search on the basis of text
> summarization.
>
> 7. August 22- September 2: Implement the error correction methods to improve
> performance.
>
> 8. September 2- September 16: Find out loopholes in the implemented system
> and improve upon them.
>
>
>  Is there something that I have missed in understanding the project? I would
> be happy to receive any clarifications on the project.
>

Hi Aarti,

Thanks for your introduction. There are many threads going on on the
mailing list regarding the same subject

http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/2013-April/author.html

Kindly request you to go through the same and ask questions if there
are any over and above the same

Regards,

-- 
Bhavani Shankar
Ubuntu Developer       |  www.ubuntu.com
https://launchpad.net/~bhavi