[Project-ideas] GSoC 2013 Query

Mon Apr 22 01:40:49 PDT 2013

Hi,

    Thank You. I am interested in doing some prior assignment like a bug
fix or some feature addition.
What is it that I can do? I want to work on "*Improving information
retrieval methods for OCR data sets consisting of Indic scripts."*

Thanks You,
Aarti Dwivedi
(2nd year Undergraduate,
IIT Roorkee)

On Sun, Apr 21, 2013 at 7:45 PM, Bhavani Shankar R <bhavi at ubuntu.com> wrote:

> On Sun, Apr 21, 2013 at 2:47 AM, Aarti K. Dwivedi
> <ellydwivedi2093 at gmail.com> wrote:
> > Hi,
> >
> >      I am an applicant for GSoC 2013. I am enthusiastic about working on
> > "Improving information retrieval methods for OCR data sets consisting of
> > Indic scripts."
> > Before I posted the proposal I wanted to discuss what I have framed on
> the
> > basis of my understanding of the project idea.
> >
> > Synopsis:
> >
> > 1. My first step would be familiarizing myself with the current methods
> and
> > algorithms  that are used in retrieval of information from digitized text
> > and also with their shortcomings.
> >
> > 2. Figure out the reasons for shortcomings and the degradation of text.
> >
> > 3. Propose and implement a retrieval system that does not lead to
> > degradation, i.e., improve the text processing.
> >
> > 4. Improve the existing search algorithms by weeding out inefficiencies
> and
> > propose additions while increase efficiency.
> >
> >
> > Implementation details of the project:
> >
> > 1. Test the current methods of retrieval of information from digitized
> text
> > to find out specific problems and areas of shortcomings. File these as
> > issues. The shortcomings are described in terms of technical details of
> > where the search falls short.
> >
> > 2. Remove errors based on character level and make the search
> independent of
> > character level error.
> >
> > 3. Develop a system to classify documents according to tags. Addition of
> > tags to the documents would help in narrowing down the search.
> >
> > 4. Reduce the error by predicting words when characters are perceived to
> be
> > inaccurate.
> >
> > 5. Continue improving search implementation as the errors come out.
> >
> >
> > Phases/Milestones with dates:
> >
> > 1. June 17- June 27: Filter out errors in specific terms and find out
> their
> > causes.
> >
> > 2. June 27- July 7: Make the retrieval independent of character level,
> i.e.,
> > improve the recognition of words as a whole.
> >
> > 3. July 7- July 24: Workaround other problems in the current methods of
> > standardized and structured text processing.
> >
> > 4. July 24- August 1: Implement tagging system. (The bot decides from a
> list
> > of pre-decided tags and assigns it to the documents on the basis of the
> > first few pages, thus reducing the amount of full text search that needs
> to
> > be done).
> >
> > 5. August 1- August 12: Implement information retrieval by text
> > summarization.
> >
> > 6. August 12- August 22: Implement search on the basis of text
> > summarization.
> >
> > 7. August 22- September 2: Implement the error correction methods to
> improve
> > performance.
> >
> > 8. September 2- September 16: Find out loopholes in the implemented
> system
> > and improve upon them.
> >
> >
> >  Is there something that I have missed in understanding the project? I
> would
> > be happy to receive any clarifications on the project.
> >
>
> Hi Aarti,
>
> Thanks for your introduction. There are many threads going on on the
> mailing list regarding the same subject
>
>
> http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/2013-April/author.html
>
> Kindly request you to go through the same and ask questions if there
> are any over and above the same
>
> Regards,
>
>
> --
> Bhavani Shankar
> Ubuntu Developer       |  www.ubuntu.com
> https://launchpad.net/~bhavi
>

-- 
Aarti K. Dwivedi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20130422/81a12d24/attachment-0003.htm>