[Project-ideas] INTRODUCTION & PROJECT PROPOSAL “Improving information retrieval methods for OCR data sets consisting of Indic scripts” Mentor-Sankarshan Mukhopadhyay

Gourab Saha gourab.isikolkata at gmail.com
Thu Mar 22 08:57:47 PDT 2012


1. INDRODUCTION


I am Gourab Saha, a M.Tech(Computer Science) student of Indian Statistical
Institute,Kolkata.I have a diverse computer education.I have a project
experience in the field of networking,Parallel computing (using NVIDIA CUDA
Technology),Web application etc.I have a decent amount of coding experience
in C,C++,JAVA,PHP. My current study is focussed to *Information
retrieval*.I am quick learner and a highly motivated ,hardworking
person.This is the
brief introduction about me.


2.PROJECT INTEREST


I am interested in the project

 “*Improving information retrieval methods for OCR data sets consisting of
Indic scripts”*

**

Mentor-Sankarshan Mukhopadhyay<https://fedoraproject.org/wiki/User:Sankarshan>




3.FIRST DRAFT OF PROJECT PROPOSAL


  3.1 AVAILABLE TRAINING DATA(This data is freely available)

 I will focuses evaluating IR effectiveness on Indic script OCR'd text data
of RISOT2011. It is a collection of relevance judged collection of 62,825
articles of a leading Bangla newspaper, Anandabazar Patrika (2004-2006).
For each article, both the original digital text and corresponding OCR
results are given. Relevance judgments are available for 92 topics. The OCR
output is obtained by rendering each digital document as a document image,
which is then processed by a Bangla OCR system. The document images have
variation in font faces, character styles and sizes. The character level
(more specifically, Unicode level) accuracy of the OCR engine is about 92%.
The same topic statements are available in english also. Here is a snapshot
of available documents attached with this mail.

3.2  PROPOSAL 1 (INTRODUCING DEGRADATION MODEL INTO CORPUS)


 A Bengali OCR system was used to convert these images into electronic text
using a feature-based template matching approach .  Automatic evaluation
found the accuracy OCR engine is about 92%.A manually typed clean text from
the newspaper is converted into image and then converted into OCR ed text.
As the OCRed data accuracy is as high as 92% for above mentioned corpus I
want to introduce a degradation model into the corpus to degrade the
quality of the OCR ed document introducing some errors, which may be
suitable for our experiment.Then on the degraded corpus I want to  try
retrieval algorithm based on Error modeling in conjunction with stemming
and overlapping n-grams. I will measure of our performance using
precision,recall and other information retrieval performance evolution
parameters.



3.3 PROPOSAL 2  (CROSS-LANGUAGE INFORMATION RETRIEVAL ON OCR’ED BENGALI
DATA)


 This my second proposal is Cross-language information retrieval model on
OCR ed text.

As we have corpus of OCR ed bengali text and english corpus comprising the
same topic we can try to propose some cross-language information retrieval
algorithm .Our goal is to extract the relevant information from the OCRed
bengali corpus using the clean text query in english.Suppose a query in
english “Singur land dispute” will retrieve  relevant document from the OCR
ed data set of bengali news paper Anandabazar Patrika.As a part of our
second proposal I want to implement such a cross-language retrieval
algorithm using RISOT 2011 bengali & english corpus.



I will be highly obliged if you kindly go through my project proposals and
give your opinion on the viability of this project as a part of GSEOC 2012.
I am exploring more about it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20120322/9170383a/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: oc.tiff
Type: image/tiff
Size: 141498 bytes
Desc: not available
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20120322/9170383a/attachment-0002.tiff>


More information about the Project-ideas mailing list