[Project-ideas] OCR tools for Bengali language to 98%

sourav dutta mailsouravdutta at gmail.com
Mon Apr 2 20:42:48 PDT 2012


Hi,
Sorry for coming in so late, i came to know about Gsoc few days ago.
I am Undergraduate doing my Btech-Hons for IIIT - Hyderabad. I have worked
in OCR, Vision,(Sfm) ,Image Processing,Information retreval.

I have worked before  with OCR(english). and would like to work with
Bengali text too.
To achieve 98% accuracy we first need to find the bottle necks in the
recognition process, which happens mainly in the reprocessing step. also to
increase the accuracy i plan to use a dictionary look-up based post-process.
Here is the basic overview of my idea.

A) preprocessing
       i) Image Acquisition and Binarization - (convert image to gray scale
and then binarise using Otsu method).
      ii) Noise elimination - This is huge area in itself. There can be a
lot of noises possible. Background noise can be removed with salt n pepper
noise and connected
        component analysis.
     iii) Skew detection and correction - First we identified the upper
envelope and then we applied Radon transform to the upper envelope to get
the skew angle.
    iV) Line, word and character level segmentation- segment and isolate
each character.( noise can add to splitting error).
B)  Pattern Classification - pattern matching can itself be done in various
ways- Template matching, Nural networks, HMMs, SVM. HMMs are known to have
best accracy for
    char recognition.
    i) For the feature we can use vertically segmented char in DCT domain.
C) Training for any of the classifiers we need a supervised training with a
annotated dataset.
D) Post processing -  we can use spelling checker for correcting the
erroneously recognized words with dictionary llokup.



-- 
Sourav Dutta
CSE,UG3
IIIT H
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20120403/d1420b4d/attachment-0002.htm>


More information about the Project-ideas mailing list