[Project-ideas] OCR tools for Bengali language to 98%

Sankarshan Mukhopadhyay sankarshan.mukhopadhyay at gmail.com
Mon Apr 2 21:30:15 PDT 2012


On Tue, Apr 3, 2012 at 9:12 AM, sourav dutta <mailsouravdutta at gmail.com> wrote:

> Sorry for coming in so late, i came to know about Gsoc few days ago.
> I am Undergraduate doing my Btech-Hons for IIIT - Hyderabad. I have worked
> in OCR, Vision,(Sfm) ,Image Processing,Information retreval.

You shouldn't wait for the list admin to approve a non-member post. We
make a point to request everyone to subscribe to the list.

> Here is the basic overview of my idea.
>
> A) preprocessing
>        i) Image Acquisition and Binarization - (convert image to gray scale
> and then binarise using Otsu method).
>       ii) Noise elimination - This is huge area in itself. There can be a
> lot of noises possible. Background noise can be removed with salt n pepper
> noise and connected
>         component analysis.
>      iii) Skew detection and correction - First we identified the upper
> envelope and then we applied Radon transform to the upper envelope to get
> the skew angle.
>     iV) Line, word and character level segmentation- segment and isolate
> each character.( noise can add to splitting error).
> B)  Pattern Classification - pattern matching can itself be done in various
> ways- Template matching, Nural networks, HMMs, SVM. HMMs are known to have
> best accracy for
>     char recognition.
>     i) For the feature we can use vertically segmented char in DCT domain.
> C) Training for any of the classifiers we need a supervised training with a
> annotated dataset.
> D) Post processing -  we can use spelling checker for correcting the
> erroneously recognized words with dictionary llokup.

The above is the basic theory which can be converted to some
implementation. I don't see/read much originality in terms of
addressing the problem. Since the proposal submission dates close by
06Apr2012, would you like to take the above and convert it into a
reasonable proposal ?

-- 
sankarshan mukhopadhyay
<http://sankarshan.randomink.org/blog/>



More information about the Project-ideas mailing list