[Project-ideas] follow up discussions - improve accuracy of bengali OCR

Sat Apr 13 19:02:01 PDT 2013

Hi Sankarshan,
As last discussed on IRC, I have been reading around the existing tools,
mainly "tesseractIndic" , but it has been dormant, in fact, it has been
almost a month and my membership to the google group is still pending.
However, I like the approach of using separate python scripts for
pre-processing and I believe the same style could be used for further
improvements. The blog is really helpful.
I also read about the "banglaOCR" project developed at CRBLP, BRAC
university, Bangladesh, and currently going through the details.
Most probably I would like to develop around either (or maybe a hybrid) of
the two systems.

I have certain doubts at this point:
1. the idea objective states to improve accuracy to 98% . My doubts are, do
we have some benchmark data or shall we define it for our purpose? I read
about the FIRE
Also, M.A.Hasnat, the developer of BanglaOCR pointed to me that the
accuracy may not be same for all domains, eg., newspaper, book, typewriting
docs, etc, so, domain adaptability should be considered.
Personally, I feel we should focus on perfecting the system for one domain
and then we can look into the other domains.
I would appreciate some clarification on these points.

-- 
-Regards,
Debajyoti Nag
http://twitter.com/aramis7d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20130414/6995dcf5/attachment-0002.htm>