<div dir="ltr">Hi Sankarshan,<br>As last discussed on IRC, I have been reading around the existing tools, mainly "tesseractIndic" , but it has been dormant, in fact, it has been almost a month and my membership to the google group is still pending.<br>
However, I like the approach of using separate python scripts for pre-processing and I believe the same style could be used for further improvements. The blog is really helpful.<br>I also read about the "banglaOCR" project developed at CRBLP, BRAC university, Bangladesh, and currently going through the details.<div>
Most probably I would like to develop around either (or maybe a hybrid) of the two systems.</div><div><br>I have certain doubts at this point:<br>1. the idea objective states to improve accuracy to 98% . My doubts are, do we have some benchmark data or shall we define it for our purpose? I read about the FIRE <br>
Also, M.A.Hasnat, the developer of BanglaOCR pointed to me that the accuracy may not be same for all domains, eg., newspaper, book, typewriting docs, etc, so, domain adaptability should be considered.<br>Personally, I feel we should focus on perfecting the system for one domain and then we can look into the other domains.<br>
I would appreciate some clarification on these points.<br><div><div><br></div>-- <br>-Regards,<div>Debajyoti Nag</div><div><a href="http://twitter.com/aramis7d" target="_blank">http://twitter.com/aramis7d</a><br></div><div>
<br></div>
</div></div></div>