[Project-ideas] INTRODUCTION & PROJECT PROPOSAL

Sankarshan Mukhopadhyay sankarshan.mukhopadhyay at gmail.com
Thu Mar 22 22:07:41 PDT 2012


I didn't realize that Gourab had also mailed the list. My mistake and
I should have checked this up first.

On Thu, Mar 22, 2012 at 9:30 PM, Gourab Saha
<gourab.isikolkata at gmail.com> wrote:

> I will focuses evaluating IR effectiveness on Indic script OCR'd text data
> of RISOT2011. It is a collection of relevance judged collection of 62,825
> articles of a leading Bangla newspaper, Anandabazar Patrika (2004-2006). For
> each article, both the original digital text and corresponding OCR results
> are given. Relevance judgments are available for 92 topics. The OCR output
> is obtained by rendering each digital document as a document image, which is
> then processed by a Bangla OCR system. The document images have variation in
> font faces, character styles and sizes. The character level (more
> specifically, Unicode level) accuracy of the OCR engine is about 92%. The
> same topic statements are available in english also. Here is a snapshot of
> available documents.

To ensure that the conversation stays on the list, I'll summarize what
I responded to him with.

I am familiar with the RISOT work
(<http://www.isical.ac.in/~clia/risot/risot.html>) and the task list.
My concern is mostly around extending the RISOT work using the tools
and data sources from the work. At this point I am yet to receive a
statement on how the content of ABP can be re-used. And, also whether
the OCR system which he mentions has the source code released and
available under an appropriate free license.

Aside from that, the RISOT task list and approach works out reasonably
well when attempting to evolve the end objective of the project.

-- 
sankarshan mukhopadhyay
<http://sankarshan.randomink.org/blog/>



More information about the Project-ideas mailing list