[Project-ideas] OCR IR
Alok Kothari
kothari.alok at gmail.com
Fri Apr 19 04:33:46 PDT 2013
Thanks for your reply!
> > Background: I graduated from IIT Kharapur in 2009 and have been involved
> in
> > research in IR/NLP and Machine Learning for nearly 2 years.
>
> Would it be possible to provide links to any papers/presentations or,
> code that you have published?
>
Yes definetely. Unfortunately my oldwebsite is down at the organisation i
worked at. It contained more details of the projects.
However here are the links to papers:
https://dl.acm.org/citation.cfm?id=2010069&dl=ACM&coll=DL&CFID=316248085&CFTOKEN=31366376
http://www.aclweb.org/anthology/D11-1073
http://www.icwsm.org/2013/program/accepted-papers/ A recent one
('Detecting Comments on News Articles in Microblogs')
>
> > 1. I was wondering whether I could have a look at or have some
> indication to
> > the quality of files available. This will give me some idea about the
> kinds
> > of error
>
> The project idea requires the interested candidate to propose within
> the scope of the project the kind of errors the initial
> iteration/release will handle.
>
I would be happy to propose some methods to tackle errors. I was wondering
whether I could have a look at the digitized text corpora itself. for e.g.
I know there can be a wrongly recognized characters, spelling mistakes and
such. However I thought I would get a better idea about other errors if I
saw some of the documents for which such search would be built. Do you
think this is possible?
>
> > 3. Does the IR system have to be implemented on top of Lucene (or other
> open
> > source software) or can be completely stand alone.
>
> I was hoping that we would be able to utilize ElasticSearch or,
> similar. Lucene is an option too.
>
>
I will look at ElasticSearch.
Thanks again!
Alok
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20130419/9455a386/attachment-0003.htm>
More information about the Project-ideas
mailing list