[Project-ideas] Queries regarding 'Improving models for Cross Language Text Re-use'

Sat Mar 31 15:42:06 PDT 2012

This is Ravi Kumar Singh from IIIT-Hyderabad.I know I am a little delayed
in my queries but before I pitch in my proposal, I would like to inquire a
little more about the project *Improving models for Cross Language Text
Re-use. *The first thing is that are we looking at a language specific
model or are we looking to develop a basic language independent model. I
know a language independent model would definitely be a better option, but
without much prior research in this area, the accuracy of any such model
cannot be expected to be very high.
Secondly, are we supposed to choose our entire method of implementation now
or does the project give us some time to research on the existing methods
and see how can they be improved. Just to assure, I am not getting
disoriented when I say 'some methods'. Below a describe a few ideas in my
mind and also some basic ways of how can they be improved. The accuracy
percentage that I mention are based on previous research done in this
field. But still to test it on the actual dataset will need some time.

Implementation methods-

1.Rdf'ng the data and comparing the taxonomy tree thus generated-
With Rdf(Resource Description Framework), we represent each sentence in the
form of subject, predicate and object. And each object is uniquely
referenced by a URI(Uniform Resource Indicator). For example a book is
referenced by their ISBN number. For example every material object can be
refernced by the link to their wiki page.We need to decide on a unique way
of referncing the objects though.The W3 consortium has given specific
guidelines for this under their Semantic Web Activity. Thus 'mango' in
English and 'aam' in Hindi would have the same URI.Also with every
sentence, we also store the context and that helps us in forming the Rdf
tree. Even if a sentence is rewritten in an another language, the URI for
the subject and object, as well as the context in which they are used
remains same. This way they will lead to more or less a similar Rdf tree.
Using a weighted Rdf tree, we can remove the trivial cases even. Also there
is an extensive query language SPARQL that can help us in querrying the Rdf
structures thus formed. Since this a new idea, we cannot estimate the
accuracy percentage and would require extensive testing. But it can very
easily be deployed in multi lingual scenarios, because generally the query
terms are signified by nouns, which can easily be indentified with a URI,
after simple translation. Also the corresponding context can be preserved
by using techniques like POS Tagging in the source language.
http://en.wikipedia.org/wiki/Resource_Description_Framework
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier

2. Dictionary Based Approach - This method is widely used for cross
language information retrieval. But this method is highly language
specific. We have a bilingual dictionary which translates the query into
the target language. This method is expected to have a higher accuracy.
According to some previous research:

The average precision for document retrieval with the query being in
Oromo(National Language of Ethiopia) and the target documents in English
was 19.4% (according to the paper Evaluation of Oromo-English
Cross-Language Information Retrieval by Kula Kekeba Tune, Vasudeva Varma
and Prasad Pingali). Similarly for English to German the accuracy was 56%.

This precision percentage depends on the specific language set used, on the
dictionary used and also the specific test set even. Also in some cases the
vocabularies of two languages can be very disparate. To improve the
accuracy of this technique, we need to improve :

-> how we handle the terms that could not be translated by using
dictionary. For this we could use web bases translation technique. For
example when we are searching for proper nouns, we could look for specific
web pages that are in both the source and target language. Specific search
engines provide this service. Then using tf-idf algorithm we could find the
terms that are most related to the current word.
-> how we handle terms that have multiple translations using the
dictionary. For this we use statistical similarity value among terms, i.e
we determine the best translation by seeing how frequently the the
individual translations of each query term appear together in documents.
-> the query terms can never have the exact translation. But to maintain
the effectiveness of the query, we also add a few terms are related to the
exact translation without compromising the context.
Also stemming is easier for languages that are morphologically rich.
Stemming is like zeroing down the words 'sleeping' and 'slept' to root word
sleep. So if we consider languages like Hindi, we could exploit the
grammatical structure of the language for this method.

I would also like to research upon the other methods that are currently in
use. In the second method described above, I'l basically be working to
implenet these improvement techniques for the Indic languages. Also a basic
problem is that bilingual dictionary for Indian languages, with respect to
English, is not very well populated. In such cases these improvement
techniques become very significant.

I would request the mentors to kindly answer my queries. And also if I am
expected to give my entire plan of implementation in the proposal, kindly
guide which of the above two implementations should I be using.

Regards,
Ravi Kumar Singh
Undergraduate, 3rd Year
B. Tech, Computer Science and Engineering,
IIIT-Hyderabad
+91-8688566310
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20120401/c87166a9/attachment-0002.htm>