[Project-ideas] Reg- Inquiring More Details for the Project [Add language grammar rules to a machine translation system]

Sankarshan Mukhopadhyay sankarshan.mukhopadhyay at gmail.com
Mon Apr 22 21:13:47 PDT 2013


On Mon, Apr 22, 2013 at 7:03 PM, piyush arora <piyusharora07 at gmail.com> wrote:

> I looked into two main aspects 1) Learning Different Models Using Moses and
> 2) Data sources for English-Bengali language pair.
>
> Models of our interest-
> 1) Phrased Based Approach- rely on parallel corpora and monolingual corpus
> for forming language model and dictionaries can also be incorporated to get
> enriched information and reduce missing or erroneous translation.
> 2) Factored Approach http://www.statmt.org/moses/?n=Moses.FactoredModels-
> rely on parallel data with more enhanced information such as part of speech,
> morph and other information.
>
> I read some papers it seems factored approach perform better as compared to
> phrased based approaches, it can be interpreted as a model over Phrase
> incorporating linguistic information and cues.
>
> Data Sources-
> The main 2-3 data sources that have been used are as follows-
> 1) EMILLE corpus http://www.elda.org/catalogue/en/text/W0037.html it has
> about 18k bengali-english sentences

The EMILLE license is somewhat restrictive. I am not a lawyer, but
using an EMILLE system to train may make the resulting application not
useful.

> 2) Joshua Corpus
>
> http://joshua-decoder.org

Have you tried reaching out to ISI Kolkata as to whether they have data-sources?

> 3) Learning information from wikipedia dumb which has about 25k articles
>
>>Golam did a stellar job with Anubadok. However, it does have a current
>>limitation in being unable to be turned on to a data source of scale
>>in order to have inorganic generation of content in Bengali
>
> Can you provide more information about same ? Is it more a rule based /
> syntax model .

Anubadok had an effort to turn it into an Apertium project (probably
during an earlier iteration of GSoC). You should get in touch with
Golam and get more detail.

/sankarshan



More information about the Project-ideas mailing list