[Project-ideas] Reg- Inquiring More Details for the Project [Add language grammar rules to a machine translation system]

Mon Apr 22 06:33:06 PDT 2013

Hi Sankarshan,

I looked into two main aspects 1) Learning Different Models Using Moses and
2) Data sources for English-Bengali language pair.

Models of our interest-
1) Phrased Based Approach- rely on parallel corpora and monolingual corpus
for forming language model and dictionaries can also be incorporated to get
enriched information and reduce missing or erroneous translation.
2) Factored Approach http://www.statmt.org/moses/?n=Moses.FactoredModels-
rely on parallel data with more enhanced information such as part of
speech, morph and other information.

I read some papers it seems factored approach perform better as compared to
phrased based approaches, it can be interpreted as a model over Phrase
incorporating linguistic information and cues.

Data Sources-
The main 2-3 data sources that have been used are as follows-
1) EMILLE corpus http://www.elda.org/catalogue/en/text/W0037.html it has
about 18k bengali-english sentences
2) Joshua Corpus *

http://joshua-decoder.org
*
*

*
3) Learning information from wikipedia dumb which has about 25k articles

>Golam did a stellar job with Anubadok. However, it does have a current
>limitation in being unable to be turned on to a data source of scale
>in order to have inorganic generation of content in Bengali

Can you provide more information about same ? Is it more a rule based /
syntax model .

>We are aiming to create a reasonably robust MT system that we can
>deploy and point to a content source of significant volume and obtain
>translated content (in Bengali, primarily) which can be curated and
>the MT system can continue to learn from the curation/editing. In
>short, a sentient continuous MT system.
Idea seems good, I guess we need to chart and fix some things which will
help to plan and design things accordingly.

Regards
Piyush Arora

On Fri, Apr 19, 2013 at 4:33 PM, Sankarshan Mukhopadhyay <
sankarshan.mukhopadhyay at gmail.com> wrote:

> On Thu, Apr 18, 2013 at 4:26 PM, piyush arora <piyusharora07 at gmail.com>
> wrote:
> > Hi Sanskarshan,
>
> 'Sankarshan'
>
> > Sure not a problem, we can discuss more during the weekend. Sampark is a
> > government funded project and the code for the implementation is not
> > available as per I now we can look into details for same.
>
> Alright. The immediate issue this brings forth is that you'd have to
> bring in a "clean room" implementation. Ideas, models shouldn't
> overlap with the Sampark system and nor should they be strikingly
> similar.
>
> > We can start by looking how Moses performs and do the error analysis and
> > make improvisation over same using the necessary methods . What data are
> we
> > using for learning can you provide more details about the corpus that we
> > have in terms of number of sentences.
>
> An aspect of the project idea is that the student would propose
> appropriate data sources. The corpus chosen need not be on a giant
> scale, but it should be promising enough to be expanded. I envisage
> the system to be deployed and actively learning (as opposed to a toy
> project in a show-case).
>
> >  I was also thinking it would be a good idea to first do a ground
> research
> > about other English-Bengali systems and use the knowledge from same.
> >
> > Two important systems which I found are as follows-
> > 1) http://tdil-dc.in/components/com_mtsystem/CommonUI/homeMT.php this
> is a
> > government project and it's more on hybrid mechanism kind of a pipeline
> > architecture, we can discuss the details as per the need I know the
> > architecture and other detailed information about same.
>
> If the TDIL system is not freely licensed or, under an appropriate
> libre license, we may not want to spend time on it.
>
> > 2)Anubadok- (http://bengalinux.sourceforge.net/cgi-bin/anubadok/index.pl)
> it
> > seems this is an open source project and it's using some of the resources
> > been build by Ankur organization the English-Bengali dictionary
> > (http://www.bengalinux.org/cgi-bin/abhidhan/statistics.pl) so if you
> have
> > some more details about same then it will be great. I downloaded the
> > Anubadok system and is trying to have some hand-on experience on same and
> > look into the source code.
>
> Golam did a stellar job with Anubadok. However, it does have a current
> limitation in being unable to be turned on to a data source of scale
> in order to have inorganic generation of content in Bengali
>
> > Apart from this there is also an apertium project
> > (http://wiki.apertium.org/wiki/Apertium-bn-en) for English-Bengali
> language
> > pair which has some of the tools and resources available.
>
> Apertium has promise. During the previous year of GSoC we had a
> proposal around Apertium and extending/enhancing it.
>
> > I have few queries-
> > What are we aiming by this project as far as I see there can be 3
> different
> > aspects-
>
> We are aiming to create a reasonably robust MT system that we can
> deploy and point to a content source of significant volume and obtain
> translated content (in Bengali, primarily) which can be curated and
> the MT system can continue to learn from the curation/editing. In
> short, a sentient continuous MT system.
>
> > 1) We want to begin from scratch and use statistical mt and see how it
> works
> > for English-Bengali language pair and over this statistical approach use
> > other knowledge to learn rules and make a translation model / prototype.
>
> Works good for a long haul project, but not for the duration of the GSoC
>
> > 2) Search and based on the available other models and resources such as
> > chunker, pos tagger which are openly available make a model combining the
> > available resources and build a MT system.
>
> An option that can be investigated.
>
> > 3) Take some of the exiting system and improve over same using
> statistical
> > approaches.
>
> At this stage, probably the option we need to assess quickly and first.
>
>
> --
> sankarshan mukhopadhyay
> <https://twitter.com/#!/sankarshan>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/attachments/20130422/d28df246/attachment-0003.htm>