[Project-ideas] Reg- Inquiring More Details for the Project [Add language grammar rules to a machine translation system]

Fri Apr 19 04:03:07 PDT 2013

On Thu, Apr 18, 2013 at 4:26 PM, piyush arora <piyusharora07 at gmail.com> wrote:
> Hi Sanskarshan,

'Sankarshan'

> Sure not a problem, we can discuss more during the weekend. Sampark is a
> government funded project and the code for the implementation is not
> available as per I now we can look into details for same.

Alright. The immediate issue this brings forth is that you'd have to
bring in a "clean room" implementation. Ideas, models shouldn't
overlap with the Sampark system and nor should they be strikingly
similar.

> We can start by looking how Moses performs and do the error analysis and
> make improvisation over same using the necessary methods . What data are we
> using for learning can you provide more details about the corpus that we
> have in terms of number of sentences.

An aspect of the project idea is that the student would propose
appropriate data sources. The corpus chosen need not be on a giant
scale, but it should be promising enough to be expanded. I envisage
the system to be deployed and actively learning (as opposed to a toy
project in a show-case).

>  I was also thinking it would be a good idea to first do a ground research
> about other English-Bengali systems and use the knowledge from same.
>
> Two important systems which I found are as follows-
> 1) http://tdil-dc.in/components/com_mtsystem/CommonUI/homeMT.php this is a
> government project and it's more on hybrid mechanism kind of a pipeline
> architecture, we can discuss the details as per the need I know the
> architecture and other detailed information about same.

If the TDIL system is not freely licensed or, under an appropriate
libre license, we may not want to spend time on it.

> 2)Anubadok- (http://bengalinux.sourceforge.net/cgi-bin/anubadok/index.pl) it
> seems this is an open source project and it's using some of the resources
> been build by Ankur organization the English-Bengali dictionary
> (http://www.bengalinux.org/cgi-bin/abhidhan/statistics.pl) so if you have
> some more details about same then it will be great. I downloaded the
> Anubadok system and is trying to have some hand-on experience on same and
> look into the source code.

Golam did a stellar job with Anubadok. However, it does have a current
limitation in being unable to be turned on to a data source of scale
in order to have inorganic generation of content in Bengali

> Apart from this there is also an apertium project
> (http://wiki.apertium.org/wiki/Apertium-bn-en) for English-Bengali language
> pair which has some of the tools and resources available.

Apertium has promise. During the previous year of GSoC we had a
proposal around Apertium and extending/enhancing it.

> I have few queries-
> What are we aiming by this project as far as I see there can be 3 different
> aspects-

We are aiming to create a reasonably robust MT system that we can
deploy and point to a content source of significant volume and obtain
translated content (in Bengali, primarily) which can be curated and
the MT system can continue to learn from the curation/editing. In
short, a sentient continuous MT system.

> 1) We want to begin from scratch and use statistical mt and see how it works
> for English-Bengali language pair and over this statistical approach use
> other knowledge to learn rules and make a translation model / prototype.

Works good for a long haul project, but not for the duration of the GSoC

> 2) Search and based on the available other models and resources such as
> chunker, pos tagger which are openly available make a model combining the
> available resources and build a MT system.

An option that can be investigated.

> 3) Take some of the exiting system and improve over same using statistical
> approaches.

At this stage, probably the option we need to assess quickly and first.

--
sankarshan mukhopadhyay
<https://twitter.com/#!/sankarshan>