Hi Sankarshan,<br><br>I looked into two main aspects 1) Learning Different Models Using Moses and 2) Data sources for English-Bengali language pair.<div><br></div><div>Models of our interest-</div><div>1) Phrased Based Approach- rely on parallel corpora and monolingual corpus for forming language model and dictionaries can also be incorporated to get enriched information and reduce missing or erroneous translation.</div>


<div>2) Factored Approach <a href="http://www.statmt.org/moses/?n=Moses.FactoredModels">http://www.statmt.org/moses/?n=Moses.FactoredModels</a>- rely on parallel data with more enhanced information such as part of speech, morph and other information.</div>


<div><br></div><div>I read some papers it seems factored approach perform better as compared to phrased based approaches, it can be interpreted as a model over Phrase incorporating linguistic information and cues.</div><div>


<br></div><div>Data Sources-</div><div>The main 2-3 data sources that have been used are as follows-</div><div>1) EMILLE corpus <a href="http://www.elda.org/catalogue/en/text/W0037.html">http://www.elda.org/catalogue/en/text/W0037.html</a> it has about 18k bengali-english sentences</div>


<div>2) Joshua Corpus <b id="internal-source-marker_0.5603661183267832" style="font-family:'Times New Roman';font-size:medium;font-weight:normal"><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;display:inline!important">


<a href="http://joshua-decoder.org/" style="text-decoration:none"><span style="font-size:15px;font-family:Arial;color:rgb(17,85,204);text-decoration:underline;vertical-align:baseline;white-space:pre-wrap">http://joshua-decoder.org</span></a></p>


</b></div><b id="internal-source-marker_0.5603661183267832" style="font-family:'Times New Roman';font-size:medium;font-weight:normal"><p dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt"><span style="font-size:15px;font-family:Arial;vertical-align:baseline;white-space:pre-wrap"></span></p>


</b><div>3) Learning information from wikipedia dumb which has about 25k articles</div><div><br></div><div><div>>Golam did a stellar job with Anubadok. However, it does have a current</div><div>>limitation in being unable to be turned on to a data source of scale<br>


>in order to have inorganic generation of content in Bengali</div><div><br></div><div>Can you provide more information about same ? Is it more a rule based / syntax model .</div><div><br></div><div>>We are aiming to create a reasonably robust MT system that we can<br>


>deploy and point to a content source of significant volume and obtain<br>>translated content (in Bengali, primarily) which can be curated and<br>>the MT system can continue to learn from the curation/editing. In<br>


>short, a sentient continuous MT system.<br><div class="gmail_quote">Idea seems good, I guess we need to chart and fix some things which will help to plan and design things accordingly.</div><div class="gmail_quote"><br>


</div><div class="gmail_quote">Regards</div><div class="gmail_quote">Piyush Arora</div><div class="gmail_quote"><br></div><div class="gmail_quote"><br></div><div class="gmail_quote">On Fri, Apr 19, 2013 at 4:33 PM, Sankarshan Mukhopadhyay <span dir="ltr"><<a href="mailto:sankarshan.mukhopadhyay@gmail.com" target="_blank">sankarshan.mukhopadhyay@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Thu, Apr 18, 2013 at 4:26 PM, piyush arora <<a href="mailto:piyusharora07@gmail.com">piyusharora07@gmail.com</a>> wrote:<br>


> Hi Sanskarshan,<br>

<br>

'Sankarshan'<br>

<div class="im"><br>

> Sure not a problem, we can discuss more during the weekend. Sampark is a<br>

> government funded project and the code for the implementation is not<br>

> available as per I now we can look into details for same.<br>

<br>

</div>Alright. The immediate issue this brings forth is that you'd have to<br>

bring in a "clean room" implementation. Ideas, models shouldn't<br>

overlap with the Sampark system and nor should they be strikingly<br>

similar.<br>

<div class="im"><br>

> We can start by looking how Moses performs and do the error analysis and<br>

> make improvisation over same using the necessary methods . What data are we<br>

> using for learning can you provide more details about the corpus that we<br>

> have in terms of number of sentences.<br>

<br>

</div>An aspect of the project idea is that the student would propose<br>

appropriate data sources. The corpus chosen need not be on a giant<br>

scale, but it should be promising enough to be expanded. I envisage<br>

the system to be deployed and actively learning (as opposed to a toy<br>

project in a show-case).<br>

<div class="im"><br>

>  I was also thinking it would be a good idea to first do a ground research<br>

> about other English-Bengali systems and use the knowledge from same.<br>

><br>

> Two important systems which I found are as follows-<br>

> 1) <a href="http://tdil-dc.in/components/com_mtsystem/CommonUI/homeMT.php" target="_blank">http://tdil-dc.in/components/com_mtsystem/CommonUI/homeMT.php</a> this is a<br>

> government project and it's more on hybrid mechanism kind of a pipeline<br>

> architecture, we can discuss the details as per the need I know the<br>

> architecture and other detailed information about same.<br>

<br>

</div>If the TDIL system is not freely licensed or, under an appropriate<br>

libre license, we may not want to spend time on it.<br>

<div class="im"><br>

> 2)Anubadok- (<a href="http://bengalinux.sourceforge.net/cgi-bin/anubadok/index.pl" target="_blank">http://bengalinux.sourceforge.net/cgi-bin/anubadok/index.pl</a>) it<br>

> seems this is an open source project and it's using some of the resources<br>

> been build by Ankur organization the English-Bengali dictionary<br>

> (<a href="http://www.bengalinux.org/cgi-bin/abhidhan/statistics.pl" target="_blank">http://www.bengalinux.org/cgi-bin/abhidhan/statistics.pl</a>) so if you have<br>

> some more details about same then it will be great. I downloaded the<br>

> Anubadok system and is trying to have some hand-on experience on same and<br>

> look into the source code.<br>

<br>

</div>Golam did a stellar job with Anubadok. However, it does have a current<br>

limitation in being unable to be turned on to a data source of scale<br>

in order to have inorganic generation of content in Bengali<br>

<div class="im"><br>

> Apart from this there is also an apertium project<br>

> (<a href="http://wiki.apertium.org/wiki/Apertium-bn-en" target="_blank">http://wiki.apertium.org/wiki/Apertium-bn-en</a>) for English-Bengali language<br>

> pair which has some of the tools and resources available.<br>

<br>

</div>Apertium has promise. During the previous year of GSoC we had a<br>

proposal around Apertium and extending/enhancing it.<br>

<div class="im"><br>

> I have few queries-<br>

> What are we aiming by this project as far as I see there can be 3 different<br>

> aspects-<br>

<br>

</div>We are aiming to create a reasonably robust MT system that we can<br>

deploy and point to a content source of significant volume and obtain<br>

translated content (in Bengali, primarily) which can be curated and<br>

the MT system can continue to learn from the curation/editing. In<br>

short, a sentient continuous MT system.<br>

<div class="im"><br>

> 1) We want to begin from scratch and use statistical mt and see how it works<br>

> for English-Bengali language pair and over this statistical approach use<br>

> other knowledge to learn rules and make a translation model / prototype.<br>

<br>

</div>Works good for a long haul project, but not for the duration of the GSoC<br>

<div class="im"><br>

> 2) Search and based on the available other models and resources such as<br>

> chunker, pos tagger which are openly available make a model combining the<br>

> available resources and build a MT system.<br>

<br>

</div>An option that can be investigated.<br>

<div class="im"><br>

> 3) Take some of the exiting system and improve over same using statistical<br>

> approaches.<br>

<br>

</div>At this stage, probably the option we need to assess quickly and first.<br>

<div class="HOEnZb"><div class="h5"><br>

<br>

--<br>

sankarshan mukhopadhyay<br>

<<a href="https://twitter.com/#!/sankarshan" target="_blank">https://twitter.com/#!/sankarshan</a>><br>

</div></div></blockquote></div><br></div></div>