Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

The role of syntax and semantics in machine translation and quality estimation of machine-translated user-generated content

Zadeh Kaljahi, Rasoul Samad (2015) The role of syntax and semantics in machine translation and quality estimation of machine-translated user-generated content. PhD thesis, Dublin City University.

Abstract
The availability of the Internet has led to a steady increase in the volume of online user-generated content, the majority of which is in English. Machine-translating this content to other languages can help disseminate the information contained in it to a broader audience. However, reliably publishing these translations requires a prior estimate of their quality. This thesis is concerned with the statistical machine translation of Symantec's Norton forum content, focusing in particular on its quality estimation (QE) using syntactic and semantic information. We compare the output of phrase-based and syntax-based English-to-French and English-to-German machine translation (MT) systems automatically and manually, and nd that the syntax-based methods do not necessarily handle grammar-related phenomena in translation better than the phrase-based methods. Although these systems generate suciently dierent outputs, the apparent lack of a systematic dierence between these outputs impedes its utilisation in a combination framework. To investigate the role of syntax and semantics in quality estimation of machine translation, we create SymForum, a data set containing French machine translations of English sentences from Norton forum content, their post-edits and their adequacy and uency scores. We use syntax in quality estimation via tree kernels, hand-crafted features and their combination, and nd it useful both alone and in combination with surface-driven features. Our analyses show that neither the accuracy of the syntactic parses used by these systems nor the parsing quality of the MT output aect QE performance. We also nd that adding more structure to French Treebank parse trees can be useful for syntax-based QE. We use semantic role labelling (SRL) for our semantic-based QE experiments. We experiment with the limited resources that are available for French and nd that a small manually annotated training set is substantially more useful than a much larger articially created set. We use SRL in quality estimation using tree kernels, hand-crafted features and their combination. Additionally, we introduce PAM, a QE metric based on the predicate-argument structure match between source and target. We nd that the SRL quality, especially on the target side, is the major factor negatively aecting the performance of the semantic-based QE. Finally, we annotate English and French Norton forum sentences with their phrase structure syntax using an annotation strategy adapted for user-generated text. We nd that user errors occur in only a small fraction of the data, but their correction does improve parsing performance. These treebanks (Foreebank) prove to be useful as supplementary training data in adapting the parsers to the forum text. The improved parses ultimately increase the performance of the semantic-based QE. However, a reliable semantic-based QE system requires further improvements in the quality of the underlying semantic role labelling.
Metadata
Item Type:Thesis (PhD)
Date of Award:November 2015
Refereed:No
Supervisor(s):Foster, Jennifer and Roturier, Johann
Subjects:Computer Science > Computational linguistics
Computer Science > Machine translating
Computer Science > Machine learning
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Initiatives and Centres > National Centre for Language Technology (NCLT)
Research Initiatives and Centres > ADAPT
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:Irish Research Council for Science Engineering and Technology
ID Code:20499
Deposited On:25 Nov 2015 14:23 by Jennifer Foster . Last Modified 25 Oct 2018 09:23
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record