Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Domain adaptation for statistical machine translation of corporate and user-generated content

Banerjee, Pratyush (2013) Domain adaptation for statistical machine translation of corporate and user-generated content. PhD thesis, Dublin City University.

Abstract
The growing popularity of Statistical Machine Translation (SMT) techniques in recent years has led to the development of multiple domain-speci c resources and adaptation scenarios. In this thesis we address two important and industrially relevant adaptation scenarios, each suited to different kinds of content. Initially focussing on professionally edited `enterprise-quality' corporate content, we address a speci c scenario of data translation from a mixture of different domains where, for each of them domain-specific data is available. We utilise an automatic classifier to combine multiple domain-specific models and empirically show that such a configuration results in better translation quality compared to both traditional and state-of-the-art techniques for handling mixed domain translation. In the second phase of our research we shift our focus to the translation of possibly `noisy' user-generated content in web-forums created around products and services of a multinational company. Using professionally edited translation memory (TM) data for training, we use different normalisation and data selection techniques to adapt SMT models to noisy forum content. In this scenario, we also study the effect of mixture adaptation using a combination of in-domain and out-of-domain data at different component levels of an SMT system. Finally we focus on the task of optimal supplementary training data selection from out-of-domain corpora using a novel incremental model merging mechanism to adapt TM-based models to improve forum-content translation quality.
Metadata
Item Type:Thesis (PhD)
Date of Award:March 2013
Refereed:No
Supervisor(s):Way, Andy, van Genabith, Josef and Roturier, Johann
Uncontrolled Keywords:Statistical Machine Translation; SMT
Subjects:Computer Science > Computational linguistics
Computer Science > Machine translating
Computer Science > Machine learning
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
ID Code:17722
Deposited On:03 Apr 2013 12:44 by Jennifer Foster . Last Modified 03 Apr 2013 12:44
Documents

Full text available as:

[thumbnail of PratsPhDThesis-FinalCorrected-14thJan.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
2MB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record