Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Improving transductive data selection algorithms for machine translation

Poncelas, Alberto orcid logoORCID: 0000-0002-5089-1687 (2019) Improving transductive data selection algorithms for machine translation. PhD thesis, Dublin City University.

Abstract
In this work, we study different ways of improving Machine Translation models by using the subset of training data that is the most relevant to the test set. This is achieved by using Transductive Algoritms (TA) for data selection. In particular, we explore two methods: Infrequent N-gram Recovery (INR) and Feature Decay Algorithms (FDA). Statistical Machine Translation (SMT) models do not always perform better when more data are used for training. Using these techniques to extract the training sentences leads to a better performance of the models for translating a particular test set than using the complete training dataset. Neural Machine Translation (NMT) can outperform SMT models, but they require more data to achieve the best performance. In this thesis, we explore how INR and FDA can also be beneficial to improving NMT models with just a fraction of the available data. On top of that, we propose several improvements for these data-selection methods by exploiting the information on the target side. First, we use the alignment between words in the source and target sides to modify the selection criteria of these methods. Those sentences containing n-grams that are more difficult to translate should be promoted so that more occurrences of these n-grams are selected. Another extension proposed is to select sentences based not on the test set but on an MT-generated approximated translation (so the target-side of the sentences are considered in the selection criteria). Finally, target-language sentences can be translated into the source-language so that INR and FDA have more candidates to select sentences from.
Metadata
Item Type:Thesis (PhD)
Date of Award:November 2019
Refereed:No
Supervisor(s):Way, Andy and Maillette de Buy Wenniger, Gideon
Subjects:Computer Science > Computational linguistics
Computer Science > Machine learning
Computer Science > Machine translating
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Initiatives and Centres > ADAPT
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:Science Foundation Ireland, Research Centres Programme (Grant 13/RC/2106)
ID Code:23726
Deposited On:19 Nov 2019 12:55 by Andrew Way . Last Modified 22 Jan 2021 14:18
Documents

Full text available as:

[thumbnail of thesis_AlbertoPoncelas.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record