Way, Andy ORCID: 0000-0001-5736-5930, Poncelas, Alberto ORCID: 0000-0002-5089-1687 and Maillette de Buy Wenniger, Gideon (2018) Data selection with feature decay algorithms using an approximated target side. In: 15th International Workshop on Spoken Language Translation (IWSLT 2018), 29-30 Apr 2018, Bruges, Belgium.
Abstract
AbstractData selection techniques applied to neural machine trans-lation (NMT) aim to increase the performance of a model byretrieving a subset of sentences for use as training data.One of the possible data selection techniques are trans-ductive learning methods, which select the data based on thetest set, i.e. the document to be translated. A limitation ofthese methods to date is that using the source-side test setdoes not by itself guarantee that sentences are selected withcorrect translations, or translations that are suitable given thetest-set domain. Some corpora, such as subtitle corpora, maycontain parallel sentences with inaccurate translations causedby localization or length restrictions.In order to try to fix this problem, in this paper we pro-pose to use an approximated target-side in addition to thesource-side when selecting suitable sentence-pairs for train-ing a model. This approximated target-side is built by pre-translating the source-side.In this work, we explore the performance of this generalidea for one specific data selection approach called FeatureDecay Algorithms (FDA).We train German-English NMT models on data selectedby using the test set (source), the approximated target side,and a mixture of both. Our findings reveal that models builtusing a combination of outputs of FDA (using the test setand an approximated target side) perform better than thosesolely using the test set. We obtain a statistically significantimprovement of more than 1.5 BLEU points over a modeltrained with all data, and more than 0.5 BLEU points over astrong FDA baseline that uses source-side information only.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Subjects: | Computer Science > Machine translating |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Initiatives and Centres > ADAPT |
Published in: | Turchi, Marco, Niehues, Jan and Frederico, Marcello, (eds.) Proceedings of the 15th International Workshop on Spoken Language Translation. . IWSLT. |
Publisher: | IWSLT |
Official URL: | https://workshop2018.iwslt.org/downloads/Proceedin... |
Copyright Information: | © 2018 The authors |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | ADAPT Centre for Digital Content Technology which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is cofunded under the European Regional Development Fund., European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 713567. 8 |
ID Code: | 23879 |
Deposited On: | 25 Oct 2019 10:20 by Andrew Way . Last Modified 25 Oct 2019 10:20 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
274kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record