Popović, Maja ORCID: 0000-0001-8234-8745 and Poncelas, Alberto ORCID: 0000-0002-5089-1687 (2020) Neural machine translation between similar south-Slavic languages. In: 2020 Fifth Conference on Machine Translation (WMT20), 19-20 Nov 2020, Dominican Republic (Online).
Abstract
This paper describes the ADAPT-DCU machine translation systems built for the WMT 2020 shared task on Similar Language Translation.
We explored several set-ups for NMT for Croatian–Slovenian and Serbian–Slovenian language pairs in both translation directions. Our experiments focus on different amounts and types of training data: we first apply basic filtering on the OpenSubtitles training corpora, then we perform additional cleaning of remaining misaligned segments based on character n-gram matching.
Finally, we make use of additional monolingual data by creating synthetic parallel data through back-translation. Automatic evaluation shows that multilingual systems with joint Serbian and Croatian data are better than bilingual, as well as that character-based cleaning leads to improved scores while using less data.
The results also confirm once more that adding back-translated data further improves the performance, especially when the synthetic data is similar to the desired domain of the development and test set. This, however, might come at a price of prolonged training time, especially for multitarget systems.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Subjects: | Computer Science > Computational linguistics Computer Science > Machine translating |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Initiatives and Centres > ADAPT |
Published in: | Fifth Conference on Machine Translation (at EMNLP-2020). . Association for Computational Linguistics (ACL). |
Publisher: | Association for Computational Linguistics (ACL) |
Official URL: | https://www.aclweb.org/anthology/2020.wmt-1.51 |
Copyright Information: | © 2020 The Authors. CC-BY- 4.0 |
Funders: | Science Foundation Ireland through the SFI Research Cen-tres Programme 13/RC/2106, European Regional Development Fund (ERDF) |
ID Code: | 25080 |
Deposited On: | 18 Nov 2020 17:01 by Alberto Poncelas . Last Modified 25 Jun 2021 13:02 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
197kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record