Lohar, Pintu ORCID: 0000-0002-5328-1585, Popović, Maja ORCID: 0000-0001-8234-8745, Alfi, Haithem ORCID: 0000-0002-7449-4707 and Way, Andy ORCID: 0000-0001-5736-5930 (2019) A systematic comparison between SMT and NMT on translating user-generated content. In: 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2019), 7 - 13 Apr 2019, La Rochelle, France.
Abstract
Twitter has become an immensely popular platform where the users can share information within a certain character limit (280 characters) which encourages them to deliver short and informal messages (tweets). In general, machine translation (MT) of tweets is a challenging task. However, for translating German tweets about football into English, it has been shown that a moderate translation performance in terms of the BLEU score can be achieved using the phrase-based translation engines built on a tiny parallel Twitter data set [1]. In this work, we propose to further increase the translation quality using the neural machine translation models and applying the following strategies: (i) we back translate a set of out-of-domain English tweets released by ”Harvard data set” in 2017 into German and add the synthetic parallel data to the tiny parallel data used in [1]; (ii) as tweets are short in general, we extract short text pairs from the large news-commentary parallel data and add it to the tiny Twitter parallel data set in order to restrict the length of the out-of-genre text segments. We build both phrase-based and neural MT systems (PBMT and NMT) using the above data combinations in order to perform a systematic comparison between the two approaches on translating tweets. Our experimental results reveal that the NMT system performs significantly worse than the PBMT system when using only the tiny Twitter data set for MT training. In contrast, when additional data is used for training, the results show huge improvements of the NMT system and produce very similar BLEU scores as the PBMT system even with only few hundred thousands of additional synthetic parallel data.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Subjects: | Computer Science > Machine translating |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Initiatives and Centres > ADAPT |
Published in: | Proceedings of CICLing 2019, the 20th International Conference on Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science (LNCS) . Springer. |
Publisher: | Springer |
Copyright Information: | © 2019 The Authors |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | y Science Foundation Ireland through ADAPT Centre (Grant 13/RC/2106) |
ID Code: | 23869 |
Deposited On: | 21 Oct 2019 14:49 by Andrew Way . Last Modified 05 May 2023 16:31 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
637kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record