Improving machine translation of educational content via crowdsourcing

Behnke, Maximiliana, Miceli Barone, Antonio Valerio, Sennrich, Rico, Sosoni, Vilelmini, Naskos, Thanasis, Takoulidou, Eirini, Stasimioti, Maria, Menno, van Zaanen, Castilho, Sheila ORCID: 0000-0002-8416-6555, Gaspari, Federico ORCID: 0000-0003-3808-8418, Georgakopoulou, Panayota ORCID: 0000-0001-9780-1813, Kordoni, Valia, Egg, Markus and Kermanidis, Katia Lida ORCID: 0000-0002-3270-5078 (2018) Improving machine translation of educational content via crowdsourcing. In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, Miyazaki, Japan. ISBN 979-10-95546-19-1

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translation models. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence of using crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of a lower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domain by collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machine translation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collected with proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned with pre-existing in-domain corpora.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Uncontrolled Keywords:	MOOCs; neural machine translation; crowdsourcing
Subjects:	Computer Science > Machine translating
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Initiatives and Centres > ADAPT
Published in:	McCrae, John P., Chiarcos, Christian, Declerck, Thierry, Gracia, Jorge and Klimek, Bettina, (eds.) Proceedings of the 6th Workshop on Linked Data in Linguistic (LDL-2018). . European Language Resource Association. ISBN 979-10-95546-19-1
Publisher:	European Language Resource Association
Official URL:	https://www.aclweb.org/anthology/L18-1528
Copyright Information:	© 2018 ELRA
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	TraMOOC project (Translation for Massive Open Online Courses) funded by the European Commission under H2020-ICT2014/H2020-ICT-2014-1 under grant agreement number 644333., grant EP/L01503X/1 for the University of Edinburgh School of Informatics Centre for Doctoral Training in Pervasive Parallelism from the UK Engineering and Physical Sciences Research Council (EPSRC).
ID Code:	23201
Deposited On:	24 Apr 2019 13:51 by Thomas Murtagh . Last Modified 20 Jan 2021 16:36

Documents

Full text available as:

[thumbnail of Improving Machine Translation of Educational Content via Crowdsourcing.pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
233kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Improving machine translation of educational content via crowdsourcing

Downloads