Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Extracting correctly aligned segments from unclean parallel data using character n-gram matching

Popović, Maja orcid logoORCID: 0000-0001-8234-8745 and Poncelas, Alberto orcid logoORCID: 0000-0002-5089-1687 (2020) Extracting correctly aligned segments from unclean parallel data using character n-gram matching. In: Conference on Language Technologies and Digital Humanities 2020, 24-25 Sept 2020, Ljubljana, Slovenia (Online).

Abstract
Training of Neural Machine Translation systems is a time- and resource-demanding task, especially when large amounts of parallel texts are used. In addition, it is sensitive to unclean parallel data. In this work, we explore a data cleaning method based on character n-gram matching. The method is particularly convenient for closely related language since the n-gram matching scores can be calculated directly on the source and the target parts of the training corpus. For more distant languages, a translation step is needed and then the MT output is compared with the corresponding original part. We show that the proposed method not only reduces the amount of training corpus, but also can increase the system’s performance.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:Research Initiatives and Centres > ADAPT
Published in: Proceedings of the Conference on Language Technologies and Digital Humanities 2020. . SDJT – Slovensko društvo za jezikovne tehnologije.
Publisher:SDJT – Slovensko društvo za jezikovne tehnologije
Official URL:http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_Popovic-et-...
Copyright Information:© 2020 The Authors
Funders:Science Foundation Ireland and co-funded by the European Regional Development Fund (ERDF) through Grant 13/RC/2106, European Association for Machine Translation under its programme “2019 Sponsorship of Activities”.
ID Code:25025
Deposited On:18 Sep 2020 14:23 by Maja Popovic . Last Modified 08 Apr 2021 13:41
Documents

Full text available as:

[thumbnail of chrf-cleaning.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
141kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record