Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Transformers for low-resource languages: is féidir linn!

Lankford, Séamus, Afli, Haithem orcid logoORCID: 0000-0002-7449-4707 and Way, Andy orcid logoORCID: 0000-0001-5736-5930 (2021) Transformers for low-resource languages: is féidir linn! In: Machine Translation Summit XVIII: Research Track, 16 - 20 Aug 2021, Virtual.

Abstract
The Transformer model is the state-of-the-art in Machine Translation. However and in general and neural translation models often under perform on language pairs with insufficient training data. As a consequence and relatively few experiments have been carried out using this architecture on low-resource language pairs. In this study and hyperparameter optimization of Transformer models in translating the low-resource English-Irish language pair is evaluated. We demonstrate that choosing appropriate parameters leads to considerable performance improvements. Most importantly and the correct choice of subword model is shown to be the biggest driver of translation performance. SentencePiece models using both unigram and BPE approaches were appraised. Variations on model architectures included modifying the number of layers and testing various regularization techniques and evaluating the optimal number of heads for attention. A generic 55k DGT corpus and an in-domain 88k public admin corpus were used for evaluation. A Transformer optimized model demonstrated a BLEU score improvement of 7.8 points when compared with a baseline RNN model. Improvements were observed across a range of metrics and including TER and indicating a substantially reduced post editing effort for Transformer optimized models with 16k BPE subword models. Bench-marked against Google Translate and our translation engines demonstrated significant improvements. The question of whether or not Transformers can be used effectively in a low-resource setting of English-Irish translation has been addressed. Is féidir linn - yes we can.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Subjects:Computer Science > Artificial intelligence
Computer Science > Computational linguistics
Computer Science > Machine learning
Computer Science > Machine translating
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Initiatives and Centres > ADAPT
Published in: Proceedings of the 18th Biennial Machine Translation Summit. . Association for Machine Translation in the Americas (AMTA).
Publisher:Association for Machine Translation in the Americas (AMTA)
Official URL:https://aclanthology.org/2021.mtsummit-research.5
Copyright Information:© 2021 AMTA
Funders:Science Foundation Ireland (SFI) Research Centres Programme (Grant 13/RC/2016), European Regional Development Fund, Munster Technological University
ID Code:28381
Deposited On:29 May 2023 13:49 by Seamus Lankford . Last Modified 29 May 2023 13:49
Documents

Full text available as:

[thumbnail of slankford-isfeidirlinn.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-No Derivative Works 4.0
804kB
Downloads

Downloads

Downloads per month over past year

Available Versions of this Item

Archive Staff Only: edit this record