Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

gaHealth: An English–Irish bilingual corpus of health data

Lankford, Séamus, Afli, Haithem orcid logoORCID: 0000-0002-7449-4707, Ní Loinsigh, Orla and Way, Andy orcid logoORCID: 0000-0001-5736-5930 (2022) gaHealth: An English–Irish bilingual corpus of health data. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 20-25 June 2022, Marseille, France.

Abstract
Machine Translation is a mature technology for many high-resource language pairs. However in the context of low-resource languages, there is a paucity of parallel data datasets available for developing translation models. Furthermore, the development of datasets for low-resource languages often focuses on simply creating the largest possible dataset for generic translation. The benefits and development of smaller in-domain datasets can easily be overlooked. To assess the merits of using in-domain data, a dataset for the specific domain of health was developed for the low-resource English to Irish language pair. Our study outlines the process used in developing the corpus and empirically demonstrates the benefits of using an in-domain dataset for the health domain. In the context of translating health-related data, models developed using the gaHealth corpus demonstrated a maximum BLEU score improvement of 22.2 points (40%) when compared with top performing models from the LoResMT2021 Shared Task. Furthermore, we define linguistic guidelines for developing gaHealth, the first bilingual corpus of health data for the Irish language, which we hope will be of use to other creators of low-resource data sets. gaHealth is now freely available online and is ready to be explored for further research.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Uncontrolled Keywords:Health data; parallel corpus; machine translation; Irish
Subjects:Computer Science > Computational linguistics
Computer Science > Machine translating
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Initiatives and Centres > ADAPT
Published in: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). . European Language Resources Association (ELRA).
Publisher:European Language Resources Association (ELRA)
Official URL:http://www.lrec-conf.org/proceedings/lrec2022/pdf/...
Copyright Information:© 2022 European Language Resources Association (ELRA)
Funders:Science Foundation Ireland through ADAPT Centre (Grant 13/RC/2106), Munster Technological University, National Relay Station (NRS) of Ireland
ID Code:28339
Deposited On:18 May 2023 12:09 by Seamus Lankford . Last Modified 19 May 2023 11:29
Documents

Full text available as:

[thumbnail of slankford-lrec.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial 4.0
359kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record