Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

gaBERT — an Irish Language model

Barry, James orcid logoORCID: 0000-0003-3051-585X, Wagner, Joachim orcid logoORCID: 0000-0002-8290-3849, Cassidy, Lauren, Cowap, Alan orcid logoORCID: 0000-0002-6300-6034, Lynn, Teresa, Abigail, Walsh, Ó Meachair, Mícheál J. orcid logoORCID: 0000-0003-3931-5571 and Foster, Jennifer orcid logoORCID: 0000-0002-7789-4853 (2022) gaBERT — an Irish Language model. In: Thirteenth Language Resources and Evaluation Conference, LREC 2022, 20-25 June 2022, Marseille, France.

Abstract
The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Subjects:Computer Science > Artificial intelligence
Computer Science > Computational linguistics
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
DCU Faculties and Schools > Faculty of Humanities and Social Science > Fiontar agus Scoil na Gaeilge
Research Initiatives and Centres > ADAPT
Published in: Proceedings of the Thirteenth Language Resources and Evaluation Conference. . European Language Resources Association (ELRA).
Publisher:European Language Resources Association (ELRA)
Official URL:https://aclanthology.org/2022.lrec-1.511/
Copyright Information:© European Language Resources Association (ELRA)
Funders:Science Foundation Ireland (Grant 13/RC/2106), European Regional Development Fund, Irish Government Department of Culture, Heritage and the Gaeltacht, Science Foundation Ireland (SFI) Frontiers for the Future programme (19/FFP/6942), Science Foundation Ireland (SFI) Centre for Research Training in Machine Learning (18/CRT/6183)
ID Code:28293
Deposited On:28 Apr 2023 09:12 by Joachim Wagner . Last Modified 28 Apr 2023 09:12
Documents

Full text available as:

[thumbnail of 2022.lrec-1.511.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial 4.0
963kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record