Barry, James ORCID: 0000-0003-3051-585X, Wagner, Joachim ORCID: 0000-0002-8290-3849, Cassidy, Lauren, Cowap, Alan ORCID: 0000-0002-6300-6034, Lynn, Teresa, Abigail, Walsh, Ó Meachair, Mícheál J. ORCID: 0000-0003-3931-5571 and Foster, Jennifer ORCID: 0000-0002-7789-4853 (2022) gaBERT — an Irish Language model. In: Thirteenth Language Resources and Evaluation Conference, LREC 2022, 20-25 June 2022, Marseille, France.
Abstract
The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Subjects: | Computer Science > Artificial intelligence Computer Science > Computational linguistics |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing DCU Faculties and Schools > Faculty of Humanities and Social Science > Fiontar agus Scoil na Gaeilge Research Initiatives and Centres > ADAPT |
Published in: | Proceedings of the Thirteenth Language Resources and Evaluation Conference. . European Language Resources Association (ELRA). |
Publisher: | European Language Resources Association (ELRA) |
Official URL: | https://aclanthology.org/2022.lrec-1.511/ |
Copyright Information: | © European Language Resources Association (ELRA) |
Funders: | Science Foundation Ireland (Grant 13/RC/2106), European Regional Development Fund, Irish Government Department of Culture, Heritage and the Gaeltacht, Science Foundation Ireland (SFI) Frontiers for the Future programme (19/FFP/6942), Science Foundation Ireland (SFI) Centre for Research Training in Machine Learning (18/CRT/6183) |
ID Code: | 28293 |
Deposited On: | 28 Apr 2023 09:12 by Joachim Wagner . Last Modified 28 Apr 2023 09:12 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial 4.0 963kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record