Passban, Peyman, Way, Andy ORCID: 0000-0001-5736-5930 and Liu, Qun ORCID: 0000-0002-7000-1792 (2015) Benchmarking SMT performance for Farsi using the TEP++ Corpus. In: 18th Annual Conference of the European Association for Machine Translation, 11 - 13 May 2015., Antalya, Turkey.
Abstract
Statistical machine translation (SMT) suffers from various problems which are exacerbated where training data is in short
supply. In this paper we address the data
sparsity problem in the Farsi (Persian) language and introduce a new parallel corpus, TEP++. Compared to previous results the new dataset is more efficient for
Farsi SMT engines and yields better output. In our experiments using TEP++ as
bilingual training data and BLEU as a metric, we achieved improvements of +11.17
(60%) and +7.76 (63.92%) in the Farsi–
English and English–Farsi directions, respectively. Furthermore we describe an
engine (SF2FF) to translate between formal and informal Farsi which in terms of
syntax and terminology can be seen as
different languages. The SF2FF engine
also works as an intelligent normalizer for
Farsi texts. To demonstrate its use, SF2FF
was used to clean the IWSLT–2013 dataset
to produce normalized data, which gave
improvements in translation quality over
FBK’s Farsi engine when used as training
data
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Uncontrolled Keywords: | BLEU; SF2FF engine; FBK |
Subjects: | Computer Science > Machine translating |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Initiatives and Centres > ADAPT |
Published in: | Proceedings of the 18th Annual Conference of the European Association for Machine Translation. . Association for Computational Linguistics. |
Publisher: | Association for Computational Linguistics |
Official URL: | https://www.aclweb.org/anthology/W15-4911 |
Copyright Information: | © 2015 The Authors |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | Science Foundation Ireland through the CNGL Programme (Grant 12/CE/I2267) in the ADAPT Centre (www.adaptcentre.ie) at Dublin City University. |
ID Code: | 23218 |
Deposited On: | 01 May 2019 15:32 by Thomas Murtagh . Last Modified 01 May 2019 15:32 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
193kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record