Afli, Haithem ORCID: 0000-0002-7449-4707, Qui, Zhengwei, Way, Andy ORCID: 0000-0001-5736-5930 and Sheridan, Páraic (2016) Using SMT for OCR error correction of historical texts. In: Tenth International Conference on Language Resources and Evaluation (LREC 2016), 23-28 May 2016, Portorož, Slovenia. ISBN 978-2-9517408-9-1
Abstract
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of
paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated
by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital
text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error
Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a
qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows
that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly
13% relative improvement compared to the initial baseline.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Uncontrolled Keywords: | Optical Character Recognition; Language Modelling; SpeechToSpeech Translation |
Subjects: | Computer Science > Machine translating |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Initiatives and Centres > ADAPT Research Initiatives and Centres > Centre for Next Generation Localisation (CNGL) |
Published in: | Calzolari, Nicoletta, Choukri, Khalid, Declerck, Thierry and Goggi, Sara, (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). . European Language Resource Association. ISBN 978-2-9517408-9-1 |
Publisher: | European Language Resource Association |
Copyright Information: | © 2016 ELRA |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | Science Foundation Ireland through the TIDA Programme (Grant 14/TIDA/2384), ADAPT Centre (Grant 13/RC/2106) (www.adaptcentre.ie) at Dublin City University |
ID Code: | 23226 |
Deposited On: | 02 May 2019 08:35 by Thomas Murtagh . Last Modified 16 May 2019 11:05 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
565kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record