Poncelas, Alberto ORCID: 0000-0002-5089-1687, Aboomar, Mohammad ORCID: 0000-0002-1391-5061, Buts, Jan ORCID: 0000-0002-7657-804X, Hadley, James ORCID: 0000-0003-1950-2679 and Way, Andy ORCID: 0000-0001-5736-5930 (2020) A tool for facilitating OCR postediting in historical documents. In: Workshop on Language Technologies for Historical and Ancient Languages, LT4HALA (2020), 11-16 May 2020, Marseille, France.
Abstract
Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom. As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Workshop |
Refereed: | Yes |
Additional Information: | Colocated with LREC 2020 Workshop Language Resources and Evaluation Conference Due to the COVID-19 pandemic, the workshop will not take place. However, the proceedings are published online. |
Subjects: | Computer Science > Digital electronics |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Initiatives and Centres > ADAPT |
Published in: | Sprugnoli, Rachele and Passarotti, Marco, (eds.) Proceedings of 1st Workshop on Language Technologies for Historical and Ancient Languages. . LREC. |
Publisher: | LREC |
Official URL: | https://aclanthology.org/2020.lt4hala-1.7.pdf |
Copyright Information: | © 2020 The Authors |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | Irish Research Council’s COALESCE scheme (COALESCE/2019/117), SFI Research Centres Programme (Grant 13/RC/2106) |
ID Code: | 24441 |
Deposited On: | 11 May 2020 15:11 by Alberto Poncelas . Last Modified 07 Jan 2022 16:41 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
232kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record