Ji, Tianbo ORCID: 0000-0003-0143-6220 (2022) Human evaluation and statistical analyses on machine reading comprehension, question generation and open-domain dialogue. PhD thesis, Dublin City University.
Abstract
Evaluation is a critical element in the development process of many natural language based systems. In this thesis, we will present critical analyses of standard evaluation methodologies applied in the following Natural Language Processing (NLP) domains: machine reading comprehension (MRC), question generation (QG), and open-domain dialogue. Generally speaking, systems from tasks like MRC are usually evaluated by comparing the similarity between hand-crafted references and system generated outputs using automatic evaluation metrics, thus these metrics are mainly borrowed from other NLP tasks that have been well-developed, such as machine translation and text summarization. Meanwhile, the evaluation of QG and dialogues is even a known open problem as such tasks do not have the corresponding references for computing the similarity, and human evaluation is indispensable when assessing the performance of the systems from these tasks. However, human evaluation is unfortunately not always valid because: i) human evaluation may cost too much and be hard to deploy when experts are involved; ii) human assessors can lack reliability in the crowd-sourcing environment. To overcome the challenges from both automatic metrics and human evaluation, we first design specific crowdsourcing human evaluation methods for these three target tasks, respectively. We then show that these human evaluation methods are reproducible, highly reliable, easy to deploy, and cost-effective. Additionally, with the data collected from our experiments, we measure the accuracy of existing automatic metrics and analyse the potential limitations and disadvantages of the direct application of these metrics. Furthermore, in allusion to the specific features of different tasks, we provide detailed statistical analyses on the collected data to discover their underlying trends, and further give suggestions about the directions to improving systems on different aspects.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | November 2022 |
Refereed: | No |
Supervisor(s): | Jones, Gareth, Graham, Yvette and Liu, Qun |
Uncontrolled Keywords: | natural language processing evaluation; human evaluation; machine reading comprehension evaluation; question generation evaluation; open-domain dialogue evaluation |
Subjects: | Computer Science > Computational linguistics Computer Science > Machine learning |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Initiatives and Centres > ADAPT |
Funders: | Science Foundation Ireland |
ID Code: | 27703 |
Deposited On: | 10 Nov 2022 14:19 by Gareth Jones . Last Modified 10 Nov 2022 14:19 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0 4MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record