Rehbein, Ines (2009) Treebank-based grammar acquisition for German. PhD thesis, Dublin City University.
Abstract
Manual development of deep linguistic resources is time-consuming and costly and therefore often described as a bottleneck for traditional rule-based NLP. In my PhD thesis I present a treebank-based method for the automatic acquisition of LFG resources for German. The method automatically creates deep and rich linguistic presentations
from labelled data (treebanks) and can be applied to large data sets.
My research is based on and substantially extends previous work on automatically acquiring wide-coverage, deep, constraint-based grammatical resources from the English Penn-II treebank (Cahill et al.,2002; Burke et al., 2004; Cahill, 2004). Best results for English show a dependency f-score of 82.73% (Cahill et al., 2008) against the PARC
700 dependency bank, outperforming the best hand-crafted grammar of Kaplan et al. (2004). Preliminary work has been carried out to test the approach on languages other than English, providing proof of concept for the applicability of the method (Cahill et al., 2003; Cahill,
2004; Cahill et al., 2005).
While first results have been promising, a number of important research questions have been raised. The original approach presented first in Cahill et al. (2002) is strongly tailored to English and the datastructures
provided by the Penn-II treebank (Marcus et al., 1993).
English is configurational and rather poor in inflectional forms. German, by contrast, features semi-free word order and a much richer morphology. Furthermore, treebanks for German differ considerably from the Penn-II treebank as regards data structures and encoding schemes underlying the grammar acquisition task.
In my thesis I examine the impact of language-specific properties of German as well as linguistically motivated treebank design decisions on PCFG parsing and LFG grammar acquisition. I present experiments investigating the influence of treebank design on PCFG parsing and show which type of representations are useful for the PCFG and
LFG grammar acquisition tasks. Furthermore, I present a novel approach to cross-treebank comparison, measuring the effect of controlled error insertion on treebank
trees and parser output from different treebanks. I complement the cross-treebank comparison by providing a human evaluation using TePaCoC, a new testsuite for testing parser performance on complex grammatical constructions. Manual evaluation on TePaCoC data provides
new insights on the impact of flat vs. hierarchical annotation schemes on data-driven parsing. I present treebank-based LFG acquisition methodologies for two German treebanks. An extensive evaluation along different dimensions complements the investigation and provides valuable insights for the future development of treebanks.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | November 2009 |
Refereed: | No |
Supervisor(s): | van Genabith, Josef |
Uncontrolled Keywords: | German; treebanks; lexical-functional grammar; LFG; parsing; PCFG; grammar acquisistion; |
Subjects: | Computer Science > Computational linguistics Humanities > German language |
DCU Faculties and Centres: | Research Initiatives and Centres > National Centre for Language Technology (NCLT) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
Funders: | SFI Grant 04/IN/I527, Science Foundation Ireland |
ID Code: | 14900 |
Deposited On: | 12 Nov 2009 10:46 by Josef Vangenabith . Last Modified 19 Jul 2018 14:48 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
3MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record