Okita, Tsuyoshi (2012) Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting. PhD thesis, Dublin City University.
Abstract
This thesis discusses how to incorporate linguistic knowledge into an SMT system. Although one important category of linguistic knowledge is that obtained by a constituent / dependency parser, a POS / super tagger, and a morphological analyser, linguistic knowledge here includes larger domains than this: Multi-Word Expressions, Out-Of-Vocabulary words, paraphrases, lexical semantics (or non-literal translations), named-entities, coreferences, and transliterations. The first discussion is about word alignment where we propose a MWE-sensitive word aligner. The second discussion is about the smoothing methods for a language model and a translation model where we propose a hierarchical Pitman-Yor process-based smoothing method. The common grounds for these discussion are the examination of three exceptional cases from real-world data: the presence
of noise, the availability of prior knowledge, and the problem of underfitting. Notable characteristics of this design are the careful usage of (Bayesian) priors in order that it can capture both frequent and linguistically important phenomena. This can be considered to provide one example to solve the problems of statistical models which often aim to learn from frequent examples only, and often overlook less frequent but linguistically important phenomena.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | March 2012 |
Refereed: | No |
Supervisor(s): | Way, Andy |
Uncontrolled Keywords: | statisitcal machine translation; SMT; Multi-Word Expressions; linguistic knowledge |
Subjects: | Computer Science > Computational linguistics Computer Science > Machine translating |
DCU Faculties and Centres: | Research Initiatives and Centres > Centre for Next Generation Localisation (CNGL) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
Funders: | Science Foundation Ireland |
ID Code: | 16759 |
Deposited On: | 28 Mar 2012 13:20 by Declan Groves . Last Modified 19 Jul 2018 14:55 |
Documents
Full text available as:
Preview |
PDF (Word Alignment and Smoothing Methods in Statistical Machine Translation: Noise, Prior Knowledge and Overfitting)
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record