Yu, Dahai (2008) The application of manifold based visual speech units for visual speech recognition. PhD thesis, Dublin City University.
Abstract
This dissertation presents a new learning-based representation that is referred to as a Visual
Speech Unit for visual speech recognition (VSR). The automated recognition of human speech using only features from the visual domain has become a significant research topic that plays an essential role in the development of many multimedia systems such as audio visual speech recognition(AVSR), mobile phone applications, human-computer interaction (HCI) and sign language recognition. The inclusion of the lip visual information is opportune since it can improve the overall accuracy of audio or hand recognition algorithms especially when such systems are operated in environments characterized by a high level of acoustic noise.
The main contribution of the work presented in this thesis is located in the development of a new learning-based representation that is referred to as Visual Speech
Unit for Visual Speech Recognition (VSR). The main components of the developed Visual Speech Recognition system are applied to: (a) segment the mouth region of
interest, (b) extract the visual features from the real time input video image and (c) to identify the visual speech units. The major difficulty associated with the VSR systems resides in the identification of the smallest elements contained in the image sequences that represent the lip movements in the visual domain.
The Visual Speech Unit concept as proposed represents an extension of the standard viseme model that is currently applied for VSR. The VSU model augments the standard viseme approach by including in this new representation not only the data associated with the articulation of the visemes but also the transitory information between consecutive
visemes. A large section of this thesis has been dedicated to analysis the performance of the new visual speech unit model when compared with that attained for standard (MPEG-
4) viseme models. Two experimental results indicate that:
1. The developed VSR system achieved 80-90% correct recognition when the system has been applied to the identification of 60 classes of VSUs, while the
recognition rate for the standard set of MPEG-4 visemes was only 62-72%.
2. 15 words are identified when VSU and viseme are employed as the visual speech element. The accuracy rate for word recognition based on VSUs is 7%-12% higher than the accuracy rate based on visemes.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | November 2008 |
Refereed: | No |
Supervisor(s): | Sutherland, Alistair and Whelan, Paul F. |
Uncontrolled Keywords: | Visual Speech Recognition; Lip-Reading; |
Subjects: | Computer Science > Interactive computer systems Computer Science > Image processing |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
ID Code: | 598 |
Deposited On: | 10 Nov 2008 11:29 by Alistair Sutherland . Last Modified 19 Jul 2018 14:41 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
7MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record