Moriya, Yasufumi, Sanabria, Ramon, Metze, Florian ORCID: 0000-0002-6663-8600 and Jones, Gareth J.F. ORCID: 0000-0003-2923-8365 (2018) Eyes and ears together: new task for multimodal spoken content analysis. In: MediaEval’18, 29-31 Oct 2018, Sophia Antipolis, France.
Abstract
Human speech processing is often a multimodal process combining
audio and visual processing. Eyes and Ears Together proposes two
benchmark multimodal speech processing tasks: (1) multimodal automatic speech recognition (ASR) and (2) multimodal co-reference
resolution on the spoken multimedia. These tasks are motivated by
our desire to address the difficulties of ASR for multimedia spoken
content. We review prior work on the integration of multimodal
signals into speech processing for multimedia data, introduce a
multimedia dataset for our proposed tasks, and outline these tasks.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Uncontrolled Keywords: | Human speech processing |
Subjects: | UNSPECIFIED |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Initiatives and Centres > ADAPT |
Published in: | Larson, Martha, Arora, Piyush, Demarty, Claire-Hélène and Riegler, Michael, (eds.) Working Notes Proceedings of the MediaEval 2018 Workshop. 2283. CEUR-WS. |
Publisher: | CEUR-WS |
Official URL: | http://ceur-ws.org/Vol-2283/MediaEval_18_paper_59.... |
Copyright Information: | ©2018 The Authors |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
ID Code: | 23384 |
Deposited On: | 30 May 2019 15:31 by Thomas Murtagh . Last Modified 31 Jul 2019 08:51 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
776kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record