Hu, Feiyan ORCID: 0000-0001-7451-6438, Eva, Mohedano, O'Connor, Noel E. ORCID: 0000-0002-4033-9135 and McGuinness, Kevin ORCID: 0000-0003-1336-6477 (2021) Temporal bilinear encoding network of audio-visual features at low sampling rates. In: 16th International Conference on Computer Vision Theory and Applications - VISAPP 2021, 8-10 Feb 2021, Vienna, Austria (Online). ISBN 978-989-758-488-6
Abstract
Current deep learning based video classification architectures are typically trained end-to-end on large volumes of data and require extensive computational resources. This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate. We propose Temporal Bilinear Encoding Networks (TBEN) for encoding both audio and visual long range temporal information using bilinear pooling and demonstrate bilinear pooling is better than average pooling on the temporal dimension for videos with low sampling rate. We also embed the label hierarchy in TBEN to further improve the robustness of the classifier. Experiments on the FGA240 fine-grained classification dataset using TBEN achieve a new state-of-the-art (hit@1=47.95%). We also exploit the possibility of incorporating TBEN with multiple decoupled modalities like visual semantic and motion features: experiments on UCF101 sampled at 1 FPS achieve close to state-of-the-art accuracy (hit@1=91.03%) while requiring significantly less computational resources than competing approaches for both training and prediction.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Uncontrolled Keywords: | Video classification; bilinear pooling; Action classification; Deep learning; Audio-visual; Compact Bilinear Pooling |
Subjects: | Computer Science > Artificial intelligence Computer Science > Image processing Computer Science > Machine learning Computer Science > Digital video Computer Science > Video compression |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing DCU Faculties and Schools > Faculty of Engineering and Computing > School of Electronic Engineering Research Initiatives and Centres > INSIGHT Centre for Data Analytics |
Published in: | Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications: VISAPP,. 5. SciTePress. ISBN 978-989-758-488-6 |
Publisher: | SciTePress |
Official URL: | http://dx.doi.org/10.5220/0010337306370644 |
Copyright Information: | © 2021 The Authors (CC BY-NC-ND 4.0) |
Funders: | Science Foundation Ireland (SFI) under grant number SFI/15/SIRG/3283 and SFI/12/RC/2289_P2. |
ID Code: | 25289 |
Deposited On: | 09 Feb 2021 14:05 by Feiyan Hu . Last Modified 13 Sep 2021 10:15 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
217kB |
Downloads
Downloads
Downloads per month over past year
Available Versions of this Item
- Temporal bilinear encoding network of audio-visual features at low sampling rates. (deposited 09 Feb 2021 14:05) [Currently Displayed]
Archive Staff Only: edit this record