Open-ended Video Question Answering with Multi-stream 3D Convolutional Networks
-
- MIYANISHI Taiki Miyanishi
- Advanced Telecommunications Research Institute International RIKEN Center for Advanced Intelligence Project
-
- KAWANABE Motoaki
- Advanced Telecommunications Research Institute International RIKEN Center for Advanced Intelligence Project
Bibliographic Information
- Other Title
-
- マルチストリーム3次元畳み込みネットワークによる外観・動作・音声情報を統合した映像質問応答
Abstract
<p>We propose an open-ended multimodal video question answering (VideoQA) method that simultaneously takes motion, appearance, and audio signals as input and then outputs textual answers. Although audio information is useful for understanding video content along with visual one, standard open-ended VideoQA methods exploit only the motion-appearance signals and ignore the audio one. Moreover, due to the lack of fine-grained modeling multimodality data and effective fusing them, a few prior works using motion, visual appearance, and audio signals showed poor results on public benchmarks. To address these problems, we propose multi-stream 3-dimensional convolutional networks (3D ConvNets) modulated with textual conditioning information. Our model integrates the fine-grained motion-appearance and audio information to the multiple 3D ConvNets and then modulates their intermediate representation using question-guided spatiotemporal information. Experimental results on public open-ended VideoQA datasets with audio track show our VideoQA method by effectively combines motion, appearance, and audio signals and outperformed state-of-the-art methods.</p>
Journal
-
- Proceedings of the Annual Conference of JSAI
-
Proceedings of the Annual Conference of JSAI JSAI2021 (0), 2Yin505-2Yin505, 2021
The Japanese Society for Artificial Intelligence
- Tweet
Keywords
Details 詳細情報について
-
- CRID
- 1390851320454046720
-
- NII Article ID
- 130008051725
-
- Text Lang
- ja
-
- Data Source
-
- JaLC
- CiNii Articles
-
- Abstract License Flag
- Disallowed