Multi-Step Spatio-Temporal Reasoning for Open-Ended Video Question Answering

DOI

Bibliographic Information

Other Title
  • 多段階時空間推論による映像質問応答

Abstract

<p>This study tackles open-ended video question answering (Video QA), which aims to generate correct text answers according to the questions about video content. Although video consists of sequential frames that involve multiple objects, the methods capturing both spatial and temporal structures are less explored. Furthermore, the existing methods focusing mainly on the temporal structure of video have still achieved competitive performance on public Video QA datasets. However, for more complex and precise reasoning, it is essential to model jointly spatial and temporal structures of video, guided by the content of textual questions. In this paper, we propose the two-stream spatiotemporal MAC network that performs question-aware sequential reasoning over video frames to infer correct answers. Our network computes spatial and temporal representations of video by using clip-wise motion features with frame-wise object-based appearance features and then weights them with spatial and temporal attention to output answers. Moreover, it can repeatedly refine this reasoning process by attending both objects and clips relevant to important words of questions. On both short- and long-form open-ended Video QA datasets, our new model significantly outperforms state-of-the-art Video QA models.</p>

Journal

Details 詳細情報について

Report a problem

Back to top