Contents
1 서론
최근 몇 년 동안 비전-언어 이해 연구가 활발하게 이루어지고 있으며, 그 중에서도 비디오 질의응답(VideoQA)은 인터랙티브 AI를 개발하여 동적인 시각적 세계와 자연 언어로 소통할 수 있는 가능성 때문에 특히 주목받고 있습니다. VideoQA는 비디오를 종합적으로 이해하고 질문에 정확하게 답변할 수 있는 모델을 요구하며, 이는 시각적 객체, 행동, 활동 및 사건의 인식뿐만 아니라 의미, 공간적, 시간적, 인과적 관계의 인퍼런스를 포함합니다.
이런 챌린지를 극복하기 위해 시공간적 어텐션, 동작-외형 메모리, 시공간 또는 계층적 그래프 모델 등의 기술이 제안되었습니다. 하지만 사용된 데이터셋, 정의된 챌린지, 해당 알고리즘들이 다양하고 조금은 혼란스러워 이 분야의 연구를 심각하게 방해하고 있습니다. 본 논문은 이런 문제점을 개선하고자 VideoQA에 대한 보다 포괄적이고 의미 있는 조사를 제공합니다.
2 VideoQA 과제 및 데이터셋
2.1 문제 정의
VideoQA 과제는 주어진 비디오 $V$와 질문 $q$에 기반하여 올바른 답변 $a^*$를 예측하는 것입니다. 멀티 초이스 QA와 개방형 QA의 두 가지 주요 유형이 있습니다.
멀티 초이스 QA에서는 모델이 각 질문에 대해 여러 후보 답변 $A_{mc}$를 제시받고 정확한 답변을 선택해야 합니다. 이는 다음 수식으로 표현됩니다.
\[a^* = F(a | q, V, A_{mc})\]개방형 QA에서는 문제가 주로 다중 클래스 분류 문제로 설정되며, 모델은 비디오-질문 쌍을 사전 정의된 전역 답변 세트 $A_{oe}$로 분류해야 합니다.
\[a^* = F(a | q, V) \quad \text{where } a \in A_{oe}\]이런 과제들은 다양한 평가 척도와 데이터셋을 통해 평가됩니다.
2.2 평가 척도
정확도는 다음과 같이 정의됩니다.
\[\text{Accuracy} = \frac{\sum_{i=1}^N \text{min}(1, \frac{\text{WUP}(a_i, a^*_i)}{\gamma})}{N}\]상기 식에서 $a_i$와 $a^*_i$는 각각 예측된 답변과 실제 정답을 나타내며, $\gamma$는 유사도를 측정하는 WUP 점수의 임계값입니다.
2.3 데이터셋
VideoQA는 다양한 관점에서 이해될 수 있으며, 특정 질문의 지도 하에 비디오를 다각도로 이해하는 것을 목표로 합니다. 데이터셋은 일반 VideoQA, 멀티모달 VideoQA, 지식 기반 VideoQA로 분류될 수 있습니다. 이 분류는 질문과 답변에 사용된 데이터 모달리티에 따라 다릅니다.
2.4 주요 프레임워크
VideoQA의 일반적인 솔루션 프레임워크는 비디오 인코더, 질문 인코더, 교차 모달 상호 작용 및 답변 디코더로 구성됩니다. 이 구성 요소들은 각각 비디오와 질문에서 정보를 추출하고, 이를 통합하여 최종 답변을 생성합니다. 이 프레임워크는 다양한 기술과 함께, 여러 단계의 인퍼런스과 계층적 학습을 통해 비디오 요소의 구조와 관계를 반영하며, 질문에 계층적으로 답변합니다.
3 알고리즘
3.1 방법
초기 어텐션 기반 작업
어텐션 메커니즘은 입력의 중요 부분을 찾아 유용한 정보에 선택적으로 집중하는 휴먼 영감의 메커니즘입니다. 이는 VideoQA에서 시공간적 차원에서 비디오의 특정 부분에 집중하도록 사용됩니다. 초기 작업에서는 단순한 시간 어텐션를 적용하여 글로벌 비디오 및 질문 표현을 융합하는 방법을 시도했습니다.
메모리 네트워크
메모리 네트워크는 연속 입력을 메모리 슬롯에 캐시하고 이를 명시적으로 활용하여 긴 이야기를 이해할 수 있으며, 영화와 TV 쇼 같은 긴 비디오 이야기 이해와 같은 분야에서 주목받고 있습니다.
트랜스포머
트랜스포머는 장기 관계를 모델링하는데 유리하며, 다양한 비디오QA 데이터셋에서 유망한 성능을 보여줍니다. 이는 주로 큰 규모의 데이터셋에 대한 사전 훈련을 통해 이루어집니다.
그래프 신경 네트워크
그래프 구조 기술은 비디오QA 모델의 인퍼런스 능력을 향상시키는 데 유리하며, 특히 인퍼런스 VideoQA가 주목받으면서 더욱 각광받고 있습니다. 이 기술들은 비디오 요소가 의미 공간에서 계층적임을 고려하여 개발되었습니다.
3.2 성능 분석
표 2와 표 3은 유명한 VideoQA 벤치마크에서 보고된 결과에 기반한 방법을 분석합니다. Factoid VideoQA와 Inference VideoQA 모두에서, 특히 그래프 구조 기법과 계층적 학습은 유망한 성능을 보여주고 있습니다.
Recent years have witnessed a flourish of research in vision-language understanding (Xu et al., 2016; Chen et al., 2017; Antol et al., 2015; Chen et al., 2018; Jang et al., 2017), of which, video Question Answering (VideoQA) is one of the mostprominent, given its promise to develop interactive AI to communicate with the dynamic visual world via natural languages. Despite the popularity, VideoQA remains one of the greatest challenges, because it demands the models to comprehensively understand the videos to correctly answer questions. The questions involve not only the recognition of visual objects, actions, activities and events, but also the inference of their semantic, spatial, temporal, and causal relationships (Xu et al., 2017; Jang et al., 2017; Shang et al., 2019, 2021; Yang et al., 2021b; Xiao et al., 2021, 2022a).
To tackle the challenges, techniques such as spatio-temporal attention (Jang et al., 2017), motion-appearance memory (Gao et al., 2018), and spatio-temporal or hierarchical graph models (Cherian et al., 2022; Xiao et al., 2022a) have been proposed and demonstrated their effectiveness on different VideoQA datasets. However, we find that the datasets, the defined challenges, and the corresponding algorithms are varied and a bit messy. There is a lack of a meaningful survey to categorize the datasets and to organize the technique developed, which seriously impedes the research.
Although a handful of recent works (Sun et al., 2021; Khurana and Deshpande, 2021; Patel et al., 2021) have tried to review VideoQA, they mostly follow an old-to-new fashion to summarize the literature and lack an effective taxonomy to classify them. In terms of the contents, these works focus merely on factoid questions and neglect the inference questions (see Fig. 1 for the difference). Furthermore, lots of recent new techniques (e.g., pre-training and Transformer) are missing.
This paper thus gives a more comprehensive and meaningful survey to VideoQA, in the hope of learning from the past and shaping the future. Our contributions are as follows. (1) We provide a clear taxonomy to VideoQA. We can either classify existing VideoQA tasks into Factoid VideoQA and Inference VideoQA according to the fundamental challenges embodied in QAs, or classify them into normal VideoQA, Multi-modal VideoQA, and Knowledge-based VideoQA according to the multimodal information invoked in the QAs. (2) We categorize existing VideoQA techniques as Memory, Transformer, Graph, Neural Modular Network, and Neural-Symbolic method. Along with the techniques, some meaningful insights are also included: attention modeling, cross-modal pre-training, hierarchical learning, multi-granular ensemble, and progressive reasoning. (3) We analyze existing methods from the perspective of the challenges encountered in the various VideoQA tasks and provide our prospects for future research.
VideoQA is a task to predict the correct answer 𝑎∗ based on a question 𝑞 and a video 𝑉. There are mainly two types of tasks in VideoQA: multichoice QA and open-ended QA.
For multi-choice QA, the models are presented with several candidate answers A𝑚𝑐 for each question and are required to pick the correct one $𝑎∗ = F (𝑎\|𝑞, V, A𝑚𝑐)$. For open-ended QA, the problem can be classification (the most popular), generation (word-by-word) and regression (for counting) depending on the specific datasets. Specifically, open-ended QA is popularly set as a multi-class classification problem which requires the models to classify a video-question pair into a predefined global answer set A𝑜𝑒: $𝑎∗ = F (𝑎\|𝑞, V)$ where 𝑎 ∈ A𝑜𝑒. Open-ended QA can also be formulated as a generation problem, which might have more practical use and receiving increasing attention. Usually the answer is denoted as $𝑎 = (𝑎1, 𝑎2, …, 𝑎𝑡 , …, 𝑎 𝑀 )$ of length 𝑀, where 𝑎𝑡 is the 𝑡-th word; and the model is required to predict the next word 𝑎𝑡 in the vocabulary set $W: 𝑎∗ 𝑡 = F (𝑎𝑡\|𝑞, V, (𝑎1, 𝑎2, …, 𝑎𝑡−1))$, where $𝑎𝑡 ∈ W$. For the counting task, which is defined as an open-ended question about counting the number of repetitions of an action (Jang et al., 2017), it is formulated as an regression problem, requiring the model to compute an integer-valued answer to be close to the ground truth.
Compared with open-ended QA, multi-choice QA is typically defined to study beyond factoid QA to inference QA (Xiao et al., 2021; Wu et al., 2021a), as it dispenses with the generation and evaluation of natural languages.
Accuracy. For multi-choice QA and open-ended QA (classification), accuracy is defined based on the entire testing question set Q, given by:
where 𝐿 denotes the length of the shorter answer. WUPS. The WUPS is the soft measure of accuracy by taking into account word synonyms. It is based on the WUP score (Wu and Palmer, 1994) to evaluate the quality of the generated answer (Zhao et al., 2017b, 2018; Xiao et al., 2021). The WUP measures word similarity based on WordNet (Fellbaum, 1998). WUPS score with the threshold 𝛾 is defined as,
in which a and a* are predicted and ground-truth numbers respectively.
The evaluation metrics mainly serve for different task settings, while there are also some novel and diagnostic ones (Gandhi et al., 2022; Li et al., 2022b; Castro et al., 2022a) that may be helpful for robustness and interpretation of VideoQA models.
VideoQA can be understood from different perspectives, since the aim is to gain multi-view and multi-grained understanding of videos under the guidance of specific questions.
Table 1: VideoQA datasets in the literature.
Modality-based Taxonomy. According to the data modality invoked in the questions and answers, VideoQA can be classified into normal VideoQA, multi-modal VideoQA (MM VideoQA), and knowledge VideoQA (KB VideoQA). Normal VideoQA only invokes visual resources to understand the question and to derive the correct answer. It emphasizes visual understanding of the video elements and reasoning of their relations. Usually, the videos are short and are typically usergenerated on social platforms. Different from normal VideoQA, MM VideoQA often involves other resources aside from visual contents, such as subtitles/transcripts and text plots of movies (Tapaswi et al., 2016) and TV shows (Lei et al., 2018). MM VideoQA mainly challenges multi-modal information fusion and long video story understanding. Finally, KB VideoQA (Garcia et al., 2020) demands external knowledge distillation from explicit knowledge bases or commonsense reasoning (Fang et al., 2020). Different from MM VideoQA, KB VideoQA provides a global knowledge base for the whole dataset, instead of giving paired “knowledge” for each question. For better understanding of the three kinds of VideoQA, we show typical examples in Figure 1 (right).
Question-based Taxonomy. According to the type of question (or the challenges posted in the questions), VideoQA can be classified into factoid VideoQA and inference VideoQA. A factoid question directly asks about the visual fact, such as the location (where is), objects/attributes (who/what (color) is), and invokes little relations to understand the questions and infer the correct answers.
Factoid QA emphasizes the holistic understanding of the questions and the recognition of the visual elements. In contrast, inference VideoQA aims to explore the logic and knowledge reasoning ability in dynamic scenarios. It features various relationships between the visual facts. Though rich in relation types, VideoQA emphasizes temporal (before/after) and causal (why/How/what if) relationships that feature temporal dynamics, as emphasized by recent works (Zadeh et al., 2019; Yi et al., 2020; Xiao et al., 2021; Li et al., 2022b). Datasets Analysis. The timeline of some established VideoQA datasets is shown in Figure 2. We categorize all the datasets according to our defined taxonomy in Table 1 and their details are listed in Table A1 (see Appendix). VideoQA and MM VideoQA almost appear simultaneously, and have been studied separately by the community. Despite the unique challenges of MM VideoQA in reasoning on multiple modalities (Kim et al., 2020), algorithms targeting VideoQA and MM VideoQA share similar spirits. Modality-based taxonomy stems from research preference for video domains. While question-based taxonomy is affected more by the methodological considerations, since the recently proposed Inference VideoQA brings new technical challenges, which is driving artificial intelligence towards new heights, not just limited to learning the correlations in data.
As shown in Figure 3, a common framework comprises four parts: video encoder, question encoder, cross-modal interaction, and answer decoder. The video encoder often encodes raw videos by jointly extracting frame appearance and clip motion features. Recent works also show that object-level visual and semantic features (e.g., category and attribute labels) are important. These features are usually extracted with pre-trained 2D or 3D neural networks, as summarized in Table 2. Question encoder extracts token-level representation, such as GloVe and BERT features (Kenton and Toutanova, 2019). Then, the sequential data of vision and language can be further processed by sequential models (e.g., RNN, CNN, and Transformer) for the convenience of cross-modal interaction, which will be detailed further. For multi-choice QA, the answer decoder can be a 1-way classifier to select the correct answer from the provided multiple choices. For open-ended QA, it can be either an n-way classifier to select an answer from a pre-defined global answer set, or a language generator to generate an answer word by word. The video and language encoders can be pre-trained or more recently endto-end fine-tuned (Lei et al., 2021).
Unique Challenges. Compared with ImageQA (Lu et al., 2016; Anderson et al., 2018), VideoQA is much more challenging because of the spatio-temporal nature of videos (Xiao et al., 2020, 2021). Thus, a simple extension of existing ImageQA techniques to answer queries of videos will lead to sub-optimal results. Compared with other video tasks, question-answering requires a comprehensive understanding of videos in different aspects and granularity, such as from fine-grained to coarsegrained in both temporal and spatial domains, and from factoid questions to inference questions. To tackle the challenges, a lot of research efforts have been developed on cross-modal interaction, which aims to gain understanding of videos under the guidance of questions. We summarize some common and meaningful insights as follows:
Attention. Attention is a human-inspired mechanism that locates the important part of the input and selectively focuses on useful information. In VideoQA, to attend to a specific part of videos in both spatial and temporal dimensions, temporal attention and spatial attention are widely used. Selfattention has a good ability to model long-range dependencies, and can be used in intra-modal modeling, such as temporal information in the video and global dependencies of questions. Co-attention (Cross-modal attention) can attend to both relevant and critical multi-modal information, such as the question-guided video representation and video-guided question representation.
Figure 3: A common solution framework for VideoQA. It includes: a video encoder, a question encoder, a cross-modal interaction, and an answer decoder. Some common insights are also involved in the model design.
Specifically, linguistic concepts are analyzed from word to sentence. Similarly, video elements are processed from objects to actions, activities, and global events. Compared with the multi-granularity ensemble, hierarchical learning processes the multi-granular information progressively. It gradually reasons and aggregates the low-level, local visual information into the highlevel, global video representation. Thus, hierarchical learning can better reflect the structure and relationship of video elements and accomplish question answering hierarchically.
Others. Aside from the above, multi-step reasoning (Wang et al., 2021; Mao et al., 2022) and causal discovery (Li et al., 2022d) also demonstrate the effectiveness. Most importantly, these insights are not mutually exclusive; they can be coordinated in a single model for good performance.
Early Attention-based Works.
(Zeng et al., 2017) try to directly apply element-wise multiplication to fuse the global video and question representations for answer prediction. Additionally, it demonstrates the advantage of a simple temporal attention. Attention is also explored in more complex scenarios in conjunction with various other ideas, such as multi-granularity ensemble (Xu et al., 2017) and hierarchical learning (Zhao et al., 2017a). In particular, (Jang et al., 2017) propose a dual-LSTM based approach with spatial and temporal attention mechanisms, which can focus better on critical frames in a video and critical regions in a frame. (Xu et al., 2017) refine attention over both framelevel and clip-level visual features, conditioned with both the coarse-grained question feature and fine-grained word feature. (Zhao et al., 2017a) propose hierarchical dual-level attention networks (DLAN) to learn the question-aware video representations with word-level and question-level attention based on appearance and motion.
Despite the ability to attend to video frames and clips, these works rely on RNN for history information modeling, which has later been shown to be weak in capturing long-term dependency.
Memory Networks. Memory networks can cache sequential inputs in memory slots and explicitly utilize even far early information. Memory especially receives attention in long video story understanding, such as movies and TV-Shows. Because the QAs in these VideoQA tasks not only involve the understanding of visual contents, but also the long stories they convey.
(Tapaswi et al., 2016) first incorporate and modify the memory network (Sukhbaatar et al., 2015) into VideoQA, to store video and subtitle features in the memory bank. To enable memory read and write operations with high capacity and flexibility, (Na et al., 2017) design a memory network with multiple convolution layers. Considering dualmodal information in the movie story, (Kim et al., 2019) introduce a progressive attention mechanism to progressively prune out irrelevant temporal parts in the memory bank for each modality, and adaptively integrate outputs of each memory.
Memory has also been explored in normal VideoQA. (Gao et al., 2018) propose a two-stream framework (CoMem) to deal with motion and appearance information with a co-memory attention module, introducing multi-level contextual information and producing dynamic fact ensembles for diverse questions. Considering that CoMem synchronizes the attentions detected by appearance and motion features, it could thus generate incorrect attention, (Fan et al., 2019) further introduce a heterogeneous external memory module (HME) with attentional read and write operations to integrate the motion and appearance features and learn the spatio-temporal attention simultaneously.
Transformer. Transformer (Vaswani et al., 2017) has a good ability to model long-term relationships and has demonstrated promising performance for modeling multi-modal vision-language tasks such as VideoQA, with pre-training on largescale datasets (Zhu and Yang, 2020). Motivated by the success of Transformer, (Li et al., 2019) first introduce the architecture of Transformer without pre-training to VideoQA (PSAC), which consists of two positional self-attention blocks to replace LSTM, and a video-question co-attention block to simultaneously attend both visual and textual information. (Yang et al., 2020) and (Urooj et al., 2020) incorporate the pre-trained language-based Transformer (BERT) (Kenton and Toutanova, 2019) to movie and story understanding, which requires more modeling on languages like subtitles and dialogues. Both works process each of the input modalities such as video and subtitles, with question and candidate answer, respectively, and lately fuse several streams for the final answer.
More recently, (Lei et al., 2021) apply the imagetext pre-trained Transformer for cross-modal pretraining and fine-tune it for downstream video-text tasks, such as VideoQA. (Yang et al., 2021a) train a VideoQA model, based on a large-scale dataset, with 69M video-question-answer triplets, using contrastive learning between a multi-modal videoquestion Transformer and an answer Transformer. This video-text pre-trained Transformer can be further fine-tuned on other downstream VideoQA tasks, which shows the benefits of task-specific pretraining for the target VideoQA task. Furthermore, (Zellers et al., 2021) train a cross-modal Transformer (MERLOT) in a label-free, self-supervised manner, based on 180M video segments with image frames and words. Similar to MERLOT, VIOLET (Fu et al., 2021) is another end-to-end videotext pre-trained Transformer model but with more advanced video encoder and proxy tasks.
While the aforementioned Transformer-style models have demonstrated strong performances on popular Factoid VideoQA datasets (refer to our analysis in Sec. 3.2), recent works (Buch et al., 2022; Xiao et al., 2022b) reveal that their performance are weak in answering questions that emphasize visual relation reasoning, especially the temporal and causal relations which feature video dynamics. Furthermore, their demands on largescale video data for pre-training and the lack of explanability largely prevent their popularity. Such weaknesses call for more future efforts in developing foundation models for fine-grained video reasoning, and simultaneously, with less computation resources and better interpretability.
Graph Neural Networks. Graph-structured techniques (Kipf and Welling, 2017; Zhang et al., 2022) are recently more favoured for improving the reasoning ability of VideoQA models, especially when Inference VideoQA draws attention to the community (Jang et al., 2017; Xiao et al., 2021). HGA (Jiang and Han, 2020), and more recent works, B2A (Park et al., 2021) and DualVGR (Wang et al., 2021) build the graphs based on coarse-grained video segments. Yet, they incorporate both intraand inter-modal relationship learning and achieve good performances. To gain object-level information, (Huang et al., 2020) build the graph (LGCN) based on objects represented by their appearance and location features. They model the interaction between objects related to questions with GNN (Kipf and Welling, 2017).
Considering that the video elements are hierarchical in semantic space, (Liu et al., 2021a), (Peng et al., 2021) and (Xiao et al., 2022a) incorporate hierarchical learning into graph networks. Specifically, (Liu et al., 2021a) propose a graph memory mechanism (HAIR), to perform relational visionsemantic reasoning from object level to frame level; (Peng et al., 2021) concatenate different-level graphs, that is, object-level, frame-level, and cliplevel, progressively to learn the visual relations (PGAT). (Xiao et al., 2022a) propose a hierarchical conditional graph model (HQGA) to weave together visual facts from low-level entities to higherlevel video elements through graph aggregation and pooling, to enable vision-text matching at multigranularity levels. To leverage the semantics of the 3D scene, (Cherian et al., 2022) transfer the video frames to a 2.5D (pseudo-3D) scene graph and then split it into static and dynamic sub-graphs, allowing the pruning of redundant detections.
With a good ability for information communication, graph architectures have shown promising results on inference VideoQA. Nonetheless, the emphasis and difficulty lie in how to skillfully design the graph structure for video representation.
Modular Networks. (Le et al., 2020) find that most VideoQA models design tailor-made network architectures. They point out such hand-crafted architectures are inflexible in dealing with varied data modality, video length and question types. Therefore, they design a reusable neural unit Conditional Relation Network (CRN), which captures the relations of input features given the global context and encapsulates them hierarchically to form networks. Such a constituted architecture has shown better generalization ability and flexibility in handling different types of questions. Following similar design philosophy, (Dang et al., 2021) and (Xiao et al., 2022a) design the spatio-temporal graph and conditional graph respectively as neural building blocks. The neural building blocks are hierarchically stacked to achieve good reasoning performances. While the above works aim for repeating a single module for videoQA. Recently, (Qian et al., 2022) design multiple modules tailored for compositional video question-answering (GrundeMcLaughlin et al., 2021), and has also demonstrated success. Overall, modular networks are of improved flexibility and transparency. Nonetheless, they either lack explicit logic for reasoning (Le et al., 2020; Dang et al., 2021; Xiao et al., 2022a), or can only handle questions that can be parsed into pre-defined subtasks of limited scope.
Neural-Symbolic. (Yi et al., 2020) point out two essential elements for causal reasoning in VideoQA are object-centric video representation that is aware of the temporal and causal relations between the objects and events, and a dynamics model that is able to predict the object dynamics under unobserved or counterfactual scenarios. Motivated by the neural-symbolic method in ImageQA (Yi et al., 2018), (Yi et al., 2020) propose the NS-DR model, which extracts object-level representation with a video parser, turns a question into a functional program, extracts and predicts the dynamic scene of the video with a dynamics predictor, and runs the program on the dynamic scene to obtain an answer. NS-DR aims to combine neural nets for pattern recognition and dynamics prediction, and symbolic logic for causal reasoning. It achieves significant gain on the explanatory, predictive, and counterfactual questions on the synthetic object dataset (Yi et al., 2020). (Chen et al., 2021) and (Ding et al., 2021) promote further progress.
Despite the good reasoning ability of NeuralSymbolic methods on synthetic datasets, they are currently hard to be applied in unconstrained video with open-form natural questions.
Others. There are also flexibly designed networks to address specific problems. For example, (Kim et al., 2020) propose a framework that first detects a specific temporal moment from moments of interest candidates for temporally-aligned video and subtitle using pre-defined sliding windows, and then fuses information based on the localized moment using intra-modal and cross-modal attention mechanisms. Due to their focuses on specific purposes, the question remains on whether these networks can be generalized to other VideoQA tasks. Studies are also conducted in terms of input information. (Falcon et al., 2020) explore several data augmentation techniques to prevent overfitting with only small-scale datasets. (Kim et al., 2021a) point out existing works suffer from significant computational complexity and insufficient representation capability and they introduce VideoQA features obtained from coded video bit-stream to address the problem. To overcome spurious visual-linguistic correlations, (Li et al., 2022d,c) explore robust and trustworthy grounding framework from causal theory, which is promising to enhance the SOTA models’ accuracy and trustability.
While the above efforts focus on a better video
representation for question answering, a handful of works (Xue et al., 2018; Hong et al., 2019) also pay attention to the language side by reserving the syntactic structure (Fei et al., 2022, 2021) of the questions, and also shows advantages.
We analyze the advanced methods for Factoid VideoQA in Table 2 and Inference VideoQA in Table 3 based on the results reported on popular VideoQA benchmarks. Apart from normal VideoQA, advanced methods for MM VideoQA and KB VideoQA are also summarized in Table 4. Table 2 reveals that the cross-modal pre-trained Transformer-style models can achieve superior performance for factiod QA than others. By focusing on methods without pre-training, graph-structured techniques are the most popular and have also shown great potential. It would be interesting to explore cross-modal pre-training of graph models for VideoQA. Besides, hierarchical learning and fine-grained object features usually help to improve performances. In addition to the datasets given in Table 2, the recent iVQA (Yang et al., 2021a) dataset has also received increasing attention, and we believe it could be a more effective dataset towards open-ended VideoQA for its high quality.
Inference VideoQA is a nascent task that challenges mainly visual relation reasoning of video information. It also receives increasing attention. Graph-structured techniques, causal discovery, and hierarchical learning have shown promising performance (see Table 3). Notably, we find that cross-modal pre-training and fine-tuning not only achieves good performance on factoid VideoQA, but also significantly improves the results on inference VideoQA. Particularly, the accuracies of reasoning tasks on TGIF-QA reach unprecedentedly high. This dataset is likely not challenging enough and has serious language bias as revealed by recent studies (Peng et al., 2021; Piergiovanni et al., 2022; Xiao et al., 2022b). In contrast, NExTQA is much more challenging; it emphasizes causal and temporal relation reasoning between multiple objects in real-world videos. Table 3 shows that SOTA methods still struggle on NExT-QA. As such, NExT-QA could be a more effective benchmark for visual reasoning of realistic video contents under natural language instructions. Additionally, NExTQA also contains open-ended QA task that provide ample challenge for existing research.
Table 2: Performance on Factoid VideoQA tasks. (Att: Attention, MG: Multi-Granularity, HL: Hierarchical Learning, CM-PF: Cross-modal Pre-training and Fine-tuning, Mem: Memory, GNN: Graph Neural Networks, MN: Modular Networks, TF: Transformer. RN: ResNet at frame-level, RX(3D): 3D ResNeXt at clip-level, RoI: Regionof-interest features from Faster R-CNN, GV: GloVe, BT: BERT, VG: Visual Genome (Krishna et al., 2017) , YT-T: Youtube-Temporal-180M (Zellers et al., 2021) , Web: WebVid2M (Bain et al., 2021) , CC: Conceptual Captions3M (Sharma et al., 2018) . ViT (Dosovitskiy et al., 2020) and VSwin (Liu et al., 2021b) are Transformer-style visual encoders. Attention is found in all methods, but we omit it for those methods that do not emphasize attention.)
MM and KB VideoQA require models to locate and perform reasoning in all heterogeneous modalities for answering the question. Similar to normal VideoQA, MM VideoQA also benefits from advanced networks and large-scale datasets. However, it is worth noting that modality shifting ability is essential (Kim et al., 2020; Engin et al., 2021).
From Recognition to Reasoning. Advanced neural network models excel at recognizing objects, attributes and even actions in visual data. Thus, answering the questions like “what is” is no longer the core of VideoQA. To enable more meaningful and in-depth human-machine interaction, it is urgent to study the casual and temporal relations between objects, actions, and events (Xiao et al., 2021). Such problems feature video-level understanding and demand inference ability for question answering. The focus on inference questions promotes research towards the core of human intelligence, which could be one of the “north stars” towards groundbreaking works (Fei-Fei and Krishna, 2022).
Knowledge VideoQA. To answer the questions
that are beyond the visual scene, it is of crucial importance to inject knowledge into the reasoning stage (Jin et al., 2019; Garcia et al., 2020; Zhuang et al., 2020; Wu et al., 2021b). Knowledge incorporation can not only greatly extend the scope of questions that can be asked about videos, but also enable the exploration of more human-like inference. Because we humans are natural to answer questions that may involve commonsense (Fang et al., 2020; Chadha et al., 2020) or domain-specific knowledge (Xu et al., 2021; Gao et al., 2021). Reasoning with knowledge and diagnosing the retrieved knowledge for a specific question will help to enhance the model’s interpretability and trustability. It will also serve as important groundwork for the future multi-modal conversation systems (Nie et al., 2019; Li et al., 2022e).
Cross-modal Pre-training and Fine-tuning. Cross-modal pre-trained representations (Zellers et al., 2021; Fu et al., 2021) have shown great benefit for VideoQA (see Table 2 and 3). However, most models only demonstrate their good performance on VideoQA tasks that challenge the recognition or shallow description of the video contents. Also, it significance towards practical multi-modal QA systems and multi-modal conversation systems.
Table 3: Performance on Inference VideoQA tasks. For the counting (Cnt) task in TGIF-QA, value of mean square error (MSE) is reported for evaluation.
This paper gives a quick overview to the broad aspect of video question answering. We mainly categorized the related datasets and techniques. Also, we discussed some meaningful insights and analyzed the performances of different techniques on different type of datasets. We finally concluded several promising future directions. With these efforts, we hope this survey can shed light and attract more research to VideoQA, and eventually, foster more efforts towards strong AI systems that can demonstrate their understanding of the dynamic visual world by making meaningful responses to our natural language instructions or queries.