Contents
1 서론
인공지능 분야에서 이미지 이해와 자연어 처리를 결합하는 것은 오랜 챌린지 중 하나입니다. 본 논문에서는 이미지와 텍스트 정보를 결합하여 질문에 답하는 학습 문제, 즉 이미지 질문-응답(Question Answering, QA) 작업을 다룹니다. 최근에는 컨볼루션 신경망(CNN)과 단어 임베딩 같은 기법을 사용하여 이미지와 텍스트로부터 학습하는 연구가 활발히 진행되었습니다. 이런 연구들은 객체 인식과 대규모 텍스트 코퍼스 분석을 통해 텍스트와 이미지 데이터로부터 고차원적인 표현을 학습할 수 있게 했습니다. 본 연구에서는 CNN과 순환 신경망(RNN)을 결합한 시각적-의미적 임베딩을 이용한 새로운 질문-응답 모델을 제안합니다. 이 모델은 이미지를 텍스트 질문의 일부로 취급하고, 두 종류의 데이터에서 추출한 정보를 융합하여 응답을 생성합니다.
2 선행 연구
이미지 기반 QA 연구는 주로 이미지에서 정보를 추출하고, 그 정보를 바탕으로 자연어 질문에 답하는 방식으로 진행되었습니다. Malinowski와 Fritz는 실제 이미지를 사용한 QA 작업을 위한 첫 데이터셋인 DAQUAR를 개발하였으며, 이는 실내 이미지와 관련된 질문-답변 쌍을 포함하고 있습니다. 이 연구들은 주로 의미 파싱과 이미지 분할을 결합하여 질문에 답하려고 시도했습니다. 그러나 이런 접근 방식은 특정 데이터셋에 의존적이고, 이미지 분할 알고리즘의 정확도에 크게 좌우되는 단점이 있습니다. 또한, 이미지 내의 모든 가능한 공간적 관계를 계산해야 하는 등의 비효율성이 있습니다. 본 논문에서는 이런 한계를 극복하기 위해 CNN과 RNN을 통합한 새로운 접근 방식을 제안합니다.
3 제안하는 방법
3.1 모델 설계
본 연구에서 제안하는 모델은 이미지를 질문의 일부로 취급하고, 이를 통해 텍스트 정보와 융합하여 질문에 답합니다. 구체적으로는 CNN을 통해 이미지에서 특징을 추출하고, 이 특징을 RNN의 입력으로 사용합니다. 이 때, 이미지 특징 벡터와 단어 임베딩 벡터 사이에는 선형 변환을 적용하여 두 데이터 소스를 효율적으로 통합합니다.
\[\textbf{v}_{\text{img}} = W_{\text{img}} \cdot \textbf{f}_{\text{CNN}}(\text{image}) + \textbf{b}_{\text{img}}\]\(\textbf{v}_{\text{img}}\)는 변환된 이미지 특징 벡터, \(W_{\text{img}}\)는 학습 가능한 가중치 행렬, \(\textbf{f}_{\text{CNN}}(\text{image})\)는 이미지에서 추출된 원래 특징 벡터, \(\textbf{b}_{\text{img}}\)는 편향 벡터입니다. 이렇게 변환된 이미지 벡터는 RNN의 초기 입력으로 사용되어, 이미지 정보가 질문 해석 과정에 직접적으로 통합됩니다.
3.2 질문-응답 생성
새로운 QA 데이터셋을 생성하기 위해, 이미지 설명을 바탕으로 자동으로 질문을 생성하는 알고리즘을 개발했습니다. 이 과정에서 이미지 레이블링 정보를 활용하여 더 정확하고 다양한 질문을 생성할 수 있습니다. 생성된 질문은 주로 ‘무엇’, ‘어디에’, ‘몇 개’ 등의 형태로 구성됩니다. 자동 질문 생성 과정은 다음과 같은 수학적 절차를 포함합니다.
이런 과정을 통해 얻은 질문은 머신러닝 모델이 이미지와 연관된 정보를 바탕으로 답변을 생성하도록 훈련하는 데 사용됩니다.
Combining image understanding and natural language interaction is one of the grand dreams of artificial intelligence. We are interested in the problem of jointly learning image and text through a question-answering task. Recently, researchers studying image caption generation [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] have developed powerful methods of jointly learning from image and text inputs to form higher level representations from models such as convolutional neural networks (CNNs) trained on object recognition, and word embeddings trained on large scale text corpora. Image QA involves an extra layer of interaction between human and computers. Here the model needs to pay attention to details of the image instead of describing it in a vague sense. The problem also combines many computer vision sub-problems such as image labeling and object detection.
In this paper we present our contributions to the problem: a generic end-to-end QA model using visual semantic embeddings to connect a CNN and a recurrent neural net (RNN), as well as comparisons to a suite of other models; an automatic question generation algorithm that converts description sentences into questions; and a new QA dataset (COCO-QA) that was generated using the algorithm, and a number of baseline results on this new dataset.
In this work we assume that the answers consist of only a single word, which allows us to treat the problem as a classification problem. This also makes the evaluation of the models easier and more robust, avoiding the thorny evaluation issues that plague multi-word generation problems.
Malinowski and Fritz [11] released a dataset with images and question-answer pairs, the DAtaset for QUestion Answering on Real-world images (DAQUAR). All images are from the NYU depth v2 dataset [12], and are taken from indoor scenes. Human segmentation, image depth values, and object labeling are available in the dataset. The QA data has two sets of configurations, which differ by the DAQUAR 1553 What is there in front of the sofa? Ground truth: table IMG+BOW: table (0.74) 2-VIS+BLSTM: table (0.88) LSTM: chair (0.47)
COCOQA 5078 How many leftover donuts is the red bicycle holding? Ground truth: three IMG+BOW: two (0.51) 2-VIS+BLSTM: three (0.27) BOW: one (0.29)
COCOQA 1238 What is the color of the teeshirt? Ground truth: blue IMG+BOW: blue (0.31) 2-VIS+BLSTM: orange (0.43) BOW: green (0.38)
COCOQA 26088 Where is the gray cat sitting? Ground truth: window IMG+BOW: window (0.78) 2-VIS+BLSTM: window (0.68) BOW: suitcase (0.31)
Figure 1: Sample questions and responses of a variety of models. Correct answers are in green and incorrect in red. The numbers in parentheses are the probabilities assigned to the top-ranked answer by the given model. The leftmost example is from the DAQUAR dataset, and the others are from our new COCO-QA dataset.
number of object classes appearing in the questions (37-class and 894-class). There are mainly three types of questions in this dataset: object type, object color, and number of objects. Some questions are easy but many questions are very hard to answer even for humans. Since DAQUAR is the only publicly available image-based QA dataset, it is one of our benchmarks to evaluate our models.
Together with the release of the DAQUAR dataset, Malinowski and Fritz presented an approach which combines semantic parsing and image segmentation. Their approach is notable as one of the first attempts at image QA, but it has a number of limitations. First, a human-defined possible set of predicates are very dataset-specific. To obtain the predicates, their algorithm also depends on the accuracy of the image segmentation algorithm and image depth information. Second, their model needs to compute all possible spatial relations in the training images. Even though the model limits this to the nearest neighbors of the test images, it could still be an expensive operation in larger datasets. Lastly the accuracy of their model is not very strong. We show below that some simple baselines perform better.
Very recently there has been a number of parallel efforts on both creating datasets and proposing new models [13, 14, 15, 16]. Both Antol et al. [13] and Gao et al. [15] used MS-COCO [17] images and created an open domain dataset with human generated questions and answers. In Anto et al.’s work, the authors also included cartoon pictures besides real images. Some questions require logical reasoning in order to answer correctly. Both Malinowski et al. [14] and Gao et al. [15] use recurrent networks to encode the sentence and output the answer. Whereas Malinowski et al. use a single network to handle both encoding and decoding, Gao et al. used two networks, a separate encoder and decoder. Lastly, bilingual (Chinese and English) versions of the QA dataset are available in Gao et al.’s work. Ma et al. [16] use CNNs to both extract image features and sentence features, and fuse the features together with another multi-modal CNN.
Our approach is developed independently from the work above. Similar to the work of Malinowski et al. and Gao et al., we also experimented with recurrent networks to consume the sequential question input. Unlike Gao et al., we formulate the task as a classification problem, as there is no single well-accepted metric to evaluate sentence-form answer accuracy [18]. Thus, we place more focus on a limited domain of questions that can be answered with one word. We also formulate and evaluate a range of other algorithms, that utilize various representations drawn from the question and image, on these datasets.
The methodology presented here is two-fold. On the model side we develop and apply various forms of neural networks and visual-semantic embeddings on this task, and on the dataset side we propose new ways of synthesizing QA pairs from currently available image description datasets.
Figure 2: VIS+LSTM Model
In recent years, recurrent neural networks (RNNs) have enjoyed some successes in the field of natural language processing (NLP). Long short-term memory (LSTM) [19] is a form of RNN which is easier to train than standard RNNs because of its linear error propagation and multiplicative gatings. Our model builds directly on top of the LSTM sentence model and is called the “VIS+LSTM” model. It treats the image as one word of the question. We borrowed this idea of treating the image as a word from caption generation work done by Vinyals et al. [1]. We compare this newly proposed model with a suite of simpler models in the Experimental Results section.
We use the last hidden layer of the 19-layer Oxford VGG Conv Net [20] trained on ImageNet 2014 Challenge [21] as our visual embeddings. The CNN part of our model is kept frozen during training.
We experimented with several different word embedding models: randomly initialized embedding, dataset-specific skip-gram embedding and general-purpose skip-gram embedding model [22]. The word embeddings are trained with the rest of the model.
We then treat the image as if it is the first word of the sentence. Similar to DeViSE [23], we use a linear or affine transformation to map 4096 dimension image feature vectors to a 300 or 500 dimensional vector that matches the dimension of the word embeddings.
We can optionally treat the image as the last word of the question as well through a different weight matrix and optionally add a reverse LSTM, which gets the same content but operates in a backward sequential fashion.
The LSTM(s) outputs are fed into a softmax layer at the last timestep to generate answers.
The currently available DAQUAR dataset contains approximately 1500 images and 7000 questions on 37 common object classes, which might be not enough for training large complex models. Another problem with the current dataset is that simply guessing the modes can yield very good accuracy.
We aim to create another dataset, to produce a much larger number of QA pairs and a more even distribution of answers. While collecting human generated QA pairs is one possible approach, and another is to synthesize questions based on image labeling, we instead propose to automatically convert descriptions into QA form. In general, objects mentioned in image descriptions are easier to detect than the ones in DAQUAR’s human generated questions, and than the ones in synthetic QAs based on ground truth labeling. This allows the model to rely more on rough image understanding without any logical reasoning. Lastly the conversion process preserves the language variability in the original description, and results in more human-like questions than questions generated from image labeling.
As a starting point we used the MS-COCO dataset [17], but the same method can be applied to any other image description dataset, such as Flickr [24], SBU [25], or even the internet.
We used the Stanford parser [26] to obtain the syntatic structure of the original image description. We also utilized these strategies for forming the questions.
In English, questions tend to start with interrogative words such as “what”. The algorithm needs to move the verb as well as the “wh-” constituent to the front of the sentence. For example: “A man is riding a horse” becomes “What is the man riding?” In this work we consider the following two simple constraints: (1) A-over-A principle which restricts the movement of a whword inside a noun phrase (NP) [27]; (2) Our algorithm does not move any wh-word that is contained in a clause constituent.
Question generation is still an open-ended topic. Overall, we adopt a conservative approach to generating questions in an attempt to create high-quality questions. We consider generating four types of questions below:
Object Questions: First, we consider asking about an object using “what”. This involves replacing the actual object with a “what” in the sentence, and then transforming the sentence structure so that the “what” appears in the front of the sentence. The entire algorithm has the following stages: (1) Split long sentences into simple sentences; (2) Change indefinite determiners to definite determiners; (3) Traverse the sentence and identify potential answers and replace with “what”. During the traversal of object-type question generation, we currently ignore all the prepositional phrase (PP) constituents; (4) Perform wh-movement. In order to identify a possible answer word, we used WordNet [28] and the NLTK software package [29] to get noun categories. 2. Number Questions: We follow a similar procedure as the previous algorithm, except for a different way to identify potential answers: we extract numbers from original sentences. Splitting compound sentences, changing determiners, and wh-movement parts remain the same.
Color Questions: Color questions are much easier to generate. This only requires locating the color adjective and the noun to which the adjective attaches. Then it simply forms a sentence “What is the color of the [object]” with the “object” replaced by the actual noun.
Location Questions: These are similar to generating object questions, except that now the answer traversal will only search within PP constituents that start with the preposition “in”. We also added rules to filter out clothing so that the answers will mostly be places, scenes, or large objects that contain smaller objects.
We rejected the answers that appear too rarely or too often in our generated dataset. After this QA rejection process, the frequency of the most common answer words was reduced from 24.98% down to 7.30% in the test set of COCO-QA.
Table 1 summarizes the statistics of COCO-QA. It should be noted that since we applied the QA pair rejection process, mode-guessing performs very poorly on COCO-QA. However, COCO-QA questions are actually easier to answer than DAQUAR from a human point of view. This encourages the model to exploit salient object relations instead of exhaustively searching all possible relations. COCO-QA dataset can be downloaded at http://www.cs.toronto.edu/˜mren/ imageqa/data/cocoqa
Table 1: COCO-QA question type break-down
Here we provide some brief statistics of the new dataset. The maximum question length is 55, and average is 9.65. The most common answers are “two” (3116, 2.65%), “white” (2851, 2.42%), and “red” (2443, 2.08%). The least common are “eagle” (25, 0.02%) “tram” (25, 0.02%), and “sofa” (25, 0.02%). The median answer is “bed” (867, 0.737%). Across the entire test set (38,948 QAs), 9072 (23.29%) overlap in training questions, and 7284 (18.70%) overlap in training question-answer pairs.
VIS+LSTM: The first model is the CNN and LSTM with a dimensionality-reduction weight matrix in the middle; we call this “VIS+LSTM” in our tables and figures.
2-VIS+BLSTM: The second model has two image feature inputs, at the start and the end of the sentence, with different learned linear transformations, and also has LSTMs going in both the forward and backward directions. Both LSTMs output to the softmax layer at the last timestep. We call the second model “2-VIS+BLSTM”.
IMG+BOW: This simple model performs multinomial logistic regression based on the image features without dimensionality reduction (4096 dimension), and a bag-of-word (BOW) vector obtained by summing all the learned word vectors of the question.
FULL: Lastly, the “FULL” model is a simple average of the three models above.
We release the complete details of the models at https://github.com/renmengye/ imageqa-public.
To evaluate the effectiveness of our models, we designed a few baselines.
GUESS: One very simple baseline is to predict the mode based on the question type. For example, if the question contains “how many” then the model will output “two.” In DAQUAR, the modes are “table”, “two”, and “white” and in COCO-QA, the modes are “cat”, “two”, “white”, and “room”.
BOW: We designed a set of “blind” models which are given only the questions without the images. One of the simplest blind models performs logistic regression on the BOW vector to classify answers.
LSTM: Another “blind” model we experimented with simply inputs the question words into the LSTM alone.
IMG: We also trained a counterpart “deaf” model. For each type of question, we train a separate CNN classification layer (with all lower layers frozen during training). Note that this model knows the type of question, in order to make its performance somewhat comparable to models that can take into account the words to narrow down the answer space. However the model does not know anything about the question except the type.
IMG+PRIOR: This baseline combines the prior knowledge of an object and the image understanding from the “deaf model”. For example, a question asking the color of a white bird flying in the blue sky may output white rather than blue simply because the prior probability of the bird being blue is lower. We denote c as the color, o as the class of the object of interest, and x as theimage. Assuming o and x are conditionally independent given the color,
This can be computed if p(c|x)
is the output of a logistic regression given the CNN features alone, and we simply estimate p(o|c)
empirically: ˆp(o|c) = count(o,c) count(c)
. We use Laplace smoothing on this empirical distribution.
To evaluate model performance, we used the plain answer accuracy as well as the Wu-Palmer similarity (WUPS) measure [31, 32]. The WUPS calculates the similarity between two words based on their longest common subsequence in the taxonomy tree. If the similarity between two words is less than a threshold then a score of zero will be given to the candidate answer. Following Malinowski and Fritz [32], we measure all models in terms of accuracy, WUPS 0.9, and WUPS 0.0.
Table 2 summarizes the learning results on DAQUAR and COCO-QA. For DAQUAR we compare our results with [32] and [14]. It should be noted that our DAQUAR results are for the portion of the dataset (98.3%) with single-word answers. After the release of our paper, Ma et al. [16] claimed to achieve better results on both datasets.
Table 2: DAQUAR and COCO-QA results
MULTI-WORLD [32] GUESS BOW LSTM IMG IMG+PRIOR K-NN (K=31, 13) IMG+BOW VIS+LSTM ASK-NEURON [14] 2-VIS+BLSTM FULL HUMAN
From the above results we observe that our model outperforms the baselines and the existing approach in terms of answer accuracy and WUPS. Our VIS+LSTM and Malinkowski et al.’s recurrent neural network model [14] achieved somewhat similar performance on DAQUAR. A simple average of all three models further boosts the performance by 1-2%, outperforming other models.
It is surprising to see that the IMG+BOW model is very strong on both datasets. One limitation of our VIS+LSTM model is that we are not able to consume image features as large as 4096 dimensions at one time step, so the dimensionality reduction may lose some useful information. We tried to give IMG+BOW a 500 dim. image vector, and it does worse than VIS+LSTM (≈48%).
Table 3: COCO-QA accuracy per category
By comparing the blind versions of the BOW and LSTM models, we hypothesize that in Image QA tasks, and in particular on the simple questions studied here, sequential word interaction may not be as important as in other natural language tasks.
It is also interesting that the blind model does not lose much on the DAQUAR dataset, We speculate that it is likely that the ImageNet images are very different from the indoor scene images, which are mostly composed of furniture. However, the non-blind models outperform the blind models by a large margin on COCO-QA. There are three possible reasons: (1) the objects in MS-COCO resemble the ones in ImageNet more; (2) MS-COCO images have fewer objects whereas the indoor scenes have considerable clutter; and (3) COCO-QA has more data to train complex models.
There are many interesting examples but due to space limitations we can only show a few in Figure 1 and Figure 3; full results are available at http://www.cs.toronto.edu/˜mren/ imageqa/results. For some of the images, we added some extra questions (the ones have an “a” in the question ID); these provide more insight into a model’s representation of the image and question information, and help elucidate questions that our models may accidentally get correct. The parentheses in the figures represent the confidence score given by the softmax layer of the respective model.
Model Selection: We did not find that using different word embedding has a significant impact on the final classification results. We observed that fine-tuning the word embedding results in better performance and normalizing the CNN hidden image features into zero-mean and unit-variance helps achieve faster training time. The bidirectional LSTM model can further boost the result by a little.
Object Questions: As the original CNN was trained for the ImageNet challenge, the IMG+BOW benefited significantly from its single object recognition ability. However, the challenging part is to consider spatial relations between multiple objects and to focus on details of the image. Our models only did a moderately acceptable job on this; see for instance the first picture of Figure 1 and the fourth picture of Figure 3. Sometimes a model fails to make a correct decision but outputs the most salient object, while sometimes the blind model can equally guess the most probable objects based on the question alone (e.g., chairs should be around the dining table). Nonetheless, the FULL model improves accuracy by 50% compared to IMG model, which shows the difference between pure object classification and image question answering.
Counting: In DAQUAR, we could not observe any advantage in the counting ability of the IMG+BOW and the VIS+LSTM model compared to the blind baselines. In COCO-QA there is some observable counting ability in very clean images with a single object type. The models can sometimes count up to five or six. However, as shown in the second picture of Figure 3, the ability is fairly weak as they do not count correctly when different object types are present. There is a lot of room for improvement in the counting task, and in fact this could be a separate computer vision problem on its own.
Color: In COCO-QA there is a significant win for the IMG+BOW and the VIS+LSTM against the blind ones on color-type questions. We further discovered that these models are not only able to recognize the dominant color of the image but sometimes associate different colors to different objects, as shown in the first picture of Figure 3. However, they still fail on a number of easy examples. Adding prior knowledge provides an immediate gain on the IMG model in terms of accuracy on Color and Number questions. The gap between the IMG+PRIOR and IMG+BOW shows some localized color association ability in the CNN image representation.
Figure 3: Sample questions and responses of our system
In this paper, we consider the image QA problem and present our end-to-end neural network models. Our model shows a reasonable understanding of the question and some coarse image understanding, but it is still very na¨ıve in many situations. While recurrent networks are becoming a popular choice for learning image and text, we showed that a simple bag-of-words can perform equally well compared to a recurrent network that is borrowed from an image caption generation framework [1]. We proposed a more complete set of baselines which can provide potential insight for developing more sophisticated end-to-end image question answering systems. As the currently available dataset is not large enough, we developed an algorithm that helps us collect large scale image QA dataset from image descriptions. Our question generation algorithm is extensible to many image description datasets and can be automated without requiring extensive human effort. We hope that the release of the new dataset will encourage more data-driven approaches to this problem in the future.
Image question answering is a fairly new research topic, and the approach we present here has a number of limitations. First, our models are just answer classifiers. Ideally we would like to permit longer answers which will involve some sophisticated text generation model or structured output. But this will require an automatic free-form answer evaluation metric. Second, we are only focusing on a limited domain of questions. However, this limited range of questions allow us to study the results more in depth. Lastly, it is also hard to interpret why the models output a certain answer. By comparing our models with some baselines we can roughly infer whether they understood the image. Visual attention is another future direction, which could both improve the results (based on recent successes in image captioning [8]) as well as help explain the model prediction by examining the attention output at every timestep.