Review - Show and Tell: A Neural Image Caption Generator

Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." CVPR 2015.

Goal of the Paper

The goal of this paper is to propose an end-to-end neural network for automatically generation of the English description corresponding to a given image.

Key Ideas and Intuitions

This paper used a convolutional neural network (CNN) to encode the input image into a compact representation, and then feed such compact representation into a recurrent neural network (RNN) for English sentence generation. The model was denoted as neural image caption (NIC).

Proposed Model

The model tried to maximize the probability of the correct description given the image using the following foumulation: $$\theta^* = \text{argmax}_\theta \sum_{(I, S)}\log p(S|I; \theta)$$ where $I$ is the input image, $\theta$ are the parameters of the model, and $S$ is the description sentence. It is worth noting that the input image $I$ is encoded by the last hidden layer of a Convolutional Neural Network (CNN) defined in (Ioffe, S. et al. 2015, Batch normalization: Accelerating deep network training by reducing internal co-variate shift).

Using the chain rule, the log probability of the sentence sequence $S$ can be represented by $$\log p(S|I) = \sum_{t=0}^N \log p(S_t|I, S_0, \dots, S_{t-1})$$ where $S=(S_0, \dots, S_{N})$ and the parameters $\theta$ are omitted for convenience.

In this paper, the model $\log p(S_t|I, S_0, \dots, S_{t-1})$ is modeled by a Recurrent Neural Network (RNN) where the variable number of words up to $t-1$ are expressed by a fixed-length hidden state $h_t$. Then $h_t$ is updated by a nonlinear function $f$ with a new input $x_t$, $$h_{t+1} = f(h_t, x_t)$$ where $f$ is modeled by a Long-Short Term Memory (LSTM), and $x_t$ is represented with an embedding model. The LSTM is trained to predict each word $S_t$ of the sentence after being fed with the input image $I$ and the preceding words $(S_0, \dots, S_{t-1})$.

As a result, the framework of the entire model is as following: $$x_- = \text{CNN}(I)$$ $$x_t = W_e S_t, \qquad t \in \{0, \dots, N-1\}$$ $$p_{t+1} = \text{LSTM}(x_t), \qquad t \in \{0, \dots, N-1\}$$ where $x_-$ is the CNN encoded compact representation of the input image $I$, $S_t$ is the one-hot representation of the input word, $W_e$ is the word embedding. It should be noted that, at each step, the previous hidden state $h_{t-1}$ is also passed into LSTM.

The training process is conducted by minimizing the sum of the negative log likelihood of the correct word at each step: $$\mathcal{L}(I, S) = -\sum_{t=0}^N \log p_t(S_t)$$ The above loss is minimized w.r.s to all parameter in LSTM, the top layer of the CNN encoder, and the word embeddings $W_e$.



Pascal VOC 2008, Flickr8k, Flickr20k, MSCOCO, SBU.


BLEU, METEOR, CIDER, and Human Evaluation.


The CNN encoder weights were initialized to a pre-trained model (e.g., on ImageNet), LSTM and word embeddings $W_e$ were randomly initialized.

Descriptions about the images were preprocessed with basic tokenization, keeping all words appearing at least 5 times in the training set.

Size of word embeddings and LSTM hidden states was set to 512. Dropout and ensembling were applied.

Stochastic gradient descent with fixed learning rate and no momentum was used for optimization.


Comparative analysis

the experiments were conducted with a set of algorithms as well as human evaluations on the above data sets using the above metrics. Results are as table 1 and 2. In this study, 5 human raters participated in the experiment. Human scores in table 2 were computed by comparing one of the human captions against the other four. The authors did this for each of the five raters, and average their BLEU scores.

Table 1: Scores on the MSCOCO development set

NIC 27.7 23.7 85.5
Random 4.6 9.0 5.1
Nearest Neighbor 9.9 15.7 36.5
Human 21.7 25.2 85.4

Table 2: BLEU-1 scores. SOTA stands for the current state-of-the-art

Approach PASCAL (xfer) Flickr 30k Flickr 8k SBU
SOTA 25 56 58 19
NIC 59 66 63 28
Human 69 68 70

Human Evaluation

Figure 1 shows the result of the human evaluations of the descriptions provided by NIC, a reference system, and the ground truth on the above list data sets. NIC is better than the reference system, but worse than the ground truth. Also it was implied that BLEU is not a good metric since the distributions of human evaluations between ground truth and the NIC descriptions are too different.

Figure 1: Human evaluation


The major contribution is that an end-to-end neural network NIC was proposed to view an image and generate English descriptions. CNN was used to well encode the input image into a compact representation, followed by LSTM to model the description sentence sequence. As the training data size grows, NIC would perform better qualitatively and quantitatively.