Image GPT

Last Updated : 26 Nov, 2020

Image GPT was proposed by researchers at OpenAI in 2019. This paper experiments with applying GPT like transformation in the object recognition/ object detection tasks. However, there are some challenges faced by authors like processing large image sizes etc.

Architecture:

The architecture of Image GPT (iGPT) is similar to GPT-2 i.e. it is made up of a transformer decoder block. The transformer decoder takes an input sequence x₁, …, x_n of discrete tokens, and outputs a d-dimensional embedding for each position. The transformer can be considered as a stack of decoders of size L, the l-th of which produces an embedding of h₁^l….h_n^l. After that, the input tensor is passed to different layers as follows:

n^l = layer_norm( h^l)
a^l = h^l + multi-head attention(n^l)
h^l+1= a^l+ mlp(layer norm(a^l))

Where layer_norm is layer normalization and MLP layer is a multi-layer perceptron (artificial neural network) model. Below is the list of different versions

Model Name/Variant	Input Resolution	params (M)	Features
iGPT-Large(L)	32323	1362	1536
iGPT-Large(L)	48483	1362	1536
iGPT-XL	64643	6801	3072
iGPT-XL	64643	6801	15360

Context Reduction:

Because the memory requirements of the transformer decoder scale quadratically with context length when using dense attention. This means it requires a lot of computation to train even a single layer transformer. To deal with this, the authors resize the image to lower resolutions called Input Resolutions (IRs). The iGPT model uses IRs of 32*32*3, 48*48*3, and 64*64*3.

Training Methodology:

The model training of Image GPT consists of two steps:

Pre-training

Given an unlabeled dataset X consisting of high dimensional data x = (x₁, …, x_n), we can pick a permutation π of the set [1, n] and model the density p(x) auto-regressively as follows:

$p\left ( x \right ) = \prod_{i=1}^{n} p\left ( x_{\pi_i}|x_{\pi_1},...x_{\pi_{i-1}} ,\theta \right )$

For images, we pick the identity permutation π_i = i for 1 ≤ i ≤ n, also known as raster order. The model is trained to minimize the negative log-likelihood:

$L_{AR} = \mathbb{E}_{x \sim X} \left [ -log\left ( p(x) \right )\right ]$

The authors also used the loss function similar to masked language modeling in BERT, which samples a sub-sequence M ⊂ [1, n] such that each index i independently has probability 0.15 of appearing in M.

$L_{BERT} = \mathbb{E}_{x \sim X} \mathbb{E}_{M} \left [ -log\left ( p(x_i | x_{[1,n]\backslash M}) \right )\right ]$

During pre-training, we pick one of L_AR or L_BERT and minimize the loss over our pre-training dataset.

Fine-tuning:

For fine-tuning, the authors performed average pool n L across the sequence dimension to extract a d-dimensional vector of features per example and learn a projection from the average pool layer. The authors used this projection to minimize cross-entropy loss L_CLF. That makes the total objective function

$f^{L} = \prec n^{L}_{i}\succ_{i}$ $L_{obj} = L_{GEN} + L{CLF}$

Where L_GEN Is either L_AR or L_BERT.

The authors also experimented with linear probing which is similar to fine-tuning but without any average pooling layer.

Results:

On CIFAR-10, iGPT-L achieves 99.0% accuracy and on CIFAR-100, it achieves 88.5% accuracy after fine-tuning. The iGPT-L outperform AutoAugment, the best-supervised model on these datasets.
On ImageNet, iGPT achieve 66.3% accuracy after fine-tuning at MR (Input Resolution/ Memory Resolution) 32*32, an improvement of 6% over linear probing. When fine-tuning at MR 48*48, the model achieved 72.6% accuracy, with a similar 7% improvement over linear probing.

References: