Open In App

Image GPT

Image GPT was proposed by researchers at OpenAI in 2019. This paper experiments with applying GPT like transformation in the object recognition/ object detection tasks. However, there are some challenges faced by authors like processing large image sizes etc.

Architecture:



The architecture of Image GPT (iGPT) is similar to GPT-2 i.e. it is made up of a transformer decoder block. The transformer decoder takes an input sequence x1, …, xn of discrete tokens, and outputs a d-dimensional embedding for each position. The transformer can be considered as a stack of decoders of size L, the l-th of which produces an embedding of h1l ….hnl. After that, the input tensor is passed to different layers as follows:

Where layer_norm is layer normalization and MLP layer is a multi-layer perceptron (artificial neural network) model. Below is the list of different versions



Model Name/Variant Input Resolution params (M) Features
iGPT-Large(L) 32*32*3 1362 1536
48*48*3
iGPT-XL 64*64*3 6801 3072
15360

Context Reduction:

Because the memory requirements of the transformer decoder scale quadratically with context length when using dense attention. This means it requires a lot of computation to train even a single layer transformer. To deal with this, the authors resize the image to lower resolutions called Input Resolutions (IRs). The iGPT model uses IRs of 32*32*3, 48*48*3, and 64*64*3.

Training Methodology:

The model training of Image GPT consists of two steps: 

Pre-training

Fine-tuning:

The authors also experimented with linear probing which is similar to fine-tuning but without any average pooling layer.

Results:

References:

Article Tags :