Image-to-Image Translation using Pix2Pix
pix2pix was proposed by researchers at UC Berkeley in 2017. It uses conditional Generative Adversarial Network to perform the image-to-image translation task (i.e. converting one image to another, such facades to buildings and Google Maps to Google Earth, etc.
The pix2pix uses conditional generative adversarial networks (conditional-GAN) in its architecture. The reason for that is that even if we trained a model with a simple L1/L2 loss function for a particular image-to-image translation task, this might not understand the nuances of the images.
The architecture used in the generator was U-Net architecture. It is similar to Encoder-Decoder architecture except for the use of skip-connections in the encoder-decoder architecture. The use of skip-connection makes this
- Encoder Architecture: The Encoder network of the Generator network has seven convolutional blocks. Each convolutional block has a convolutional layer, followed by a LeakyRelu activation function (with a slope of 0.2 in the paper). Each convolutional block also has a batch normalization layer except the first convolutional layer.
- Decoder Architecture: The Decoder network of the Generator network has seven Transpose convolutional block. Each upsampling convolutional block (Dconv) has an upsampling layer, followed by a convolutional layer, a batch normalization layer, and a ReLU activation function.
- The generator architecture contains skip connections between each layer i and layer n − i, where n is the total number of layers. Each skip connection simply concatenates all channels at layer i with those at layer n − i.
The discriminator use Patch GAN architecture, which is also used Style GAN architecture. This PatchGAN architecture contains a number of Transpose convolutional blocks. This PatchGAN architecture takes an NxN part of the image and tries to find whether it is real and fake. This discriminator is applied convolutionally across the whole image, averaging it to generate the result of the discriminator D.
Each block of the discriminator contains a convolution layer, batch norm layer, and LeakyReLU. This discriminator receives two inputs:
- The input image and Target Image (which discriminator should classify as real)
- The input image and Generated Image (which discriminator should classify as fake).
THe PatchGAN is used because the author argues that it will be able to preserve high-frequency details in the image, with low-frequency details that can be focused by L1-loss.
The generator loss used in the paper is the linear combination of L1- loss between generated image, target image, and GAN loss as we define above.
Our generated loss will be:
Therefore, our total loss for generator
The discriminator loss takes two inputs real image and generated image:
- real_loss is a sigmoid cross-entropy loss of the real images and an array of ones(since these are the real images).
- generated_loss is a sigmoid cross-entropy loss of the generated images and an array of zeros(since these are the fake images)
- The total loss is the sum of the real_loss and generated_loss.
- First, we download and preprocess the image dataset. We will use the CMP Facade dataset that was provided Czech Technical University and processed by the authors of the pix2pix paper. We will preprocess the dataset before training.
- Now, we load train and test data using the function we defined above.
- After performing data processing, Now, we write the code for generator architecture. This generator block contains 2 parts encoder block and decoder block. The encoder block contains a downsampling convolution block and the decoder block contains an upsampling transpose convolution block.
- Now we define our architecture for the discriminator. The discriminator architecture uses a PatchGAN model. For this architecture, we can use the above downsampling convolution block we defined. The loss of the discriminator is the sum of real loss (sigmoid cross-entropy b/w real image and array of 1s) and generated loss (sigmoid cross-entropy b/w generated image and an array of 0s).
- In this step, we define optimizers and checkpoints. We will use Adam optimizer in both generator discriminator.
- Now, we define the training procedure. The training procedure consists of the following steps:
- For each example input, we passed the image as input to the generator to get the generated image.
- The discriminator receives the input_image and the generated image as the first input. The second input is the input_image and the target_image.
- Next, we calculate the generator and the discriminator loss.
- Then, we calculate the gradients of loss with respect to both the generator and the discriminator variables(inputs) and apply those to the optimizer.
- Now, we use the generator of the trained model on test data to generate the images.