Generating a caption for a given image is a challenging problem in the deep learning domain. In this article, we will use different techniques of computer vision and NLP to recognize the context of an image and describe them in a natural language like English. we will build a working model of the image caption generator by using CNN (Convolutional Neural Networks) and LSTM (Long short term memory) units.
For training our model I’m using Flickr8K dataset. It consists of 8000 unique images and each image will be mapped to five different sentences which will describe the image.
Step 1: Import the required libraries
Step 2: Load the descriptions
The format of our file is image and caption separated by a newline (“\n”) i.e, it consists of the name of the image followed by a space and the description of the image in CSV format. Here we need to map the image to its descriptions by storing them in a dictionary.
['A child in a pink dress is climbing up a set of stairs in an entry way .', 'A girl going into a wooden building .', 'A little girl climbing into a wooden playhouse .', 'A little girl climbing the stairs to her playhouse .', 'A little girl in a pink dress going into a wooden cabin .']
Step 3: Cleaning the text
One of the main steps in NLP is to remove noise so that the machine can detect the patterns easily in the text. Noise will be present in the form of special characters such as hashtags, punctuation and numbers. All of which are difficult for computers to understand if they are present in the text. So we need to remove these for better results. Additionally, you can also remove stop words and perform Stemming and Lemmatization by using NLTK library.
Step 4: Generate the Vocabulary
Vocabulary is a set of unique words which are present in our text corpus. When processing raw text for NLP, everything is done around the vocabulary.
Step 5: Load the images
Here we need to map the images in the training set to their corresponding descriptions which are present in our descriptions variable. Create a list of names of all training images and then create an empty dictionary and map the images to their descriptions using image name as key and a list of descriptions as its value. while mapping the descriptions add unique words at the beginning and end to identify the start and end of the sentence.
['startseq child in pink dress is climbing up set of stairs in an entry way endseq', 'startseq girl going into wooden building endseq', 'startseq little girl climbing into wooden playhouse endseq', 'startseq little girl climbing the stairs to her playhouse endseq', 'startseq little girl in pink dress going into wooden cabin endseq']
Step 6: Extract the feature vector from all images
Now we will give an image as an input to our model but unlike humans, machines cannot understand the image by seeing them. So we need to convert the image into an encoding so that the machine can understand the patterns in it. For this task, I’m using transfer learning i.e, we use a pre-trained model that has been already trained on large datasets and extract the features from these models and use them for our work. Here I’m using the InceptionV3 model which has been trained on Imagenet dataset that had 1000 different classes to classify. We can directly import this model from Keras.applications module.
We need to remove the last classification layer to get the (2048, ) dimensional feature vector from InceptionV3 model.
Step 7: Tokenizing the vocabulary
In this step, we need to tokenize all the words present in our vocabulary. Alternatively, we can use tokenizer in Keras to do this task.
Step 8: Glove vector embeddings
GloVe stands for global vectors for word representation. It is an unsupervised learning algorithm developed by Stanford for generating word embeddings by aggregating global word-word co-occurrence matrix from a corpus. Also, we have 8000 images and each image has 5 captions associated with it. It means we have 30000 examples for training our model. As there are more examples you can also use data generator for feeding input in the form of batches to our model rather than giving all at one time. For simplicity, I’m not using this here.
Also, we are going to use an embedding matrix to store the relations between words in our vocabulary. An embedding matrix is a linear mapping of the original space to a real-valued space where entities will have meaningful relationships.
Step 9: Define the model
For defining the structure of our model, we will be using the Keras Model from Functional API. It has three major steps:
- Processing the sequence from the text
- Extracting the feature vector from the image
- Decoding the output by concatenating the above two layers
Step 10: Training the model
For training our model I’m using Adam’s optimizer and loss function as categorical cross-entropy. I’m training the model for 50 epochs which will be enough for predicting the output. In case you have more computational power (no. of GPU’s) you can train it by decreasing batch size and increasing number of epochs.
Epoch 1/1 292328/292328 [==============================] - 55s 189us/step - loss: 3.8895 Epoch 1/1 292328/292328 [==============================] - 55s 187us/step - loss: 3.1549 Epoch 1/1 292328/292328 [==============================] - 54s 186us/step - loss: 2.9185 Epoch 1/1 292328/292328 [==============================] - 54s 186us/step - loss: 2.7652 Epoch 1/1 292328/292328 [=================>.........] - ETA: 15s - loss: 2.6496
Step 11: Predicting the output
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course