CLIP (Contrastive Language-Image Pretraining)

CLIP is short for Contrastive Language-Image Pretraining. CLIP is an advance AI model that is jointly developed by OpenAI and UC Berkeley. The model is capable of understanding both textual descriptions and images, leveraging a training approach that emphasizes contrasting pairs of images and text. In this article, we are going to explore the fundamentals and working for CLIP. We are also going to explore its applications.

Table of Content

Origins and Development of CLIP
How CLIP Works?
CLIP’s Unique Approach
Key Applications and Uses of CLIP in Real-World Scenarios
Comparing CLIP with Traditional Models

Pre-training is a neural network that learns visual concepts through natural language supervision. What it essentially means is that it can understand the relationship between images and text, i.e., given an image and a set of different text descriptions, the model can accurately tell which description describes the best. One must note that the model does not provide the caption for an image. It tells whether a given description is a good fit or not for a given image. The model has been revolutionary since its introduction, as it has become part of many text-to-image and text-to-video models that have become popular recently.

Origins and Development of CLIP

Before CLIP, the SOTA computer vision classification models were trained to predict a set of predetermined classes. This model could not be generalized to predict any other category other than those it was trained on. To use them for other categories, one always had to further fine-tune it which required computing resources as well as the need for a good dataset, which was challenging to collect many times.

This is what led to the development of the CLIP model. It ought to solve the problem of fine-tuning an image classifier for categories it was not trained on. Thus, the one significant aspect of CLIP is that it is ‘zero shot‘ means it can classify images into any category without the need to be trained on that specific category. It uses both text and images in input for training. It has been trained on a dataset of 400 million image text pairs collected from the Internet.

How CLIP Works?

Let us understand the architectural details of CLIP. Below is the architecture of the CLIP neural network:

Text Encoder: The text endoer is a transformer based on the famous Attention is All You Need paper. The team has used a 63M-parameter 12-layer 512-wide model with 8 attention heads.
Image Encoder: For the image encoder, the team considered two models – Resnet and ViT. The team experimented with different variations of these two models and ultimately found that ViT performed the best and used it as an image encoder
Dataset: Thea team constructed a new dataset of 400 million(image, text) pairs collected from a variety of publicly available sources on the Internet. The team focused on including all words that were present at least 100 times in the English version of Wikipedia which resulted in a set of 500000 words and ensured that each word was present at least in 20000 image text pairs. The dataset is referred to as WebImageText (WIT)
Training Objective
- The objective of CLIP is to align text embeddings and image embeddings for correct pairs. IT achieved this by maximizing the cosine similarity between the correct pair of image and text embeddings and minimising the cosine similarity between incorrect image and text embeddings.
- The image and text encoder are trained from scratch without using any pre-weights.
- The embeddings generated by the image and text encoder are linear projected to match their dimension
- For a dataset of N (image, text Paris) there are N*N possible (image, text) pairs. Out of these N*N pairs N pairs are correct and the rest N*N-N pairs are incorrect.
- The model is trained to maximize the cosine similarity of the image and text embeddings of the N correct pairs and minimize the cosine similarity of N*N-N pairs.
- Cross entropy loss over similarity score is used as a loss function.
During inference time the embedding of image and text description of all possible pairs is obtained through the trained image and text encoder and a similarity score is calculated which indicates the relevance of text description concerning image

CLIP’s Unique Approach

CLIP’s unique architecture design and training approach differed from many of the standard norms before its introduction on which the popular SOTA models like resnet and imagine were trained. These resulted in many firsts in the field of computer vision like:

Multimodel training: ClIP combines images and text simultaneously rather than treating them separately.
Contrastive training objective: The training objective was to distinguish between the correct pair of images and text from the incorrect pair. This resulted in self-supervised training which was different from supervised training in which we had to explicitly tell the model the correct label or class category of each image.
ZERO Shot Learning: Since the model was trained on image and text simultaneously it was able to learn visual concepts through natural language supervision. The trained model could classify images into any category without any further fine-tuning. The model was able to transfer its learning to most other computer vision tasks and performed competitively without the need for any dataset-specific training.

Key Applications and Uses of CLIP in Real-World Scenarios

CLIP has become very successful since its introduction. It has become part of several other models.

Image generation model: The SOTA image generation model from text like Dalle-3 and Midjourney uses the CLIP model to generate image embeddings from the input text embeddings to generate images consistent with the text.
Image Segmentation model: The popular SAM (Segment anything model) from meta uses CLIP to understand user prompts and generate segments from images based on user prompt.
Image Captioning: CLIP is used to get the best caption for the image.
Content moderation: social media sites employ CLIP to analyze images that can be harmful or do not comply with their policy and filter out them.
Semantic Retrieval: Text-to-image or image text searches are possible with CLIP embeddings.
Image Search: CLIP can be utilized to find images corresponding to a specific text query.
Visual Question Answering (VQA): CLIP has the ability to respond to queries regarding the visual elements of an image using a given text input.

Comparing CLIP with Traditional Models

Traditional models like CNN focused on processing images and RNN / transformer focused on processing text. CLIP combined them to get a multimodal understanding.
Traditional image classifiers were limited in the class categories they were trained on. But CLIP exhibited zero-shot learning meaning they could be used for any class categories.
The trained CLIP model was able to perform a wide variety of tasks on many existing datasets without any further training.

In this article, we saw an overview of the CLIP model, understood its working in detail, its application and how it has become a part of many current SOTA models.

Article Tags :

AI-ML-DS

Deep Learning