Object Detection with Detection Transformer (DETR) by Facebook

Last Updated : 10 Jan, 2023

Facebook has just released its State of the art object detection Model on 27 May 2020. They are calling it DERT stands for Detection Transformer as it uses transformers to detect objects.This is the first time that transformer is used for such a task of Object detection along with a Convolutional Neural network. There are other Object detection models such as the RCNN family, YOLO(You Look Only Once) and SSD(Single Shot Detection) but none of them have ever used a transformer to achieve this task. The best part of this model is that due to the fact that it is using a transformer, it makes the architecture very simple unlike all the other techniques mentioned with has all kinds of hyperparameters and layers. So without further adieu, let’s get started.
What is object detection?
Given a photo if you need to determine if the photo has a single particular object you can do it by classification. but if you want to get the location of that object as well inside the image ….well even that is not an object detection task ….its called classification and localization. But if there are multiple objects in an image and you want the location of each and every location of every object, then that is object detection.
Some of the previous techniques try to get an RPN(Region Proposal Network) to come up with potential regions that may contain the object and then we can use the concept of anchor boxes, NMS(non-max-suppression)and IOU to generate relevant boxes and identify the object. Although these concepts work it takes some time for inferencing so real-time use with high accuracy is not achieved due to its complexity.
On a high level, this uses CNN and then a transformer to detect an object and it does so via a bipartite matching training object. This is the main reason why it is so simple.

Source – https://arxiv.org/pdf/2005.12872.pdf

Step 1:
We put the image through a convolution Neural Network Encoder because CNN works best with images. So after passing through CNN the image features are conserved. This is the higher-order representation of an image with many more feature channels.
Step 2:
This enriched feature map of the image is given to a transformer encoder-decoder, which outputs the set of box prediction. Each of these boxes is consisting of a tuple. The tuple will be a class and a bounding box. Note: this also includes the class NULL or Nothing class and its position as well.
Now, this is a real problem as in the annotation there is no object class annotated as nothing. Comparing and dealing with similar objects next to each other is another major issue and in this paper, it is tackled by using bipartite matching loss. The loss is compared by comparing each class and bounding box there is with its corresponding class and box including the none class, which are let’s say N, with the annotation including the part added that contains nothing to make the total boxes N. The assignment of the predicted to the actual is a one to one assignment such that the total loss is minimized. There is a very famous algorithm called the Hungarian method to compute these minimum matching.
The main components:

source – https://arxiv.org/pdf/2005.12872.pdf

The backbone – Features extracted from a Convolutional Neural Network and a positional encoding are passed
The transformer Encoder – A transformer is naturally a sequence processing unit and for the same reason, the incoming tensors are flattened. It transforms the sequence into an equally long sequence of features.
The Transformer Decoder – takes in Object queries So its a decoder as a side input for conditioning information.
Prediction Feed-Forward Network (FFN) – The output for this is going through a classifier which outputs the class labels and bounding box output discussed earlier
Evaluator:
The evaluation is done on COCO dataset and its primary competitor was the RCNN family that has ruled this category for quite some time and is considered to be the most classic technique for object detection.

Source – https://arxiv.org/pdf/2005.12872.pdf

Pros:

This new model is quite simple and you don’t have to install any library to use it.
DETR demonstrates significantly better performance on large objects and not on a small object which can be further improved.
A good thing is that they have even provided the code in the paper so now we will also implement it to know what its really capable of doing.

Code:

Python3

# Write Python3 code here
import torch
from torch import nn
from torchvision.models import resnet50
 
class DETR(nn.Module):
 
def __init__(self, num_classes, hidden_dim, nheads,
num_encoder_layers, num_decoder_layers):
  super().__init__()
  # We take only convolutional layers from ResNet-50 model
  self.backbone = nn.Sequential(*list(resnet50(pretrained=True).children())[:-2])
  self.conv = nn.Conv2d(2048, hidden_dim, 1)
  self.transformer = nn.Transformer(hidden_dim, heads,
  num_encoder_layers, num_decoder_layers)
  self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
  self.linear_bbox = nn.Linear(hidden_dim, 4)
  self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))
  self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
  self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
  def forward(self, inputs):
  x = self.backbone(inputs)
  h = self.conv(x)
  H , W = h.shape[-2:]
  pos = torch.cat([
  self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
  self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
], dim=-1).flatten(0, 1).unsqueeze(1)
  h = self.transformer(pos + h.flatten(2).permute(2, 0, 1),
  self.query_pos.unsqueeze(1))
  return self.linear_class(h), self.linear_bbox(h).sigmoid()
detr = DETR(num_classes=91, hidden_dim=256, nheads=8, num_encoder_layers=6, num_decoder_layers=6)
detr.eval()
inputs = torch.randn(1, 3, 800, 1200)
logits, bboxes = detr(inputs)
 
 
<strong>Listing 1: </strong>DETR PyTorch inference code. For clarity, it uses learned positional encodings in the encoder instead of fixed, and positional encodings are added to the input
only instead of at each transformer layer. Making these changes requires going beyond
PyTorch implementation of transformers, which hampers readability. The entire code
to reproduce the experiments will be made available before the conference.

We take only convolutional layers from ResNet-50 model
Code taken from the paper
Code: Try running this code on colab or just go to this link, copy and run the complete file.

Python3

import torch as the
import torchvision.transforms as T
import requests
from PIL import Image, ImageDraw, ImageFont

We will be using ResNet 101 as the backbone architecture and we will be loading this architecture directly from the Pytorch Hub.
Code:

Python3

model = th.hub.load('facebookresearch/detr', 'detr_resnet101', pretrained=True)
model.eval()
model = model.cuda()

Python3

# standard PyTorch mean-std input image normalization
transform = T.Compose([
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
 
CLASSES = [
    'N/A', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A',
    'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
    'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack',
    'umbrella', 'N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis',
    'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
    'skateboard', 'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass',
    'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich',
    'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake',
    'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table', 'N/A',
    'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard',
    'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A',
    'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair dryer',
    'toothbrush'
]

Enter the URL of an image here. The one I have used is https://i.ytimg.com/vi/vrlX3cwr3ww/maxresdefault.jpg
Code:

Python3

url = input()

Displaying the image

Python3

img = Image.open(requests.get(url, stream=True).raw).resize((800,600)).convert('RGB')
img

Code:

Python3

img_tens = transform(img).unsqueeze(0).cuda()
with th.no_grad():
  output = model(img_tens)
 
draw = ImageDraw.Draw(img)
pred_logits=output['pred_logits'][0][:, :len(CLASSES)]
pred_boxes=output['pred_boxes'][0]
 
max_output = pred_logits.softmax(-1).max(-1)
topk = max_output.values.topk(15)
 
pred_logits = pred_logits[topk.indices]
pred_boxes = pred_boxes[topk.indices]
pred_logits.shape

Code:

Python3

for logits, box in zip(pred_logits, pred_boxes):
  cls = logits.argmax()
  if cls >= len(CLASSES):
    continue
  label = CLASSES[cls]
  print(label)
  box = box.cpu() * th.Tensor([800, 600, 800, 600])
  x, y, w, h = box
  x0, x1 = x-w//2, x+w//2
  y0, y1 = y-h//2, y+h//2
  draw.rectangle([x0, y0, x1, y1], outline='red', width=5)
  draw.text((x, y), label, fill='white')

Code: Displaying the detected image

Python3

img

Here is the link to the colab notebook and github code. Also, feel free to check out the official GitHub for the same
Drawbacks:
It takes forever to train. It trained for six days on 8 GPUs. It’s not that much when you compare it with the language model at this scale as they use a transformer but still.

Suggest improvement

Emojify using Face Recognition with Machine Learning

Image Segmentation Using TensorFlow

Share your thoughts in the comments

Introduction to Computer Vision

Image Processing & Transformation

Feature Extraction and Description

Deep Learning for Computer Vision

Object Detection and Recognition

Image Segmentation

3D Reconstruction

Object Detection with Detection Transformer (DETR) by Facebook

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?