Object Detection using TensorFlow

Identifying and detecting objects within images or videos is a key task in computer vision. It is critical in a variety of applications, ranging from autonomous vehicles and surveillance systems to augmented reality and medical imaging. TensorFlow, a Google open-source machine learning framework, provides a robust collection of tools for developing and deploying object detection models.

In this article, we will go over the fundamentals of using TensorFlow for object identification. TensorFlow provides a flexible and efficient framework to match your demands, whether you’re working on a computer vision research project or designing apps that require real-time object identification capabilities. Let’s get into the specifics of utilizing TensorFlow to develop object detection and realize the full potential of this cutting-edge technology.

What is Object detection?

Object detection is a computer vision task that involves identifying and locating multiple objects within an image or video. The goal is not just to classify what is in the image but also to precisely outline and pinpoint where each object is located.

Key Concepts in Object Detection:

Bounding Boxes
- Object detection involves drawing bounding boxes around detected objects. A bounding box is a rectangle that encloses an object and is defined by its coordinates—typically, (x_min, y_min) for the top-left corner and (x_max, y_max) for the bottom-right corner.
Object Localization
- Localization is the process of determining the object’s location within the image. It involves predicting the coordinates of the bounding box that encapsulates the object.
Class Prediction
- Object detection not only locates objects but also categorizes them into different classes (e.g., person, car, dog). Each object is assigned a class label, providing information about what the object is.
Model Architectures
- Numerous architectures are used for object detection, such as SSD (Single Shot Multibox Detector), Faster R-CNN (Region-based Convolutional Neural Network), and YOLO (You Only Look Once). These models differ in their approach to balancing speed and accuracy.

Object Detection using TensorFlow

Setting Up TensorFlow

Begin by installing TensorFlow using pip:

!pip install tensorflow

Ensure that you have the necessary dependencies, and if you have a compatible GPU, consider installing TensorFlow with GPU support for faster training.

Choosing a Pre-trained Model

TensorFlow provides pre-trained models on large datasets like COCO (Common Objects in Context). These models serve as a starting point for transfer learning. Common models include Faster R-CNN, SSD (Single Shot Multibox Detector), and YOLO (You Only Look Once). For this tutorial we will be using the ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8 model.

Understanding the ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8 Model

SSD (Single Shot Multibox Detector): SSD is a popular object detection algorithm known for its speed and accuracy. It’s designed to detect objects of different scales and aspect ratios in a single pass.
MobileNetV2: MobileNetV2 is a lightweight neural network architecture optimized for mobile and edge devices. It strikes a balance between efficiency and performance, making it ideal for real-time applications.
640×640: This denotes the input image size the model expects. Larger input sizes often yield more accurate results but require more computational resources. These models are also smaller in size than models trained on larger images like 1024×1024. Also the inference time is shorter.
- Example: centernet_hg104_1024x1024_coco17_tpu-32 is a model of 1.33 GBs
- while ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8 stands at 19 MBs
- and efficientdet_d1_coco17_tpu-32 is of 50 MB (for 640×640 images)
- the inference time for all three in Google Colab is around 42s, 0s and 4s. You can clearly see how size affects the inference time of the models
COCO (Common Objects in Context) Dataset: The COCO dataset is a large-scale dataset for object detection, segmentation, and captioning. It encompasses a diverse range of object categories and is widely used for training and evaluating computer vision models.
TPU-8 (Tensor Processing Unit – 8): TensorFlow’s TPUs are custom hardware accelerators designed for machine learning workloads. The “8” refers to the number of cores, indicating enhanced parallel processing capabilities.

Now that we have everything needed, let’s begin with the code:

Step 1: Import Libraries

First let’s import the necessary libraries for TensorFlow, NumPy, OpenCV, Pillow, and Matplotlib.

Python3

import tensorflow as tf

import numpy as np

import cv2

from PIL import Image

from matplotlib import pyplot as plt

from random import randint

Step 2: Download, Extract and Load the Pre-trained Model

Now, load the pre-trained model using TensorFlow’s SavedModel format.

Python3

!wget http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz

!tar -xzvf ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz
 
model = tf.saved_model.load("ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/saved_model")

Step 3: Load and Preprocess Image

In this step, load an image, convert it to a NumPy array, and preprocess it for input to the model, as the model can’t directly work on an image therefore we first converted it into a tensor.

Python3

image = Image.open("detect.jpg")

image_np = np.array(image)

input_tensor = tf.convert_to_tensor(np.expand_dims(image_np, 0), dtype=tf.uint8)
image

Output:

Step 5: Perform Object Detection

Here we use the loaded model to perform object detection on the input image and extract bounding box coordinates, class IDs, and scores.

Python3

detection = model(input_tensor)
 
# Parse the detection results

boxes = detection['detection_boxes'].numpy()

classes = detection['detection_classes'].numpy().astype(int)

scores = detection['detection_scores'].numpy()

Step 6: Add the COCO Labels

These are the labels for the COCO dataset, which contains class names corresponding to class IDs.

The Model only gives us the integer values of classes that it was trained on i.e. the COCO dataset, to translate those integer values into meaningful class names we need these labels.

Python3

labels = ['__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',

          'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 

          'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 

          'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 

          'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',

          'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana',

          'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 

          'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse',

          'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator',

          'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

Before going further let’s learn about some concepts:

Confidence
- Confidence in object detection represents how certain the model is about its predictions. It’s like a measure of how sure the model is that it correctly identified an object in an image. Confidence values range from 0 to 1, where 1 means the model is very confident in its prediction.
Normalized Coordinates
- Normalized coordinates are a way to describe the location of an object in an image in a standardized manner. Instead of using pixel values, which can vary based on image size, normalization scales coordinates to a consistent range, usually between 0 and 1.

Let’s understand with an Analogy

Think of a treasure map. Instead of saying “walk 50 steps north,” which depends on the map’s size, you say “walk halfway up the map.” Normalized coordinates provide a universal language for pinpointing locations.

Step 7: Visualize the detected objects

Now let’s look at the code

We iterate through the detected objects, filter out low-confidence detections, convert coordinates, get class names, and visualize the result with randomly colored boxes. Adjust the confidence threshold (0.5 in this case) and other parameters as needed.

Python3

for i in range(classes.shape[1]):

    class_id = int(classes[0, i])

    score = scores[0, i]
 
    if np.any(score > 0.5):  # Filter out low-confidence detections

        h, w, _ = image_np.shape

        ymin, xmin, ymax, xmax = boxes[0, i]
 
        # Convert normalized coordinates to image coordinates

        xmin = int(xmin * w)

        xmax = int(xmax * w)

        ymin = int(ymin * h)

        ymax = int(ymax * h)
 
        # Get the class name from the labels list

        class_name = labels[class_id]
 
        random_color = (randint(0, 256), randint(0, 256), randint(0, 256))
 
        # Draw bounding box and label on the image

        cv2.rectangle(image_np, (xmin, ymin), (xmax, ymax), random_color, 2)

        label = f"Class: {class_name}, Score: {score:.2f}"

        cv2.putText(image_np, label, (xmin, ymin - 10),

                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, random_color, 2)
 
# Display the result
plt.imshow(image_np)

plt.axis('off')
plt.show()

Output:

Applications of object detection:

Object detection finds applications in diverse fields, including:

Autonomous Vehicles: Identifying pedestrians, other vehicles, and obstacles.
Surveillance Systems: Monitoring and tracking objects in real-time.
Medical Imaging: Detecting anomalies or specific structures in medical images.
Retail Analytics: Tracking products and customer behavior in stores.
Augmented Reality: Overlapping digital information on real-world objects.
Implementing Object Detection using TensorFlow

Conclusion

Object detection with models like these opens doors to a myriad of applications. From autonomous vehicles and surveillance systems to retail analytics and augmented reality, the impact is profound. As technology advances, we can anticipate further developments in model architectures, dataset diversity, and real-time deployment, ushering in a new era of intelligent visual perception.

Article Tags :

AI-ML-DS

Computer Vision

Geeks Premier League

Geeks Premier League 2023

Tensorflow