Open In App

Object Detection by YOLO using Tensorflow

You Only Look Once (YOLO) is an extremely fast and accurate, real-time, state-of-the-art object-detecting technology. In this article, using Tensorflow 2.0, we will implement YOLO from scratch.

A key component of computer vision is object detection, and our implementation uses TensorFlow to create the cutting-edge object detection model YOLOv3.



Object Detection by YOLO

Object Detection

Object Detection is a computer vision task that involves identifying and locating objects of interest within an image or a video. The main objectives are to identify objects, ascertain their classes, and supply bounding box coordinates surrounding them.

YoloV3

YOLOv3 is an object detection technique that predicts bounding boxes and class probabilities for each grid cell by first dividing the input image into a grid. YOLO is effective for real-time applications since it processes the entire image in a single forward pass, in contrast to typical object recognition techniques that rely on region proposal networks and intricate pipelines.



Prerequisites:

pip install opencv-python
pip install tensorflow

Object Detection by YOLO using Tensorflow Implementations:

Importing necessary libraries:




import numpy as np
import pandas as pd
import cv2, os, glob
import xml.etree.ElementTree as ET
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import (
    Add, Concatenate, Conv2D,
    Input, Lambda, LeakyReLU,
    MaxPool2D, UpSampling2D, ZeroPadding2D
)
from tensorflow.keras.regularizers import l2
from tensorflow.keras.losses import (
    binary_crossentropy,
    sparse_categorical_crossentropy
)

The xml.etree.ElementTree module is used for parsing XML files.

Model Configuration:

We define some hyperparameters for yolov3.

Anchors are predefined bounding boxes with specific sizes and aspect ratios, serving as reference points for localization predictions. They enable models handle variations in object scales and shapes, improving flexibility and computational efficiency during training and inference.




YOLOV3_LAYER_LIST = [
    'yolo_darknet',
    'yolo_conv_0',
    'yolo_output_0',
    'yolo_conv_1',
    'yolo_output_1',
    'yolo_conv_2',
    'yolo_output_2',
]
yolo_anchors = np.array([
    (10, 13), (16, 30), (33, 23), (30, 61), (62, 45),
    (59, 119), (116, 90), (156, 198), (373, 326)],
    np.float32) / 416
 
yolo_anchor_masks = np.array([[6, 7, 8], [3, 4, 5], [0, 1, 2]])

Class names:




class_names = [
    'person', 'bicycle','car','motorbike','aeroplane','bus','train','truck','boat',
    'traffic light','fire hydrant','stop sign','parking meter','bench',
    'bird','cat','dog','horse','sheep','cow','elephant','bear','zebra',
    'giraffe','backpack','umbrella','handbag','tie','suitcase','frisbee',
    'skis','snowboard','sports ball','kite','baseball bat','baseball glove',
    'skateboard','surfboard','tennis racket','bottle','wine glass','cup',
    'fork','knife','spoon','bowl','banana','apple','sandwich','orange',
    'broccoli','carrot','hot dog','pizza','donut','cake','chair','sofa',
    'pottedplant','bed','diningtable','toilet','tvmonitor','laptop','mouse',
    'remote','keyboard','cell phone','microwave','oven','toaster','sink',
    'refrigerator','book','clock','vase','scissors','teddy bear',
    'hair drier','toothbrush'
]

Class names represent a collection of object classes commonly encountered in object detection tasks.This list is commonly used in the context of object detection datasets such as COCO (Common Objects in Context) to label and identify various objects within images. There are 80 class labels. Each element serves as a unique identifier for a specific object class.

Model Building:

We have defined a function named load_darknet_weights that is used to load weights from a Darknet weight file into a given model.




def load_darknet_weights(model, weights_file):
    wf = open(weights_file, 'rb')
    major, minor, revision, seen, _ = np.fromfile(wf, dtype=np.int32, count=5)
     
    layers = YOLOV3_LAYER_LIST  # Assuming YOLO architecture, adjust if needed
     
    for layer_name in layers:
        sub_model = model.get_layer(layer_name)
        for i, layer in enumerate(sub_model.layers):
            if not layer.name.startswith('conv2d'):
                continue
            batch_norm = None
            if i + 1 < len(sub_model.layers) and sub_model.layers[i + 1].name.startswith('batch_norm'):
                batch_norm = sub_model.layers[i + 1]
            filters = layer.filters
            size = layer.kernel_size[0]
            in_dim = layer.input_shape[-1]
            if batch_norm is None:
                conv_bias = np.fromfile(wf, dtype=np.float32, count=filters)
            else:
                bn_weights = np.fromfile(wf, dtype=np.float32, count=4 * filters)
                bn_weights = bn_weights.reshape((4, filters))[[1, 0, 2, 3]]
                 
            conv_shape = (filters, in_dim, size, size)
            conv_weights = np.fromfile(wf, dtype=np.float32, count=np.product(conv_shape))
            conv_weights = conv_weights.reshape(conv_shape).transpose([2, 3, 1, 0])
             
            if batch_norm is None:
                layer.set_weights([conv_weights, conv_bias])
            else:
                layer.set_weights([conv_weights])
                batch_norm.set_weights(bn_weights)
     
    assert len(wf.read()) == 0, 'failed to read all data'
    wf.close()

Intersection over Union (IoU) Calculation for Bounding Boxes

IoU is a metric used to measure the overlap between two bounding boxes or regions in object detection tasks. It is calculated by dividing the area of intersection between the predicted and ground truth bounding boxes by the area of their union.

Bounding box are rectangular frames used to delineate the location of objects in images, defined by their top-left (x_min, y_min) and bottom-right (x_max, y_max) coordinates. They are essential in computer vision for tasks like object detection and image annotation.




def broadcast_iou(box_1, box_2):
     
 
    # broadcast boxes
    box_1 = tf.expand_dims(box_1, -2)
    box_2 = tf.expand_dims(box_2, 0)
    # new_shape: (..., N, (x1, y1, x2, y2))
    new_shape = tf.broadcast_dynamic_shape(tf.shape(box_1), tf.shape(box_2))
    box_1 = tf.broadcast_to(box_1, new_shape)
    box_2 = tf.broadcast_to(box_2, new_shape)
    int_w = tf.maximum(tf.minimum(box_1[..., 2], box_2[..., 2]) - tf.maximum(box_1[..., 0], box_2[..., 0]), 0)
    int_h = tf.maximum(tf.minimum(box_1[..., 3], box_2[..., 3]) - tf.maximum(box_1[..., 1], box_2[..., 1]), 0)
    int_area = int_w * int_h
    box_1_area = (box_1[..., 2] - box_1[..., 0]) * (box_1[..., 3] - box_1[..., 1])
    box_2_area = (box_2[..., 2] - box_2[..., 0]) * (box_2[..., 3] - box_2[..., 1])
    return int_area / (box_1_area + box_2_area - int_area)

A higher Intersection over Union (IoU) signifies increased overlap between bounding boxes, indicating improved alignment and localization of objects, and it is commonly employed as a crucial evaluation metric to assess model accuracy in various computer vision applications, particularly in object detection.

Model Freezing




def freeze_all(model, frozen = True):
    model.trainable = not frozen
    if isinstance(model, tf.keras.Model):
        for l in model.layers:
            freeze_all(l, frozen)

The freeze_all(model, frozen=True) allows for freezing or unfreezing all layers in a given model based on the Boolean parameter frozen. It recursively traverses through the layers of the model and sets their trainable attribute accordingly.

Visualizing Predictions




def draw_outputs(img, outputs, class_names):
    boxes, objectness, classes, nums = outputs
    boxes, objectness, classes, nums = boxes[0], objectness[0], classes[0], nums[0]
    wh = np.flip(img.shape[0:2])
    for i in range(nums):
        x1y1 = tuple((np.array(boxes[i][0:2]) * wh).astype(np.int32))
        x2y2 = tuple((np.array(boxes[i][2:4]) * wh).astype(np.int32))
        img = cv2.rectangle(img, x1y1, x2y2, (255, 0, 0), 2)
        img = cv2.putText(img, '{} {:.4f}'.format(
            class_names[int(classes[i])], objectness[i]),
            x1y1, cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (0, 0, 255), 2)
    return img

draw_outputs(img, outputs, class_names): This function takes an image (img), model outputs (outputs), and a list of class names (class_names). It draws bounding boxes and class labels on the image based on the model predictions.

Image Transformation




def transform_images(x_train, size):
    x_train = tf.image.resize(x_train, (size, size))
    x_train = x_train / 255
    return x_train

transform_images(x_train, size) function resizes input images (x_train) to a specified size (size) and normalizes pixel values to the range [0, 1].

Target Transformation Function




@tf.function
def transform_targets_for_output(y_true, grid_size, anchor_idxs, classes):
    N = tf.shape(y_true)[0]
    y_true_out = tf.zeros(
        (N, grid_size, grid_size, tf.shape(anchor_idxs)[0], 6))
    anchor_idxs = tf.cast(anchor_idxs, tf.int32)
    indexes = tf.TensorArray(tf.int32, 1, dynamic_size=True)
    updates = tf.TensorArray(tf.float32, 1, dynamic_size=True)
    idx = 0
    for i in tf.range(N):
        for j in tf.range(tf.shape(y_true)[1]):
            if tf.equal(y_true[i][j][2], 0):
                continue
            anchor_eq = tf.equal(
                anchor_idxs, tf.cast(y_true[i][j][5], tf.int32))
            if tf.reduce_any(anchor_eq):
                box = y_true[i][j][0:4]
                box_xy = (y_true[i][j][0:2] + y_true[i][j][2:4]) / 2
                anchor_idx = tf.cast(tf.where(anchor_eq), tf.int32)
                grid_xy = tf.cast(box_xy // (1/grid_size), tf.int32)
                indexes = indexes.write(
                    idx, [i, grid_xy[1], grid_xy[0], anchor_idx[0][0]])
                updates = updates.write(
                    idx, [box[0], box[1], box[2], box[3], 1, y_true[i][j][4]])
                idx += 1
    return tf.tensor_scatter_nd_update(
        y_true_out, indexes.stack(), updates.stack())
 def transform_targets(y_train, anchors, anchor_masks, classes):
    y_outs = []
    grid_size = 13
    anchors = tf.cast(anchors, tf.float32)
    anchor_area = anchors[..., 0] * anchors[..., 1]
    box_wh = y_train[..., 2:4] - y_train[..., 0:2]
    box_wh = tf.tile(tf.expand_dims(box_wh, -2), (1, 1, tf.shape(anchors)[0], 1))
    box_area = box_wh[..., 0] * box_wh[..., 1]
    intersection = tf.minimum(box_wh[..., 0], anchors[..., 0]) * tf.minimum(box_wh[..., 1], anchors[..., 1])
    iou = intersection / (box_area + anchor_area - intersection)
    anchor_idx = tf.cast(tf.argmax(iou, axis=-1), tf.float32)
    anchor_idx = tf.expand_dims(anchor_idx, axis=-1)
    y_train = tf.concat([y_train, anchor_idx], axis=-1)
    for anchor_idxs in anchor_masks:
        y_outs.append(transform_targets_for_output(
            y_train, grid_size, anchor_idxs, classes))
        grid_size *= 2
    return tuple(y_outs)

The transform_targets_for_output function transforms bounding boxes into a target tensor tailored for a specific output grid in an object detection model, considering anchor box information, grid positions, and objectness confidence, thus facilitating the training of YOLO-like architectures.

The transform_targets function prepares target tensors for YOLO-like object detection models by incorporating ground truth labels, anchor boxes, and class information. It calculates anchor indices based on Intersection over Union, appends them to labels, and generates target tensors for multiple output grids with varying scales, essential for effective model training.

These functions work together to prepare ground truth labels for training a YOLO model with multiple output scales and anchor configurations.

Custom Batch Normalization




class BatchNormalization(tf.keras.layers.BatchNormalization):
 
    def call(self, x, training = False):
        if training is None:
            traininig = tf.constant(False)
        training = tf.logical_and(training, self.trainable)
        return super().call(x, training)

The code presents a custom Batch Normalization layer implemented using TensorFlow’s Keras API. By inheriting from the standard tf.keras.layers.BatchNormalization class and overriding the call method, the custom layer introduces additional logic to handle the training parameter. Notably, it sets training to False if it is initially None and ensures that Batch Normalization is applied only when the layer is trainable. This layer offers flexibility in controlling the application of Batch Normalization based on training mode and the layer’s trainable status, making it suitable for specific training scenarios or model architectures.

Darknet Convolution




def DarknetConv(x, filters, size, strides=1, batch_norm=True):
    if strides == 1:
        padding = 'same'
    else:
        x = ZeroPadding2D(((1, 0), (1, 0)))(x)  # top left half-padding
        padding = 'valid'
    x = Conv2D(filters=filters, kernel_size=size,
               strides=strides, padding=padding,
               use_bias=not batch_norm, kernel_regularizer=l2(0.0005))(x)
    if batch_norm:
        x = BatchNormalization()(x)
        x = LeakyReLU(alpha=0.1)(x)
    return x

The code defines a function named DarknetConv, serving as a modular building block for convolutional layers within the Darknet architecture, notably used in YOLO (You Only Look Once) models. This function creates a 2D convolutional layer with options for customized padding, strides, and batch normalization. The function’s versatility allows for seamless integration into the Darknet backbone, enabling the construction of feature extraction layers. The inclusion of batch normalization and Leaky ReLU activation enhances training stability and facilitates feature learning. This modular approach enhances code readability and reusability, contributing to the efficient design and implementation of convolutional neural networks, particularly those based on the Darknet architecture.

Darknet Residual and Darknet Block




def DarknetResidual(x, filters):
    prev = x
    x = DarknetConv(x, filters // 2, 1)
    x = DarknetConv(x, filters, 3)
    x = Add()([prev, x])
    return x
  def DarknetBlock(x, filters, blocks):
    x = DarknetConv(x, filters, 3, strides=2)
    for _ in range(blocks):
        x = DarknetResidual(x, filters)

The two functions, DarknetResidual and DarknetBlock contribute to the construction of the Darknet architecture commonly employed in YOLO (You Only Look Once) models for object detection. The DarknetResidual function defines a residual block, where the input tensor x undergoes a series of DarknetConv operations, incorporating 1×1 and 3×3 convolutions. The result is added element-wise to the original input tensor, promoting feature reuse and gradient flow. The DarknetBlock function, on the other hand, orchestrates the creation of a Darknet block by utilizing the DarknetConv function with specific parameters. It includes a 3×3 convolutional layer with strided downsampling, followed by a series of DarknetResidual blocks. These functions contribute to the modularity and efficiency of the Darknet architecture, facilitating the design and implementation of deep neural networks for object detection tasks.

Darknet Architecture




def Darknet(name=None):
    x = inputs = Input([None, None, 3])
    x = DarknetConv(x, 32, 3)
    x = DarknetBlock(x, 64, 1)
    x = DarknetBlock(x, 128, 2# skip connection
    x = x_36 = DarknetBlock(x, 256, 8# skip connection
    x = x_61 = DarknetBlock(x, 512, 8)
    x = DarknetBlock(x, 1024, 4)
    return tf.keras.Model(inputs, (x_36, x_61, x), name=name)

The Darknet function constructs the YOLO Darknet architecture, initializing with a 3-channel input tensor. It applies an initial convolutional layer with 32 filters. DarknetBlocks with increasing filters and residuals form the architecture, featuring skip connections at blocks 2, 3, and 4 (x_36 and x_61). The model, encapsulated in a TensorFlow Keras Model, outputs three scales of feature maps, adhering to YOLO’s multi-scale feature extraction for robust object detection.

Convolution Layer




def YoloConv(x_in, filters, name=None):
    if isinstance(x_in, tuple):
        inputs = Input(x_in[0].shape[1:]), Input(x_in[1].shape[1:])
        x, x_skip = inputs
        # concat with skip connection
        x = DarknetConv(x, filters, 1)
        x = UpSampling2D(2)(x)
        x = Concatenate()([x, x_skip])
    else:
        x = inputs = Input(x_in.shape[1:])
    x = DarknetConv(x, filters, 1)
    x = DarknetConv(x, filters * 2, 3)
    x = DarknetConv(x, filters, 1)
    x = DarknetConv(x, filters * 2, 3)
    x = DarknetConv(x, filters, 1)
    return Model(inputs, x, name=name)(x_in)

YoloConv(filters, name=None) function Defines a YOLO convolutional block that consists of multiple convolutional layers.

Output Function




def YoloOutput(x_in, filters, anchors, classes, name=None):
    x = inputs = Input(x_in.shape[1:])
    x = DarknetConv(x, filters * 2, 3)
    x = DarknetConv(x, anchors * (classes + 5), 1, batch_norm=False)
    x = Lambda(lambda x: tf.reshape(x, (-1, tf.shape(x)[1], tf.shape(x)[2], anchors, classes + 5)))(x)
    return tf.keras.Model(inputs, x, name=name)(x_in)

YoloOutput function constructs a YOLO output block responsible for predicting bounding boxes, objectness scores, and class probabilities. The output is reshaped to facilitate subsequent processing.

Post-processing Function




def yolo_boxes(pred, anchors, classes):
   '''pred: (batch_size, grid, grid, anchors, (x, y, w, h, obj, ...classes))'''
   grid_size = tf.shape(pred)[1]
   box_xy, box_wh, objectness, class_probs = tf.split(
       pred, (2, 2, 1, classes), axis=-1)
   box_xy = tf.sigmoid(box_xy)
   objectness = tf.sigmoid(objectness)
   class_probs = tf.sigmoid(class_probs)
   pred_box = tf.concat((box_xy, box_wh), axis=-1# original xywh for loss
   grid = tf.meshgrid(tf.range(grid_size), tf.range(grid_size))
   grid = tf.expand_dims(tf.stack(grid, axis=-1), axis=2# [gx, gy, 1, 2]
   box_xy = (box_xy + tf.cast(grid, tf.float32)) / \
       tf.cast(grid_size, tf.float32)
   box_wh = tf.exp(box_wh) * anchors
   box_x1y1 = box_xy - box_wh / 2
   box_x2y2 = box_xy + box_wh / 2
   bbox = tf.concat([box_x1y1, box_x2y2], axis=-1)
   return bbox, objectness, class_probs, pred_box
 def yolo_nms(outputs, anchors, masks, classes):
   '''boxes, conf, type'''
   b, c, t = [], [], []
   for o in outputs:
       b.append(tf.reshape(o[0], (tf.shape(o[0])[0], -1, tf.shape(o[0])[-1])))
       c.append(tf.reshape(o[1], (tf.shape(o[1])[0], -1, tf.shape(o[1])[-1])))
       t.append(tf.reshape(o[2], (tf.shape(o[2])[0], -1, tf.shape(o[2])[-1])))
   bbox = tf.concat(b, axis=1)
   confidence = tf.concat(c, axis=1)
   class_probs = tf.concat(t, axis=1)
   scores = confidence * class_probs
   boxes, scores, classes, valid_detections = tf.image.combined_non_max_suppression(
       boxes=tf.reshape(bbox, (tf.shape(bbox)[0], -1, 1, 4)),
       scores=tf.reshape(
           scores,
           (tf.shape(scores)[0], -1, tf.shape(scores)[-1])
       ),
       max_output_size_per_class=100,
       max_total_size = 100,
       iou_threshold = 0.5,
       score_threshold = 0.5
   )
   return boxes, scores, classes, valid_detections

  1. yolo_boxes(pred, anchors, classes): Decodes model predictions into bounding boxes, objectness scores, and class probabilities. Applies sigmoid functions and calculates box coordinates.
  2. yolo_nms(outputs, anchors, masks, classes): Performs non-maximum suppression (NMS) on model outputs, filtering redundant bounding boxes based on confidence scores and IoU thresholds.

Model Architecture




def YoloV3(size=None, channels=3, anchors=yolo_anchors, masks=yolo_anchor_masks, classes=80, training=False):
    x = inputs = Input([size, size, channels])
    x_36, x_61, x = Darknet(name='yolo_darknet')(x)
    x = YoloConv(x, 512, name='yolo_conv_0')
    output_0 = YoloOutput(x, 512, len(masks[0]), classes, name='yolo_output_0')
    x = YoloConv((x, x_61), 256, name='yolo_conv_1')
    output_1 = YoloOutput(x, 256, len(masks[1]), classes, name='yolo_output_1')
    x = YoloConv((x, x_36), 128, name='yolo_conv_2')
    output_2 = YoloOutput(x, 128, len(masks[2]), classes, name='yolo_output_2')
    if training:
        return Model(inputs, (output_0, output_1, output_2), name='yolov3')
    boxes_0 = Lambda(lambda x: yolo_boxes(x, anchors[masks[0]], classes),
                     name='yolo_boxes_0')(output_0)
    boxes_1 = Lambda(lambda x: yolo_boxes(x, anchors[masks[1]], classes),
                     name='yolo_boxes_1')(output_1)
    boxes_2 = Lambda(lambda x: yolo_boxes(x, anchors[masks[2]], classes),
                     name='yolo_boxes_2')(output_2)
    outputs = Lambda(lambda x: yolo_nms(x, anchors, masks, classes),
                     name='yolo_nms')((boxes_0[:3], boxes_1[:3], boxes_2[:3]))
    return Model(inputs, outputs, name='yolov3')

The YOLOv3 model architecture for object detection is defined, including functions for model creation, loss calculation, and training. Key components include YOLO-like convolutional layers, output layers, and loss functions.

Loss Function




def YoloLoss(anchors, classes=80, ignore_thresh=0.5):
   def yolo_loss(y_true, y_pred):
       # 1. transform all pred outputs
       # y_pred: (batch_size, grid, grid, anchors, (x, y, w, h, obj, ...cls))
       pred_box, pred_obj, pred_class, pred_xywh = yolo_boxes(y_pred, anchors, classes)
       pred_xy = pred_xywh[..., 0:2]
       pred_wh = pred_xywh[..., 2:4]
       # 2. transform all true outputs
       # y_true: (batch_size, grid, grid, anchors, (x1, y1, x2, y2, obj, cls))
       true_box, true_obj, true_class_idx = tf.split(
           y_true, (4, 1, 1), axis=-1)
       true_xy = (true_box[..., 0:2] + true_box[..., 2:4]) / 2
       true_wh = true_box[..., 2:4] - true_box[..., 0:2]
       # give higher weights to small boxes
       box_loss_scale = 2 - true_wh[..., 0] * true_wh[..., 1]
       # 3. inverting the pred box equations
       grid_size = tf.shape(y_true)[1]
       grid = tf.meshgrid(tf.range(grid_size), tf.range(grid_size))
       grid = tf.expand_dims(tf.stack(grid, axis=-1), axis=2)
       true_xy = true_xy * tf.cast(grid_size, tf.float32) - \
           tf.cast(grid, tf.float32)
       true_wh = tf.math.log(true_wh / anchors)
       true_wh = tf.where(tf.math.is_inf(true_wh), tf.zeros_like(true_wh), true_wh)
       # 4. calculate all masks
       obj_mask = tf.squeeze(true_obj, -1)
       # ignore false positive when iou is over threshold
       true_box_flat = tf.boolean_mask(true_box, tf.cast(obj_mask, tf.bool))
       best_iou = tf.reduce_max(broadcast_iou(
           pred_box, true_box_flat), axis=-1)
       ignore_mask = tf.cast(best_iou < ignore_thresh, tf.float32)
       # 5. calculate all losses
       xy_loss = obj_mask * box_loss_scale * \
           tf.reduce_sum(tf.square(true_xy - pred_xy), axis=-1)
       wh_loss = obj_mask * box_loss_scale * \
           tf.reduce_sum(tf.square(true_wh - pred_wh), axis=-1)
       obj_loss = binary_crossentropy(true_obj, pred_obj)
       obj_loss = obj_mask * obj_loss + \
           (1 - obj_mask) * ignore_mask * obj_loss
       # Could also use binary_crossentropy instead
       class_loss = obj_mask * sparse_categorical_crossentropy(
           true_class_idx, pred_class)
       # 6. sum over (batch, gridx, gridy, anchors) => (batch, 1)
       xy_loss = tf.reduce_sum(xy_loss, axis=(1, 2, 3))
       wh_loss = tf.reduce_sum(wh_loss, axis=(1, 2, 3))
       obj_loss = tf.reduce_sum(obj_loss, axis=(1, 2, 3))
       class_loss = tf.reduce_sum(class_loss, axis=(1, 2, 3))
       return xy_loss + wh_loss + obj_loss + class_loss
   return yolo_loss

A custom YOLOv3 loss function for TensorFlow, crucial in training object detection models is defined. The loss computation incorporates algorithms to exclude false positives based on IoU thresholds and takes box size into consideration via a box loss scale. The function evaluates the discrepancies between predicted and true values in bounding boxes, objectness scores, and class probabilities, capturing the subtleties of YOLO-style object identification and helping to enhance model accuracy during training.

Model Summary




yolo = YoloV3(classes = 80)
yolo.summary()

Model: "yolov3"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, None, None, 3)] 0 []

yolo_darknet (Functional) ((None, None, None, 256), 4062064 ['input_1[0][0]']
(None, None, None, 512), 0
(None, None, None, 1024))

yolo_conv_0 (Functional) (None, None, None, 512) 1102438 ['yolo_darknet[0][2]']
4

yolo_conv_1 (Functional) (None, None, None, 256) 2957312 ['yolo_conv_0[0][0]',
'yolo_darknet[0][1]']

yolo_conv_2 (Functional) (None, None, None, 128) 741376 ['yolo_conv_1[0][0]',
'yolo_darknet[0][0]']

yolo_output_0 (Functional) (None, None, None, 3, 85) 4984063 ['yolo_conv_0[0][0]']

yolo_output_1 (Functional) (None, None, None, 3, 85) 1312511 ['yolo_conv_1[0][0]']

yolo_output_2 (Functional) (None, None, None, 3, 85) 361471 ['yolo_conv_2[0][0]']

yolo_boxes_0 (Lambda) ((None, None, None, 3, 4), 0 ['yolo_output_0[0][0]']
(None, None, None, 3, 1),
(None, None, None, 3, 80)
, (None, None, None, 3, 4)
)

yolo_boxes_1 (Lambda) ((None, None, None, 3, 4), 0 ['yolo_output_1[0][0]']
(None, None, None, 3, 1),
(None, None, None, 3, 80)
, (None, None, None, 3, 4)
)

yolo_boxes_2 (Lambda) ((None, None, None, 3, 4), 0 ['yolo_output_2[0][0]']
(None, None, None, 3, 1),
(None, None, None, 3, 80)
, (None, None, None, 3, 4)
)

yolo_nms (Lambda) ((None, 100, 4), 0 ['yolo_boxes_0[0][0]',
(None, 100), 'yolo_boxes_0[0][1]',
(None, 100), 'yolo_boxes_0[0][2]',
(None,)) 'yolo_boxes_1[0][0]',
'yolo_boxes_1[0][1]',
'yolo_boxes_1[0][2]',
'yolo_boxes_2[0][0]',
'yolo_boxes_2[0][1]',
'yolo_boxes_2[0][2]']

==================================================================================================
Total params: 62001757 (236.52 MB)
Trainable params: 61949149 (236.32 MB)
Non-trainable params: 52608 (205.50 KB)
________________________________________

Visualizing the Model Architecture




plot_model(
    yolo, rankdir = 'TB',
    to_file = 'yolo_model1.png',
    show_shapes = False,
    show_layer_names = True,
    expand_nested = False
)

Output:

Yolo Model

Loading Weights and Making Predictions on Images




load_darknet_weights(yolo, '/Users/gfg0406/Desktop/GFG TASKS/yolov3.weights', False)
def predict(image_file, visualize = True, figsize = (16, 16)):
    img = tf.image.decode_image(open(image_file, 'rb').read(), channels=3)
    img = tf.expand_dims(img, 0)
    img = transform_images(img, 416)
    boxes, scores, classes, nums = yolo.predict(img)
    img = cv2.cvtColor(cv2.imread(image_file), cv2.COLOR_BGR2RGB)
    img = draw_outputs(img, (boxes, scores, classes, nums), class_names)
    if visualize:
        fig, axes = plt.subplots(figsize = figsize)
        plt.imshow(img)
        plt.show()
    return boxes, scores, classes, nums
image_file = glob.glob('/Users/gfg0406/Desktop/GFG TASKS/Images/*')

First, we load the YOLOv3 model (yolo) with pre-trained Darknet weights. After that, a predict function is built to forecast based on an image file path. Using the transform_images function, the picture is read, encoded, and preprocessed to fit the YOLOv3 input size. The yolo.predict method is used to acquire the bounding box predictions, confidence scores, predicted classes, and number of detections. OpenCV is used to read the original image and transform it to RGB. Finally, if the visualise option is set to True, the result is presented after the anticipated outputs are superimposed on the image using the draw_outputs function. Predicted boxes, scores, classifications, and the total number of detections are returned by the function.

The code then applies this prediction function to a list of image files in the specified directory.

Change the directory path according to your device.

Detections for a Sample Image




boxes, scores, classes, nums = predict(image_file[0], figsize = (20, 20))
boxes, scores, classes, nums = predict(image_file[1], figsize = (20, 20))

Output:

Object Detection by YOLO

Object Detection by YOLO


Article Tags :