Object Detection by YOLO using Tensorflow

You Only Look Once (YOLO) is an extremely fast and accurate, real-time, state-of-the-art object-detecting technology. In this article, using Tensorflow 2.0, we will implement YOLO from scratch.

A key component of computer vision is object detection, and our implementation uses TensorFlow to create the cutting-edge object detection model YOLOv3.

Object Detection by YOLO

Object Detection

Object Detection is a computer vision task that involves identifying and locating objects of interest within an image or a video. The main objectives are to identify objects, ascertain their classes, and supply bounding box coordinates surrounding them.

YoloV3

YOLOv3 is an object detection technique that predicts bounding boxes and class probabilities for each grid cell by first dividing the input image into a grid. YOLO is effective for real-time applications since it processes the entire image in a single forward pass, in contrast to typical object recognition techniques that rely on region proposal networks and intricate pipelines.

Prerequisites:

pip install opencv-python
pip install tensorflow

Object Detection by YOLO using Tensorflow Implementations:

Importing necessary libraries:

Python3

import numpy as np

import pandas as pd

import cv2, os, glob

import xml.etree.ElementTree as ET

import matplotlib.pyplot as plt

import tensorflow as tf

from tensorflow.keras import Model

from tensorflow.keras.layers import (

    Add, Concatenate, Conv2D,

    Input, Lambda, LeakyReLU,

    MaxPool2D, UpSampling2D, ZeroPadding2D
)

from tensorflow.keras.regularizers import l2

from tensorflow.keras.losses import (

    binary_crossentropy,

    sparse_categorical_crossentropy
)

The xml.etree.ElementTree module is used for parsing XML files.

Model Configuration:

We define some hyperparameters for yolov3.

Anchors are predefined bounding boxes with specific sizes and aspect ratios, serving as reference points for localization predictions. They enable models handle variations in object scales and shapes, improving flexibility and computational efficiency during training and inference.

Python3

YOLOV3_LAYER_LIST = [

    'yolo_darknet',

    'yolo_conv_0',

    'yolo_output_0',

    'yolo_conv_1',

    'yolo_output_1',

    'yolo_conv_2',

    'yolo_output_2',
]

yolo_anchors = np.array([

    (10, 13), (16, 30), (33, 23), (30, 61), (62, 45),

    (59, 119), (116, 90), (156, 198), (373, 326)],

    np.float32) / 416
 
yolo_anchor_masks = np.array([[6, 7, 8], [3, 4, 5], [0, 1, 2]])

Class names:

Python3

class_names = [

    'person', 'bicycle','car','motorbike','aeroplane','bus','train','truck','boat',

    'traffic light','fire hydrant','stop sign','parking meter','bench',

    'bird','cat','dog','horse','sheep','cow','elephant','bear','zebra',

    'giraffe','backpack','umbrella','handbag','tie','suitcase','frisbee',

    'skis','snowboard','sports ball','kite','baseball bat','baseball glove',

    'skateboard','surfboard','tennis racket','bottle','wine glass','cup',

    'fork','knife','spoon','bowl','banana','apple','sandwich','orange',

    'broccoli','carrot','hot dog','pizza','donut','cake','chair','sofa',

    'pottedplant','bed','diningtable','toilet','tvmonitor','laptop','mouse',

    'remote','keyboard','cell phone','microwave','oven','toaster','sink',

    'refrigerator','book','clock','vase','scissors','teddy bear',

    'hair drier','toothbrush'
]

Class names represent a collection of object classes commonly encountered in object detection tasks.This list is commonly used in the context of object detection datasets such as COCO (Common Objects in Context) to label and identify various objects within images. There are 80 class labels. Each element serves as a unique identifier for a specific object class.

Model Building:

We have defined a function named load_darknet_weights that is used to load weights from a Darknet weight file into a given model.

Python3

def load_darknet_weights(model, weights_file):

    wf = open(weights_file, 'rb')

    major, minor, revision, seen, _ = np.fromfile(wf, dtype=np.int32, count=5)

    layers = YOLOV3_LAYER_LIST  # Assuming YOLO architecture, adjust if needed

    for layer_name in layers:

        sub_model = model.get_layer(layer_name)

        for i, layer in enumerate(sub_model.layers):

            if not layer.name.startswith('conv2d'):

                continue

            batch_norm = None

            if i + 1 < len(sub_model.layers) and sub_model.layers[i + 1].name.startswith('batch_norm'):

                batch_norm = sub_model.layers[i + 1]

            filters = layer.filters

            size = layer.kernel_size[0]

            in_dim = layer.input_shape[-1]

            if batch_norm is None:

                conv_bias = np.fromfile(wf, dtype=np.float32, count=filters)

            else:

                bn_weights = np.fromfile(wf, dtype=np.float32, count=4 * filters)

                bn_weights = bn_weights.reshape((4, filters))[[1, 0, 2, 3]]

            conv_shape = (filters, in_dim, size, size)

            conv_weights = np.fromfile(wf, dtype=np.float32, count=np.product(conv_shape))

            conv_weights = conv_weights.reshape(conv_shape).transpose([2, 3, 1, 0])

            if batch_norm is None:

                layer.set_weights([conv_weights, conv_bias])

            else:

                layer.set_weights([conv_weights])

                batch_norm.set_weights(bn_weights)

    assert len(wf.read()) == 0, 'failed to read all data'

    wf.close()

Intersection over Union (IoU) Calculation for Bounding Boxes

IoU is a metric used to measure the overlap between two bounding boxes or regions in object detection tasks. It is calculated by dividing the area of intersection between the predicted and ground truth bounding boxes by the area of their union.

Bounding box are rectangular frames used to delineate the location of objects in images, defined by their top-left (x_min, y_min) and bottom-right (x_max, y_max) coordinates. They are essential in computer vision for tasks like object detection and image annotation.

Python3

def broadcast_iou(box_1, box_2):

    # broadcast boxes

    box_1 = tf.expand_dims(box_1, -2)

    box_2 = tf.expand_dims(box_2, 0)

    # new_shape: (..., N, (x1, y1, x2, y2))

    new_shape = tf.broadcast_dynamic_shape(tf.shape(box_1), tf.shape(box_2))

    box_1 = tf.broadcast_to(box_1, new_shape)

    box_2 = tf.broadcast_to(box_2, new_shape)

    int_w = tf.maximum(tf.minimum(box_1[..., 2], box_2[..., 2]) - tf.maximum(box_1[..., 0], box_2[..., 0]), 0)

    int_h = tf.maximum(tf.minimum(box_1[..., 3], box_2[..., 3]) - tf.maximum(box_1[..., 1], box_2[..., 1]), 0)

    int_area = int_w * int_h

    box_1_area = (box_1[..., 2] - box_1[..., 0]) * (box_1[..., 3] - box_1[..., 1])

    box_2_area = (box_2[..., 2] - box_2[..., 0]) * (box_2[..., 3] - box_2[..., 1])

    return int_area / (box_1_area + box_2_area - int_area)

A higher Intersection over Union (IoU) signifies increased overlap between bounding boxes, indicating improved alignment and localization of objects, and it is commonly employed as a crucial evaluation metric to assess model accuracy in various computer vision applications, particularly in object detection.

Model Freezing

Python3

def freeze_all(model, frozen = True):

    model.trainable = not frozen

    if isinstance(model, tf.keras.Model):

        for l in model.layers:

            freeze_all(l, frozen)

The freeze_all(model, frozen=True) allows for freezing or unfreezing all layers in a given model based on the Boolean parameter frozen. It recursively traverses through the layers of the model and sets their trainable attribute accordingly.

Visualizing Predictions

Python3

def draw_outputs(img, outputs, class_names):

    boxes, objectness, classes, nums = outputs

    boxes, objectness, classes, nums = boxes[0], objectness[0], classes[0], nums[0]

    wh = np.flip(img.shape[0:2])

    for i in range(nums):

        x1y1 = tuple((np.array(boxes[i][0:2]) * wh).astype(np.int32))

        x2y2 = tuple((np.array(boxes[i][2:4]) * wh).astype(np.int32))

        img = cv2.rectangle(img, x1y1, x2y2, (255, 0, 0), 2)

        img = cv2.putText(img, '{} {:.4f}'.format(

            class_names[int(classes[i])], objectness[i]),

            x1y1, cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (0, 0, 255), 2)

    return img

draw_outputs(img, outputs, class_names): This function takes an image (img), model outputs (outputs), and a list of class names (class_names). It draws bounding boxes and class labels on the image based on the model predictions.

Image Transformation

Python3

def transform_images(x_train, size):

    x_train = tf.image.resize(x_train, (size, size))

    x_train = x_train / 255

    return x_train

transform_images(x_train, size) function resizes input images (x_train) to a specified size (size) and normalizes pixel values to the range [0, 1].

Target Transformation Function

Python3

@tf.function

def transform_targets_for_output(y_true, grid_size, anchor_idxs, classes):

    N = tf.shape(y_true)[0]

    y_true_out = tf.zeros(

        (N, grid_size, grid_size, tf.shape(anchor_idxs)[0], 6))

    anchor_idxs = tf.cast(anchor_idxs, tf.int32)

    indexes = tf.TensorArray(tf.int32, 1, dynamic_size=True)

    updates = tf.TensorArray(tf.float32, 1, dynamic_size=True)

    idx = 0

    for i in tf.range(N):

        for j in tf.range(tf.shape(y_true)[1]):

            if tf.equal(y_true[i][j][2], 0):

                continue

            anchor_eq = tf.equal(

                anchor_idxs, tf.cast(y_true[i][j][5], tf.int32))

            if tf.reduce_any(anchor_eq):

                box = y_true[i][j][0:4]

                box_xy = (y_true[i][j][0:2] + y_true[i][j][2:4]) / 2

                anchor_idx = tf.cast(tf.where(anchor_eq), tf.int32)

                grid_xy = tf.cast(box_xy // (1/grid_size), tf.int32)

                indexes = indexes.write(

                    idx, [i, grid_xy[1], grid_xy[0], anchor_idx[0][0]])

                updates = updates.write(

                    idx, [box[0], box[1], box[2], box[3], 1, y_true[i][j][4]])

                idx += 1

    return tf.tensor_scatter_nd_update(

        y_true_out, indexes.stack(), updates.stack())

 def transform_targets(y_train, anchors, anchor_masks, classes):

    y_outs = []

    grid_size = 13

    anchors = tf.cast(anchors, tf.float32)

    anchor_area = anchors[..., 0] * anchors[..., 1]

    box_wh = y_train[..., 2:4] - y_train[..., 0:2]

    box_wh = tf.tile(tf.expand_dims(box_wh, -2), (1, 1, tf.shape(anchors)[0], 1))

    box_area = box_wh[..., 0] * box_wh[..., 1]

    intersection = tf.minimum(box_wh[..., 0], anchors[..., 0]) * tf.minimum(box_wh[..., 1], anchors[..., 1])

    iou = intersection / (box_area + anchor_area - intersection)

    anchor_idx = tf.cast(tf.argmax(iou, axis=-1), tf.float32)

    anchor_idx = tf.expand_dims(anchor_idx, axis=-1)

    y_train = tf.concat([y_train, anchor_idx], axis=-1)

    for anchor_idxs in anchor_masks:

        y_outs.append(transform_targets_for_output(

            y_train, grid_size, anchor_idxs, classes))

        grid_size *= 2

    return tuple(y_outs)

The transform_targets_for_output function transforms bounding boxes into a target tensor tailored for a specific output grid in an object detection model, considering anchor box information, grid positions, and objectness confidence, thus facilitating the training of YOLO-like architectures.

The transform_targets function prepares target tensors for YOLO-like object detection models by incorporating ground truth labels, anchor boxes, and class information. It calculates anchor indices based on Intersection over Union, appends them to labels, and generates target tensors for multiple output grids with varying scales, essential for effective model training.

These functions work together to prepare ground truth labels for training a YOLO model with multiple output scales and anchor configurations.

Custom Batch Normalization

Python3

class BatchNormalization(tf.keras.layers.BatchNormalization):
 
    def call(self, x, training = False):

        if training is None:

            traininig = tf.constant(False)

        training = tf.logical_and(training, self.trainable)

        return super().call(x, training)

The code presents a custom Batch Normalization layer implemented using TensorFlow’s Keras API. By inheriting from the standard tf.keras.layers.BatchNormalization class and overriding the call method, the custom layer introduces additional logic to handle the training parameter. Notably, it sets training to False if it is initially None and ensures that Batch Normalization is applied only when the layer is trainable. This layer offers flexibility in controlling the application of Batch Normalization based on training mode and the layer’s trainable status, making it suitable for specific training scenarios or model architectures.

Darknet Convolution

Python3

def DarknetConv(x, filters, size, strides=1, batch_norm=True):

    if strides == 1:

        padding = 'same'

    else:

        x = ZeroPadding2D(((1, 0), (1, 0)))(x)  # top left half-padding

        padding = 'valid'

    x = Conv2D(filters=filters, kernel_size=size,

               strides=strides, padding=padding,

               use_bias=not batch_norm, kernel_regularizer=l2(0.0005))(x)

    if batch_norm:

        x = BatchNormalization()(x)

        x = LeakyReLU(alpha=0.1)(x)

    return x

The code defines a function named DarknetConv, serving as a modular building block for convolutional layers within the Darknet architecture, notably used in YOLO (You Only Look Once) models. This function creates a 2D convolutional layer with options for customized padding, strides, and batch normalization. The function’s versatility allows for seamless integration into the Darknet backbone, enabling the construction of feature extraction layers. The inclusion of batch normalization and Leaky ReLU activation enhances training stability and facilitates feature learning. This modular approach enhances code readability and reusability, contributing to the efficient design and implementation of convolutional neural networks, particularly those based on the Darknet architecture.

Darknet Residual and Darknet Block

Python3

def DarknetResidual(x, filters):

    prev = x

    x = DarknetConv(x, filters // 2, 1)

    x = DarknetConv(x, filters, 3)

    x = Add()([prev, x])

    return x

  def DarknetBlock(x, filters, blocks):

    x = DarknetConv(x, filters, 3, strides=2)

    for _ in range(blocks):

        x = DarknetResidual(x, filters)

The two functions, DarknetResidual and DarknetBlock contribute to the construction of the Darknet architecture commonly employed in YOLO (You Only Look Once) models for object detection. The DarknetResidual function defines a residual block, where the input tensor x undergoes a series of DarknetConv operations, incorporating 1×1 and 3×3 convolutions. The result is added element-wise to the original input tensor, promoting feature reuse and gradient flow. The DarknetBlock function, on the other hand, orchestrates the creation of a Darknet block by utilizing the DarknetConv function with specific parameters. It includes a 3×3 convolutional layer with strided downsampling, followed by a series of DarknetResidual blocks. These functions contribute to the modularity and efficiency of the Darknet architecture, facilitating the design and implementation of deep neural networks for object detection tasks.

Darknet Architecture

Python3

def Darknet(name=None):

    x = inputs = Input([None, None, 3])

    x = DarknetConv(x, 32, 3)

    x = DarknetBlock(x, 64, 1)

    x = DarknetBlock(x, 128, 2)  # skip connection

    x = x_36 = DarknetBlock(x, 256, 8)  # skip connection

    x = x_61 = DarknetBlock(x, 512, 8)

    x = DarknetBlock(x, 1024, 4)

    return tf.keras.Model(inputs, (x_36, x_61, x), name=name)

The Darknet function constructs the YOLO Darknet architecture, initializing with a 3-channel input tensor. It applies an initial convolutional layer with 32 filters. DarknetBlocks with increasing filters and residuals form the architecture, featuring skip connections at blocks 2, 3, and 4 (x_36 and x_61). The model, encapsulated in a TensorFlow Keras Model, outputs three scales of feature maps, adhering to YOLO’s multi-scale feature extraction for robust object detection.

Convolution Layer

Python3

def YoloConv(x_in, filters, name=None):

    if isinstance(x_in, tuple):

        inputs = Input(x_in[0].shape[1:]), Input(x_in[1].shape[1:])

        x, x_skip = inputs

        # concat with skip connection

        x = DarknetConv(x, filters, 1)

        x = UpSampling2D(2)(x)

        x = Concatenate()([x, x_skip])

    else:

        x = inputs = Input(x_in.shape[1:])

    x = DarknetConv(x, filters, 1)

    x = DarknetConv(x, filters * 2, 3)

    x = DarknetConv(x, filters, 1)

    x = DarknetConv(x, filters * 2, 3)

    x = DarknetConv(x, filters, 1)

    return Model(inputs, x, name=name)(x_in)

YoloConv(filters, name=None) function Defines a YOLO convolutional block that consists of multiple convolutional layers.

Output Function

Python3

def YoloOutput(x_in, filters, anchors, classes, name=None):

    x = inputs = Input(x_in.shape[1:])

    x = DarknetConv(x, filters * 2, 3)

    x = DarknetConv(x, anchors * (classes + 5), 1, batch_norm=False)

    x = Lambda(lambda x: tf.reshape(x, (-1, tf.shape(x)[1], tf.shape(x)[2], anchors, classes + 5)))(x)

    return tf.keras.Model(inputs, x, name=name)(x_in)

YoloOutput function constructs a YOLO output block responsible for predicting bounding boxes, objectness scores, and class probabilities. The output is reshaped to facilitate subsequent processing.

Post-processing Function

Python3

def yolo_boxes(pred, anchors, classes):

   '''pred: (batch_size, grid, grid, anchors, (x, y, w, h, obj, ...classes))'''

   grid_size = tf.shape(pred)[1]

   box_xy, box_wh, objectness, class_probs = tf.split(

       pred, (2, 2, 1, classes), axis=-1)

   box_xy = tf.sigmoid(box_xy)

   objectness = tf.sigmoid(objectness)

   class_probs = tf.sigmoid(class_probs)

   pred_box = tf.concat((box_xy, box_wh), axis=-1)  # original xywh for loss

   grid = tf.meshgrid(tf.range(grid_size), tf.range(grid_size))

   grid = tf.expand_dims(tf.stack(grid, axis=-1), axis=2)  # [gx, gy, 1, 2]

   box_xy = (box_xy + tf.cast(grid, tf.float32)) / \

       tf.cast(grid_size, tf.float32)

   box_wh = tf.exp(box_wh) * anchors

   box_x1y1 = box_xy - box_wh / 2

   box_x2y2 = box_xy + box_wh / 2

   bbox = tf.concat([box_x1y1, box_x2y2], axis=-1)

   return bbox, objectness, class_probs, pred_box

 def yolo_nms(outputs, anchors, masks, classes):

   '''boxes, conf, type'''

   b, c, t = [], [], []

   for o in outputs:

       b.append(tf.reshape(o[0], (tf.shape(o[0])[0], -1, tf.shape(o[0])[-1])))

       c.append(tf.reshape(o[1], (tf.shape(o[1])[0], -1, tf.shape(o[1])[-1])))

       t.append(tf.reshape(o[2], (tf.shape(o[2])[0], -1, tf.shape(o[2])[-1])))

   bbox = tf.concat(b, axis=1)

   confidence = tf.concat(c, axis=1)

   class_probs = tf.concat(t, axis=1)

   scores = confidence * class_probs

   boxes, scores, classes, valid_detections = tf.image.combined_non_max_suppression(

       boxes=tf.reshape(bbox, (tf.shape(bbox)[0], -1, 1, 4)),

       scores=tf.reshape(

           scores,

           (tf.shape(scores)[0], -1, tf.shape(scores)[-1])

       ),

       max_output_size_per_class=100,

       max_total_size = 100,

       iou_threshold = 0.5,

       score_threshold = 0.5

   )

   return boxes, scores, classes, valid_detections

yolo_boxes(pred, anchors, classes): Decodes model predictions into bounding boxes, objectness scores, and class probabilities. Applies sigmoid functions and calculates box coordinates.
yolo_nms(outputs, anchors, masks, classes): Performs non-maximum suppression (NMS) on model outputs, filtering redundant bounding boxes based on confidence scores and IoU thresholds.

Model Architecture

Python3

def YoloV3(size=None, channels=3, anchors=yolo_anchors, masks=yolo_anchor_masks, classes=80, training=False):

    x = inputs = Input([size, size, channels])

    x_36, x_61, x = Darknet(name='yolo_darknet')(x)

    x = YoloConv(x, 512, name='yolo_conv_0')

    output_0 = YoloOutput(x, 512, len(masks[0]), classes, name='yolo_output_0')

    x = YoloConv((x, x_61), 256, name='yolo_conv_1')

    output_1 = YoloOutput(x, 256, len(masks[1]), classes, name='yolo_output_1')

    x = YoloConv((x, x_36), 128, name='yolo_conv_2')

    output_2 = YoloOutput(x, 128, len(masks[2]), classes, name='yolo_output_2')

    if training:

        return Model(inputs, (output_0, output_1, output_2), name='yolov3')

    boxes_0 = Lambda(lambda x: yolo_boxes(x, anchors[masks[0]], classes),

                     name='yolo_boxes_0')(output_0)

    boxes_1 = Lambda(lambda x: yolo_boxes(x, anchors[masks[1]], classes),

                     name='yolo_boxes_1')(output_1)

    boxes_2 = Lambda(lambda x: yolo_boxes(x, anchors[masks[2]], classes),

                     name='yolo_boxes_2')(output_2)

    outputs = Lambda(lambda x: yolo_nms(x, anchors, masks, classes),

                     name='yolo_nms')((boxes_0[:3], boxes_1[:3], boxes_2[:3]))

    return Model(inputs, outputs, name='yolov3')

The YOLOv3 model architecture for object detection is defined, including functions for model creation, loss calculation, and training. Key components include YOLO-like convolutional layers, output layers, and loss functions.

Loss Function

Python3

def YoloLoss(anchors, classes=80, ignore_thresh=0.5):

   def yolo_loss(y_true, y_pred):

       # 1. transform all pred outputs

       # y_pred: (batch_size, grid, grid, anchors, (x, y, w, h, obj, ...cls))

       pred_box, pred_obj, pred_class, pred_xywh = yolo_boxes(y_pred, anchors, classes)

       pred_xy = pred_xywh[..., 0:2]

       pred_wh = pred_xywh[..., 2:4]

       # 2. transform all true outputs

       # y_true: (batch_size, grid, grid, anchors, (x1, y1, x2, y2, obj, cls))

       true_box, true_obj, true_class_idx = tf.split(

           y_true, (4, 1, 1), axis=-1)

       true_xy = (true_box[..., 0:2] + true_box[..., 2:4]) / 2

       true_wh = true_box[..., 2:4] - true_box[..., 0:2]

       # give higher weights to small boxes

       box_loss_scale = 2 - true_wh[..., 0] * true_wh[..., 1]

       # 3. inverting the pred box equations

       grid_size = tf.shape(y_true)[1]

       grid = tf.meshgrid(tf.range(grid_size), tf.range(grid_size))

       grid = tf.expand_dims(tf.stack(grid, axis=-1), axis=2)

       true_xy = true_xy * tf.cast(grid_size, tf.float32) - \

           tf.cast(grid, tf.float32)

       true_wh = tf.math.log(true_wh / anchors)

       true_wh = tf.where(tf.math.is_inf(true_wh), tf.zeros_like(true_wh), true_wh)

       # 4. calculate all masks

       obj_mask = tf.squeeze(true_obj, -1)

       # ignore false positive when iou is over threshold

       true_box_flat = tf.boolean_mask(true_box, tf.cast(obj_mask, tf.bool))

       best_iou = tf.reduce_max(broadcast_iou(

           pred_box, true_box_flat), axis=-1)

       ignore_mask = tf.cast(best_iou < ignore_thresh, tf.float32)

       # 5. calculate all losses

       xy_loss = obj_mask * box_loss_scale * \

           tf.reduce_sum(tf.square(true_xy - pred_xy), axis=-1)

       wh_loss = obj_mask * box_loss_scale * \

           tf.reduce_sum(tf.square(true_wh - pred_wh), axis=-1)

       obj_loss = binary_crossentropy(true_obj, pred_obj)

       obj_loss = obj_mask * obj_loss + \

           (1 - obj_mask) * ignore_mask * obj_loss

       # Could also use binary_crossentropy instead

       class_loss = obj_mask * sparse_categorical_crossentropy(

           true_class_idx, pred_class)

       # 6. sum over (batch, gridx, gridy, anchors) => (batch, 1)

       xy_loss = tf.reduce_sum(xy_loss, axis=(1, 2, 3))

       wh_loss = tf.reduce_sum(wh_loss, axis=(1, 2, 3))

       obj_loss = tf.reduce_sum(obj_loss, axis=(1, 2, 3))

       class_loss = tf.reduce_sum(class_loss, axis=(1, 2, 3))

       return xy_loss + wh_loss + obj_loss + class_loss

   return yolo_loss

A custom YOLOv3 loss function for TensorFlow, crucial in training object detection models is defined. The loss computation incorporates algorithms to exclude false positives based on IoU thresholds and takes box size into consideration via a box loss scale. The function evaluates the discrepancies between predicted and true values in bounding boxes, objectness scores, and class probabilities, capturing the subtleties of YOLO-style object identification and helping to enhance model accuracy during training.

Model Summary

Python3

yolo = YoloV3(classes = 80)
yolo.summary()

Model: "yolov3"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
==================================================================================================
 input_1 (InputLayer)        [(None, None, None, 3)]      0         []                            
                                                                                                  
 yolo_darknet (Functional)   ((None, None, None, 256),    4062064   ['input_1[0][0]']             
                              (None, None, None, 512),    0                                       
                              (None, None, None, 1024))                                           
                                                                                                  
 yolo_conv_0 (Functional)    (None, None, None, 512)      1102438   ['yolo_darknet[0][2]']        
                                                          4                                       
                                                                                                  
 yolo_conv_1 (Functional)    (None, None, None, 256)      2957312   ['yolo_conv_0[0][0]',         
                                                                     'yolo_darknet[0][1]']        
                                                                                                  
 yolo_conv_2 (Functional)    (None, None, None, 128)      741376    ['yolo_conv_1[0][0]',         
                                                                     'yolo_darknet[0][0]']        
                                                                                                  
 yolo_output_0 (Functional)  (None, None, None, 3, 85)    4984063   ['yolo_conv_0[0][0]']         
                                                                                                  
 yolo_output_1 (Functional)  (None, None, None, 3, 85)    1312511   ['yolo_conv_1[0][0]']         
                                                                                                  
 yolo_output_2 (Functional)  (None, None, None, 3, 85)    361471    ['yolo_conv_2[0][0]']         
                                                                                                  
 yolo_boxes_0 (Lambda)       ((None, None, None, 3, 4),   0         ['yolo_output_0[0][0]']       
                              (None, None, None, 3, 1),                                           
                              (None, None, None, 3, 80)                                           
                             , (None, None, None, 3, 4)                                           
                             )                                                                    
                                                                                                  
 yolo_boxes_1 (Lambda)       ((None, None, None, 3, 4),   0         ['yolo_output_1[0][0]']       
                              (None, None, None, 3, 1),                                           
                              (None, None, None, 3, 80)                                           
                             , (None, None, None, 3, 4)                                           
                             )                                                                    
                                                                                                  
 yolo_boxes_2 (Lambda)       ((None, None, None, 3, 4),   0         ['yolo_output_2[0][0]']       
                              (None, None, None, 3, 1),                                           
                              (None, None, None, 3, 80)                                           
                             , (None, None, None, 3, 4)                                           
                             )                                                                    
                                                                                                  
 yolo_nms (Lambda)           ((None, 100, 4),             0         ['yolo_boxes_0[0][0]',        
                              (None, 100),                           'yolo_boxes_0[0][1]',        
                              (None, 100),                           'yolo_boxes_0[0][2]',        
                              (None,))                               'yolo_boxes_1[0][0]',        
                                                                     'yolo_boxes_1[0][1]',        
                                                                     'yolo_boxes_1[0][2]',        
                                                                     'yolo_boxes_2[0][0]',        
                                                                     'yolo_boxes_2[0][1]',        
                                                                     'yolo_boxes_2[0][2]']        
                                                                                                  
==================================================================================================
Total params: 62001757 (236.52 MB)
Trainable params: 61949149 (236.32 MB)
Non-trainable params: 52608 (205.50 KB)
________________________________________

Visualizing the Model Architecture

Python3

plot_model(

    yolo, rankdir = 'TB',

    to_file = 'yolo_model1.png',

    show_shapes = False,

    show_layer_names = True,

    expand_nested = False
)

Output:

Yolo Model

Loading Weights and Making Predictions on Images

Python3

load_darknet_weights(yolo, '/Users/gfg0406/Desktop/GFG TASKS/yolov3.weights', False)

def predict(image_file, visualize = True, figsize = (16, 16)):

    img = tf.image.decode_image(open(image_file, 'rb').read(), channels=3)

    img = tf.expand_dims(img, 0)

    img = transform_images(img, 416)

    boxes, scores, classes, nums = yolo.predict(img)

    img = cv2.cvtColor(cv2.imread(image_file), cv2.COLOR_BGR2RGB)

    img = draw_outputs(img, (boxes, scores, classes, nums), class_names)

    if visualize:

        fig, axes = plt.subplots(figsize = figsize)

        plt.imshow(img)

        plt.show()

    return boxes, scores, classes, nums

image_file = glob.glob('/Users/gfg0406/Desktop/GFG TASKS/Images/*')

First, we load the YOLOv3 model (yolo) with pre-trained Darknet weights. After that, a predict function is built to forecast based on an image file path. Using the transform_images function, the picture is read, encoded, and preprocessed to fit the YOLOv3 input size. The yolo.predict method is used to acquire the bounding box predictions, confidence scores, predicted classes, and number of detections. OpenCV is used to read the original image and transform it to RGB. Finally, if the visualise option is set to True, the result is presented after the anticipated outputs are superimposed on the image using the draw_outputs function. Predicted boxes, scores, classifications, and the total number of detections are returned by the function.

The code then applies this prediction function to a list of image files in the specified directory.

Change the directory path according to your device.

Detections for a Sample Image

Python3

boxes, scores, classes, nums = predict(image_file[0], figsize = (20, 20))

boxes, scores, classes, nums = predict(image_file[1], figsize = (20, 20))

Output:

Object Detection by YOLO

Article Tags :

AI-ML-DS

Computer Vision

AI-ML-DS With Python

Computer Vision Projects

Tensorflow