Open In App

Local Relational Network

Local Relational Network was proposed by researchers of Tsinghua University and Microsoft Research. The idea behind this paper is that the convolution layer has been the dominant feature extractor in computer vision for years. However, this spatial aggregation process is just a pattern matching process that is inefficient for visual process modeling visual elements with the varying spatial distribution.

 To deal with this inefficiency this paper presents a new image feature extractor, called the local relation layer, that adaptively determines aggregation weights based on the compositional relationship of local pixel pairs. With this relational approach, it can composite visual elements into higher-level entities in a more efficient manner that benefits semantic inference. The local Relational layer can be used directly in place of the convolution layer with some overhead.



Architecture:

In this section, we describe the general formulation of a feature extractor based on which local relational layer is based:

Let’s consider the input and output of a layer by x∈R*C*H*Wand y∈R*C*H*W, with C and C’ being the channels of input/output features and h, w, h’, w’ the input/output spatial resolution. Existing basic image features generally produce the output feature by weighted aggregation of previous features, which can be represented by the equation below:



where c, c’ and p = (h, w), p’ = (h’, w’ ) index the input and output channels and feature map positions, respectively; cand pdenote the scope for channel and spatial aggregation of input features in producing the output feature value at channel c’ and position p’, respectively; ω(c’, c, p’, p) denotes the aggregation weight from c, p to c’, p’.

Local Relational Layer

The local relational layer can be expressed with following expression:

where,  is the measure of composability between target pixel p’ and pixel p within its scope, based on their appearance after transformations  and .

The total complexity of local relational layer can be calculated by:

where, H×W is the dimensions of input feature map, k×k spatial neighborhood, C channels, and m channels per aggregation computation.

Local Relational Network:

LR-Net is similar to ResNet architecture except for all the Convolution layer in ResNet is replaced by the Local Relational Layer. Below is the architecture of the Local Relational Network (LR-Net):

Stage Output ResNet-50 LR-Net-50 (7×7, m=8)
res1 112*112 7×7 conv, 64, stride, 2

1×1, 64

7×7 LR, 64, stride, 2

res2 (x3)  56*56 3×3 max pool, stride, 2 3×3 max pool, stride, 2

1×1, 64

3×3 conv, 64

1×1, 256

1×1, 100

7×7 LR, 100

1×1, 256

res3 (x4) 28*28

1×1, 128

3×3 conv, 128

1×1, 512

1×1, 200

7×7 LR, 200

1×1, 512

res4 (x6) 14×14

1×1, 256

3×3 conv, 256

1×1, 1024

1×1, 400

7×7 LR, 400

1×1, 1024

res5 (x3) 7×7

1×1, 256

3×3 conv, 512

1×1, 2048

1×1, 800

7×7 LR, 800

1×1, 2048

  1×1

global average pool

1000-d fc, softmax

global average pool

1000-d fc, softmax

# params   25.5 x 106 23.3 x 106
FLOPs   4.3 x 109 4.3 x 109

Implementation

In this implementation, we will be PyTorch and Torchvision libraries. These libraries are pre-installed in Colaboratory. To install these modules locally please check out this guide:

import torch
 
class GeometricPriori(torch.nn.Module):
    def __init__(self, k, channels, multiplier=0.5):
        super(GeometricPriori, self).__init__()
        self.channels = channels
        self.k = k
        self.position = 2 * torch.rand(1, 2, k, k, requires_grad=True) - 1
        self.l1 = torch.nn.Conv2d(2, int(multiplier * channels), 1)
        self.l2 = torch.nn.Conv2d(int(multiplier * channels), channels, 1)
         
    def forward(self, x):
        x = self.l2(torch.nn.functional.relu(self.l1(self.position)))
        return x.view(1, self.channels, 1, self.k ** 2)
       
class KeyandQueryMap(torch.nn.Module):
    def __init__(self, channels, m):
        super(KeyandQueryMap, self).__init__()
        self.l = torch.nn.Conv2d(channels, channels // m, 1)
     
    def forward(self, x):
        return self.l(x)
       
class AppearanceComposability(torch.nn.Module):
    def __init__(self, k, padding, stride):
        super(AppearanceComposability, self).__init__()
        self.k = k
        self.unfold = torch.nn.Unfold(k, 1, padding, stride)
     
    def forward(self, x):
        key_map, query_map = x
        k = self.k
        key_map_unfold = self.unfold(key_map)
        query_map_unfold = self.unfold(query_map)
        key_map_unfold = key_map_unfold.view(
                    key_map.shape[0], key_map.shape[1],
                    -1,
                    key_map_unfold.shape[-2] // key_map.shape[1])
        query_map_unfold = query_map_unfold.view(
                    query_map.shape[0], query_map.shape[1],
                    -1,
                    query_map_unfold.shape[-2] // query_map.shape[1])
        return key_map_unfold * query_map_unfold[:, :, :, k**2//2:k**2//2+1]
 
def combine_priors(appearance_kernel, geometry_kernel):
    return torch.nn.functional.softmax(appearance_kernel + geometry_kernel,
                                   dim=-1)
 
class LocalRelationLayer(torch.nn.Module):
    """
    Define Local Relational Layer as given in the paper
    """
    def __init__(self, channels, k, stride=1, m=None, padding=0):
        super(LocalRelationalLayer, self).__init__()
        self.channels = channels
        self.k = k
        self.stride = stride
        self.m = 8
        if(m != 8 and m != None):
          self.m =m
        self.padding = padding
        self.kmap = KeyandQueryMap(channels, k)
        self.qmap = KeyandQueryMap(channels, k)
        self.ac = AppearanceComposability(k, padding, stride)
        self.gp = GeometricPriori(k, channels//m)
        self.unfold = torch.nn.Unfold(k, 1, padding, stride)
        self.final1x1 = torch.nn.Conv2d(channels, channels, 1)
         
    def forward(self, x):
        gpk = self.gp(0)
        km = self.kmap(x)
        qm = self.qmap(x)
        ak = self.ac((km, qm))
        ck = combine_priors(ak, gpk)[:, None, :, :, :]
        x_unfold = self.unfold(x)
        x_unfold = x_unfold.view(x.shape[0], self.m, x.shape[1] // m,
                                 -1, x_unfold.shape[-2] // x.shape[1])
        pre_output = (ck * x_unfold).view(x.shape[0], x.shape[1],
                                          -1, x_unfold.shape[-2] // x.shape[1])
        h_out = (x.shape[2] + 2 * self.padding - 1 * (self.k - 1) - 1) // \
                                                            self.stride + 1
        w_out = (x.shape[3] + 2 * self.padding - 1 * (self.k - 1) - 1) // \
                                                            self.stride + 1                              
        pre_output = torch.sum(pre_output, axis=-1).view(x.shape[0], x.shape[1],
                                                         h_out, w_out)
        return self.final1x1(pre_output)
layer = LocalRelationalLayer(channels=64,k=7,stride=1,m=8)
print(layer)

                    

 
 

LocalRelationalLayer(
  (kmap): KeyQueryMap(
    (l): Conv2d(64, 9, kernel_size=(1, 1), stride=(1, 1))
  )
  (qmap): KeyQueryMap(
    (l): Conv2d(64, 9, kernel_size=(1, 1), stride=(1, 1))
  )
  (ac): AppearanceComposability(
    (unfold): Unfold(kernel_size=7, dilation=1, padding=0, stride=1)
  )
  (gp): GeometryPrior(
    (l1): Conv2d(2, 4, kernel_size=(1, 1), stride=(1, 1))
    (l2): Conv2d(4, 8, kernel_size=(1, 1), stride=(1, 1))
  )
  (unfold): Unfold(kernel_size=7, dilation=1, padding=0, stride=1)
  (final1x1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
)

References:


 


Article Tags :