Open In App
Related Articles

Local Relational Network

Like Article
Save Article
Report issue

Local Relational Network was proposed by researchers of Tsinghua University and Microsoft Research. The idea behind this paper is that the convolution layer has been the dominant feature extractor in computer vision for years. However, this spatial aggregation process is just a pattern matching process that is inefficient for visual process modeling visual elements with the varying spatial distribution.

 To deal with this inefficiency this paper presents a new image feature extractor, called the local relation layer, that adaptively determines aggregation weights based on the compositional relationship of local pixel pairs. With this relational approach, it can composite visual elements into higher-level entities in a more efficient manner that benefits semantic inference. The local Relational layer can be used directly in place of the convolution layer with some overhead.


In this section, we describe the general formulation of a feature extractor based on which local relational layer is based:

Let’s consider the input and output of a layer by x∈R*C*H*W and y∈R*C*H*W, with C and C’ being the channels of input/output features and h, w, h’, w’ the input/output spatial resolution. Existing basic image features generally produce the output feature by weighted aggregation of previous features, which can be represented by the equation below:

y ({c}',{p}') = \sum_{c \epsilon \Omega_{c}', p \epsilon \Omega_{p}' } \omega \left ( {c}',c, {p}',p \right ) . x(c,p)

where c, c’ and p = (h, w), p’ = (h’, w’ ) index the input and output channels and feature map positions, respectively; c and p denote the scope for channel and spatial aggregation of input features in producing the output feature value at channel c’ and position p’, respectively; ω(c’, c, p’, p) denotes the aggregation weight from c, p to c’, p’.

  • Parameterization method: It defines the model weights to be learned, the most common parameterization method is to directly learn the weights w. There are some methods that also learn the meta networks weights (∅).
  • Aggregation scope: It defines the range of channels and spatial positions involved in aggregation computation. For a channel scope, regular convolution includes all input channels when computing each channel output. For greater efficiency, some methods consider only one or a group of input channels in producing one channel of the output feature greater.
  • Aggregation Weights: These are typically learned in network parameters or are computer from these parameters. Almost all variants of convolution networks are computer in a top-down fashion, where they are either fixed across positions or determined by meta-network on the input features across positions.

Local Relational Layer

The local relational layer can be expressed with following expression:

\omega\left ( {p}', p \right ) = softmax(\Phi (f_{\theta_q}(x_{p}'),f_{\theta_k}(x_p))+f_{\theta_\theta}(p-{p}'))

where, \Phi (f_{\theta_q}(x_{p}'),f_{\theta_k}(x_p))      is the measure of composability between target pixel p’ and pixel p within its scope, based on their appearance after transformations f_{\theta_q}(x_{p}')      and f_{\theta_k}(x_p)     .

  • Locality: The bottom-up features typically aggregate input features above the full image. But, the local relational layer limits the aggregation computation limit to a small local area i.e 7×7 neighborhood. This method proves to be more effective in utilizing large kernels and offers steady variation in accuracy whereas the ConvNets accuracy gets saturated with an increase in training steps. This may be because the power of ConvNets is bottlenecked by the number of filters.
  • Appearance Composability: The authors follow a common approach in deep learning i.e to calculate the appearance composability:

\Phi (f_{\theta_q}(x_{p}'),f_{\theta_k}(x_p))

  • where xp and xp’ are projected to a query and key embedding space. While in previous models, these are used as vectors but here the authors used them as scaler and argue that they have better speed-to-accuracy. The authors use the following instantiations ∅.
    • Square difference: \phi\left ( q_{p}', k_p \right ) = -\left (q_{p}' - k_p\right )^{2}
    • Absolute Difference: \phi\left ( q_{p}', k_p \right ) = - \left | q_{p}' - k_p \right |
    • Multiplication: \phi\left ( q_{p}', k_p \right ) = q_{p}' \cdot k_p
  • Geometric Priors: An important aspect differentiating the local relational layer from the other convolution layer is the use of geometric priors. The geometric prior is encoded by a small network on the relative position of p to p’. This small network consists of two-channel transformation layers, with a ReLU activation between them. The authors argue that a small network is better than directly learning the values of geometric priors, especially when the neighborhood size is large.
  • Weight Normalization: The layer uses softmax for Weight normalization.
  • Channel Sharing: After each local relational layer, the authors use channel sharing in aggregating computations, where multiple channels are shared the same aggregating weights. While this decreases some computation but not significantly affect accuracy.

The total complexity of local relational layer can be calculated by:

C = O \left ( (\frac{1 + s^2}{m} +1  ) * C * (C + k^2)\frac{HW}{s^2} \right )

where, H×W is the dimensions of input feature map, k×k spatial neighborhood, C channels, and m channels per aggregation computation.

Local Relational Network:

LR-Net is similar to ResNet architecture except for all the Convolution layer in ResNet is replaced by the Local Relational Layer. Below is the architecture of the Local Relational Network (LR-Net):

StageOutputResNet-50LR-Net-50 (7×7, m=8)
res1112*1127×7 conv, 64, stride, 2

1×1, 64

7×7 LR, 64, stride, 2

res2 (x3) 56*563×3 max pool, stride, 23×3 max pool, stride, 2

1×1, 64

3×3 conv, 64

1×1, 256

1×1, 100

7×7 LR, 100

1×1, 256

res3 (x4)28*28

1×1, 128

3×3 conv, 128

1×1, 512

1×1, 200

7×7 LR, 200

1×1, 512

res4 (x6)14×14

1×1, 256

3×3 conv, 256

1×1, 1024

1×1, 400

7×7 LR, 400

1×1, 1024

res5 (x3)7×7

1×1, 256

3×3 conv, 512

1×1, 2048

1×1, 800

7×7 LR, 800

1×1, 2048


global average pool

1000-d fc, softmax

global average pool

1000-d fc, softmax

# params 25.5 x 10623.3 x 106
FLOPs 4.3 x 1094.3 x 109


In this implementation, we will be PyTorch and Torchvision libraries. These libraries are pre-installed in Colaboratory. To install these modules locally please check out this guide:


import torch
class GeometricPriori(torch.nn.Module):
    def __init__(self, k, channels, multiplier=0.5):
        super(GeometricPriori, self).__init__()
        self.channels = channels
        self.k = k
        self.position = 2 * torch.rand(1, 2, k, k, requires_grad=True) - 1
        self.l1 = torch.nn.Conv2d(2, int(multiplier * channels), 1)
        self.l2 = torch.nn.Conv2d(int(multiplier * channels), channels, 1)
    def forward(self, x):
        x = self.l2(torch.nn.functional.relu(self.l1(self.position)))
        return x.view(1, self.channels, 1, self.k ** 2)
class KeyandQueryMap(torch.nn.Module):
    def __init__(self, channels, m):
        super(KeyandQueryMap, self).__init__()
        self.l = torch.nn.Conv2d(channels, channels // m, 1)
    def forward(self, x):
        return self.l(x)
class AppearanceComposability(torch.nn.Module):
    def __init__(self, k, padding, stride):
        super(AppearanceComposability, self).__init__()
        self.k = k
        self.unfold = torch.nn.Unfold(k, 1, padding, stride)
    def forward(self, x):
        key_map, query_map = x
        k = self.k
        key_map_unfold = self.unfold(key_map)
        query_map_unfold = self.unfold(query_map)
        key_map_unfold = key_map_unfold.view(
                    key_map.shape[0], key_map.shape[1],
                    key_map_unfold.shape[-2] // key_map.shape[1])
        query_map_unfold = query_map_unfold.view(
                    query_map.shape[0], query_map.shape[1],
                    query_map_unfold.shape[-2] // query_map.shape[1])
        return key_map_unfold * query_map_unfold[:, :, :, k**2//2:k**2//2+1]
def combine_priors(appearance_kernel, geometry_kernel):
    return torch.nn.functional.softmax(appearance_kernel + geometry_kernel,
class LocalRelationLayer(torch.nn.Module):
    Define Local Relational Layer as given in the paper
    def __init__(self, channels, k, stride=1, m=None, padding=0):
        super(LocalRelationalLayer, self).__init__()
        self.channels = channels
        self.k = k
        self.stride = stride
        self.m = 8
        if(m != 8 and m != None):
          self.m =m
        self.padding = padding
        self.kmap = KeyandQueryMap(channels, k)
        self.qmap = KeyandQueryMap(channels, k) = AppearanceComposability(k, padding, stride) = GeometricPriori(k, channels//m)
        self.unfold = torch.nn.Unfold(k, 1, padding, stride)
        self.final1x1 = torch.nn.Conv2d(channels, channels, 1)
    def forward(self, x):
        gpk =
        km = self.kmap(x)
        qm = self.qmap(x)
        ak =, qm))
        ck = combine_priors(ak, gpk)[:, None, :, :, :]
        x_unfold = self.unfold(x)
        x_unfold = x_unfold.view(x.shape[0], self.m, x.shape[1] // m,
                                 -1, x_unfold.shape[-2] // x.shape[1])
        pre_output = (ck * x_unfold).view(x.shape[0], x.shape[1],
                                          -1, x_unfold.shape[-2] // x.shape[1])
        h_out = (x.shape[2] + 2 * self.padding - 1 * (self.k - 1) - 1) // \
                                                            self.stride + 1
        w_out = (x.shape[3] + 2 * self.padding - 1 * (self.k - 1) - 1) // \
                                                            self.stride + 1                              
        pre_output = torch.sum(pre_output, axis=-1).view(x.shape[0], x.shape[1],
                                                         h_out, w_out)
        return self.final1x1(pre_output)
layer = LocalRelationalLayer(channels=64,k=7,stride=1,m=8)



  (kmap): KeyQueryMap(
    (l): Conv2d(64, 9, kernel_size=(1, 1), stride=(1, 1))
  (qmap): KeyQueryMap(
    (l): Conv2d(64, 9, kernel_size=(1, 1), stride=(1, 1))
  (ac): AppearanceComposability(
    (unfold): Unfold(kernel_size=7, dilation=1, padding=0, stride=1)
  (gp): GeometryPrior(
    (l1): Conv2d(2, 4, kernel_size=(1, 1), stride=(1, 1))
    (l2): Conv2d(4, 8, kernel_size=(1, 1), stride=(1, 1))
  (unfold): Unfold(kernel_size=7, dilation=1, padding=0, stride=1)
  (final1x1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))



Last Updated : 06 Jun, 2021
Like Article
Save Article
Share your thoughts in the comments
Similar Reads