Local Relational Network

Last Updated : 06 Jun, 2021

Local Relational Network was proposed by researchers of Tsinghua University and Microsoft Research. The idea behind this paper is that the convolution layer has been the dominant feature extractor in computer vision for years. However, this spatial aggregation process is just a pattern matching process that is inefficient for visual process modeling visual elements with the varying spatial distribution.

To deal with this inefficiency this paper presents a new image feature extractor, called the local relation layer, that adaptively determines aggregation weights based on the compositional relationship of local pixel pairs. With this relational approach, it can composite visual elements into higher-level entities in a more efficient manner that benefits semantic inference. The local Relational layer can be used directly in place of the convolution layer with some overhead.

Architecture:

In this section, we describe the general formulation of a feature extractor based on which local relational layer is based:

Let’s consider the input and output of a layer by x∈R*C*H*Wand y∈R*C*H*W, with C and C’ being the channels of input/output features and h, w, h’, w’ the input/output spatial resolution. Existing basic image features generally produce the output feature by weighted aggregation of previous features, which can be represented by the equation below:

$y ({c}',{p}') = \sum_{c \epsilon \Omega_{c}', p \epsilon \Omega_{p}' } \omega \left ( {c}',c, {p}',p \right ) . x(c,p)$

where c, c’ and p = (h, w), p’ = (h’, w’ ) index the input and output channels and feature map positions, respectively; Ω_cand Ω_pdenote the scope for channel and spatial aggregation of input features in producing the output feature value at channel c’ and position p’, respectively; ω(c’, c, p’, p) denotes the aggregation weight from c, p to c’, p’.

Parameterization method: It defines the model weights to be learned, the most common parameterization method is to directly learn the weights w. There are some methods that also learn the meta networks weights (∅).
Aggregation scope: It defines the range of channels and spatial positions involved in aggregation computation. For a channel scope, regular convolution includes all input channels when computing each channel output. For greater efficiency, some methods consider only one or a group of input channels in producing one channel of the output feature greater.
Aggregation Weights: These are typically learned in network parameters or are computer from these parameters. Almost all variants of convolution networks are computer in a top-down fashion, where they are either fixed across positions or determined by meta-network on the input features across positions.

Local Relational Layer

The local relational layer can be expressed with following expression:

$\omega\left ( {p}', p \right ) = softmax(\Phi (f_{\theta_q}(x_{p}'),f_{\theta_k}(x_p))+f_{\theta_\theta}(p-{p}'))$

where, $\Phi (f_{\theta_q}(x_{p}'),f_{\theta_k}(x_p))$ is the measure of composability between target pixel p’ and pixel p within its scope, based on their appearance after transformations $f_{\theta_q}(x_{p}')$ and $f_{\theta_k}(x_p)$ .

Locality: The bottom-up features typically aggregate input features above the full image. But, the local relational layer limits the aggregation computation limit to a small local area i.e 7×7 neighborhood. This method proves to be more effective in utilizing large kernels and offers steady variation in accuracy whereas the ConvNets accuracy gets saturated with an increase in training steps. This may be because the power of ConvNets is bottlenecked by the number of filters.
Appearance Composability: The authors follow a common approach in deep learning i.e to calculate the appearance composability:

$\Phi (f_{\theta_q}(x_{p}'),f_{\theta_k}(x_p))$

where x_p and x_p’are projected to a query and key embedding space. While in previous models, these are used as vectors but here the authors used them as scaler and argue that they have better speed-to-accuracy. The authors use the following instantiations ∅.
- Square difference: $\phi\left ( q_{p}', k_p \right ) = -\left (q_{p}' - k_p\right )^{2}$
- Absolute Difference: $\phi\left ( q_{p}', k_p \right ) = - \left | q_{p}' - k_p \right |$
- Multiplication: $\phi\left ( q_{p}', k_p \right ) = q_{p}' \cdot k_p$
Geometric Priors: An important aspect differentiating the local relational layer from the other convolution layer is the use of geometric priors. The geometric prior is encoded by a small network on the relative position of p to p’. This small network consists of two-channel transformation layers, with a ReLU activation between them. The authors argue that a small network is better than directly learning the values of geometric priors, especially when the neighborhood size is large.
Weight Normalization: The layer uses softmax for Weight normalization.
Channel Sharing: After each local relational layer, the authors use channel sharing in aggregating computations, where multiple channels are shared the same aggregating weights. While this decreases some computation but not significantly affect accuracy.

The total complexity of local relational layer can be calculated by:

$C = O \left ( (\frac{1 + s^2}{m} +1 ) * C * (C + k^2)\frac{HW}{s^2} \right )$

where, H×W is the dimensions of input feature map, k×k spatial neighborhood, C channels, and m channels per aggregation computation.