Local Relational Network
Local Relational Network was proposed by researchers of Tsinghua University and Microsoft Research. The idea behind this paper is that the convolution layer has been the dominant feature extractor in computer vision for years. However, this spatial aggregation process is just a pattern matching process that is inefficient for visual process modeling visual elements with the varying spatial distribution.
To deal with this inefficiency this paper presents a new image feature extractor, called the local relation layer, that adaptively determines aggregation weights based on the compositional relationship of local pixel pairs. With this relational approach, it can composite visual elements into higher-level entities in a more efficient manner that benefits semantic inference. The local Relational layer can be used directly in place of the convolution layer with some overhead.
In this section, we describe the general formulation of a feature extractor based on which local relational layer is based:
Let’s consider the input and output of a layer by x∈R*C*H*W and y∈R*C*H*W, with C and C’ being the channels of input/output features and h, w, h’, w’ the input/output spatial resolution. Existing basic image features generally produce the output feature by weighted aggregation of previous features, which can be represented by the equation below:
where c, c’ and p = (h, w), p’ = (h’, w’ ) index the input and output channels and feature map positions, respectively; Ωc and Ωp denote the scope for channel and spatial aggregation of input features in producing the output feature value at channel c’ and position p’, respectively; ω(c’, c, p’, p) denotes the aggregation weight from c, p to c’, p’.
- Parameterization method: It defines the model weights to be learned, the most common parameterization method is to directly learn the weights w. There are some methods that also learn the meta networks weights (∅).
- Aggregation scope: It defines the range of channels and spatial positions involved in aggregation computation. For a channel scope, regular convolution includes all input channels when computing each channel output. For greater efficiency, some methods consider only one or a group of input channels in producing one channel of the output feature greater.
- Aggregation Weights: These are typically learned in network parameters or are computer from these parameters. Almost all variants of convolution networks are computer in a top-down fashion, where they are either fixed across positions or determined by meta-network on the input features across positions.
Local Relational Layer
The local relational layer can be expressed with following expression:
where, is the measure of composability between target pixel p’ and pixel p within its scope, based on their appearance after transformations and .
- Locality: The bottom-up features typically aggregate input features above the full image. But, the local relational layer limits the aggregation computation limit to a small local area i.e 7×7 neighborhood. This method proves to be more effective in utilizing large kernels and offers steady variation in accuracy whereas the ConvNets accuracy gets saturated with an increase in training steps. This may be because the power of ConvNets is bottlenecked by the number of filters.
- Appearance Composability: The authors follow a common approach in deep learning i.e to calculate the appearance composability:
- where xp and xp’ are projected to a query and key embedding space. While in previous models, these are used as vectors but here the authors used them as scaler and argue that they have better speed-to-accuracy. The authors use the following instantiations ∅.
- Square difference:
- Absolute Difference:
- Geometric Priors: An important aspect differentiating the local relational layer from the other convolution layer is the use of geometric priors. The geometric prior is encoded by a small network on the relative position of p to p’. This small network consists of two-channel transformation layers, with a ReLU activation between them. The authors argue that a small network is better than directly learning the values of geometric priors, especially when the neighborhood size is large.
- Weight Normalization: The layer uses softmax for Weight normalization.
- Channel Sharing: After each local relational layer, the authors use channel sharing in aggregating computations, where multiple channels are shared the same aggregating weights. While this decreases some computation but not significantly affect accuracy.
The total complexity of local relational layer can be calculated by:
where, H×W is the dimensions of input feature map, k×k spatial neighborhood, C channels, and m channels per aggregation computation.
Local Relational Network:
LR-Net is similar to ResNet architecture except for all the Convolution layer in ResNet is replaced by the Local Relational Layer. Below is the architecture of the Local Relational Network (LR-Net): 1×1, 64 7×7 LR, 64, stride, 2 1×1, 64 3×3 conv, 64 1×1, 256 1×1, 100 7×7 LR, 100 1×1, 256 1×1, 128 3×3 conv, 128 1×1, 512 1×1, 200 7×7 LR, 200 1×1, 512 1×1, 256 3×3 conv, 256 1×1, 1024 1×1, 400 7×7 LR, 400 1×1, 1024 1×1, 256 3×3 conv, 512 1×1, 2048 1×1, 800 7×7 LR, 800 1×1, 2048 global average pool 1000-d fc, softmax global average pool 1000-d fc, softmax
Stage Output ResNet-50 LR-Net-50 (7×7, m=8) res1 112*112 7×7 conv, 64, stride, 2 res2 (x3) 56*56 3×3 max pool, stride, 2 3×3 max pool, stride, 2 res3 (x4) 28*28 res4 (x6) 14×14 res5 (x3) 7×7 1×1 # params 25.5 x 106 23.3 x 106 FLOPs 4.3 x 109 4.3 x 109
7×7 LR, 64, stride, 2
3×3 conv, 64
7×7 LR, 100
3×3 conv, 128
7×7 LR, 200
3×3 conv, 256
7×7 LR, 400
3×3 conv, 512
7×7 LR, 800
global average pool
1000-d fc, softmax
global average pool
1000-d fc, softmax
In this implementation, we will be PyTorch and Torchvision libraries. These libraries are pre-installed in Colaboratory. To install these modules locally please check out this guide:
LocalRelationalLayer( (kmap): KeyQueryMap( (l): Conv2d(64, 9, kernel_size=(1, 1), stride=(1, 1)) ) (qmap): KeyQueryMap( (l): Conv2d(64, 9, kernel_size=(1, 1), stride=(1, 1)) ) (ac): AppearanceComposability( (unfold): Unfold(kernel_size=7, dilation=1, padding=0, stride=1) ) (gp): GeometryPrior( (l1): Conv2d(2, 4, kernel_size=(1, 1), stride=(1, 1)) (l2): Conv2d(4, 8, kernel_size=(1, 1), stride=(1, 1)) ) (unfold): Unfold(kernel_size=7, dilation=1, padding=0, stride=1) (final1x1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1)) )