Figure 1: Sketch of a RepMLP. Here N, C, H, W are the batch size, number of input channels, height and width, h, w, g, p, O are the desired partition height and width, number of groups, padding, and output channels, respectively.
The input feature map is split into a set of partitions, and the Global Perceptron adds the correlations among partitions onto each partition. Then the Local Perceptron captures the local patterns with several conv layers, and the Partition Perceptron models the long-range dependencies. This sketch assumes $N = C = 1$,$H = W$, $H/w = W/w = 2$ (i.e., a channel is split into four partitions) for the better readability. We assume $ h, w > 7$ so that the Local Perceptron has conv branches of kernel size 1, 3, 5, 7. The shapes of parameter tensors are shown alongside FC and conv layers. Via structural re-parameterization, the training-time block with conv and BN layers is equivalently converted into a three-FC block, which is saved and used for inference.
2.1 Formulation
我们假设特征图表示为,我们采用F和W表示卷积与全连接层的核参数。为简单起见,本文采用了pytroch风格的数据排布与伪代码风格,比如卷积处理的数据流表示如下:
$$
M^{(o u t)}=C O N V\left(M^{(i n)}, F, p\right)
$$
Partition Perceptron 它包含FC与BN层,并以分区特征作为输入。前述输出将通过reshape、re-arrange、reshape等操作变为。我们进一步采用组卷及降低FC3的参数量,定义如下:
$$
M^{(o u t)}=g C O N V\left(M^{(i n)}, F, g, p\right), F \in R^{O \times \frac{C}{g} \times K \times K}
$$
类似的,组形式的FC核,此时参数量可以减少g倍。尽管这种形式的FC不被某些框架(如Pytorch)支持,但我们可以采用卷积代替实现。它的实现包含三个步骤:
将reshape为空域尺寸为的特征;
采用groups=g的卷积;
将上述所得特征reshape为。
整个过程定义如下:
$$
\begin{array}{c}
M^{\prime}=R S\left(V^{(i n)},(N, P, 1,1)\right) \\
R^{\prime}=R S\left(W,\left(Q, \frac{P}{g}, 1,1\right)\right) \\
g M M U L\left(V^{(i n)}, W, g\right)=R S\left(g C O N V\left(M^{\prime}, F^{\prime}, g, 0\right),(N, Q)\right)
\end{array}
$$
相关代码如下:
1
2
3
4
5
6
7
8
9
10
11
def__init__():self.fc3=nn.Conv2d(self.C*self.h*self.w,self.O*self.h*self.w,1,1,0,bias=deploy,groups=fc3_groups)self.fc3_bn=nn.Identity()ifdeployelsenn.BatchNorm1d(self.O*self.h*self.w)defforward():# Feed partition map into Partition Perceptronfc3_inputs=partitions.reshape(-1,self.C*self.h*self.w,1,1)fc3_out=self.fc3(fc3_inputs)fc3_out=fc3_out.reshape(-1,self.O*self.h*self.w)fc3_out=self.fc3_bn(fc3_out)fc3_out=fc3_out.reshape(-1,self.h_parts,self.w_parts,self.O,self.h,self.w)
2.4 Local Perceptron
Local Perceptron 它将分区特征经由几个卷积进行处理。前面的图示Fig1给出了的示意图。理论上,仅有的约束为:核尺寸(因为采用比分辨率更大的核没有意义),但是我们参考ConvNet的常规操作仅采用了奇数核。为简单起见,我们采用这种方框形式,其他同样有效。所有卷积分支的输出与Partition perceptron的输出相加作为最终的输出。
def__init__(...):self.reparam_conv_k=reparam_conv_kifnotdeployandreparam_conv_kisnotNone:forkinreparam_conv_k:conv_branch=nn.Sequential()conv_branch.add_module('conv',nn.Conv2d(in_channels=self.C,out_channels=self.O,kernel_size=k,padding=k//2,bias=False,groups=fc3_groups))conv_branch.add_module('bn',nn.BatchNorm2d(self.O))self.__setattr__('repconv{}'.format(k),conv_branch)defforward(...):# Feed partition map into Local Perceptronifself.reparam_conv_kisnotNoneandnotself.deploy:conv_inputs=partitions.reshape(-1,self.C,self.h,self.w)conv_out=0forkinself.reparam_conv_k:conv_branch=self.__getattr__('repconv{}'.format(k))conv_out+=conv_branch(conv_inputs)conv_out=conv_out.reshape(-1,self.h_parts,self.w_parts,self.O,self.h,self.w)fc3_out+=conv_out# fc3_out是前面Partition Perceptron的输出
2.5 Merging Conv into FC
在将RePMLP转换为三个FC之前,我们首先看一下如何将卷积合并到FC。假设FC核,卷积核,我们期望构建满足:
$$
M M U L\left(M^{(i n)}, W^{\prime}\right)=M M U L\left(M^{(i n)}, W^{(1)}\right)+C O N V\left(M^{(i n)}, F, p\right)
$$
我们注意到:对任意与同形状的核,MMUL的加法特征满足:
$$
M M U L\left(M^{(i n)}, W^{(1)}\right)+M M U L\left(M^{(i n)}, W^{(2)}\right)=M M U L\left(M^{(i n)}, W^{(1)}+W^{(2)}\right)
$$
因此,只要可以构建与同形状的,我们就可以将F合并到并满足:
$$
M M U L\left(M^{(i n)}, W^{F, p}\right)=C O N V\left(M^{(i n)}, F, p\right)
$$
很明显,一定存在(因为卷积可视作稀疏版FC)。考虑到不同平台对于卷积的加速策略、内存排布等方式的不同,A平台的矩阵构建可能并不适合于B平台。我们提出了一种简单的、平台无关解决方案。
正如前面所说,对于任意输入,卷积核F,存在一个FC核满足:
$$
M^{(o u t)}=\operatorname{CONV}\left(M^{(i n)}, F, p\right)=M M U L\left(M^{(i n)}, W^{(F, p)}\right)
$$
采用矩阵乘形式,此时有:
$$
V^{(out)} = V^{(in)}\cdot W^{(F,p)_T}
$$
$$
\begin{array}{c}
M^{(I)}=R S(I,(C h w, C, h, w)) \\
I \cdot W^{(F, p)_{T}}=C O N V\left(M^{(I)}, F, p\right) \\
V^{(o u t)}=V^{i n} \cdot R S\left(I \cdot W^{(F, p)_{T}},(C h w, \text { Ohw })\right)
\end{array}
$$
通过比较上述公式,我们可以得到:
$$
W^{(F, p)}=R S\left(C O N V\left(M^{(I)}, F, p\right),(C h w, O h w)\right)^{T}
$$
该公式精确的展示了如何通过F,p构建。简而言之,卷积核的等效FC核可以通过对恒等矩阵进行卷积并添加合适reshape得到。
Figure 2: The pure MLP model and the convolutional counterpart
The stage1 and stage3 are displayed in detail. Taking stage1 for example, $32×32$ is the resolution, $C = 16 $is the number of output channels (except the last layer). Left: FC(32,16) is the kernel size, suggesting that this FC (equivalent to a $1×1$ conv) projects 16 channels into 32 channels; all the RepMLPs are configured with $g = 2, h = w = 8$. Right: the convolutional counterpart uses $3×3$ conv. A BN follows every conv and a ReLU follows every RepMLP or conv-BN sequence.
Table 4: Comparisons with traditional ConvNets on ImageNet all trained with the identical settings. The speed is tested on the same 1080Ti with a batch size of 128. The input resolutions of the EfficientNets are different because they are fixed as the structural hyper-parameters.