Fig 1 The network structure of Focus-MOT.
Figure 2 shows the network of Field Enhancement Refinement Module. The Field Enhancement Refinement Module first goes through five parallel modules: adaptive pooling, 3×3 convolution with hole rates of 6, 8, and 12, and 1×1 convolution, respectively, and then stitches them together so that multi-scale information can be captured while expanding the feature map sensory field. And after completing such an operation, we design two parallel modules to capture rich contextual relationships to better achieve compact feature representation within the class.
First is the branch above, A is the feature map of the input parallel module with size C × H × W. First, A is subjected to a convolution operation to obtain new feature maps B, C (B = C, size C × H × W), and then BC are reshape to the size of C × N, where N = H × W. B is transposed and multiplied with C, and the obtained result is then subjected to a softmax operation to obtain the feature map S of size The sum of each row in S is 1. s_ji can be interpreted as the weight of pixel at position j to pixel at position i, i.e., the weight of all pixels j to a fixed pixel i is 1.
\begin{equation} s_{\text{ji}}=\frac{\exp\left(B_{i}\cdot C_{j}\right)}{\sum_{i=1}^{N}\mspace{2.0mu }exp\left(B_{i}\cdot C_{j}\right)}\nonumber \\ \end{equation}
Meanwhile, A is subjected to another convolution operation to obtain the feature map D (of size C × H × W), with the same reshape of size C × N. Multiply it with the transpose of S to get the result map of size C × N, and then reshape it back to size C × H × W, multiplying it by a coefficient γ. Finally, add it to A to get the final feature map result E incorporating location information. where γ is a weight parameter to be learned, with an initial value of 0.
\begin{equation} E_{j}=\gamma\sum_{i=1}^{N}\mspace{2.0mu }\left(s_{\text{ji}}D_{i}\right)+A_{j}\nonumber \\ \end{equation}
Such a branch is able to build rich contextual relationships on local features, encoding broader contextual information into local features and thus enhancing their representational power.Then comes the next branch, where we argue that the channel graph of each high-level feature can be regarded as a class-specific response, and by mining the interdependencies between channel graphs, the interdependent feature graphs can be highlighted and the semantics-specific feature representation can be improved. Therefore, this branch of the paper aims at building a channel attention module to explicitly model the dependencies between channels.Similar to the previous branch, except that instead of performing a convolution operation on the feature map A, the operation is performed directly on A. Similarly, A is reshaped to a size of C × N, denoted as B, and then B is multiplied with its own transpose and then subjected to a softmax operation to obtain a feature map X of size C × C.
\begin{equation} x_{\text{ji}}=\frac{\exp\left(A_{i}\cdot A_{j}\right)}{\sum_{i=1}^{C}\mspace{2.0mu }exp\left(A_{i}\cdot A_{j}\right)}\nonumber \\ \end{equation}
The transpose of X is multiplied by B and then reshape back to the size of C×H×W, multiplied by a factor β, denoted as D. Adding A to D gives the final feature map E with fused channel information. β also has an initial value of 0.
\begin{equation} E_{j}=\beta\sum_{i=1}^{C}\mspace{2.0mu }\left(x_{\text{ji}}A_{i}\right)+A_{j}\nonumber \\ \end{equation}
After the input features are processed by these two parallel branches, the two feature maps are added element by element to complete the fusion of the two feature maps, and the 1×1 convolution is used to reduce the dimensionality, so that the whole Spatial Fusion module can enhance the fusion of the low-level features.
After completing this series of operations, we up-sample the high-level features one by one by the designed Information Aggregation Module, and each up-sampling will be added element by element with the feature maps of the same resolution output by Res2net-50, and then the final four feature maps will be output.