Fig 1 The network structure of Focus-MOT.
Figure 2 shows the network of Field Enhancement Refinement Module. The
Field Enhancement Refinement Module first goes through five parallel
modules: adaptive pooling, 3×3 convolution with hole rates of 6, 8, and
12, and 1×1 convolution, respectively, and then stitches them together
so that multi-scale information can be captured while expanding the
feature map sensory field. And after completing such an operation, we
design two parallel modules to capture rich contextual relationships to
better achieve compact feature representation within the class.
First is the branch above, A is the feature map of the input parallel
module with size C × H × W. First, A is subjected to a convolution
operation to obtain new feature maps B, C (B = C, size C × H × W), and
then BC are reshape to the size of C × N, where N = H × W. B is
transposed and multiplied with C, and the obtained result is then
subjected to a softmax operation to obtain the feature map S of size The
sum of each row in S is 1. s_ji can be interpreted as the weight of
pixel at position j to pixel at position i, i.e., the weight of all
pixels j to a fixed pixel i is 1.
\begin{equation}
s_{\text{ji}}=\frac{\exp\left(B_{i}\cdot C_{j}\right)}{\sum_{i=1}^{N}\mspace{2.0mu }exp\left(B_{i}\cdot C_{j}\right)}\nonumber \\
\end{equation}Meanwhile, A is subjected to another convolution operation to obtain the
feature map D (of size C × H × W), with the same reshape of size C × N.
Multiply it with the transpose of S to get the result map of size C × N,
and then reshape it back to size C × H × W, multiplying it by a
coefficient γ. Finally, add it to A to get the final feature map result
E incorporating location information. where γ is a weight parameter to
be learned, with an initial value of 0.
\begin{equation}
E_{j}=\gamma\sum_{i=1}^{N}\mspace{2.0mu }\left(s_{\text{ji}}D_{i}\right)+A_{j}\nonumber \\
\end{equation}Such a branch is able to build rich contextual relationships on local
features, encoding broader contextual information into local features
and thus enhancing their representational power.Then comes the next
branch, where we argue that the channel graph of each high-level feature
can be regarded as a class-specific response, and by mining the
interdependencies between channel graphs, the interdependent feature
graphs can be highlighted and the semantics-specific feature
representation can be improved. Therefore, this branch of the paper aims
at building a channel attention module to explicitly model the
dependencies between channels.Similar to the previous branch, except
that instead of performing a convolution operation on the feature map A,
the operation is performed directly on A. Similarly, A is reshaped to a
size of C × N, denoted as B, and then B is multiplied with its own
transpose and then subjected to a softmax operation to obtain a feature
map X of size C × C.
\begin{equation}
x_{\text{ji}}=\frac{\exp\left(A_{i}\cdot A_{j}\right)}{\sum_{i=1}^{C}\mspace{2.0mu }exp\left(A_{i}\cdot A_{j}\right)}\nonumber \\
\end{equation}The transpose of X is multiplied by B and then reshape back to the size
of C×H×W, multiplied by a factor β, denoted as D. Adding A to D gives
the final feature map E with fused channel information. β also has an
initial value of 0.
\begin{equation}
E_{j}=\beta\sum_{i=1}^{C}\mspace{2.0mu }\left(x_{\text{ji}}A_{i}\right)+A_{j}\nonumber \\
\end{equation}After the input features are processed by these two parallel branches,
the two feature maps are added element by element to complete the fusion
of the two feature maps, and the 1×1 convolution is used to reduce the
dimensionality, so that the whole Spatial Fusion module can enhance the
fusion of the low-level features.
After completing this series of operations, we up-sample the high-level
features one by one by the designed Information Aggregation Module, and
each up-sampling will be added element by element with the feature maps
of the same resolution output by Res2net-50, and then the final four
feature maps will be output.