Focus-MOT’s network structure
Figure 1 shows the proposed network structure of Focus-MOT. Focus-MOT
uses Res2Net-50 [7] as a backbone network to increase the perceptual
field of the network layers by constructing hierarchical residual class
connections within a single residual block. The input image is
normalized to 3*608*1088, and five-layer feature maps with sizes of
64*304*544, 256*152*272, 512*76*136, 1024*38*78, and 2048*19*39 are
obtained through the backbone network, and the obtained five-layer
feature maps are enhanced by the designed Field Enhancement Refinement
Module to expand the perceptual field of the high-dimensional features,
while completing the refinement of the features from the spatial
dimension and the channel dimension, and then through the Information
Aggregation Module, the bottom-up feature fusion from the high level to
the low level, completing the information interaction between the high
level semantic information and the low level detail information.