Focus-MOT’s network structure
Figure 1 shows the proposed network structure of Focus-MOT. Focus-MOT uses Res2Net-50 [7] as a backbone network to increase the perceptual field of the network layers by constructing hierarchical residual class connections within a single residual block. The input image is normalized to 3*608*1088, and five-layer feature maps with sizes of 64*304*544, 256*152*272, 512*76*136, 1024*38*78, and 2048*19*39 are obtained through the backbone network, and the obtained five-layer feature maps are enhanced by the designed Field Enhancement Refinement Module to expand the perceptual field of the high-dimensional features, while completing the refinement of the features from the spatial dimension and the channel dimension, and then through the Information Aggregation Module, the bottom-up feature fusion from the high level to the low level, completing the information interaction between the high level semantic information and the low level detail information.