The Feature Pyramid Network (FPN) is a famous architecture that applies the multiple-scale strategy to a base feature extractor . FPN follows the idea of the images’ pyramid, and extends it to the pyramid of feature maps. The goal of FPN is to combine the advantages of both high-level and low-level feature maps. As shown in Figure 2, FPN consists of two inverse pathways, a bottom-up and atop-down pathway. The bottom-up pathway is the base feature extractor mentioned above (on the left in Figure 2), and usually employs a convolutional neural network (CNN) classifier. Along the direction of the dataflow in the bottom-up pathway, thebase feature extractor is separated into five stages, and a downsampling operation is applied to each block. The top layers export the feature maps with more semantic information, while the output of the low layers possesses a higher spatial resolution. Following the architecture in [27], the base feature extractor adopts a Residual Neural Network (ResNet) [45]. Concretely, ResNet-50 is chosen to balance performance and computational complexity. The architecture of the adopted ResNet-50 is displayed on the left side of Figure 2. The learnable convolutional layers are organized into 5 stages, with Stage1 as a convolutional layer (out-channel=64 and stride=2) and Stage2~Stage5 as several stacked convolutional blocks. Each convolutional block inStage2~Stage5 has three convolutional layers which match to the three lines11Each line contains settings for the kernel size, input channel, and output channel. The first line consists of two input channels. The former is a parameter of the initial convolutional layer, whereas the latter is a parameter of the two subsequent convolutional layers. in “Stage” boxes in Figure 2.
As depicted in Figure 2, FPN also provides a top-down pathway that contains top-down and lateral connections. The top-down connections are responsible for upsampling the higher-level feature maps to the same size as their lower-level counterparts. Specifically, an upsampling operation is based on nearest-neighbor interpolation. Meanwhile, lateral connections use a convolutional layer to increase the channel dimension of the bottom feature maps according to the top ones. The upsampled and channel-increased feature maps are then merged and fed into a convolutional layer to generate pyramid feature maps (P5, P4, P3 , and P2 ). From top to bottom, the top-down and lateral connections cooperate to handle the original feature maps from ResNet-50 stage-by-stage.

3.2 RPN with PU learning for incomplete annotations

In insulator detection, incomplete annotation will lead to some unlabeled insulators treated as background during the training process, which causes the ambiguity between targets and background [34]. Therefore, we introduce PU learning as the new loss of Region Proposal Network (RPN).

A. Region Proposal Network for insulator proposals

RPN is a typical anchor-based detector, which implies targets are detected from anchor regions. The anchors are obtained by partitioning input images. RPN is in charge of determining whether an insulator exists and locating the target’s offsets in each anchor. The centers of the anchors correspond to the centers of the receptive field and, more specifically, to the pixels in the top feature map P5 . The anchor boxes at the center of each anchor have variable height-to-width ratios to accommodate targets of various shapes. According to [27], the Faster RCNN pipeline has nine anchor boxes with varying height-to-width ratios.
Our PU-RPN inherits the architecture and supervision method of the vanilla RPN. As seen in Figure 3 (or Module B in Figure 1), the feature maps from FPN are fed into PU-RPN, which generates proposals and crops the feature maps based on the proposals. The cropped feature maps serve as the ROI Head’s inputs. PU-RPN is comprised of convolutional layer and two separated convolutional layers. The former convolutional layer learns from the pyramid feature maps (P5, P4, P3 , andP2 ), which expand the input channel (256) to the output channel (512). In Figure 3, the upper classification branch uses a convolutional layer as a binary classifier between insulators and the background. The number of output channel in this layer is eighteen , which impliestwo categories and nine anchors. Similarly, the other regressor branch aims to predict the coordinate offsets of the insulators (offsets for and ). The output dimension of regressor branch is 36 ( offsets).
Before the training processing, the ground-truth bounding boxes need to be converted to the supervision information of the anchors. A positive label is assigned to an anchor when Intersection over Union (IoU) is greater than 0.7 with any ground-truth box, whereas a negative label corresponds to IoU values below 0.3. The coordinate offsets are determined using the difference between the annotated bounding boxes and the positive anchors. The coordinate offsets of negative anchors are set in a random way.
The loss functions of the original RPN can be summarized in two parts: Positive-Negative (PN) classification and smooth L1 regression. The PN classification of insulators predicts the good insulators and defective insulators as positive samples, while the background is regarded as negative samples. The loss function for this PN classification is computed as follows:
where and separately represent the total number of a specific class and the predicted classification score of a particular anchor. The subscripts and stand for positive and negative class, respectively. The superscripts and are the indices of positive and negative anchors, respectively. is usually set to a cross-entropy loss that calculates the error between the anchors’ prediction classification probability and the corresponding ground-truth labels.
When it comes to the localization loss for insulator defect detection, a typical choice is the smooth L1-loss function [46]. The predicted bounding-box is denoted as , while the ground-truth bounding box is represented as . Hence, the localization loss is defined as
In this equation, and is the same as Equation . The complete loss function for insulator defect detection is based on the combination of the PU classification loss and the localization loss .
The loss in Equation is used to train the original RPN in the Faster RCNN. Our proposed PU-RPN replaces the PN loss with the PU loss, and the details are given in the following sections.

B. PU learning for incomplete annotations

For insulator defect detection from images, manual annotations need to overcome the problems derived from the varied insulator appearances and the complicated background. In the scenario of incomplete annotations, the missing-labeled regions with insulators are treated as the background. If PU-RPN is trained with the loss defined in Equation , the PN loss will lead to semantic ambiguity. To solve this issue, PU learning is introduced in PU-RPN as an alternative to PN loss. Furthermore, PU learning can mitigate the effect that unlabeled insulators are treated as background.
In the framework of PU learning [47], the class prior π is usually introduced to represent the proportion of the actual positive samples in the dataset. The loss function of PU learning can be defined as:
where and therein stand for the number of labeled positive samples and unlabeled samples, respectively. and represent the indices of unlabeled anchors and the corresponding classification probability, respectively. The remaining symbols refer to Equation . The first term in Equation estimates approximately the loss from predicting true-positive samples as positive. The second term is the difference in loss between all anchors and true-positive anchors, which are both predicted to be negative. Then a non-negative operation is applied to the second term as suggested in [47], which leads to
The estimation of the class prior is crucial for the PU classification loss. The approach to determine the class prior is described in Section 2.4. Based on PU classification loss , Equation is rewritten as:

3.3 ROI Head with focal loss for sample imbalance

During identifying the categories of the insulators, different categories are with various quantities of samples. Therefore, the categories’ contributions to the loss are not equal, which makes the category with fewer samples inclines to obtain a worse performance. Therefore, we introduce focal loss into the Region of Interest (ROI) Head to relieve this issue.

A. Region of Interest (ROI) Head for insulator detection

The ROI Head is the second-stage detector at the end of the Faster RCNN. The schematic diagram of ROI Head refers to Figure 4 (or Module Cin Figure 1). It follows the FPN and RPN modules, which reserve the top k proposal regions as ROIs. The ROI Head refines the classification and regression results predicted by RPN. Its network architecture is composed of ROI pooling layers and several fully-connected (FC) layers. ROI pooling projects ROI spatial dimensions to fixed-size feature maps. Those FC layers imitate the VGG classifier head, which possesses two shared FC layers and two parallel separate FC layers as the classification and detection branches. The goal of the classification branch is to identify the good or defective insulators from those ROIs’ feature maps.
During the training process, ROI pooling initially receives a lot of ROIs in different sizes. Each ROI’s feature map can be partitioned roughly equal bins along the spatial dimensions. Then ROI pooling employs max-pooling to handle the values in the bins. As a result, each bin generates one maximum as its replacement, ensuring that all ROI’s feature maps have the same size. Furthermore, the ROI-pooled feature maps are reshaped as a feature vector. The vector passes the shared FC layers for the enhancement of the semantic information. The last two separated FC layers finish the classification and detection tasks.
The loss functions for the ROI Head contain a multi-class cross entropy and smooth L1 localization loss defined in Equation . The multi-class cross entropy is defined as
where is the predicted vectors, and stands for the one-hot vector of the label. The and correspond to the indices of ROIs and elements in predicted vectors, respectively.

B. Focal loss for sample imbalances

The original Faster RCNN employs the multi-class cross entropy that penalizes the samples equally for all classes. This leads to the drawback that the classes with more samples are weighted by a larger for all classes. This leads to the drawback that the classes with more samples are weighted by a larger factor.
Focal loss is a better alternative for cross-entropy when the problem of sample imbalance exists [48]. In our framework, focal loss is incorporated into ROI Head to alleviate the effect of the sample imbalance. The focal loss for multi-class can be defined as follows: