The Feature Pyramid Network (FPN) is a famous architecture that applies
the multiple-scale strategy to a base feature extractor . FPN
follows the idea of the images’ pyramid, and extends it to the pyramid
of feature maps. The goal of FPN is to combine the advantages of both
high-level and low-level feature maps. As shown in Figure 2, FPN
consists of two inverse pathways, a bottom-up and atop-down pathway. The bottom-up pathway is the base
feature extractor mentioned above (on the left in Figure 2), and
usually employs a convolutional neural network (CNN) classifier. Along
the direction of the dataflow in the bottom-up pathway, thebase feature extractor is separated into five stages, and a
downsampling operation is applied to each block. The top layers export
the feature maps with more semantic information, while the output of the
low layers possesses a higher spatial resolution. Following the
architecture in [27], the base feature extractor adopts a
Residual Neural Network (ResNet) [45]. Concretely, ResNet-50 is
chosen to balance performance and computational complexity. The
architecture of the adopted ResNet-50 is displayed on the left side of
Figure 2. The learnable convolutional layers are organized into 5
stages, with Stage1 as a convolutional layer (out-channel=64
and stride=2) and Stage2~Stage5 as several
stacked
convolutional
blocks. Each convolutional block inStage2~Stage5 has three convolutional layers
which match to the three lines11Each line contains settings for
the kernel size, input channel, and output channel. The first line
consists of two input channels. The former is a parameter of the
initial convolutional layer, whereas the latter is a parameter of the
two subsequent convolutional layers. in “Stage” boxes in
Figure 2.
As depicted in Figure 2, FPN also provides a top-down pathway
that contains top-down and lateral connections. The top-down connections
are responsible for upsampling the higher-level feature maps to the same
size as their lower-level counterparts. Specifically, an upsampling
operation is based on nearest-neighbor interpolation. Meanwhile, lateral
connections use a convolutional layer to increase the channel dimension
of the bottom feature maps according to the top ones. The upsampled and
channel-increased feature maps are then merged and fed into a
convolutional layer to generate pyramid feature maps (P5, P4,
P3 , and P2 ). From top to bottom, the top-down and lateral
connections cooperate to handle the original feature maps from ResNet-50
stage-by-stage.
3.2 RPN with PU learning for incomplete
annotations
In insulator detection, incomplete annotation will lead to some
unlabeled insulators treated as background during the training process,
which causes the ambiguity between targets and background [34].
Therefore, we introduce PU learning as the new loss of Region Proposal
Network (RPN).
A. Region Proposal Network for insulator
proposals
RPN is a typical anchor-based detector, which implies targets are
detected from anchor regions. The anchors are obtained by partitioning
input images. RPN is in charge of determining whether an insulator
exists and locating the target’s offsets in each anchor. The centers of
the anchors correspond to the centers of the receptive field and, more
specifically, to the pixels in the top feature map P5 . The
anchor boxes at the center of each anchor have variable height-to-width
ratios to accommodate targets of various shapes. According to [27],
the Faster RCNN pipeline has nine anchor boxes with varying
height-to-width ratios.
Our PU-RPN inherits the architecture and supervision method of the
vanilla RPN. As seen in Figure 3 (or Module B in Figure 1), the
feature maps from FPN are fed into PU-RPN, which generates proposals and
crops the feature maps based on the proposals. The cropped feature maps
serve as the ROI Head’s inputs. PU-RPN is comprised of convolutional
layer and two separated convolutional layers. The former convolutional
layer learns from the pyramid feature maps (P5, P4, P3 , andP2 ), which expand the input channel (256) to the output channel
(512). In Figure 3, the upper classification branch uses a convolutional
layer as a binary classifier between insulators and the background. The
number of output channel in this layer is eighteen , which impliestwo categories and nine anchors. Similarly, the other
regressor branch aims to predict the coordinate offsets of the
insulators (offsets for and ). The output dimension of regressor branch
is 36 ( offsets).
Before the training processing, the ground-truth bounding boxes need to
be converted to the supervision information of the anchors. A positive
label is assigned to an anchor when Intersection over Union (IoU) is
greater than 0.7 with any ground-truth box, whereas a negative label
corresponds to IoU values below 0.3. The coordinate offsets are
determined using the difference between the annotated bounding boxes and
the positive anchors. The coordinate offsets of negative anchors are set
in a random way.
The loss functions of the original RPN can be summarized in two parts:
Positive-Negative (PN) classification and smooth L1 regression.
The PN classification of insulators predicts the good insulators and
defective insulators as positive samples, while the background is
regarded as negative samples. The loss function for this PN
classification is computed as follows:
where and separately represent the total number of a specific class and
the predicted classification score of a particular anchor. The
subscripts and stand for positive and negative class, respectively. The
superscripts and are the indices of positive and negative anchors,
respectively. is usually set to a cross-entropy loss that calculates the
error between the anchors’ prediction classification probability and the
corresponding ground-truth labels.
When it comes to the localization loss for insulator defect detection, a
typical choice is the smooth L1-loss function [46]. The
predicted bounding-box is denoted as , while the ground-truth bounding
box is represented as . Hence, the localization loss is defined as
In this equation, and is the same as Equation . The complete loss
function for insulator defect detection is based on the combination of
the PU classification loss and the localization loss .
The loss in Equation is used to train the original RPN in the Faster
RCNN. Our proposed PU-RPN replaces the PN loss with the PU loss, and the
details are given in the following sections.
B. PU learning for incomplete
annotations
For insulator defect detection from images, manual annotations need to
overcome the problems derived from the varied insulator appearances and
the complicated background. In the scenario of incomplete annotations,
the missing-labeled regions with insulators are treated as the
background. If PU-RPN is trained with the loss defined in Equation , the
PN loss will lead to semantic ambiguity. To solve this issue, PU
learning is introduced in PU-RPN as an alternative to PN loss.
Furthermore, PU learning can mitigate the effect that unlabeled
insulators are treated as background.
In the framework of PU learning [47], the class prior π is usually
introduced to represent the proportion of the actual positive samples in
the dataset. The loss function of PU learning can be defined as:
where and therein stand for the number of labeled positive samples and
unlabeled samples, respectively. and represent the indices of unlabeled
anchors and the corresponding classification probability, respectively.
The remaining symbols refer to Equation . The first term in Equation
estimates approximately the loss from predicting true-positive samples
as positive. The second term is the difference in loss between all
anchors and true-positive anchors, which are both predicted to be
negative. Then a non-negative operation is applied to the second term as
suggested in [47], which leads to
The estimation of the class prior is crucial for the PU classification
loss. The approach to determine the class prior is described in Section
2.4. Based on PU classification loss , Equation is rewritten as:
3.3 ROI Head with focal loss for sample
imbalance
During identifying the categories of the insulators, different
categories are with various quantities of samples. Therefore, the
categories’ contributions to the loss are not equal, which makes the
category with fewer samples inclines to obtain a worse performance.
Therefore, we introduce focal loss into the Region of Interest (ROI)
Head to relieve this issue.
A. Region of Interest (ROI) Head for insulator
detection
The ROI Head is the second-stage detector at the end of the Faster RCNN.
The schematic diagram of ROI Head refers to Figure 4 (or Module Cin Figure 1). It follows the FPN and RPN modules, which reserve the top
k proposal regions as ROIs. The ROI Head refines the classification and
regression results predicted by RPN. Its network architecture is
composed of ROI pooling layers and several fully-connected (FC) layers.
ROI pooling projects ROI spatial dimensions to fixed-size feature maps.
Those FC layers imitate the VGG classifier head, which possesses two
shared FC layers and two parallel separate FC layers as the
classification and detection branches. The goal of the classification
branch is to identify the good or defective insulators from those ROIs’
feature maps.
During the training process, ROI pooling initially receives a lot of
ROIs in different sizes. Each ROI’s feature map can be partitioned
roughly equal bins along the spatial dimensions. Then ROI pooling
employs max-pooling to handle the values in the bins. As a result, each
bin generates one maximum as its replacement, ensuring that all ROI’s
feature maps have the same size. Furthermore, the ROI-pooled feature
maps are reshaped as a feature vector. The vector passes the shared FC
layers for the enhancement of the semantic information. The last two
separated FC layers finish the classification and detection tasks.
The loss functions for the ROI Head contain a multi-class cross entropy
and smooth L1 localization loss defined in Equation . The
multi-class cross entropy is defined as
where is the predicted vectors, and stands for the one-hot vector of the
label. The and correspond to the indices of ROIs and elements in
predicted vectors, respectively.
B. Focal loss for sample
imbalances
The original Faster RCNN employs the multi-class cross entropy that
penalizes the samples equally for all classes. This leads to the
drawback that the classes with more samples are weighted by a larger for
all classes. This leads to the drawback that the classes with more
samples are weighted by a larger factor.
Focal loss is a better alternative for cross-entropy when the problem of
sample imbalance exists [48]. In our framework, focal loss is
incorporated into ROI Head to alleviate the effect of the sample
imbalance. The focal loss for multi-class can be defined as follows: