Part Detectors and CNNs


Abstract. First, we review the part detectors we created in the re-id bookchapter. Second, we review the structure of CNNs. The connection between these two topics is this idea: the first convolutional layer of a CNN is similar to a bank of spatial filters, while the part detectors are based on histograms of oriented gradients (HOG) features: is there some transferable knowledge between the two approaches? a new type of layer for the CNN? a new type of feature extraction for HOGs? what about the following convolutional layers in a CNN?

\label{fig:HOG_Overview}Overview of the HOG feature extraction.

Part Detectors

Calculating the HOG features requires a series of steps, shown summarized in Fig. \ref{fig:HOG_Overview}. At each step, Dalal and Triggs (Dalal 2005) experimentally show that certain choices produce better results than others, and they call the resultant procedure the default detector (HOG-dd). Like other recent implementations (Felzenszwalb 2010), we largely operate the same choices, but also introduce some tweaks.


Here, we assume the input is an image window of canonical size for the body part we are considering. Like in HOG-dd, we directly compute the gradients with the masks \([-1,0,1]\). For color images, each RGB color channel is processed separately, and pixels assume the gradient vector with the largest norm. While it does not take full advantage of the color information, it is better than discarding it like in the Andriluka’s detector.


Next, we turn each pixel gradient vector into an histogram by quantizing its orientation into 18 bins. The orientation bins are evenly spaced over the range \(0^{\circ}-180^{\circ}\) so each bin spans \(10^{\circ}\). For pedestrians there is no a-priori light/dark scheme between foreground and background (due to clothes and scenes) that justifies the use of the “signed” gradients with range \(0^{\circ}-360^{\circ}\): in other words, we use the contrast insensitive version (Felzenszwalb 2010). To reduce aliasing, when an angle does not fall squarely in the middle of a bin, its gradient magnitude is split linearly between the neighboring bin centers. The outcome can be seen as a sparse image with 18 channels, which is further processed by applying a spatial convolution, to spread the votes to 4 neighboring pixels (Wang 2009).


We then spatially aggregate the histograms into cells made by \(7\times 7\) pixel regions, by defining the feature vector at a cell to be the sum of its pixel-level histograms.


As in the HOG-dd, we group cells into larger blocks and contrast normalize each block separately. In particular, we concatenate features from \(2\times 2\) contiguous cells into a vector \(\mathbf{v}\), then normalize it as \(\tilde{\mathbf{v}}=\min(\mathbf{v}/||\mathbf{v}||,0.2)\), L2 norm followed by clipping. This produces 36-dimensional feature vectors for each block. The final feature vector for the whole part image is obtained by concatenating the vectors of all the blocks.

When the initial part image is rotated such that its orientation is not aligned with the image grid, the default approach is to normalize this situation by counter-rotating the entire image (or the bounding box of the part) before processing it as a canonical window. This can be computationally expensive during training, where image parts have all sorts of orientations, and during testing, even if we limit the number of detectable angles. Furthermore, dealing with changes in the scaling factor of the human figures and the foreshortening of limbs introduces additional computational burdens. In the following, we introduce a novel approximation method that manages to speed up the detection process.