Abstract
Non-maximum suppression (NMS) is a post-processing step in almost every
visual object detector. NMS aims to prune the number of overlapping
detected candidate regions-of-interest (ROIs) on an image, in order to
assign a single and spatially accurate detection to each object. The
default NMS algorithm (GreedyNMS) is fairly simple and suffers from
severe drawbacks, due to its need for manual tuning. A typical case of
failure with high application relevance is pedestrian/person detection
in dense human crowds, where GreedyNMS doesn’t provide accurate results.
This paper proposes an efficient deep neural architecture for NMS in the
person detection scenario, by capturing relations of neighbouring ROIs
and aiming to ideally assign precisely one detection per person. The
presented Seq2Seq-NMS architecture assumes a sequence-to-sequence
formulation of the NMS problem, exploits the Multihead Scale-Dot Product
Attention mechanism and jointly processes both geometric and visual
properties of the input candidate ROIs. Thorough experimental evaluation
on three public person detection datasets shows favourable results
against competing methods, with acceptable inference runtime
requirements and good behaviour for large numbers of raw candidate ROIs
per image.