Abstract
This paper presents a new approach to solve unsupervised video object
segmentation~(UVOS) problem (called TMNet). The UVOS is
still a challenging problem as prior methods suffer from issues like
generalization errors to segment multiple objects in unseen test videos
(category agnostic), over reliance on inaccurate optic flow, and problem
towards capturing fine details at object boundaries. These issues make
the UVOS, particularly in presence of multiple objects, an ill-defined
problem. Our focus is to constrain the problem and improve the
segmentation results by inclusion of multiple available cues such as
appearance, motion, image edge, flow edge and tracking information
through neural attention. To solve the challenging category agnostic
multiple object UVOS, our model is designed to predict neighbourhood
affinities for being part of the same object and cluster those to obtain
accurate segmentation. To achieve multi cue based neural attention, we
designed a Temporal Motion Attention module, as part of our segmentation
framework, to learn the spatio-temporal features. To refine and improve
the accuracy of object segmentation boundaries, an edge refinement
module (using image and optic flow edges) and a geometry based loss
function are incorporated. The overall framework is capable of
segmenting and finding accurate objects’ boundaries without any
heuristic post processing. This enables the method to be used for unseen
videos. Experimental results on challenging DAVIS16 and multi object
DAVIS17 datasets shows that our proposed TMNet performs favourably
compared to the state-of-the-art methods without post processing.