Sandipan Sarma - Authorea

The increasing number of actions in the real world makes it difficult for traditional deep-learning models to recognize unseen actions. Recently, pretrained contrastive image-based visual-language (I-VL) models have been adapted for efficient â\euroœzero-shotâ\euro? scene understanding, with transformers for temporal modeling. However, the significance of modeling the local spatial context of objects and action environments remains unexplored. In this work, we propose a framework called LoCATe-GAT, comprising a novel Local Context-Aggregating Temporal transformer (LoCATe) and a Graph Attention Network (GAT) that take image and text encodings from a pretrained I-VL model as inputs. Motivated by the observation that object-centric and environmental contexts drive both distinguishability and functional similarity between actions, LoCATe captures multiscale local context using dilated convolutional layers during temporal modeling. Furthermore, the proposed GAT models semantic relationships between classes and achieves a strong synergy with the video embeddings produced by LoCATe. Extensive experiments on two widely-used benchmarks â\euro“ UCF101 and HMDB51 â\euro“ show we achieve state-of-the-art results. Specifically, we obtain absolute gains of 2.8% and 2.3% on these datasets in conventional and 8.6% on UCF101 in generalized zero-shot action recognition settings. Additionally, we gain 18.6% and 5.8% on UCF101 and HMDB51 as per the recent â\euroœTruZeâ\euro? evaluation protocol.Â