Cross-Modal Self-Supervised Vision Language Pre-training with Multiple
Objectives for Medical Visual Question Answering
Abstract
Medical Visual Question Answering (VQA) is a task that aims to provide
answers to questions about medical images, which utilizes both visual
and textual information in the reason?ing process. The absence of
large-scale annotated medical VQA datasets presents a formidable
obstacle to training a medical VQA model from scratch in an end-to-end
manner. Existing works have been using image captioning dataset in the
pre-training stage and fine-tuning to downstream VQA tasks. Following
the same paradigm, we use a collection of public medical image
captioning datasets to pre-train multimodality models in a
self-supervised setup, and fine-tune to downstream medical VQA tasks. In
the work, we propose a method that featured with Cross-Modal
pre?training with Multiple Objectives (CMMO), which includes masked
image modelling, masked language modelling, image-text match?ing, and
image-text contrastive learning. The proposed method is designed to
associate the visual features of medical images with corresponding
medical concepts in captions, for learning aligned vision and language
feature representations, and multi-modal interactions. The experimental
results reveal that our proposed CMMO method outperforms
state-of-the-art methods on three pub?lic medical VQA datasets, showing
absolute improvements of 2.6%, 0.9%, and 4.0% on the VQA-RAD,
PathVQA, and SLAKE dataset, respectively. We also conduct comprehensive
ablation studies to validate our method, and visualize the attention
maps which show a strong interpretability. The code and pre-trained
weights will be released at https://github.com/pengfeiliHEU/CMMO.