Abstract
Textbook Question Answering (TQA) is the task of answering diagram and
non-diagram questions given large multi-modal contexts consisting of
abundant text and diagrams. Deep text understandings and effective
learning of diagram semantics are important for this task due to its
specificity. In this paper, we propose a Weakly Supervised learning
method for TQA (WSTQ), which regards the incompletely accurate results
of essential intermediate procedures for this task as supervision to
develop Text Matching (TM) and Relation Detection (RD) tasks and then
employs the tasks to motivate itself to learn strong text comprehension
and excellent diagram semantics respectively. Specifically, we apply the
result of text retrieval to build positive as well as negative text
pairs. In order to learn deep text understandings, we first pre-train
the text understanding module of WSTQ on TM and then fine-tune it on
TQA. We build positive as well as negative relation pairs by checking
whether there is any overlap between the items/regions detected from
diagrams using object detection. The RD task forces our method to learn
the relationships between regions, which are crucial to express the
diagram semantics. We train WSTQ on RD and TQA simultaneously,
\emph{i.e.}, multitask learning, to obtain effective
diagram semantics and then improve the TQA performance. Extensive
experiments are carried out on CK12-QA and AI2D to verify the
effectiveness of WSTQ. Experimental results show that our method
achieves significant accuracy improvements of $5.02\%$
and $4.12\%$ on test splits of the above datasets
respectively than the current state-of-the-art baseline. We have
released our code on
\url{https://github.com/dr-majie/WSTQ}.