DocXClassifier: Towards a Robust and Interpretable Deep Neural Network
for Document Image Classification
Abstract
Model interpretability and robustness are becoming increasingly critical
today for the safe and practical deployment of deep learning (DL) models
in industrial settings. As DL-backed automated document processing
systems become increasingly common in business workflows, there is a
pressing need today to enhance interpretability and robustness for the
task of document image classification, an integral component of such
systems. Surprisingly, while much research has been devoted to improving
the performance of deep models for this task, little attention has been
given to their interpretability and robustness. In this paper, we aim to
improve upon both aspects and introduce DocXClassifier, an inherently
interpretable deep document classifier that not only achieves
significant performance improvements over existing approaches in
image-based document classification, but also holds the capability to
simultaneously generate feature importance maps while making its
predictions. Our approach attains state-of-the-art performance in
image-based classification on two popular document datasets, RVL-CDIP
and Tobacco3482, with top-1 classification accuracies of 94.17% and
95.57%, respectively. Additionally, it sets a new record for the
highest image-based classification accuracy on Tobacco3482 without
transfer learning from RVL-CDIP, at 90.14%. In addition, our proposed
training strategy demonstrates superior robustness compared to existing
approaches, significantly outperforming them on 19 out of 21 different
types of novel data distortions, while achieving comparable results on
the remaining two. By combining robustness with interpretability,
DocXClassifier presents a promising step towards the practical
deployment of DL models for document classification tasks.