Quantitative characterizations and estimations of uncertainty are of fundamental importance for machine learning classification, particularly in safety-critical settings such as the military battlefield where continuous real-time monitoring requires explainable and reliable scoring. Reliance on the maximum a posteriori principle to determine label classification can obscure a model’s certainty of label assignment. We develop quantitative scores of certainty and competence based on predicted probability estimates as an effective tool for inferring the verity of positives across different data modalities and architectures. Our theoretical results establish that competent models have distinct distributions of certainty for true and false positives. Our empirical results bear out that there are distinct distributions of certainty scores on training and holdout data, as well as data that is a priori out-of-distribution. Further, we find that the most reliable test for out-of-distribution data is to compare the global True positive certainty score distribution against test data. At least 92.3% of out-of-distribution are successfully identified this way across our two experimental modalities at the tranche level. Further, 100% of the out-of-context images are identified as out-of-distribution using the stochastic form of our out-of-distribution detection test across all five stochastic variants of the ResNet models. Consequently, we find that the use of our certainty framework provides a robust means of detecting out-of-distribution inputs, while also serving as a reliable mechanism for comparing model quality of accurately distinguishing between true and False positives, particularly in safety-critical contexts.

Mark Stefik

and 11 more

COGLE (COmmon Ground Learning and Explanation) is an explainable artificial intelligence (XAI) system for autonomous drones that deliver supplies in mountainous areas to field units. The drone missions have risks that vary with topography, flight decisions, and mission goals in a simulated environment. Users must determine which AI-controlled drone is better for a mission. Narrative explanations identify the advantages of a drone’s plan (“What?”) and reasons that the better drone is able to do them (“Why?”). Visual explanations highlight risks from obstacles that users may have overlooked (“Where?”). A model induction user study showed that post-decision explanations produced a small effect on the participants’ abilities to identify the better of two imperfect drones and their plans for a mission, but they did not teach participants to judge the multiple success factors in complex missions as well as the AI pilots. In a decision support variation of the task, users would receive pre-decision explanations to help them to decide when to trust the XAI’s decision. In a fielded XAI application, every drone available for a mission may lack some competencies. We created a proof-of-concept demonstration of automatic ways to combine knowledge from multiple imperfect AIs to get better solutions that the individual AIs do not find on their own. This paper reports on the research challenges, technical approach, and findings of the project and also reflects on the multidisciplinary journey that we took.