loading page

Exploring Bias Assessment and Strong Calibration in AI-based Medical Risk Prediction Models
  • Nidhi Gaonkar
Nidhi Gaonkar

Corresponding Author:[email protected]

Author Profile


Rapid advancements in machine learning and artificial intelligence have raised important concerns regarding fairness and biases in algorithmic decision-making. This paper evaluates the biases present in major machine learning models–Random Forest, K-nearest neighbors, XGBoost, and Naive Bayes - by comparing their performance on different subgroups of health data, including across genders and racial groups. We also analyze the models for strong calibration: if the average predicted probability corresponds to the actual observed rate for all subgroups. Through a t-test and ANOVA test of model performance on different subgroups, we demonstrate that the models have significantly different True Positive Rates across racial groups, but not across gender groups. A comparison of parity metrics and testing for strong calibration through changepoint detection of the random forest model further showed that there were disparities in model performance. Cross validation and the implementation of a paired t-test showed that Brier scores differed across k -folds, indicating inconsistencies in model calibration. When we employed the bias mitigation strategy of reweighting, however, the disparity was reduced. We also suggest a new method for bias mitigation which jointly employs reweighting and adversarial training. This work serves to caution the use of machine learning models in medical contexts, especially for clinical outcome prediction where risk or mortality likelihood may be significantly impacted by biases in the model or data. It is imperative that creators and users of machine learning models prioritize diversity and representation in both datasets and trained models concerning sensitive attributes to ensure equity and ethical practices in this field. We also suggest future work.