Insert Figure 4.
One of AI’s significant benefits is its ability to scale intelligence at
an unprecedented pace. In a time in which a clinician could diagnose a
single patient, an AI system could analyze unlimited number of patients,
at least in theory. However, the same scalability holds for mistakes and
faulty diagnoses, and validation is of the utmost importance to prevent
the lack of generalizability of AI models. AI models tend to ‘overfit’
on the training data, which results in a model that works seemingly well
on the training population but poorly predicts future or other patient
outcomes, especially in high-dimensional models. One example was from
IBM’s Watson, which recommended unsafe cancer treatments, because it was
trained on a sample size too limited for its dimensionality. For models
to be more broadly applicable and generalizable to other populations,
diligent validation and replication (in external datasets) are
paramount. Unfortunately, this is often insufficient or altogether
missing, respectively. Even FDA-approved AI applications fall short in
this domain: Only 11/118 FDA applications (up until 2021) reported a
validation set of more than 1,000 samples, and only 19/118 reported a
multi-reader, multicenter validation study. Site-specific recalibration
or retraining on multiple datasets are solutions to adapt a model to
another context, although caution is required to avoid spurious learning
patterns.
Randomized control trials (RCTs) or prospective validations are scarce
in medical AI80,81. Most applications are only tested
on retrospective data and have not passed prospective validation in an
independent dataset. New guidelines have been emerging for reporting and
evaluating RCTs with an AI intervention component in the past two years,
such as the CONSORT-AI standard and SPIRIT-AI. A systematic review from
2022 reported that none of the 41 assessed RCTs adhered to this standard
and suggested that AI applications with FDA approval do not always prove
efficacy. Thus, the clinical utility and safety remain uncertain,
providing a clear direction for future research to confidently implement
AI in clinical medicine.
Ethical considerations
AI systems often rely on and are trained on confidential personal data,
such as health records, imaging, or genomic data. The more voluminous
these data become, especially with integrating multiple data sources and
unlocking new data sources, the more critical privacy becomes. The EU’s
General Data Protection Regulation (GDPR) already provides a ‘right to
explanation’ when decisions are based on “automated processing” such
as AI. There is a complicated relationship between privacy and trust. If
the mechanisms of algorithms remain hidden for privacy reasons, this
could also impede trust in the solution and slow down adoption by
patients and clinicians. Furthermore, being overprotective of privacy in
data collection, usage, and sharing can also hinder potential patient
benefits from using these data to drive AI solutions for novel
diagnostic or therapeutic options. Novel approaches are emerging that
preserve privacy without slowing down innovation, such as the generation
of synthetic data. Rather than (pseudo)-anonymizing samples,
AI-generated synthetic data samples can be used for safe data sharing or
even new model development.
While AI systems are not moral agents, their decisions can have ethical
consequences. Especially bias and fairness are two key concepts in this
context, and various cases of embedded biases exist in developed models.
The 2021 AI action plan from the FDA, warns that biases in healthcare
systems, such as racial or gender biases, can be inadvertently
introduced to algorithms. This will lead to research conclusions and
applications biased toward specific populations while overlooking
others. If they are not corrected, this could further reinforce biases
and exacerbate health inequalities experienced by certain
underrepresented populations by excluding them from AI-driven medical
innovations. Therefore, researchers need to ensure that the training
sample is diverse and represents any future population to which the AI
model will be applied.
While the above risks are important, it is essential to realize that
humans are not free from implicit biases. For instance, cardiologists
are trained to recognize symptoms of coronary artery disease more
frequently in men, resulting in underdiagnosis in women. The advantage
of data and algorithms is that biases may be detected, corrected, or
prevented. From the study’s outset, during the data collection phase,
investigators should strive for a representative training dataset that
resembles the data distribution the algorithm would encounter once
deployed. Before model development, guidelines have been defined to
assess the risk of algorithmic bias, such as the PROBAST tool. Likewise,
new techniques for the modeling phase are emerging that can help to
mitigate bias, such as adversarial debiasing. Lastly, dedicated tools
have been developed to evaluate the fairness of algorithms along a
variety of fairness definitions, like the open-source Python library AI
Fairness 360.
When the above considerations are not managed adequately, an AI system
may make mistakes. This raises the intricate question of (moral)
accountability, which becomes increasingly pressing with more clinical
applications in place. However, the traditional notion of accountability
is problematic in the context of an AI system. It is questionable
whether a clinician can be held responsible for such a system’s
decisions. Furthermore, the system’s complexity can make it infeasible
for the clinician, and sometimes even the designer, to understand
precisely why certain decisions are made. Therefore, we anticipate that
the introduction of AI in clinical medicine will first be limited to
decision support systems, with the final clinical decision to be made by
the caring physician.
Clinical implementation
Despite exciting showcases, AI has been criticized for underdelivering
tangible clinical impact. Translating solid AI models to effective
action remains an open challenge and actual clinical use is still
nascent. Recently, even with the surge of COVID-19-related AI research,
the clinical value of AI applications remained limited. Important
challenges for clinical implementation include questionable clinical
advantages, inadequate reporting, and adoption and integration in
clinical practices.
Developers of algorithms are also urged to be transparent and complete
in their reporting to provide a fair view of improving patient care.
RISE criteria (Regulatory aspects, Interpretability, Interoperability,
Structured Data, and Evidence) can support overcoming major pitfalls in
developing AI applications for clinical practice. Recently, the
DECIDE-AI guideline has been introduced as a reporting checklist of
AI-based (early-stage) clinical evaluation of decision support systems.
In addition, clinicians and patients must adapt to working with and
trusting new AI systems, and such behavioral change is notoriously hard.
There is a need for (better) AI education for clinicians that will need
to adapt to new roles and tools to support them in their
decision-making. To smoothen this transition, integration into the
medical education system has been proposed. The recent American Academy
of Asthma, Allergy and Immunology workgroup has underscored a knowledge
and an educational gap in the allergy and immunology field. Furthermore,
interoperability of AI systems is vital to ensure that they can be
integrated with existing clinical and technical workflows across sites
and health systems.