Assessing the Optimal Time to Include an AI Tool into Clinical Setting

Objective: To reflect on the appropriate speed and methodology for implementing new AI tools in clinical medicine, analyzing the tension between innovation access and evidence-based practice.

Background: A friend tells you that her friend just had a mammogram. At the radiology office they ask her if she would like to have her mammogram “read” by an AI tool, in addition to the radiologist. There was a $40 charge for this extra service. If the radiologist and the AI tool agreed on the interpretation (no cancer-no cancer or cancer-cancer) that would be the conclusion. But if there was disagreement, the mammogram would be sent to a second radiologist. The friend agreed to pay the $40.

The results came back that the first radiologist did not see cancer, but the AI tool did. The second radiologist agreed with the AI tool. The woman was treated for breast cancer. Your friend tells you that this is a great use case to teach why AI is beneficial.

Part I: Initial Reflection

What is your opinion on this scenario? Address the following:

What if this was a new drug instead of an AI tool?

If the AI tool were a new drug, the regulatory and ethical landscape would be significantly different.

Rigorous Testing and Approval: A new drug would require extensive Phase I, II, and III clinical trials to demonstrate safety, efficacy, and appropriate dosing. The drug would only be approved by agencies like the FDA after proving its benefit outweighs its risks in a large, diverse patient population.
Evidence Standard: Anecdotal success is insufficient for drug approval. The standard is statistical evidence derived from controlled studies.
Cost and Access: While drugs are costly, if a drug were proven life-saving, there would be immense pressure for insurance coverage, subsidies, or making it standard of care, rather than a $40 optional add-on.

The AI tool, while not physically administered, is a diagnostic “intervention”. This scenario highlights that AI tools, especially in high-stakes fields like cancer detection, should be held to a similar high standard of validation and regulation as new drugs or medical devices.

Is the plural of anecdote, evidence? What if the next 99 women suffered from unnecessary surgery?

The plural of anecdote is not evidence. The core risk in the hypothetical is false positives. If the AI tool has a high false-positive rate, it would correctly detect cancer in this one woman, but then incorrectly suggest cancer in many others, leading to:

Unnecessary biopsies and surgeries (e.g., the next 99 women).
- Emotional distress and anxiety.
- Increased healthcare costs.

A tool that saves one life but causes 99 instances of unnecessary, costly, and invasive procedures may not be deemed beneficial for the overall healthcare system.

Does this case fully illustrate why AI is beneficial, or are there additional considerations such as costs, ethics, or potential risks of over-reliance on AI tools?

While this case is a great example of AI’s potential to augment human performance and catch missed diagnoses, it does not fully illustrate the overall benefit without considering other factors such as the costs, ethics, explainability and transparency, automation bias…

Discuss the implications of trusting AI tools when human opinions differ, especially in high-stakes healthcare situations.

The scenario where the AI saw cancer and the first radiologist did not, leading to a life-saving diagnosis, perfectly illustrates the critical tension in integrating AI into high-stakes medical decisions. When human and AI opinions diverge, the implications are far-reaching, touching on issues of professional responsibility, patient safety, and bias.

Is it ethical to charge $40 for a potentially life-saving service? Should such decisions be subsidized or made standard practice?

Depending on the stage of the AI development in healthcare field. If it’s in the mature stage and already being deployed everywhere, cost may have been justified to the minimal and this won’t be a question. We need to look into the access and equity perspectives in the ethical concerns segment. Treatments might be different; we need to keep the same principle of delivering relatively accurate diagnostics for rich or poor.

However, the $40 charge might be necessary to cover the development, validation, and computational costs of maintaining the AI systems. If the service cannot charge, it may not be offered at all. This can be resolved by standardizing the imaging process and charged as single item on the bill.

Given the high-stakes nature of cancer detection, if the AI’s effectiveness is proven, it should be subsidized or integrated into the standard cost of the mammogram to ensure equal access to the best available care.

Part II: Deeper Analysis

Provide critical analysis of the following considerations:

How do you address automation bias in situations where AI challenges human judgment?

Several important aspects regarding to the automation bias are addressed in the following categories:

Overtreatment: The physician incorrectly adopts the AI’s erroneous advice (e.g., performing unnecessary surgery because the AI flagged cancer, as in your hypothetical risk).
1. Missed Diagnosis: The physician accepts the AI’s “low-risk” classification and fails to investigate further, missing a real issue.
1. Degraded Performance: Studies show that exposing physicians to erroneous AI recommendations can significantly degrade their diagnostic performance. Even AI-trained physicians can defer to flawed AI output, highlighting a critical patient safety risk.
1. Susceptibility: Non-specialists, who stand to gain the most from AI-enabled Clinical Decision Support Systems (CDSS), are often the most susceptible to automation bias.

What metrics might validate the AI tool’s effectiveness, and how would you assess whether it improves outcomes over traditional methods?

Some metrics, derived from the confusion matrix which assess the algorithm’s capability under controlled conditions, might be able to validate AI tool’s effectiveness in the AI mammogram scenario:

From our classes learnt, metrics like sensitivity (Recall / True Positive Rate), specificity (True negative), precision (PPV), or AUC (Area under the Curve) … all contribute to the technical performance. For clinical implementation such as the workflow, metrics like workload reduction, turnaround time, or inter-rater agreement (Cohen’s Kappa) can be used to evaluate how the tools perform in the real-world physician environment.

Ultimately, whether the AI tool leads to a measurable improvement in patient health and quality of life compared to the existing standard of care (traditional methods), requires rigorous study design, going beyond just technical validation. These include:

Pragmatic Randomized Controlled Trail (pRCT) commonly compares two groups of patients, Control Group (traditional method) and Intervention Group (AI-augmented). This is the gold standard for clinical evidences in order to prove that AI provides a genuine benefit over traditional methods.
For patient outcome, we can measure the mortality rate, stage at diagnosis, and/or the rate of unnecessary biopsies to calculate the number of false positive leading to invasive, costly follow-up procedures.

Beyond clinical impact, the successful of AI is also measured by its value to the healthcare system. This includes the costs to implement the AI and managing its false positives, patient’s satisfaction, and ethical / fairness audit… to ensure the AI is not introducing or perpetuating bias.

By successfully demonstrating superior performance across these categories—technical, clinical, and economic—an AI tool can be fully validated as a beneficial addition to standard care.

What are the broader implications of integrating AI into routine diagnostics? How might this influence healthcare disparities?

This really depends on the geographical location and culture background. Two main topics need to be carefully studied:

Implications for clinical practice and office workflow which centering on how AI transforms the day-to-day operations of patient-care:
1. Resource optimization: Allocation and efficiency
1. Shifting of clinician’s role: from a primary data interpreter to a validator, curator…
1. eHR (electronic Health Records) integration: including training to help clinicians understand AI’s behaviors, capabilities, limitations…
1. Error reduction through rigorous pattern training and lower the risk of automation bias
1. Healthcare disparities from both ethical and societal perspectives: this is probably the most critical and complicated implication because AI has the potential to both mitigate and exacerbate existing healthcare disparities, largely depending on how the system is designed and implemented.
The opportunity – using AI to mitigate medical disparities:
1. Potential to allow patients to have access to high-quality and accurate diagnostics in under-resourced and remote regions.
1. Capable of learning vast amount of data and find out correlations and underlying mechanisms of health disparities.
1. Cost reduction and proactive care
1. Release human manpower to more critical decisions.
Risk management – controlling existing inequities exacerbated by Ai systems:
1. Data quality, consistency and algorithm bias
1. Vast number of biomarkers and imaging overpowering social determinants (SDOH)
1. Tiering care affecting the treatment quality: spending $40 get you more accurate result.

Does this example overstate AI reliability, or might it underrepresent limitations of human expertise?

This question is common when dealing with new technologies or innovations. We need to break into two aspects:

Marketing: it’s common to overstate the effects in order to get AI to spread to various level of healthcare systems. Through word of mouth, it’s easier to let more people to be aware of what is brought to table by new AI technology.
1. Single success fallacy and the unknown False Positive Rate can sometimes hurt the AI reputation from the beginning. It needs more generalizability in performance across all cases.

One the other hand, the scenario provides a perfect example of human error, highlighting a known limitation of the current standard of care. This includes:

Human variability and fatigue: The first diagnosis hints the human factor limitations in diagnostics, such as inter-rater variability and degradation of performance due to possible fatigue and/or high workload.
AI augmentation: It’s been mentioned in the classes that AI’s most valuable role is not replacing humans, but augmenting them. It acts as critical second pair of eyes to mitigate human limitations.
Confirmation bias: It is possible that the second radiologist, knowing the AI had flagged cancer, was predisposed to look harder or confirm the AI’s finding. While the second opinion saved the patient, the study doesn’t isolate whether the human limitation (the miss) was purely visual or a function of the system.

The scenario concludes that AI can be beneficial to the diagnostics, but the result of saying AI is more accurate or powerful is an overstatement without looking into other issues such as reliability of the system… The more neutral implication is that AI, when implemented properly, can compensate for the inherent variability and limitations of human expertise, which leading to better patient outcomes.

What would you do if you were a patient given this choice?

Depending on the reputation of the hospital, department, facility rating, and the clinicians, if it’s a highly reputable doctors with newer imaging machines, I might ignore the AI offer knowing the AI’s in the experimental stage.

Part III: Comparative Evaluation

Address the regulatory and implementation pathway questions:

Why not perform a pragmatic randomized controlled trial of usual care vs. usual care + AI?

A pRCT is the gold standard for proving clinical effectiveness and definitely should be considered in the first priority. However, we also need to evaluate situations in the context before performing them immediately:

Adoption and usability: Even a well-designed pRCT may fail to detect an impact if clinicians and patients do not find the AI tool useful or easy to integrate into their workflow, leading to low adoption within the trial. As any software development, if the staff doesn’t trust the AI or the interface is poor, the intervention may fail regardless of its theoretical performance or claimed benefits.
- Ethical constraints: The principle of clinical equipoise—that there is genuine uncertainty about which arm in the pRCT is better—is difficult to maintain the trial for seemingly clear benefits.
- Research design and adjustments for AI: when setting up the pRCT, making sure clinicians, informatics experts, biostatistics scientists are forming agreements and make proper adjustments to AI context including the transparency of the training data, calibration methodology, and identification / analysis of performance errors.
- Logistic and cost: pRCT can be costly and takes a long period of time to complete. As a result, it can delay access to potentially life-saving technology. Due to its nature of requiring to recruit a large and diverse cohort across multiple settings to ensure the results are generalizable to the real-world patient population.

Some argue this allows faster access to potentially beneficial technology. Others worry it risks overuse, patient anxiety, false positives, and healthcare costs without clear outcome benefits. What is your opinion?

My opinion is that patient safety and equity must override the speed of deployment. While I support faster access to proven beneficial technology, I am highly concerned about precipitous adoption of untested systems.

From management perspective, faster access indeed is a great opportunity for equity and early detection. AI’s fundamental role to reduce human error, expedite diagnosis, and potentially catching early-stage cancers as in the mammogram context is a strong proposition in taking the business forward. However, we need to take good care of the R-side of the story – the risks, which include threat to patient trust and cost control. The risks are real and consequential. False positives, such as incorrectly flagged cancer, lead directly to patient anxiety, expensive and unnecessary follow-up procedures (e.g., biopsies), and potentially defensive medicine practices, increasing the overall cost of care. Widespread adoption without clear outcome benefits can erode patient trust in both the technology and the clinician.

It is conducive to first adopt a cautious, evidence-driven approach guided by the regulatory principles. Secondly, verifying the AI can replicate the success from the scenario to reduce false negatives. Thirdly, to protect patient well-being and control costs (lower false positives). Lastly, to ensure the AI performs equally well across all demographic groups (fairness).

In the U.S., many AI tools in radiology receive FDA clearance through the 510(k) pathway, requiring “substantial equivalence” to existing devices based on retrospective data and reader studies—not necessarily RCTs. Is this pathway being abused?

First of all, many 510(k) clearances have limited transparency regarding the algorithm’s training details, which are critical for assessing quality and safety. While the 510(k) pathway offers regulatory speed, which is beneficial to innovation, its application to AI often results in the market clearance of devices without the rigorous prospective clinical data (like an RCT) needed to definitively prove a meaningful clinical benefit and rule out safety concerns like high false-positive rates. This regulatory issue is sometimes blocking the progress of new technology and innovation.

When we compare competitive markets such as in China, it doesn’t have complete and meticulous regulations from the government. AI development in healthcare remains as high-stakes field, but projects can go from various directions or approaches as long as the risk management is placed as top priority. It has shown that the development of practical AI applications is booming fast. For now, hospitals in the U.S. have to rely on independent, post-market validation studies to confirm the technology’s effectiveness before integrating it as standard of care. It looks like the pathway is being abused, but it’s just for now, the initial stage under the specific context. When FDA treats AI capability in healthcare as part of the national strategy, the regulation might shift gear. Power ahead!

Disclaimer: This research paper was originally authored in English. You are currently viewing an automated machine translation.

Leave a Comment Cancel