Objective: To reflect on the current state of bias and fairness in the application of AI in medicine and propose specific and practical solutions for achieving more equitable healthcare AI systems.
Part I: Initial Reflection
Address the following questions with clear reasoning and examples:
- In your opinion, what is the best approach for improving fairness with the use of AI in medicine?
- Is the AI field headed in the right direction? If not, what needs to change?
- Consider both technical solutions (algorithmic approaches) and systemic changes (policy, regulation, data collection practices)
Bias and fairness in artificial intelligence (AI) in medicine remain pressing concerns, as AI systems have repeatedly demonstrated disparities in performance across demographic groups, particularly for racial and ethnic minorities, women, and other historically marginalized populations. Meanwhile, based on different policies toward management and application of AI systems in different parts of the world, what is considered as fairness has totally different definitions and contexts. These disparities arise from multiple sources, including underrepresentation in training datasets, algorithmic design choices, and the use of proxies that reflect structural inequities rather than true clinical need. For example, risk prediction tools trained on non-representative data have led to underdiagnosis and undertreatment in Black and Hispanic populations, exacerbating existing health inequities.[1][2]
The best approach for improving fairness with the use of artificial intelligence in medicine is to apply an equity-centered, multidisciplinary framework that integrates both technical and systemic solutions throughout the entire AI lifecycle. This includes:
- intentional problem formulation that centers the needs of populations most at risk of algorithmic harm;
- collection and curation of diverse, representative datasets using federated and community-engaged methods;
- rigorous subgroup performance audits and use of fairness metrics such as calibration within groups and equalized odds;
- transparent reporting of data sources, model assumptions, and subgroup validation;
- ongoing governance, including inventory and periodic bias audits of deployed algorithms; and
- authentic community engagement at every stage, from design to deployment, to ensure trustworthiness and relevance.[3][4]
Technically, distributional methods (e.g., data augmentation, reweighting, federated learning) and algorithmic approaches (e.g., adversarial learning, fairness regularization) are effective for mitigating bias, but must be tailored to the clinical context and disease-specific risks.[5][3] Systemic changes are equally critical: policy frameworks should mandate transparency, accountability, and regular external validation; regulatory bodies such as the FDA and World Health Organization emphasize inclusiveness and equity in deployment strategies; and institutions should establish governance structures for oversight and deimplementation when harm is detected.[4][6]
Currently, the field of artificial intelligence is not yet headed fully in the right direction regarding bias and fairness. While awareness is increasing and national initiatives (e.g., NIH Bridge2AI) are building diverse datasets, most AI tools still rely on nonrepresentative data and lack robust subgroup reporting and governance.[7][8][9] To correct course, changes needed include:
- prioritizing equity in funding and research agendas;
- enforcing standards for dataset diversity and transparency;
- requiring external validation in all relevant populations;
- mandating transparent subgroup performance reporting; and
- centering community engagement and distributive justice in all phases of AI development and deployment.[3][4][9]
Finally, Cross-border and geopolitical contexts must be explicitly integrated into the analysis of bias and fairness in medical artificial intelligence, as the movement of immigrants and refugees introduces new disease patterns and demographic shifts that challenge the generalizability and equity of AI models. AI systems trained on data from one region or population may not perform accurately for migrants, refugees, or populations with different genetic, cultural, or environmental backgrounds, leading to misdiagnosis or inequitable care.[10][11][12]
Data poverty, contextual shifts, and structural biases are particularly pronounced in low- and middle-income countries and among mobile populations, exacerbating global health disparities if not addressed.[11][12]
In summary, the pursuit of fairness in medical AI requires shifting from purely technical fixes to an equity-centered, multidisciplinary framework that addresses systemic disparities throughout the entire software lifecycle. While current AI systems often exacerbate health inequities for women and marginalized racial groups due to unrepresentative data and biased algorithmic design, these harms can be mitigated through a combination of technical interventions—such as adversarial learning and reweighting—and systemic reforms like rigorous subgroup auditing and transparent governance. Meanwhile, factoring cross-border and geopolitical contexts into AI fairness analysis requires representative data, ongoing model adaptation, stakeholder engagement, and robust ethical governance to ensure equitable care for immigrants, refugees, and globally mobile populations.[3][4][5][6]
Part II: Deeper Analysis
Provide critical analysis of the following considerations:
- What are the advantages and disadvantages of including race as a predictor in an AI tool?
- Hospitalized patients with missing data for race routinely have the worst outcomes. How would you explain this phenomenon?
- Can you imagine that AI tools might improve fairness and decrease bias? How would you design a project to make that happen?
- Are minoritized, rural, socioeconomically disadvantaged, or non-Western populations adequately represented in the datasets used to create most AI tools? If not, how would you improve representation?
- In your opinion, what needs to be changed to address the following systemic issues:
- The choice of proxies for health outcomes (e.g., cost instead of illness)?
- Unequal training data (underrepresentation of minorities, women, rural populations)?
- Unvalidated clinical assumptions (e.g., pulse oximeter accuracy across skin tones)?
- Lack of transparent subgroup performance reporting?
If we evaluate the inclusion of race in predictive models through a lens of Precision and Equity, it’s common to receive different voices from either side of reasoning. The most important step is to push for a proper regulation with global-perspectives. The consensus in the field is rapidly shifting from treating race as a biological variable (race-based medicine) toward using it to understand social and environmental impacts (race-conscious medicine). Applying certain security measures to preventing discriminating but get enough insights to help diagnosis is the key. This might not be a fairness concern in the future.
Below is a critical analysis of the advantages and disadvantages of including race as a predictor in medical AI tools.
| Advantages (The “Precision” Argument) | Disadvantages (The “Equity & Scientific” Argument) | |
| Statistical Accuracy | Race can capture a cluster of variables (genetics, environment, diet, stress) that are otherwise difficult to measure. Removing it without having better variables can decrease the model’s overall predictive power for certain groups. | Race is a social construct, not a biological one. Using it as a “shorthand” for biology is scientifically inaccurate and often masks the true underlying drivers like social determinants of health (SDOH). |
| Addressing Disparities | “Fairness through Awareness:” By including race, developers can explicitly audit and “tune” the model to ensure it doesn’t perform worse for marginalized groups. | “Automation Bias:” If an algorithm predicts lower success for a group (e.g., Black patients in VBAC calculators), doctors may blindly follow the “data,” leading to systematic undertreatment or unnecessary procedures. |
| Biomarker Proxy | In specific cases, race may correlate with outcomes due to shared ancestry (e.g., sickle cell trait). It can act as a temporary proxy until direct genetic markers are available. | Race is a leaky proxy. It is too imprecise to account for the vast genetic diversity within racial groups and completely fails to account for multiracial identities. |
| Health Equity Goals | It allows for “race-conscious” interventions, such as prioritizing resources for communities historically denied care. | “Race-Norming” Harm: Applying “correction factors” (like in eGFR kidney function tests) can make sicker patients appear “healthier” on paper, delaying life-saving transplants. |
Hospitalized patients with missing race data often have the worst clinical outcomes. This is rarely a random occurrence; it is a signal of systemic neglect or structural barriers.
- Selection Bias: Vulnerable populations—such as those with higher comorbidities or lower-quality insurance—are statistically more likely to have incomplete documentation in structured Electronic Health Records (EHR).
- Safety Net Strain: Facilities with fewer resources (under-resourced safety-net hospitals) often have less robust data-capture protocols, and the patients they serve are already at higher risk due to adverse social determinants of health.
- Trust Gaps: Patients from marginalized communities may intentionally withhold demographic information due to historical mistrust of medical institutions, which can lead to fragmented care.
AI has the potential to expose and mitigate bias rather than just repeat it. A project designed for this would include:
- Human-Centered Design: Involving diverse stakeholders—doctors, ethicists, and community members—from day one to define what “fairness” looks like for a specific population.
- Fairness-Aware Algorithms: Using “in-processing” techniques where the model is penalized during training if it shows a performance gap between demographic subgroups.
- Open-Source Auditing: Utilizing tools like “SHAP” or “LIME” to explain why an AI made a decision, ensuring it isn’t relying on biased proxies like zip codes.
For other regions other than in the U.S., It is a great opportunity to use the AI system to form personalized health profiles data pertaining to individuals, not to groups of people. This is more conducive for citizens covered by general governmental health insurance programs such as in many European countries, Canada, or in Taiwan where governments are constantly seeking ways to provide better care and lower the cost. This way, government operates an AI system of feeding patient’s latest health profile and constantly receiving treatment updates from the eHR, roles and responsibilities are clear. If an AI tool must use race, it should be used to direct more resources to those in need, rather than to deny or delay care based on a perceived biological difference.
To improve representation of datasets, we must invest in digital infrastructure in data-poor regions (e.g., digitizing records in rural clinics). Additionally, “federated learning” and “federated regulated” can be used to train models across different hospitals without moving sensitive patient data, allowing for more diverse data curation while respecting privacy.
Finally, to address systemic issues, there are great amount of work in the AI area which includes:
- Faulty proxies: avoid using cost/utilization as a proxy for need.
- Unequal training data: implement mandated diversity quotas for training sets and use data augmentation to balance underrepresented groups.
- Unvalidated assumptions: real-world or cross-border validation across races, body types, lifestyles… need to be conducted.
Part III: Evaluation Methods and Implementation
Address the methodological and practical evaluation questions:
- How can fairness and bias be evaluated in the model evaluation stage?
- How can fairness and bias be evaluated in the implementation stage?
- How could a forest plot be used to graph fairness results across different demographic subgroups?
- What are better ways of evaluating fairness and bias in clinical AI research than a pragmatic randomized controlled trial? Consider alternative study designs and metrics.
To approach these evaluation questions, it is important to bridge the gap between technical rigor and real-world equity. Standard population-level metrics often mask “hidden stratification,” where a model appears successful overall but fails specific subgroups. We can shift from fairness through unawareness to fairness through awareness in order to move toward truly inclusive health systems.
In the model evaluation stage, identifying systemic errors in how the model was trained should be the focus. This occurs before any patient contact. Hence, the evaluation is on the system level:
- Modularize performance metrics: AUROC, AUPRC,… for every demographic slice.
- Data bias metrics: Quantify the distance between label distributions
- Calibration audits: Making sure the predicted risk matches observed risk.
- Net benefit parity: beyond simple accuracy, use SNB (Standard Net Benefit) to compare models’ clinical utility across groups at different decision thresholds.
In the implementation stage, the focus switches to how the system interacts with real-world workflow and existing systemic inequities. These could include:
- Predictive consistency: checking clinically similar patients across different demographic groups to see if same prediction is received.
- Assess algorithmic impact: Do doctors or nurse practitioners follow the AI alerts more often for certain type of patients?
- Proxy drift: monitor leaky proxies are used which causes automated discrimination.
A forest Plot could be an ideal visual tool for health administrators to scan a snapshot of subgroup equity.
- The “Line of No Effect”: In a fairness forest plot, the central vertical line represents exact parity (e.g., a ratio of 1.0).
- Plotting Disparities: Each row represents a demographic subgroup such as “Rural Male”. The dot represents the performance metric (like AUROC) and the horizontal whiskers represent the 95% Confidence Interval.
- Interpreting Results: If the whiskers for one group do not overlap with another group or the central line, there is a statistically significant fairness gap that requires administrative intervention.
While Pragmatic Randomized Controlled Trials (pRCTs) are the gold standard, they are often considered to be longer or expensive to keep up with rapidly evolving AI software. After looking up researches, I found there are some alternatives include:
- Target Trial Emulation: This uses high-quality Real-World Data (RWD) to replicate the conditions of a trial retrospectively. It allows you to see how the AI would have performed on diverse populations not typically captured in clinical trials.
- Difference-in-Differences (DiD): This quasi-experimental design compares outcomes before and after AI implementation across different hospitals or clinics (one with the AI, one without) to see if the tool reduced or widened existing health gaps.
- Shadow Mode (Parallel) Studies: Running the AI in the background of clinical care without it influencing decisions. This allows for a safe “stress test” of the model’s fairness on real-time data before it is “turned on” for patients.
While there is no personal experience, in research setting, the “Target Trial Emulation” combined with “𝛾-Subgroup Fairness” reporting seems promising. This provides a statistically robust view of equity without the $10M+ price tag of a prospective trial.
References
- Considering Biased Data as Informative Artifacts in AI-Assisted Health Care. Ferryman K, Mackintosh M, Ghassemi M. The New England Journal of Medicine. 2023;389(9):833-838. doi:10.1056/NEJMra2214964.
- Sociodemographic Bias in Clinical Machine Learning Models: A Scoping Review of Algorithmic Bias Instances and Mechanisms. Colacci M, Huang YQ, Postill G, et al. Journal of Clinical Epidemiology. 2025;178:111606. doi:10.1016/j.jclinepi.2024.111606.
- Use of Artificial Intelligence in Improving Outcomes in Heart Disease: A Scientific Statement From the American Heart Association. Armoundas AA, Narayan SM, Arnett DK, et al. Circulation. 2024;149(14):e1028-e1050. doi:10.1161/CIR.0000000000001201.
- Guiding Principles to Address the Impact of Algorithm Bias on Racial and Ethnic Disparities in Health and Health Care. Chin MH, Afsar-Manesh N, Bierman AS, et al. JAMA Network Open. 2023;6(12):e2345050. doi:10.1001/jamanetworkopen.2023.45050.
- A Survey of Recent Methods for Addressing AI Fairness and Bias in Biomedicine. Yang Y, Lin M, Zhao H, et al. Journal of Biomedical Informatics. 2024;154:104646. doi:10.1016/j.jbi.2024.104646.
- Artificial Intelligence in Cardiovascular Care-Part 2: Applications: JACC Review Topic of the Week. Jain SS, Elias P, Poterucha T, et al. Journal of the American College of Cardiology. 2024;83(24):2487-2496. doi:10.1016/j.jacc.2024.03.401.
- Considering Biased Data as Informative Artifacts in AI-Assisted Health Care. Ferryman K, Mackintosh M, Ghassemi M. The New England Journal of Medicine. 2023;389(9):833-838. doi:10.1056/NEJMra2214964.
- Artificial Intelligence and Health Equity. Bright TJ, Norris KC. Annual Review of Medicine. 2025;. doi:10.1146/annurev-med-043024-125309.
- Bridging the Digital Divide: Artificial Intelligence as a Catalyst for Health Equity in Primary Care Settings. Osonuga A, Osonuga AA, Fidelis SC, et al. International Journal of Medical Informatics. 2025;204:106051. doi:10.1016/j.ijmedinf.2025.106051.
- Artificial Intelligence in Migrant Health: A Critical Perspective on Opportunities and Risks. Matlin SA, Claron IMM, Merone J, et al. The Lancet Regional Health. Europe. 2025;57:101421. doi:10.1016/j.lanepe.2025.101421.
- Use of Artificial Intelligence to Address Health Disparities in Low- And Middle-Income Countries: A Thematic Analysis of Ethical Issues. Yu L, Zhai X. Public Health. 2024;234:77-83. doi:10.1016/j.puhe.2024.05.029.
- Artificial Intelligence and the Future of Global Health. Schwalbe N, Wahl B. Lancet (London, England). 2020;395(10236):1579-1586. doi:10.1016/S0140-6736(20)30226-9.
