Introduction
Medical imaging, such as computed tomography (CT) and magnetic resonance imaging (MRI), provides detailed structural and functional insights from organs to cells and is essential in modern medicine. By analyzing lesion morphology, signal intensity, and spatial distribution in medical images, clinicians can determine tissue types, assess invasiveness, and understand interactions with adjacent structures, providing support for disease diagnosis and treatment. However, clinical imaging diagnostics are limited by their heavy reliance on subjective expertise, resulting in variability in interpretations. In addition, the heterogeneity of diseases such as cancer, reflected in biological variability both among patients and even within lesions of the same patient, presents significant challenges. This inherent complexity constrains the effectiveness of conventional imaging descriptors and parameters in achieving the high precision necessary for personalized medicine.
Over the past decade, artificial intelligence (AI) technologies have made remarkable progress in computer vision and natural language processing (NLP), also showing great potential in medical imaging. Notable breakthroughs include improved disease visualization and quantification. AI techniques have enhanced MRI reconstruction quality and efficiency[1] and increased CT resolution while preserving tumor edges and textures.[2] However, AI-driven disease quantification has made significant advancements across a broad spectrum of clinical tasks, addressing the varied demands of precision medicine. Compared to traditional diagnostics, deep learning networks and radiomics models based on predefined features can extract multidimensional features from medical images with higher precision, enabling robust mappings to microscopic pathological information and clinical outcomes.[3] For example, some models have demonstrated high accuracy in disease diagnosis,[4] predicting lymph node and distant metastases,[5,6] and surpassing clinical indicators in evaluating treatment response.[7]
Currently, most AI models remain in the research stage, facing significant challenges in transitioning to real-world implementation. Beyond the need for large-scale clinical trials and addressing ethical and regulatory concerns, key technical hurdles lie in improving model accuracy and generalizability while developing systematic pathways for seamless integration into existing clinical workflows. Specifically, current methods often suffer from limited interactivity, rigid input–output patterns, and outputs that are primarily single-value results, which are not intuitive for clinicians and increase the learning and usage burden. Although visualization techniques such as heatmaps and feature importance weights have been introduced, they lack deep causal inference and fail to illuminate decision paths, leaving clinicians to rely on subjective judgment when interpreting results. Furthermore, existing models struggle to effectively extract and use the fundamental knowledge of diseases, making their performance highly dependent on the distribution of experimental data. As a result, these models often fail to adapt to diverse patient populations and complex clinical conditions when applied to external cohorts, leading to poor generalizability and scalability. In addition, most models rely on single-modality imaging data and lack the ability to process and integrate multimodal information. However, clinical practice typically involves combining diverse data sources, such as medical records, clinical indicators, and multimodal imaging, to achieve a comprehensive understanding of a patient’s condition. Leveraging the complementary and interactive information within multimodal data has significant potential to enhance diagnostic accuracy and personalization. Although researchers have proposed various data fusion strategies at pixel, feature, and decision levels, there is still a lack of robust paradigms. Additional challenges include resolving differences in resolution, dimensionality, and signal-to-noise ratios across modalities, as well as inconsistencies in data quality and handling missing modalities.
To address these challenges, large model technologies present a promising solution. Large language models (LLMs), such as GPT (Generative Pre-trained Transformer)[8] and LLaMA (Large Language Model Meta AI),[9] exhibit robust NLP capabilities. These models can integrate complex imaging analysis results with patient records and laboratory data to generate clear reports and personalized treatment recommendations. Moreover, clinicians can engage with these systems through real-time dialogue, enhancing decision-making efficiency while reducing reliance on technical expertise. Notably, the integration of LLMs with techniques such as retrieval-augmented generation (RAG) and chain-of-thought (CoT) reasoning further strengthens clinical support by enabling information retrieval and stepwise reasoning. For instance, a study[10] has shown that the OpenAI o1-preview model, leveraging CoT technology, performs exceptionally in complex multi-step reasoning in medical tasks, particularly in differential diagnosis generation, diagnostic reasoning presentation, and management reasoning, surpassing the performance of human physicians and earlier models. SAM (Segment Anything Model),[11] a representative large visual model (LVM), demonstrates outstanding zero-shot generalization capabilities. It captures precise details and edges of objects, significantly reducing manual intervention and improving the efficiency and accuracy of image segmentation. In addition, multimodal large models (MLMs), exemplified by vision-language models, have achieved remarkable progress in integrating data from different modalities. For instance, UNITER (Universal Image-Text Representation)[12] leverages the self-attention mechanism of transformers to learn interaction features between images and text, progressively aligning them to enhance semantic correlation. CLIP (Contrastive Language-Image Pretraining)[13] uses contrastive learning to align image and text features in a unified vector space, ensuring that semantically similar features are closely positioned. Flamingo[14] combines pre-trained visual encoders with language models, enabling efficient learning for new tasks. DeepSeek,[15] a state-of-the-art mixture-of-experts (MoEs) language model, further advances large model capabilities by offering highly efficient training and inference through innovative architectural designs, including multi-head latent attention (MLA) and DeepSeekMoE. With its ability to activate only a fraction of its parameters during inference, DeepSeek achieves top-tier performance while significantly reducing training costs and memory usage. Its robust multimodal capabilities have the potential to handle diverse medical data sources, integrating imaging, textual, and clinical information to enhance diagnostic and decision-making processes, making it an ideal candidate for medical AI applications. In conclusion, the ability of MLMs to capture cross-modal features and semantic relationships allows them to retain strong inferential capabilities even with limited high-quality data or missing modalities, making them highly promising for multimodal medical scenarios.
Emerging large models enable effective processing of multimodal medical data, and uncover deeper insights into disease mechanisms and clinical patterns. Their integration into medical imaging workflows offers the potential to simplify complex tasks for clinicians by providing more intuitive and interpretable outputs. These advancements are set to transform the field of medical imaging, driving the adoption of AI-driven solutions in precision diagnostics and treatment planning. In recent years, large models in medical imaging have attracted significant attention, with growing research highlighting their potential. The following sections will outline their core concepts and methods, review key advances in precision medicine, and explore challenges and opportunities for their clinical application [Figure 1].
Figure 1: The overall structure of this review article. AI: Artificial intelligence; CoT: Chain-of-thought; GPT: Generative Pre-trained Transformer; LLM: Large language model; LVM: Large visual model; MLM: Multimodal large model; SAM: Segment Anything Model.
Concepts and Methods of Large Models in Medical Imaging
Supervised learning has long been the dominant paradigm in medical imaging AI; however, it heavily relies on the scale and quality of labeled data and lacks generalization capabilities for new tasks. Models like BERT,[16] built on transformer architecture, introduced self-supervised pretraining and fine-tuning strategies, significantly reducing the need for training from scratch and achieving breakthroughs first in NLP. Notably, GPT-3[8] demonstrated exceptional emergent abilities, particularly in few-shot and zero-shot scenarios, greatly accelerating advances in prompt engineering. In computer vision, vision transformer (ViT)[17] has surpassed traditional convolutional neural networks (CNNs) on various image classification benchmarks, efficiently capturing global context through self-attention mechanisms, especially advantageous for large-scale datasets and out-of-distribution scenarios. Building upon ViT architecture, SAM further enhanced these capabilities, excelling in zero-shot and prompt-driven image segmentation tasks. In the medical imaging domain, a counterpart known as MedSAM[18] extends this approach by fine-tuning SAM for universal segmentation across various medical image modalities, reducing the need for extensive domain-specific labels. Meanwhile, MLMs such as LLaVA (Large Language and Vision Assistant)[19] have bridged the gap between different modalities by integrating visual encoders with LLMs, propelling advancements in cross-modal understanding. Similarly, LLaVA-Med[20] has been developed to bring these vision-language capabilities into the biomedical field, enabling sophisticated multimodal interactions (e.g., visual question answering [VQA]) on medical images. Currently, research on large models in medical imaging is rapidly evolving, predominantly leveraging general-purpose large models fine-tuned on medical datasets (Figure 2; Table 1 summarizes the key technologies discussed in this section). This approach capitalizes on the extensive knowledge and feature representations learned from large-scale natural image or multimodal pretraining, significantly reducing the dependence on domain-specific annotated data. Another notable example is M3D-LaMed,[21] a specialized MLM for 3D medical image analysis. It incorporates LLaMA-2-7B as its language backbone, leveraging the advanced capabilities of this pre-trained LLM for medical text understanding and generation. This allows M3D-LaMed to interpret complex clinical queries and generate coherent, contextually relevant radiology reports. By integrating a powerful language model with a dedicated 3D vision encoder, M3D-LaMed effectively tackles complex volumetric imaging tasks, such as text-guided 3D CT image segmentation and automated report generation, benefiting from broad pretraining while being optimized for medical applications.
Figure 2: Illustration of mainstream approaches for constructing large models in medical imaging. AI: Artificial intelligence; AUC: Area under the curve; BLEU: Bilingual evaluation understudy; DSC: Dice similarity coefficient; IoU: Intersection over union.
Table 1 -
Essential AI technologies for large models in medical imaging.
| Technology |
Explanation |
| Transformer architecture |
A deep learning model that uses multi-head self-attention and feed-forward networks to process sequential data, excelling at capturing long-range dependencies, and is widely used in text and image analysis. |
| Multi-head self-attention |
The core mechanism of the transformer, where multiple parallel attention heads capture complex interrelations from different perspectives. |
| Vision transformer |
Divides an image into fixed-size patches, then embeds each patch into a vector that is fed into a transformer encoder for processing. Primarily used for image classification, it can also be extended to other visual tasks. |
| Segment anything model |
Achieves universal high-precision image segmentation based on prompts (such as clicks or bounding boxes) to segment any target; in medical imaging, domain adaptation, and fine-tuning are typically required. |
| Self-supervised learning |
Pre-trains on unlabeled data by generating its own supervisory signal to learn features, thereby reducing the dependency on annotated data. |
| Masked autoencoders |
Pre-trains by masking parts of the input data and reconstructing them, which helps the model learn both global and local features. |
| Supervised fine-tuning |
Optimizes a pre-trained model using annotated data, enabling the model to better adapt to specific tasks such as classification or prediction. |
| Full parameter fine-tuning |
Adjusts all parameters of the model to adapt to a specific task, typically achieving excellent performance, albeit with higher computational resource requirements. |
| Parameter-efficient fine-tuning |
Adjusts only a small subset of parameters or specific modules to quickly adapt the model to new tasks, thus saving computational resources. |
| Continual pretraining |
Continuously updates a pre-trained model using domain-specific data to enhance its adaptability within a particular field. |
| Prompt engineering |
Optimizes input prompts to activate the model’s inherent knowledge and enhance task performance without additional training, commonly used for guiding large models. |
| Zero-shot/one-shot/few-shot prompting |
Guides the model to perform tasks with no, a single, or few examples, suitable for data-scarce scenarios and mainly used in large models. |
| Chain-of-thought prompting |
Guides the model to generate a step-by-step reasoning process to enhance the logical flow and interpretability of its output, suitable for complex decision-making tasks. |
AI: Artificial intelligence.
Database construction
Databases play a critical role in the development of large models, as the quality and scale of data directly impact model performance and generalization capabilities. Currently, most training data for large models are sourced from the internet, covering natural images and text. However, internet data often include harmful content and biases, which can degrade model performance in specific tasks. Data cleaning and weighting are essential preprocessing steps. For example, selecting low-compression, information-rich data can improve learning efficiency and model performance.[22] In addition, fine-tuning with high-quality datasets is often used to further enhance model capabilities.
With advancements in data-sharing platforms, several public medical resources now provide valuable datasets [Table 2]. For example, TCIA (The Cancer Imaging Archive)[23] hosts CT, MRI, and other medical images for various cancers, accompanied by clinical information. MIMIC (Medical Information Mart for Intensive Care)[24] offers clinical data from intensive care unit patients, while SEER (Surveillance, Epidemiology, and End Results Program)[25] compiles cancer patient data from across the United States to support epidemiological research. In addition, AbdomenAtlas[26] includes 20,460 3D CT scans and 673,000 high-quality abdominal anatomical masks from 112 hospitals. Despite these resources, large-scale medical datasets remain scarce. Their construction is a complex and time-intensive process, further complicated by challenges related to data privacy and ethical considerations.
Table 2 -
Overview of representative medical datasets.
| Database |
Modality |
Source |
Scale |
Description |
URL |
| TCIA |
Medical images (CT, MRI, etc.) |
Cancer Imaging Archive |
Multiple TB of data across various cancer types |
Repository of medical images of cancer with corresponding clinical data |
https://www.cancerimagingarchive.net/
|
| MIMIC |
Clinical data |
Beth Israel Deaconess Medical Center |
Over 40,000 ICU patients |
De-identified health data from intensive care unit patients |
https://mimic.mit.edu/
|
| SEER |
Cancer statistics |
National Cancer Institute |
Data from ~30% of US population |
Epidemiological information on cancer incidence and survival |
https://seer.cancer.gov/
|
| AbdomenAtlas |
CT scans |
112 hospitals |
20,460 3D CT scans, 673,000 abdominal anatomical masks |
Large-scale dataset of abdominal CT scans with high-quality masks |
https://www.zongweiz.com/dataset
|
| PMC-OA |
Image-text pairs |
PubMed Central |
1.65 million image-text pairs |
Open-access biomedical literature with figures and captions |
https://www.ncbi.nlm.nih.gov/pmc/
|
| CMeKG |
Knowledge base |
Medical literature and databases |
1.56M relationships linking thousands of diseases, drugs, symptoms, and diagnostic and therapeutic techniques |
Structured medical knowledge linking diseases, symptoms, drugs, and treatments |
http://cmekg.pcl.ac.cn/
|
| GenMedGPT-5k |
Text (dialogues) |
AI-generated |
5000 doctor–patient dialogues |
Synthetic medical dialogues generated by ChatGPT |
https://github.com/Kent0n-Li/ChatDoctor
|
AI: Artificial intelligence; CMeKG: Chinese Medical Knowledge Graph; ICU: Intensive care units; MIMIC: Medical Information Mart for Intensive Care; PMC: PubMed Central; SEER: Surveillance, Epidemiology, and End Results Program; TB: Terabytes; TCIA: The Cancer Imaging Archive.
Published medical literature also serves as a valuable resource for training large models. For example, the PMC-OA dataset contains 1.65 million image-text pairs from PubMed Central (PMC), which have been used to train models such as PMC-CLIP[27] and Med-Flamingo.[28] Furthermore, medical knowledge bases like CMeKG (Chinese Medical Knowledge Graph)[29] provide comprehensive medical terms and concepts, enriching the knowledge foundation required for these models to better understand medical contexts.
Generative models offer a promising solution to alleviate data limitations by producing large volumes of high-quality synthetic data. GenMedGPT-5k[30] contains 5000 doctor–patient dialogues generated by ChatGPT. LLaVA[19] enhances multimodal capabilities through fine-tuning on instruction-following data generated by GPT-4. Endora[31] combines spatial-temporal transformers and diffusion models to generate high-quality endoscopy videos, demonstrating potential as a data augmentation tool for downstream video analysis tasks.
Frameworks of general-purpose large models
Transformer architecture underpins most general-purpose large models, serving as core technology for processing text, image, and multimodal tasks due to its strong representation learning capabilities. Transformers are built from stacks of encoder and decoder layers, tailored to accommodate diverse objectives. Encoders extract global feature representations, whereas decoders generate target sequences for generative tasks. Through multi-head self-attention (MSA) and feedforward neural networks (FNNs), transformers effectively integrate information and capture long-range dependencies.
In language tasks, the GPT series, as a representative of mainstream LLMs, adopts a decoder-only architecture with multiple transformer decoder layers. It leverages extensive unsupervised pretraining on large-scale text corpora to learn contextual representations and language patterns. Using causal masking, GPT ensures attention to previous tokens, enabling it to excel in tasks such as text completion, summarization, translation, and dialogue generation. In contrast, encoder–decoder models, while combining text understanding and generation, generally underperform decoder-only LLMs in zero-shot tasks without fine-tuning.[32]
In vision tasks, ViT successfully applied transformers to image analysis by dividing images into fixed-size patches and treating them as one-dimensional sequences. This achievement demonstrated that transformer architecture is not limited to text and can effectively process image data. Subsequently, SAM[11] further expanded the application of transformers in vision tasks. SAM uses an image encoder (e.g., ViT) to extract global features, a prompt encoder to process user-provided input (e.g., points or bounding boxes), and a mask decoder to generate precise segmentation outputs. Leveraging prompt engineering, SAM can adapt to diverse image segmentation tasks without additional training. In addition, stable diffusion (SD),[33] a generative model based on transformer architecture, uses a latent diffusion model (LDM) to generate high-quality images. SD can generate images from textual descriptions, showcasing powerful creative generation capabilities.
Transformers treat data as tokens, enabling unified token-based representation across different modalities by embedding layers, making them highly suitable for multimodal data processing. In multimodal tasks, many MLMs use LLMs as cognitive engines, integrating them with vision encoders (e.g., ViT, CLIP-ViT, or ResNet) and using modality alignment modules to facilitate interactions between visual and textual information. This setup allows LLMs to interpret visual inputs and produce corresponding textual outputs. However, models like Text2Seg[34] and SAMText[35] rely on LVMs such as SAM, combining visual inputs with text prompts through prompt-enhancement methods to improve adaptability for image segmentation and understanding. Furthermore, systems that integrate multiple models, such as Inpaint Anything[36] and SAM-Track,[37] advance the multifunctionality and adaptability required for complex scene tasks.
Training strategies for large models
Large models primarily rely on self-supervised learning, supplemented by supervised learning, to achieve representation capabilities. For LLMs adopting a pretraining-fine-tuning paradigm, the models are first pre-trained on large-scale unlabeled data using pretext tasks such as masked language modeling or next token prediction. They are then fine-tuned on labeled data to enhance task-specific performance. For example, after the pretraining phase, InstructGPT[38] undergoes supervised fine-tuning (SFT) using task-specific data, such as question answering, summarization, or dialogue. This process tailors the model for conversational tasks, enabling it to produce text that aligns more naturally with human conversational norms. Reinforcement learning from human feedback (RLHF) further refines the model by aligning its outputs with human preferences, enabling it to perform more coherently and effectively in multi-turn conversations.
Masked autoencoders (MAEs)[39] facilitate the learning of both global and local features in visual data by masking and reconstructing portions of image patches. This strategy is widely applied in LVMs with ViT as the backbone, providing robust initialization for downstream vision tasks and improving training efficiency. For example, SAM[11] uses MAE as the pretraining strategy for its image encoder, enabling it to capture powerful multi-scale and multi-semantic representations. SAM is then further trained on a large-scale supervised dataset (SA-1B), optimizing its response to various prompts such as points, boxes, and masks. This significantly enhances its generalization ability and real-time interactivity, achieving zero-shot transfer across diverse downstream segmentation tasks. In contrast, DINOv2[40] leverages self-supervised contrastive learning and knowledge distillation, focusing more on global semantic feature learning rather than feature reconstruction. By aligning the features of teacher and student models, DINOv2 efficiently trains on unlabeled data, generating globally consistent and highly generalizable representations. This approach is particularly suited for vision tasks that require capturing global context and semantic structures, such as image classification, retrieval, and clustering.
In addition, training strategies for multimodal tasks have also gained increasing attention, emphasizing the alignment and integration of features across different modalities, such as vision and language, to drive advancements in cross-modal intelligence. CLIP,[13] for example, uses contrastive learning on large-scale image-text pairs, maximizing the similarity of correct pairs while minimizing that of incorrect ones, effectively aligning multimodal features through its image and text encoders. LLaVA[19] integrates visual features extracted by the vision encoder in CLIP into the input space of LLM LLaMA[9] using linear mapping. It is then fine-tuned with multimodal instruction data generated by GPT-4, demonstrating excellent performance in VQA and image understanding. BLIP-2 (Bootstrapped Language-Image Pretraining)[41] freezes the pre-trained image encoder and uses a lightweight querying transformer (Q-Former) to extract key visual features, mapping them into the input space of LLMs. It leverages learnable queries to optimize vision-language alignment, enabling efficient multimodal feature fusion and generation through staged training.
Construction of large models in medical imaging
Although general-purpose large models exhibit strong representational capabilities, their application in medical imaging often faces numerous limitations due to significant differences between medical and natural images, as well as the lack of domain-specific medical knowledge. To effectively handle fine-grained medical tasks, models require targeted adaptation. Optimizing parameters ensures alignment with medical data characteristics and clinical needs.
Full parameter fine-tuning is a commonly used strategy that adjusts all parameters of a large model based on medical data.[20] This approach leverages the model’s existing representational power while adapting it to specific medical applications. In contrast, parameter-efficient fine-tuning (PEFT) achieves task adaptation by modifying only a subset of the model’s parameters or incorporating lightweight modules, significantly reducing computational and data requirements. This makes PEFT a viable alternative or complementary strategy under resource-constrained conditions. For example, low-rank adaptation (LoRA)[42] fine-tunes a model by introducing low-rank matrices into the weight updates, significantly reducing the number of trainable parameters during fine-tuning. This approach modifies only the weight updates while preserving the original weights and structure of the pre-trained model. Prompt tuning modifies task-specific prompt vectors instead of directly adjusting the model’s weights.[32] This allows for flexible task adaptation while preserving the pre-trained model’s internal parameters, thereby minimizing interference with its existing knowledge. The Adapter[43] introduces lightweight modules into existing network layers and trains only these modules, effectively reducing computational overhead. In addition, quantization methods like quantized LoRA (QLoRA)[44] reduce the precision of the model’s floating-point weights by converting them into lower-bit integer representations. This transformation significantly lowers both storage requirements and computational complexity, while improving efficiency on hardware accelerators.
Depending on available resources, a combination of fine-tuning techniques can be flexibly chosen to leverage their advantages. SFT trains the model on high-quality annotated medical datasets to significantly improve performance for specific tasks while aligning the generated content with human preferences and ethical standards. Instruction tuning adapts the model using data consisting of instruction-output pairs, improving its ability to interpret and generate appropriate responses to task-specific instructions. These pairs are typically derived from manually curated seed data or instruction templates. Reinforcement learning strategies are also widely used to optimize model performance, such as RLHF, which integrates expert feedback to improve model interpretability and user satisfaction. Reinforcement learning from AI feedback (RLAIF) substitutes or supplements human feedback with signals generated by auxiliary AI systems, reducing annotation costs and enhancing optimization efficiency. In contrast, direct preference optimization (DPO)[45] optimizes preference pairs directly, bypassing the complex reward modeling process and providing a more straightforward approach to fine-tuning for granular tasks. Simple preference optimization (SimPO)[46] further reduces complexity and computational cost by using implicit reward functions.
In addition to fine-tuning techniques, continuous pretraining (CPT)[47] adapts models by using unannotated medical data, addressing significant differences between medical texts/images and general pretraining data. This enhances the model’s domain-specific representation, providing a foundation for efficient adaptation to medical tasks.
It is important to note that beyond fine-tuning general-purpose large models to acquire medical domain knowledge, researchers have increasingly focused on training large models from scratch specifically for the medical domain. These models are pre-trained on large-scale medical datasets, enabling them to develop a deep understanding of specialized medical knowledge. BioMedLM[48] was trained using PubMed abstracts and full-text articles, achieving outstanding performance in medical NLP tasks, particularly in medical question answering and information extraction. BioMedLM uses a specialized medical vocabulary and tokenizer, which allows it to process medical terminology with greater precision while avoiding word fragmentation issues. For instance, the standard GPT-2 tokenizer would split “thrombin” into “th”, “rom”, and “bin”, disrupting the integrity of medical concepts. In contrast, BioMedLM’s customized tokenization strategy preserves term integrity, thereby enhancing the model’s comprehension of medical texts. Similarly, GatorTron[49] was pre-trained on over 90 billion words of medical text, including more than 82 billion words from de-identified electronic health records (EHRs) and other biomedical literature, making it highly adaptable to EHR analysis tasks. The model was specifically optimized for tasks such as medical named entity recognition, relation extraction, and clinical text summarization. Compared to fine-tuned general-purpose models, GatorTron demonstrates superior performance in medical question answering. In addition, because GatorTron is trained primarily on de-identified EHR data, it can be deployed on local servers, reducing reliance on external application programming interfaces (APIs) and minimizing the need to transmit sensitive patient data to cloud services. This enhances data security and privacy protection. When it comes to medical imaging, research on training models from scratch rather than fine-tuning general-purpose models remains limited. A relatively small yet noteworthy example is PMC-CLIP,[27] a multimodal model designed specifically for medical image-text matching. PMC-CLIP was trained from scratch on 1.65 million medical image-text pairs. It uses a ResNet-50 + PubMedBERT + Transformer architecture, integrating image-text contrastive learning and masked language modeling to enhance the alignment between medical images and corresponding textual descriptions. This domain-specific design significantly improves the model’s ability to understand and associate medical images with relevant textual information, demonstrating the potential of fully domain-trained large models in medical imaging.
Prompt engineering
A prompt is user-provided input that guides a large model to accurately understand intent and generate specific outputs. Prompt engineering optimizes the structure and content of the input to activate the implicit knowledge of the pre-trained model, enhancing its understanding and execution of tasks without modifying or with only minimal fine-tuning of model parameters.
Providing examples in prompts helps models better understand user expectations. Based on the number of examples, prompts can be categorized into zero-shot, one-shot, and few-shot prompting. Zero-shot prompting triggers the model’s ability to perform a task using a single instruction, leveraging the knowledge it has accumulated during pretraining. One-shot and few-shot prompting, however, help the model understand the task by including contextual examples in the instruction. This reduces bias in task understanding and ensures that the generated content aligns more closely with the intended style.
To further enhance the performance and interpretability of large models in complex tasks, various strategies have been proposed to fully harness their potential. CoT prompting guides the model to generate intermediate reasoning steps, boosting its logical reasoning ability and enabling it to produce more accurate and interpretable responses. Tree of thought (ToT)[50] extends this by expanding linear reasoning paths into multi-branch tree structures, allowing the model to perform forward or backward reasoning in complex tasks while selecting the optimal execution path. Self-consistency[51] prompting improves the reliability and robustness of the model’s reasoning by generating multiple responses to the same task and selecting the most consistent one. This approach helps ensure accurate results, even when some individual responses may be incorrect. In addition, self-refine prompting[52] guides the model to iteratively improve its generated responses by self-evaluating and refining them based on internal feedback, leading to continuous performance enhancement after multiple cycles.
Various prompt optimization strategies have been developed for visual or multimodal tasks. Visual prompt tuning (VPT)[53] introduces trainable prompts into input space, optimizing only the prompt parameters while keeping the pre-trained backbone model frozen, thereby enhancing performance in specific visual tasks. DAM-VP (Diversity-Aware Meta Visual Prompting)[54] uses a diversity-adaptive mechanism, dividing the downstream dataset into smaller homogeneous subsets and optimizing prompts to improve the model’s knowledge transfer and generalization across complex data distributions. For text prompt optimization, CoOP (Context Optimization)[55] replaces manual prompts with learnable text inputs to optimize CLIP’s performance in downstream tasks. DenseCLIP[56] transforms image-text matching into pixel-text matching and introduces visual-to-language prompt methods, enabling efficient application of pre-trained knowledge in dense tasks such as semantic segmentation, object detection, and instance segmentation. In addition, MaPLe[57] dynamically couples visual and text prompts, achieving collaborative optimization of the two modality representation spaces. It demonstrates superior performance in generalizing to novel categories, transferring across datasets, and handling domain shifts.
Evaluation metrics
Due to the diversity of tasks, evaluation methods for large models in medical imaging are multifaceted. Comprehensive evaluation from multiple dimensions and perspectives is necessary to understand the potential and limitations of these models in clinical applications, especially in complex clinical scenarios.
To quantitatively assess the accuracy and reliability of a model, standardized evaluation metrics are required. For classification tasks, metrics such as accuracy, recall, F1 score, and area under the curve (AUC) are used to evaluate performance. For segmentation tasks, intersection over union (IoU) and Dice coefficient are used to measure segmentation precision. For dialogue and text generation tasks, metrics like BLEU and ROUGE are used to assess response accuracy, fluency, and diversity. In the evaluation of multimodal tasks, assessing the contribution of each modality helps prevent over-reliance on any single modality. In addition, evaluation methods can incorporate contrastive learning and multimodal alignment to assess the model’s effectiveness in integrating information from different modalities.
In the evaluation of vision-language models, tasks such as image-text matching, VQA, and image captioning are used to examine the integration of visual and linguistic information. Luo et al[58] proposed the CODIS benchmark, designed to assess the ability of MLMs to understand context-dependent visual information. This benchmark includes tasks involving ambiguous questions and image pairs, integrating contextual information such as location, direction, time, cultural background, object attributes, and interpersonal relationships, to systematically assess how models use contextual data to resolve image ambiguities and generate accurate answers. Jiang et al[59] evaluated the performance of vision-language large models such as MiniGPT-v2 and LLaVA-1.5 on medical and general tasks including polyp detection, skin lesion detection, absurd question answering, and spatial relation reasoning. They used accuracy, precision, and other metrics to measure recognition and localization abilities, combined with SAM to assess segmentation results, and conducted a comprehensive discussion of model performance by manually analyzing failure cases and evaluating task completion quality.
Moreover, models need to be assessed for robustness across different lesion subgroups, multi-center datasets, and scenarios involving low-quality or missing data. The applicability of the model within actual clinical workflows requires validation, encompassing considerations such as processing speed, real-time performance, and user-friendliness. A thorough evaluation can help ensure smoother integration into clinical decision support systems, potentially improving practical outcomes in real-world applications.
In addition to objective metrics, researchers are increasingly incorporating human evaluations and AI-based assessments. For example, medical experts may be asked to evaluate model responses based on dimensions such as naturalness, empathy, effectiveness, and safety. To address the limitations of human evaluation, such as subjectivity and high cost, AI-driven assessments aligned with human values are being explored as a complementary approach. For instance, LLaVA-Med[20] uses GPT-4 to assess the accuracy, relevance, and detail richness of medical dialogues.
Exploring Tasks for Large Models in Medical Imaging
The remarkable performance of large models in various medical imaging tasks demonstrates their significant potential to enhance clinical workflows [Table 3]. These models have addressed challenges that were difficult to attain with traditional technologies. In the following, we will present representative results from large models in several specific tasks. It is important to note that this section primarily focuses on the marginal effects AI can achieve in clinical applications within the context of the advances in large models, and thus does not completely exclude models that are relatively smaller in scale compared to general-purpose LLMs. We believe that, based on these pioneering paradigms, large models will increasingly adapt to addressing related tasks.
Table 3 -
Representative studies of large models in medical imaging.
| Model name |
Modalities |
Architecture |
Training method |
Downstream tasks |
| MedSAM[18]
|
CT, MRI, ultrasound, X-ray, endoscopy, fundus, dermoscopy, pathology, OCT, mammography |
SAM |
Fine-tuned from SAM, updating image encoder and mask decoder for segmentation tasks |
Medical image segmentation |
| Med-SA[60]
|
CT, MRI, ultrasound, fundus images, dermoscopic images |
SAM with SD-Trans and HyP-Adpt modules |
Parameter-efficient fine-tuning, updating only adapter modules (~2% of parameters) |
Medical image segmentation |
| Med-MLLM[61]
|
CT, X-ray, chest radiograph, medical reports, clinical notes |
ResNet-50 (image encoder), transformer (text encoder/decoder) with multimodal pretraining modules |
Large-scale pretraining on unlabeled data with multimodal contrastive learning; fine-tuned with limited labeled data |
Medical report generation, disease diagnosis, survival prognosis |
| R2GenGPT[62]
|
X-Ray |
Swin transformer (visual encoder), linear mapper (visual mapper), LLaMA-2-7B (frozen LLM) |
Parameter-efficient fine-tuning with delta alignment |
Radiology report generation |
| Med-Flamingo[28]
|
X-Ray, ultrasound, pathology, radiology images, multimodal medical text |
OpenFlamingo-9B |
Continued pretraining on paired and interleaved medical image-text datasets; supports multimodal few-shot learning |
VQA, rationale generation, multimodal medical reasoning |
| LLaVA-Med[20]
|
CT, X-ray, MRI, histopathology, gross pathology, biomedical text (e.g., captions, PubMed context) |
LLaVA |
Fine-tuned from LLaVA with domain-specific biomedical image-text pairs, using a two-stage curriculum learning method: Stage 1 for concept alignment, Stage 2 for open-ended instruction-following |
VQA, biomedical visual chatbot |
| ChatCAD+[63]
|
CT, X-ray, endoscopy, fundus, dermoscopy, mammography, blood cell images, text |
Domain-specific CAD models (e.g., chest X-ray, knee MRI) integrated with CLIP and LLMs |
Integration of pre-trained CAD models and LLMs (e.g., ChatGPT), using TF-IDF-based retrieval and LLM-guided hierarchical knowledge navigation |
Multimodal report generation, reliable interactive medical consultation |
| LLM-CXR[64]
|
Chest X-ray, radiology reports |
VQ-GAN (image tokenizer), Dolly-v2-3B (pre-trained LLM with extended token embedding table) |
Two-stage instruction-fine-tuning with image-text bidirectional tasks; clinical information-preserving loss to enhance image tokenization |
Radiology report generation, image generation from text reports, VQA |
| M3D-LaMed[21]
|
CT, MRI (3D medical images), diagnostic reports |
3D ViT (image encoder), 3D spatial pooling perceiver, LLaMA-2-7B (LLM), SegVol (segmentation module) |
Pre-trained 3D image encoder using a CLIP-like strategy with cross-modal contrastive learning; fine-tuned with instruction-response pairs and LoRA-based parameter-efficient fine-tuning for multimodal tasks |
Image-text retrieval, report generation, VQA, positioning, segmentation |
| SurgicalGPT[65]
|
Endoscopy (surgical scenes), textual questions |
GPT2 with Integrated vision tokenizer (ResNet18, swin transformer, ViT) |
Fine-tuned GPT2 with trainable vision tokenizer and vision token embedding (type and pose); sequencing text tokens before vision tokens for training. |
VQA in surgical scenarios |
| SkinGPT-4[66]
|
Skin images, clinical notes |
ViT with integrated Q-Former and LLaMA-2-13b-chat |
Sequential training first for clinical concept alignment and then for fine-tuning on specific diagnosis tasks |
Dermatology diagnosis, interactive treatment recommendations |
CAD: Computer-aided diagnosis; CLIP: Contrastive Language-Image Pretraining; CT: Computed tomography; GPT: Generative Pre-trained Transformer; LLM: Large language model; LoRA: Low-rank adaptation; MRI: Magnetic resonance imaging; SAM: Segment Anything Model; OCT: Optical coherence tomography; ViT: Vision transformer; VQA: Visual question answering.
Segmentation
Automated segmentation is fundamental and crucial for supporting diagnosis and treatment planning. Traditional segmentation models rely on task-specific datasets. They require large amounts of labeled data and multiple iterations to achieve high accuracy for specific anatomical structures or lesion types. These methods often lack broad applicability. When new lesion types or anatomical structures arise, manual labeling and retraining are necessary. In addition, they face challenges in handling complex structures and diverse imaging data.
With the rise of models like SAM, general-purpose segmentation models have become a research focus, driving a paradigm shift in image segmentation. SAM, pre-trained on a massive volume of natural images, uses prompts as input to quickly segment target regions. These models exhibit great potential in the medical imaging field. On one hand, they demonstrate adaptability across modalities (CT, MRI, ultrasound, etc.) and anatomical structures, reducing the cost of training separate models for each specific task. On the other hand, they can serve as annotation aids, rapidly generating precise segmentation results with expert interaction, significantly reducing annotation costs and improving quality. Furthermore, researchers are exploring various strategies to integrate SAM into medical image segmentation tasks to fully unlock its potential. For example, Zhang et al[67] proposed the SAMAug method, which uses segmentation masks generated by SAM as additional input channels to enhance original medical images, thereby providing rich high-level semantic information for downstream segmentation models. This input augmentation strategy leverages SAM’s general segmentation capabilities to improve segmentation performance in multi-task scenarios while avoiding additional computational overhead. Similarly, MedCLIP-SAM[68] combines a biomedical vision-language model (BioMedCLIP) with SAM to enable medical image segmentation through text prompts. By using weak supervision based on global similarity scores, MedCLIP-SAM reduces the need for pixel-level annotations and demonstrates superior performance across multiple medical imaging datasets, highlighting the potential of cross-modal learning. In the context of specific medical tasks, TP-DRSeg[69] uses explicit text prompts to assist SAM in segmenting diabetic retinopathy lesions. This approach optimizes prompt strategies to improve segmentation accuracy for low-contrast and complex boundary targets, further showcasing the versatility of prompt-driven models in specialized applications.
To systematically evaluate SAM’s performance in medical image segmentation, Huang et al[70] constructed the COSMOS 1050K dataset, which covers 18 modalities and 84 targets. Their experimental results demonstrated strong performance, with the Dice coefficient surpassing 90% for segmenting anatomical structures like the brain, kidneys, and lungs. In addition, manual prompts, such as box prompts, significantly improved segmentation accuracy, enabling SAM to quickly generate high-quality labeled results. However, SAM struggled with targets featuring complex boundaries or low contrast, such as brain tumors and pancreas. This highlights the need for structural modifications and strategic optimizations to enhance SAM’s adaptability to complex medical segmentation tasks. Ma et al[18] proposed MedSAM based on SAM, and fine-tuned it on a large-scale dataset consisting of 1.57 million pairs of medical images and their corresponding masks. This fine-tuning enabled MedSAM to adapt to multiple medical imaging modalities (such as CT, MRI, and ultrasound) and different pathological types (such as brain tumors and colon polyps). MedSAM improved the interaction mechanism between the prompt encoder and decoder, enhancing prompt responsiveness. It also refined the embedding of positional information and the efficiency of feature fusion, leading to more accurate associations between prompts and target regions. In addition, the optimized prompt generation strategy further improved the model’s performance in segmenting weak boundaries and low-contrast targets, significantly enhancing segmentation accuracy and robustness. For 3D medical imaging, SAM-Med3D[71] was designed to handle volumetric data efficiently. Trained on 143,000 3D masks across 245 categories and evaluated on 16 common datasets, SAM-Med3D demonstrates the capability of prompt-driven segmentation in 3D tasks, addressing the challenges of 3D segmentation discussed earlier. Sheng et al[72] decoupled SAM’s components and integrated its mask decoder and prompt encoder with the Swin Transformer and DETR[73] architectures. This enhanced the practical applicability of segmentation models in real-time scenarios and eliminated bottlenecks caused by manual interaction. By replacing SAM’s original image encoder with DETR and using automatically generated prompt boxes instead of manual prompts, the approach enabled real-time segmentation of surgical instruments.
Auxiliary diagnosis
Disease diagnosis is a core task in medical image analysis, aimed at identifying and classifying lesions from imaging data to support clinical decision-making. Large models, by integrating multimodal information and large-scale pretraining strategies, demonstrate higher diagnostic accuracy and stronger cross-task generalization compared to traditional models. Traditional disease diagnosis models are typically trained for specific diseases or imaging modalities (such as CT-based lung nodule detection or MRI-based breast tumor diagnosis) and rely on large amounts of annotated data for supervised learning. These models often require data collection, annotation, and retraining when faced with new disease subtypes, patient populations, or imaging parameters, resulting in high costs and limited generalization.
Through pretraining on vast amounts of heterogeneous data, models can learn highly generalizable feature representations with deep semantic understanding, encompassing multi-level feature patterns related to anatomical structures, tissue details, and imaging modalities. This facilitates achieving higher diagnostic accuracy. For instance, RetiZero,[74] a vision-language model for fundus disease detection, was trained on 341,896 fundus images paired with textual descriptions, covering over 400 retinal and optic nerve diseases across diverse populations. It demonstrated superior performance in zero-shot disease recognition and image retrieval, even surpassing ophthalmologists in certain diagnostic tasks. Similarly, RETFound[75] was pre-trained via self-supervised learning on over 1.6 million unlabeled retinal images, adapting to specific disease detection tasks with a small amount of labeled data. Experiments showed that RETFound significantly outperformed traditional transfer learning models in diagnosing eye diseases such as diabetic retinopathy (DR) and glaucoma, as well as predicting systemic diseases like heart failure and myocardial infarction, with a significantly higher AUC. Endo-FM[76] is designed for endoscopic video. Pre-trained on a dataset of over 5 million video frames, it captures spatiotemporal information. This improved F1 scores by 3.1% to 5.5% in classification, segmentation, and detection tasks, enhancing adaptability to dynamic scenes and spatial variations. EchoCLIP[77] uses self-supervised learning based on over 1 million pairs of cardiac ultrasound videos and clinical text data. The model achieved a mean absolute error of 7.1% in predicting heart function and performed well in several benchmark tasks, including left ventricular ejection fraction prediction and implant device detection.
Large-scale pretraining significantly strengthens a model’s ability to transfer knowledge across tasks and improve its generalization performance. This effectively reduces training costs and time for task-specific models, enabling rapid adaptation and flexible deployment across diverse clinical scenarios. For example, Prov-GigaPath,[78] pre-trained on over 1.3 billion pathological image slices, achieved state-of-the-art performance in tumor classification across nine major cancer types, including lung, breast, and colorectal cancer. Pai et al[79] proposed a framework for cancer imaging analysis, using self-supervised learning to effectively extract key features from images without requiring large-scale labeled data. Their model demonstrated high stability and robustness in tasks such as malignancy prediction in lung nodules and prognosis assessment in non-small cell lung cancer. In addition, Med-MLLM,[61] which integrates visual and text modality data, exhibited excellent performance in COVID-19 report generation, diagnosis, and prognosis prediction tasks, particularly in cross-task adaptation for Delta and Omicron variants. Even with only 1% of annotated data, the model was able to quickly and accurately handle emerging infectious diseases. Its framework has also been successfully extended to the analysis of 14 common chest diseases, including pneumonia and tuberculosis, highlighting its flexibility and efficiency in clinical applications for emerging infectious diseases.
Automated report generation
Radiologists are required to produce standardized textual reports based on accurate interpretation of medical images. Although structured reporting techniques have enhanced the clarity and standardization of reports to some extent, thereby reducing the workload of radiologists, the process remains susceptible to human-related factors. Variations in experience, personal preferences, and working conditions can lead to issues such as omitted information, imprecise wording, inconsistent report formats, and insufficiently detailed descriptions of lesions. Furthermore, reports generated using traditional template-based methods or shallow NLP techniques often lack clinical depth and tend to be overly mechanical, making them inadequate for addressing complex cases and meeting specific clinical needs.
In recent years, LLMs integrating visual inputs have shown promise in medical report generation. These technologies aim to translate imaging content into structured textual reports, potentially improving consistency and efficiency while reducing subjectivity in interpretation. However, recent research suggests that current general-purpose large models still face significant limitations in clinical applications. Zhou et al[80] evaluated GPT-4V’s ability to detect radiologic findings from chest radiographs and found that while it demonstrates basic multimodal reasoning capabilities, its performance remains insufficient for real-world diagnostic usage. Although incorporating a limited number of annotated examples (few-shot learning) marginally improves its accuracy, GPT-4V struggles with correctly identifying International Classification of Diseases (ICD-10) codes and corresponding anatomical locations. Jiang et al[81] further highlight that GPT-4V fails to generate clinically reliable radiology reports due to its lack of meaningful medical image interpretation and underperforms compared to fine-tuned models. These findings underscore the need for domain-specific model optimization and rigorous validation before such AI systems can be integrated into clinical workflows.
Recent advances in multimodal learning have significantly improved automated radiology report generation. Lu et al[82] proposed a two-stage fine-tuning strategy for MLMs, improving radiology report generation by better aligning visual features with language model embeddings. Their approach, based on OpenLLaMA-7B, achieves state-of-the-art performance on MIMIC-CXR without domain-specific pretraining. Interestingly, they found that larger language models tend to rely less on visual features, making reports less grounded in image content. To further strengthen the connection between reports and images, Bannur et al[83] introduced MAIRA-2, designed for spatially grounded radiology report generation. Unlike traditional models, MAIRA-2 explicitly links findings to specific image locations using bounding boxes. Trained on MIMIC-CXR, PadChest, and USMix, it incorporates lateral views and prior images to improve accuracy and reduce hallucinations. They also propose RadFact, an LLM-based framework to evaluate both textual correctness and spatial grounding, setting a new benchmark in the field. Wang et al[62] proposed R2GenGPT, which uses a Swin Transformer-based visual encoder to extract imaging information and then maps these low-dimensional visual features into a high-dimensional feature space of an LLM through a linear projector, enabling image-to-text conversion. By leveraging the strong semantic modeling capabilities of LLMs, R2GenGPT has demonstrated superior performance across multiple benchmark datasets (e.g., IU-Xray and MIMIC-CXR), achieving robust results on multiple metrics such as BLEU, ROUGE, and F1 score. Beyond radiology, multimodal learning has also been applied to ophthalmology and diabetes management. DeepDR-LLM[84] combines deep learning for DR detection with an LLM for diabetes management. Its DeepDR-Transformer module analyzes fundus images, assessing image quality, DR lesions, and severity, while the LLM module provides personalized care recommendations. It improved primary care physicians’ DR diagnosis accuracy from 81.0% to 92.3% and generated management plans comparable to endocrinology residents. A two-arm, real-world prospective study showed higher patient adherence to diabetes care and DR referrals, making it valuable for low-resource settings. We anticipate that future research will focus on enhancing models’ understanding and expressive capabilities for complex content of medical images, exploring more efficient multimodal fusion strategies, and achieving greater adaptability across diverse clinical scenarios. In addition, incorporating medical knowledge graphs and external semantic information into the report-generation process is also a potential approach to enhancing the overall quality of model-generated reports.
Although text-based unimodal LLMs are not the central focus of this review, it is important to acknowledge their indirect yet significant contributions to clinical workflows. Even without direct engagement in image analysis, LLMs enhance clinical documentation through automated report refinement and structured information extraction. By efficiently identifying critical data from unstructured free-text reports and reorganizing them into standardized formats, LLMs improve readability, semantic coherence, and terminological consistency[85]—key factors in reducing ambiguity and ensuring actionable clinical insights. Furthermore, LLMs are capable of translating complex medical terminology into more accessible language, which facilitates clearer communication between physicians and patients.[86] Overall, LLMs not only boost the efficiency of report processing and generation but also provide robust technical support for batch processing and information mining of large-scale report databases. They additionally exhibit promising potential in enhancing data privacy protection and adapting to evolving clinical requirements.
Interactive systems
The application of large models has made it possible to develop interactive systems that provide deep medical support. These systems possess advanced semantic understanding and contextual reasoning capabilities, allowing them to maintain coherence and accuracy in complex multi-turn dialogues, thereby offering an unprecedented experience for medical professionals, students, and patients (see Figure 3 for examples of a radiologist using large models in multi-turn clinical diagnosis). In this context, representative models such as Med-Flamingo,[28] LLaVA-Med,[20] and Med-PaLM M[87] have emerged. Med-Flamingo integrates large-scale medical image and text data and leverages few-shot learning techniques to perform well in complex reasoning tasks without requiring extensive annotated data. Its multimodal fusion architecture enables simultaneous processing of image and text information, demonstrating impressive visual reasoning and semantic understanding capabilities in USMLE-style open-ended question-answering tasks. LLaVA-Med, through domain-specific instruction tuning, significantly enhances the quality of semantic understanding and text generation, forming a multimodal dialogue assistant for medical image interpretation. This system supports multiple medical imaging modalities, including X-ray, CT, and MRI, while ensuring consistent and context-aware interactions throughout multi-turn dialogues. Med-PaLM M was trained and evaluated using MultiMedBench, a multimodal biomedical benchmark encompassing 14 tasks across text, medical imaging, and genomics. Based on PaLM-E,[88] the model was fine-tuned to handle multiple biomedical tasks with a unified architecture and shared model weights, using instruction tuning and zero-shot reasoning to enhance generalization capabilities. Experimental results demonstrate that Med-PaLM M achieves or surpasses state-of-the-art performance on all MultiMedBench tasks, exhibiting emergent zero-shot medical reasoning and positive cross-task transfer. Furthermore, ChatCAD+[63] integrates multiple computer-aided diagnosis (CAD) models with LLMs, using a domain recognition module to select the appropriate model based on input images and hierarchical contextual learning to refine outputs with insights from similar case reports. In addition, ChatCAD+ provides accurate interactive medical recommendations based on professional knowledge bases, such as the Merck Manual,[89] improving both the quality of diagnostic reports and the reliability of interactive medical question answering.
Figure 3: Examples of using large models for VQA in multi-turn dialogue for clinical diagnosis. VQA: Visual question answering.
In specialized clinical domains, large models demonstrate strong adaptability, providing innovative solutions for various medical challenges. For instance, SurgicalGPT[65] focuses on surgical visual analysis and intelligent question answering. It outperforms existing models across multiple surgical datasets by deeply understanding surgical images and related information. This interactivity can reduce reliance on real-time expert consultations, enhance the patient consultation experience, and improve medical education efficiency. SkinGPT-4[66] specializes in dermatological diagnosis, leveraging tens of thousands of skin disease images along with physician notes and clinical concepts to enable precise recognition and interactive diagnostic report generation. By providing personalized treatment recommendations and classification analyses, SkinGPT-4 holds the potential to expand access to dermatological diagnostics and consultations, particularly in resource-limited regions.
Beyond static models, agentic AI systems are redefining medical AI, evolving from passive tools into interactive collaborators within clinical workflows. These large-model-driven agents not only respond to queries but also proactively gather information, recall patient histories, and coordinate with other models or clinicians to improve diagnostic accuracy and medical decision-making.[90,91] For instance, AMIE (Articulate Medical Intelligence Explorer)[92] leverages self-play training to enhance its conversational diagnostic reasoning, outperforming primary care physicians in history-taking and patient engagement. Meanwhile, AgentClinic[93] serves as a benchmark for evaluating multimodal medical AI agents in realistic clinical environments, highlighting both the challenges and advantages of AI-driven sequential decision-making in patient interactions. As these systems continue to evolve, autonomous, adaptive, and interactive AI agents have the potential to transform healthcare delivery by augmenting human expertise across various medical specialties.
Future Directions and Challenges
As the demand for practical application of large models in the field of medical imaging, future exploration and innovation will not only involve improvements in modeling methods and technical tools, but also be closely linked to data ecosystems, regulatory mechanisms, and the collaborative development of the industrial chain. The following discussion will address multiple dimensions, including the deep integration of domain knowledge, enhancement of imaging analysis capabilities, insurance of the reliability of clinical applications, and optimization of deployment pathways [Table 4].
Table 4 -
Key challenges and proposed solutions for large models in medical imaging.
| Category |
Key challenges |
Proposed solutions |
| Data availability & quality |
Scarcity of high-quality, diverse medical datasets, compounded by privacy concerns and sharing restrictions. |
Develop large-scale, multi-center datasets; utilize generative adversarial networks to generate diverse and high-quality data; enhance data sharing and privacy measures. |
| Model adaptability & generalization |
Limited generalization capabilities across various tasks and patient populations. |
Leverage continuous pretraining for domain-specific representations; explore cross-task, multimodal integration strategies. |
| Computational resources & efficiency |
High resource consumption for training and inference, limiting deployment in real-world clinical settings. |
Employ model compression, mixed-precision training, and parameter-efficient fine-tuning to optimize resource usage. |
| Integration of medical knowledge |
Insufficient integration of deep medical expertise, affecting model accuracy in clinical contexts. |
Strengthen links with medical knowledge bases and clinical guidelines to enhance the understanding of disease mechanisms and clinical environments. |
| Clinical reliability & interpretability |
Lack of transparency in decision-making, leading to concerns over reliability and safety. |
Incorporate interpretability techniques such as attention visualization, causal reasoning, and feature importance analysis; build feedback-driven optimization systems. |
| Multimodal data processing |
Difficulty in aligning and integrating diverse data types, facing issues like signal noise and resolution inconsistencies. |
Develop more effective methods for modality alignment and multimodal representation learning; enhance data fusion strategies. |
| Ethical & privacy concerns |
Privacy risks and ethical issues arising from sensitive medical data. |
Improve data anonymization techniques; adopt federated learning and secure multi-party computation to ensure privacy protection. |
| Clinical deployment & workflow integration |
Challenges in standardizing deployment procedures and integrating with existing clinical workflows. |
Create unified deployment and evaluation standards; design lightweight models tailored for diverse clinical environments. |
Enhancing domain knowledge integration
To meet the evolving professional demands of medical image analysis, large models must go beyond their current capabilities and further deepen their understanding of medical knowledge. This involves not only integrating a wide range of clinical knowledge sources such as clinical guidelines, diagnostic workflows, radiology textbooks, and the most up-to-date professional literature, but also expanding their ability to learn from diverse medical contexts, including niche specialties and rare diseases. Moreover, although current AI exhibits remarkable performance in creative divergent thinking, top experts still maintain irreplaceable advantages in intuitive judgment and advanced pattern recognition.[94] Large models should focus on integrating data-driven intelligent analysis with the rich experience of experts to achieve more precise and reliable clinical decision-making.
Data scarcity, which remains one of the most significant bottlenecks in the development of medical large models, severely limits the ability to perform more complex pretraining tasks and restricts the capacity for fine-tuning specialized tasks. Accelerating the construction of large-scale, high-quality medical datasets is essential to enable the development of more sophisticated models. These datasets must encompass diverse patient populations, including variations in demographics, diseases, and medical imaging modalities. Researchers[95] have repeatedly demonstrated that pretraining models on domain-specific data can dramatically enhance their performance on downstream tasks, particularly when dealing with the intricacies and complexities of medical image analysis. This domain-specific knowledge allows models to better understand and interpret the context of medical images. Despite the considerable computational resources required to train such large models, particularly in terms of processing power and memory, training models from scratch using high-quality, domain-specific data holds immense promise.[49] These models are better equipped to handle the unique challenges posed by medical data, including high dimensionality and variability in imaging patterns.
In light of these challenges, it is imperative for international medical institutions, research organizations, and private enterprises to collaborate in creating open, shared, and rigorously anonymized medical datasets. These datasets should be designed to support comprehensive analysis across different regions, ethnic groups, and disease spectrums, ensuring that the data is free from bias to prevent discriminatory outcomes.[96] Studies[97–99] show that biased data in AI systems can lead to unfair results, such as misdiagnosing underserved populations or inequitable resource allocation. Furthermore, the advancement of multimodal generative AI frameworks capable of producing high-fidelity synthetic data—including anatomically plausible medical images[100] and semantically aligned text-image pairs[101]—has become imperative in addressing real-world data scarcity. By simulating edge cases and rare pathological manifestations, such synthetic datasets enable robust model exposure to clinically relevant variations that may be absent or underrepresented in conventional datasets. Critically, this paradigm shifts the focus from mere data volume expansion to targeted diversity engineering, fostering AI systems resilient to real-world heterogeneity while maintaining diagnostic fidelity.
Improving imaging analysis capabilities
Although large models have achieved notable success in processing textual information, they still exhibit significant shortcomings in analyzing image data, particularly within the medical domain. Medical images inherently possess a high level of complexity that demands more specialized techniques than those typically used in general computer vision tasks.
Currently, multimodal LLMs are advancing rapidly by converting image data into learnable vectors using visual feature extractors and mapping these vectors into word embedding spaces for unified modeling with language information. Although this approach has enabled some progress, the simple vector representations produced may not fully capture the rich phenotypic features inherent in medical images. Medical images often include complex anatomical structures, subtle lesion details, and unique subtype characteristics that require a much finer level of representation. General-purpose visual feature extractors, which are typically optimized for natural images, may overlook critical diagnostic details such as the slight textural variations that distinguish benign from malignant tissues.
In the future, integrating LLMs with LVMs is expected to offer a more comprehensive framework for capturing the fine-grained features of medical images. This integrated approach leverages the deep contextual understanding inherent in LLMs alongside the detailed perceptual abilities of advanced LVMs. By fusing these modalities, it becomes possible to more accurately interpret complex imaging phenotypes, thereby enhancing diagnostic precision and overall clinical utility.
Moreover, there is a growing need to design visual feature extractors specifically tailored to the characteristics of medical images. Such extractors would be developed with an understanding of the unique imaging modalities, contrast variations, and anatomical complexities present in clinical data. This specialization could involve training on extensive, domain-specific datasets or incorporating domain knowledge into the model architecture.
Ensuring clinical reliability and safety
The “black-box” nature of decision-making in large models, coupled with the risk of generating false or misleading “hallucinations,” presents significant challenges to clinical trust and safety. This opacity in the model’s internal workings can lead to skepticism among clinicians. Methods such as attention mapping, feature importance scoring, and surrogate interpretable models can shed light on how decisions are made. For instance, visual explanations that highlight regions of an image influencing a diagnosis, or textual explanations that summarize key factors from the model’s internal reasoning, can help doctors better grasp the logic behind diagnostic recommendations.
A systematic review[102] of healthcare applications of LLMs revealed that most studies concentrate primarily on accuracy, often neglecting critical aspects such as robustness, fairness, and practical applicability. Furthermore, the majority of evaluations rely on medical exam questions, simulated clinical settings, or expert-generated queries rather than real-world data, which hinders the accurate reflection of the complexity present in clinical environments and may impede the safe, effective adoption of LLMs in routine medical practice. The S.C.O.R.E. (Safety, Consensus, Objectivity, Reproducibility and Explainability) framework[103] provides a structured qualitative approach to assessing these factors, emphasizing safety, consensus, objectivity, reproducibility, and explainability as key dimensions for evaluating LLMs in clinical applications.
Integrating large models into highly realistic clinical simulations and evaluating their impact on clinical workflows helps optimize their application in real-world healthcare settings, enhances interactions with clinical tools and data sources, and ensures reliability and safety in multi-stakeholder environments.[104] In addition, establishing closed-loop systems based on human expert feedback is helpful for continuous model improvement. In such systems, clinician feedback is systematically collected and used to refine the model. By incorporating methods such as RLHF, the system can dynamically adjust its predictions, gradually aligning its outputs with clinical realities. This closed-loop approach not only boosts the model’s credibility and robustness, but also ensures that it evolves in tandem with emerging clinical insights and standards.
Enhancing the weight of high-quality data samples during the fine-tuning phase is critical. Prioritizing well-annotated, verified clinical cases can help the model learn more accurate and contextually relevant patterns. Furthermore, incorporating external knowledge bases—such as updated clinical guidelines, curated medical ontologies, and diagnostic protocols—during inference can further validate the generated content in real time. This dual strategy of data weighting and external validation contributes to improved reliability and mitigates the risk of erroneous outputs.
The potential for large models to inadvertently leak privacy information from their training datasets poses serious legal and ethical challenges. The risk of exposing sensitive patient data necessitates rigorous safeguards. Techniques such as thorough data anonymization, the addition of controlled noise, and strict data cleaning protocols are vital in minimizing privacy risks. Strengthening security controls over model outputs—through real-time monitoring systems and content filtering algorithms—can further prevent accidental disclosure of sensitive information.
Optimizing deployment and efficiency
In clinical settings, the deployment of large models faces significant cost and time pressures due to hardware limitations and computational resource constraints. These challenges are particularly acute in environments where infrastructure is limited, such as rural hospitals or mobile diagnostic units.
Future research should focus on techniques such as model compression, weight pruning, mixed-precision computing, and knowledge distillation to significantly reduce parameter size and inference costs while maintaining, or even enhancing, model performance. These methods aim to streamline models by eliminating redundant parameters, lowering numerical precision without sacrificing accuracy, and transferring knowledge from large, complex models to smaller, more efficient ones. In addition to these techniques, targeted efficient model architectures and optimization methods offer flexible and innovative solutions to this challenge. For example, GPTQ[105] achieves rapid inference in low-resource environments through fine-grained quantization strategies that reduce the computational load by converting model parameters into lower-precision formats. RWKV (Receptance Weighted Key Value)[106] represents an intriguing architectural innovation by combining the expressiveness of transformers with the efficiency of recurrent neural networks. This hybrid design enables efficient handling of long-sequence data, which is especially beneficial for medical image analysis tasks that require dynamic temporal modeling, such as monitoring the progression of a disease over time or analyzing sequences in video-based diagnostics. Mamba[107] pushes the boundaries by optimizing computational processes and simplifying model structures. It aims to strike an optimal balance between complexity and efficiency, greatly reducing hardware requirements for deployment without compromising the depth of analysis. By combining these strategies, large models can be more effectively deployed on hospital devices, portable imaging equipment, and in remote medical facilities, thereby promoting broader adoption of advanced large models.
Conclusions
Large models hold significant potential in the field of medical imaging. By integrating advanced technologies from computer vision, NLP, and multimodal learning, these models can greatly enhance the accuracy and efficiency of image analysis, especially in tasks like automated segmentation, auxiliary diagnosis, and report generation. However, despite their notable progress in various domains, they still face challenges in terms of data diversity, model specialization, and the reliability and transparency of clinical applications.
Future developments will need to further integrate domain-specific medical knowledge into model training, optimize architectures to better handle the complexity of medical images, and address challenges related to data scarcity and multimodal fusion. To ensure the widespread adoption of large models in clinical settings, it will also be crucial to establish more rigorous evaluation frameworks, promote transparent model interpretability, and enhance the clinical reliability of these models. In addition, optimizing deployment efficiency and adapting to resource-constrained environments will be key to advancing these technologies.
Funding
This work was supported by the National Key R&D Program of China (2023YFC2415200), National Natural Science Foundation of China (82361168664, 82372053, 82441018, U24A20759, 62222609, 62076236, 32350010, 82302407, 82302296), Beijing Natural Science Foundation (JQ24048, 7232346), Beijing Nova Program (20240484528), Science and Technology Development Fund of Macao Special Administrative Region (0006/2023/AFJ), and China Postdoctoral Science Foundation (2022M720357).
Conflicts of interest
None.
References
1. Alkan C, Mardani M, Liao C, Li Z, Vasanawala SS, Pauly JM. AutoSamp: Autoencoding k-space sampling via variational information maximization for 3D MRI. IEEE Trans Med Imaging 2025;44:270–283. doi: 10.1109/TMI.2024.3443292.
2. Koetzier LR, Mastrodicasa D, Szczykutowicz TP, van der Werf NR, Wang AS, Sandfort V, et al. Deep learning image reconstruction for CT: Technical principles and clinical prospects. Radiology 2023;306:e221257. doi: 10.1148/radiol.221257.
3. Wang Z, Fang M, Zhang J, Tang L, Zhong L, Li H, et al. Radiomics and deep learning in nasopharyngeal carcinoma: A review. IEEE Rev Biomed Eng 2024;17:118–135. doi: 10.1109/RBME.2023.3269776.
4. Korfiatis P, Suman G, Patnam NG, Trivedi KH, Karbhari A, Mukherjee S, et al. Automated artificial intelligence model trained on a large data set can detect pancreas cancer on diagnostic computed tomography scans as well as visually occult preinvasive cancer on prediagnostic computed tomography scans. Gastroenterology 2023;165:1533–1546.e4. doi: 10.1053/j.gastro.2023.08.034.
5. Dong D, Fang MJ, Tang L, Shan XH, Gao JB, Giganti F, et al. Deep learning radiomic nomogram can predict the number of lymph node metastasis in locally advanced gastric cancer: an international multicenter study. Ann Oncol 2020;31:912–920. doi: 10.1016/j.annonc.2020.04.003.
6. Dong D, Tang L, Li ZY, Fang MJ, Gao JB, Shan XH, et al. Development and validation of an individualized nomogram to identify occult peritoneal metastasis in patients with advanced gastric cancer. Ann Oncol 2019;30:431–438. doi: 10.1093/annonc/mdz001.
7. Hu C, Chen W, Li F, Zhang Y, Yu P, Yang L, et al. Deep learning radio-clinical signatures for predicting neoadjuvant chemotherapy response and prognosis from pretreatment CT images of locally advanced gastric cancer patients. Int J Surg 2023;109:1980–1992. doi: 10.1097/JS9.0000000000000432.
8. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems 2020;33:1877–1901. Available at
https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html (Last accessed on Feb 10th, 2025).
9. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. LLaMA: Open and efficient foundation language models. arXiv 2023;2302.13971. doi: 10.48550/arXiv.2302.13971.
10. Brodeur PG, Buckley TA, Kanjee Z, Goh E, Ling EB, Jain P, et al. Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv 2024;2412.10849. doi: 10.48550/arXiv.2412.10849.
11. Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. Segment anything. Proceedings of the IEEE/CVF International Conference on Computer Vision 2023;4015–4026. doi: 10.1109/ICCV51070.2023.00371.
12. Chen YC, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, et al. UNITER: Universal image-text representation learning. European conference on computer vision. Cham: Springer International Publishing; 2020:104–120. doi: 10.1007/978-3-030-58577-8_7.
13. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning 2021;139:8748–8763. Available at
https://proceedings.mlr.press/v139/radford21a.html (Last accessed on Feb 10th, 2025).
14. Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y, et al. Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems 2022;35:23716–23736. Available at
https://proceedings.neurips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html (Last accessed on Feb 10th, 2025).
15. Liu A, Feng B, Wang B, Wang B, Liu B, Zhao C, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv 2024;2405.04434. doi: 10.48550/arXiv.2405.04434.
16. Devlin J. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018;1810.04805. doi: 10.48550/arXiv.1810.04805.
17. Dehghani M, Djolonga J, Mustafa B, Padlewski P, Heek J, Gilmer J, et al. Scaling vision transformers to 22 billion parameters. Proceedings of the 40th International Conference on Machine Learning 2023;202:7480–7512. Available at
https://proceedings.mlr.press/v202/dehghani23a.html (Last accessed on Feb 10th, 2025).
18. Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nat Commun 2024;15:654. doi: 10.1038/s41467-024-44824-z.
19. Liu H, Li C, Li Y, Lee YJ. Improved baselines with visual instruction tuning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024;26296–26306. doi: 10.1109/CVPR52733.2024.02484.
20. Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, et al. LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 2024;36. Available at
https://proceedings.neurips.cc/paper_files/paper/2023/hash/5abcdf8ecdcacba028c6662789194572-Abstract-Datasets_and_Benchmarks.html (Last accessed on Feb 10th, 2025).
21. Bai F, Du Y, Huang T, Meng MQH, Zhao B. M3d: Advancing 3D medical image analysis with multi-modal large language models. arXiv 2024;2404.00578. doi: 10.48550/arXiv.2404.00578.
22. Yin M, Wu C, Wang Y, Wang H, Guo W, Wang Y, et al. Entropy law: The story behind data compression and LLM performance. arXiv 2024;2407.06645. doi: 10.48550/arXiv.2407.06645.
23. Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, et al. The cancer imaging archive (TCIA): Maintaining and operating a public information repository. J Digit Imaging 2013;26:1045–1057. doi: 10.1007/s10278-013-9622-7.
24. Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016;3:160035. doi: 10.1038/sdata.2016.35.
25. Hankey BF, Ries LA, Edwards BK. The surveillance, epidemiology, and end results program: A national resource. Cancer Epidemiol Biomarkers Prev 1999;8:1117–1121. Available at
https://aacrjournals.org/cebp/article/8/12/1117/177795/The-Surveillance-Epidemiology-and-End-Results (Last accessed on Feb 10th, 2025).
26. Li W, Qu C, Chen X, Bassi PRAS, Shi Y, Lai Y, et al. AbdomenAtlas: A large-scale, detailed-annotated, & multi-center dataset for efficient transfer learning and open algorithmic benchmarking. Med Image Anal 2024;97:103285. doi: 10.1016/j.media.2024.103285.
27. Lin W, Zhao Z, Zhang X, Wu C, Zhang Y, Wang Y, et al. PMC-CLIP: Contrastive language-image pre-training using biomedical documents. International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland 2023:525–536. doi: 10.1007/978-3-031-43993-3_51.
28. Moor M, Huang Q, Wu S, Yasunaga M, Dalmia Y, Leskovec J, et al. Med-flamingo: A multimodal medical few-shot learner. Proceedings of the 3rd Machine Learning for Health Symposium 2023;225:353–367. Available at
https://proceedings.mlr.press/v225/moor23a.html (Last accessed on Feb 10th, 2025).
29. Byambasuren O, Yang Y, Sui Z, Dai D, Chang B, Li S, et al. Preliminary study on the construction of Chinese medical knowledge graph. J Chin Inf Process 2019;33:1–9. Available at
http://jcip.cipsc.org.cn/CN/abstract/abstract2840.shtml (Last accessed on Feb 10th, 2025).
30. Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y. ChatDoctor: A medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge. Cureus 2023;15:e40895. doi: 10.7759/cureus.40895.
31. Li C, Liu H, Liu Y, Feng BY, Li W, Liu X, et al. Endora: Video generation models as endoscopy simulators. International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland; 2024:230–240. doi: 10.1007/978-3-031-72089-5_22.
32. Xiao H, Zhou F, Liu X, Liu T, Li Z, Liu X, et al. A comprehensive survey of large language models and multimodal large language models in medicine. arXiv 2024;2405.08603. doi: 10.48550/arXiv.2405.08603.
33. Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2022;10684–10695. doi: 10.1109/CVPR52688.2022.01042.
34. Zhang J, Zhou Z, Mai G, Hu M, Guan Z, Li S, et al. Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models. arXiv 2023;2304.10597. doi: 10.48550/arXiv.2304.10597.
35. He H, Zhang J, Xu M, Liu J, Du B, Tao D. Scalable mask annotation for video text spotting. arXiv 2023;2305.01443. doi: 10.48550/arXiv.2305.01443.
36. Yu T, Feng R, Feng R, Liu J, Jin X, Zeng W, et al. Inpaint anything: Segment anything meets image inpainting. arXiv 2023;2304.06790. doi: 10.48550/arXiv.2304.06790.
37. Cheng Y, Li L, Xu Y, Li X, Yang Z, Wang W, et al. Segment and track anything. arXiv 2023;2305.06558. doi: 10.48550/arXiv.2305.06558.
38. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 2022;35:27730–27744. Available at
https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html (Last accessed on Feb 10th, 2025).
39. He K, Chen X, Xie S, Li Y, Dollár P, Girshick R. Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022;16000–16009. doi: 10.1109/CVPR52688.2022.01553.
40. Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, et al. Dinov2: Learning robust visual features without supervision. arXiv 2023;2304.07193. doi: 10.48550/arXiv.2304.07193.
41. Li J, Li D, Savarese S, Hoi S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Proceedings of the 40th International Conference on Machine Learning 2023;202:19730–19742. Available at
https://proceedings.mlr.press/v202/li23q.html (Last accessed on Feb 10th, 2025).
42. Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al. LoRA: Low-rank adaptation of large language models. arXiv 2021;2106.09685. doi: 10.48550/arXiv.2106.09685.
43. Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, et al. Parameter-efficient transfer learning for NLP. Proceedings of the 36th International Conference on Machine Learning 2019;97:2790–2799. Available at
https://proceedings.mlr.press/v97/houlsby19a.html (Last accessed on Feb 10th, 2025).
44. Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems 2024;36. Available at
https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html (Last accessed on Feb 10th, 2025).
45. Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 2024;36. Available at
https://proceedings.neurips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html (Last accessed on Feb 10th, 2025).
46. Meng Y, Xia M, Chen D. SimPO: Simple preference optimization with a reference-free reward. arXiv 2024;2405.14734. doi: 10.48550/arXiv.2405.14734.
47. Wu T, Luo L, Li YF, Pan S, Vu TT, Haffari G. Continual learning for large language models: A survey. arXiv 2024;2402.01364. doi: 10.48550/arXiv.2402.01364.
48. Bolton E, Venigalla A, Yasunaga M, Hall D, Xiong B, Lee T, et al. BioMedLM: A 2.7 b parameter language model trained on biomedical text. arXiv 2024;2403.18421. doi: 10.48550/arXiv.2403.18421.
49. Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large language model for electronic health records. NPJ Digit Med 2022;5:194. doi: 10.1038/s41746-022-00742-2.
50. Yao S, Yu D, Zhao J, Shafran I, Griffiths T, Cao Y, et al. Tree of thoughts: Deliberate problem solving with large language models. arXiv 2023. doi: 10.48550/arXiv.2305.10601.
51. Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, et al. Self-consistency improves chain of thought reasoning in language models. arXiv 2022;2203.11171. doi: 10.48550/arXiv.2203.11171.
52. Madaan A, Tandon N, Gupta P, Hallinan S, Gao L, Wiegreffe S. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 2024;36. Available at
https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html (Last accessed on Feb 10th, 2025).
53. Jia M, Tang L, Chen BC, Cardie C, Belongie S, Hariharan B, et al. Visual prompt tuning. European Conference on Computer Vision. Cham: Springer Nature Switzerland; 2022; 709–727. doi: 10.1007/978-3-031-19827-4_41.
54. Huang Q, Dong X, Chen D, Zhang W, Wang F, Hua G, et al. Diversity-aware meta visual prompting. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023;10878–10887. doi: 10.1109/CVPR52729.2023.01047.
55. Zhou K, Yang J, Loy CC, Liu Z. Learning to prompt for vision-language models. Int J Comput Vis 2022;130:2337–2348. doi: 10.1007/s11263-022-01653-1.
56. Rao Y, Zhao W, Chen G, Tang Y, Zhu Z, Huang G, et al. DenseCLIP: Language-guided dense prediction with context-aware prompting. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022;18082–18091. doi: 10.1109/CVPR52688.2022.01755.
57. Khattak MU, Rasheed H, Maaz M, Khan S, Khan FS. MaPLe: Multi-modal prompt learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;19113–19122. doi: 10.1109/CVPR52729.2023.01832.
58. Luo F, Chen C, Wan Z, Kang Z, Yan Q, Li Y, et al. CODIS: Benchmarking context-dependent visual comprehension for multimodal large language models. arXiv 2024;2402.13607. doi: 10.48550/arXiv.2402.13607.
59. Jiang Y, Yan X, Ji GP, Fu K, Sun M, Xiong H, et al. Effectiveness assessment of recent large vision-language models. Visual Intelligence 2024;2:17. doi: 10.1007/s44267-024-00050-1.
60. Wu J, Ji W, Liu Y, Fu H, Xu M, Xu Y, et al. Medical SAM adapter: Adapting segment anything model for medical image segmentation. arXiv 2023;2304.12620. doi: 10.48550/arXiv.2304.12620.
61. Liu F, Zhu T, Wu X, Yang B, You C, Wang C, et al. A medical multimodal large language model for future pandemics. NPJ Digit Med 2023;6:226. doi: 10.1038/s41746-023-00952-2.
62. Wang Z, Liu L, Wang L, Zhou L. R2GenGPT: Radiology report generation with frozen llms. Meta-Radiology 2023;1:100033. doi: 10.1016/j.metrad.2023.100033.
63. Zhao Z, Wang S, Gu J, Zhu Y, Mei L, Zhuang Z, et al. ChatCAD+: Toward a universal and reliable interactive CAD using LLMs. IEEE Trans Med Imaging 2024;43:3755–3766. doi: 10.1109/TMI.2024.3398350.
64. Lee S, Kim WJ, Chang J, Ye JC. LLM-CXR: Instruction-finetuned LLM for CXR image understanding and generation. arXiv 2023;2305.11490. doi: 10.48550/arXiv.2305.11490.
65. Seenivasan L, Islam M, Kannan G, Ren H. SurgicalGPT: end-to-end language-vision GPT for visual question answering in surgery. International conference on medical image computing and computer-assisted intervention. Cham: Springer Nature Switzerland; 2023:281–290. doi: 10.1007/978-3-031-43996-4_27.
66. Zhou J, He X, Sun L, Xu J, Chen X, Chu Y, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat Commun 2024;15:5649. doi: 10.1038/s41467-024-50043-3.
67. Zhang Y, Zhou T, Wang S, Liang P, Zhang Y, Chen DZ. Input augmentation with SAM: Boosting medical image segmentation with segmentation foundation model. International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland; 2023:129–139. doi: 10.1007/978-3-031-47401-9_13.
68. Koleilat T, Asgariandehkordi H, Rivaz H, Xiao Y. MedCLIP-SAM: Bridging text and image towards universal medical image segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland; 2024:643–653. doi: 10.1007/978-3-031-72390-2_60.
69. Li W, Xiong X, Xia P, Ju L, Ge Z. TP-DRSeg: Improving diabetic retinopathy lesion segmentation with explicit text-prompts assisted SAM. International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland; 2024: 743–753. doi: 10.1007/978-3-031-72111-3_70.
70. Huang Y, Yang X, Liu L, Zhou H, Chang A, Zhou X, et al. Segment anything model for medical images? Med Image Anal 2024;92:103061. doi: 10.1016/j.media.2023.103061.
71. Wang H, Guo S, Ye J, Deng Z, Cheng J, Li T, et al. SAM-Med3D: Towards general-purpose segmentation models for volumetric medical images. arXiv 2023;2310.15161. doi: 10.48550/arXiv.2310.15161.
72. Sheng Y, Bano S, Clarkson MJ, Islam M. Surgical-DeSAM: decoupling SAM for instrument segmentation in robotic surgery. Int J Comput Assist Radiol Surg 2024;19:1267–1271. doi: 10.1007/s11548-024-03163-6.
73. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. arXiv 2020;2005.12872. doi: 10.48550/arXiv.2005.12872.
74. Wang M, Lin T, Lin A, Yu K, Peng Y, Wang L, et al. Common and rare fundus diseases identification using vision-language foundation model with knowledge of over 400 diseases. arXiv 2024;2406.09317. doi: 10.48550/arXiv.2406.09317.
75. Zhou Y, Chia MA, Wagner SK, Ayhan MS, Williamson DJ, Struyven RR, et al. A foundation model for generalizable disease detection from retinal images. Nature 2023;622:156–163. doi: 10.1038/s41586-023-06555-x.
76. Wang Z, Liu C, Zhang S, Dou Q. Foundation model for endoscopy video analysis via large-scale self-supervised pre-train. International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland; 2023: 101–111. doi: 10.1007/978-3-031-43996-4_10.
77. Christensen M, Vukadinovic M, Yuan N, Ouyang D. Vision-language foundation model for echocardiogram interpretation. Nat Med 2024;30:1481–1488. doi: 10.1038/s41591-024-02959-y.
78. Xu H, Usuyama N, Bagga J, Zhang S, Rao R, Naumann T, et al. A whole-slide foundation model for digital pathology from real-world data. Nature 2024;630:181–188. doi: 10.1038/s41586-024-07441-w.
79. Pai S, Bontempi D, Hadzic I, Prudente V, Sokač M, Chaunzwa TL, et al. Foundation model for cancer imaging biomarkers. Nat Mach Intell 2024;6:354–367. doi: 10.1038/s42256-024-00807-9.
80. Zhou Y, Ong H, Kennedy P, Wu CC, Kazam J, Hentel K, et al. Evaluating GPT-4V (GPT-4 with vision) on detection of radiologic findings on chest radiographs. Radiology 2024;311:e233270. doi: 10.1148/radiol.233270.
81. Jiang Y, Chen C, Nguyen D, Mervak BM, Tan C. GPT-4V cannot generate radiology reports yet. arXiv 2024;2407.12176. doi: 10.48550/arXiv.2407.12176.
82. Lu Y, Hong S, Shah Y, Xu P. Effectively fine-tune to improve large multimodal models for radiology report generation. arXiv 2023;2312.01504. doi: 10.48550/arXiv.2312.01504.
83. Bannur S, Bouzid K, Castro DC, Schwaighofer A, Thieme A, Bond-Taylor S, et al. Maira-2: Grounded radiology report generation. arXiv 2024;2406.04449. doi: 10.48550/arXiv.2406.04449.
84. Li J, Guan Z, Wang J, Cheung CY, Zheng Y, Lim LL, et al. Integrated image-based deep learning and language models for primary diabetes care. Nat Med 2024;30:2886–2896. doi: 10.1038/s41591-024-03139-8.
85. Hasani AM, Singh S, Zahergivar A, Ryan B, Nethala D, Bravomontenegro G, et al. Evaluating the performance of generative pre-trained transformer-4 (GPT-4) in standardizing radiology reports. Eur Radiol 2024;34:3566–3574. doi: 10.1007/s00330-023-10384-x.
86. Lyu Q, Tan J, Zapadka ME, Ponnatapura J, Niu C, Myers KJ, et al. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: Results, limitations, and potential. Vis Comput Ind Biomed Art 2023;6:9. doi: 10.1186/s42492-023-00136-5.
87. Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang PC, et al. Towards generalist biomedical AI. NEJM AI 2024;1:AIoa2300138. doi: 10.1056/AIoa2300138.
88. Driess D, Xia F, Sajjadi MS, Lynch C, Chowdhery A, Ichter B, et al. PaLM-E: An embodied multimodal language model. arXiv 2023;2303.03378. doi: 10.48550/arXiv.2303.03378.
89. Porter RS, Kaplan JL. The Merck manual of diagnosis and therapy. New Jersey: Merck Sharp & Dohme Corp, 2011.
90. Zou J, Topol EJ. The rise of agentic AI teammates in medicine. Lancet 2025;405:457. doi: 10.1016/S0140-6736(25)00202-8.
91. Qiu J, Lam K, Li G, Acharya A, Wong TY, Darzi A, et al. LLM-based agentic systems in medicine and healthcare. Nat Mach Intell 2024;6:1418–1420. doi: 10.1038/s42256-024-00944-1.
92. Tu T, Palepu A, Schaekermann M, Saab K, Freyberg J, Tanno R, et al. Towards conversational diagnostic AI. arXiv 2024;2401.05654. doi: 10.48550/arXiv.2401.05654.
93. Schmidgall S, Ziaei R, Harris C, Reis E, Jopling J, Moor M. AgentClinic: A multimodal agent benchmark to evaluate AI in simulated clinical environments. arXiv 2024;2405.07960. doi: 10.48550/arXiv.2405.07960.
94. Koivisto M, Grassini S. Best humans still outperform artificial intelligence in a creative divergent thinking task. Sci Rep 2023;13:13601. doi: 10.1038/s41598-023-40858-3.
95. Boers TGW, Fockens KN, van der Putten JA, Jaspers TJM, Kusters CHJ, Jukema JB, et al. Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency. Med Image Anal 2024;98:103298. doi: 10.1016/j.media.2024.103298.
96. McDuff D, Schaekermann M, Tu T, Palepu A, Wang A, Garrison J, et al. Towards accurate differential diagnosis with large language models. arXiv 2023;2312.00164. doi: 10.48550/arXiv.2312.00164.
97. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366:447–453. doi: 10.1126/science.aax2342.
98. Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med 2021;27:2176–2182. doi: 10.1038/s41591-021-01595-0.
99. Samorani M, Harris SL, Blount LG, Lu H, Santoro MA. Overbooked and overlooked: machine learning and racial bias in medical appointment scheduling. M&SOM-Manuf Serv Oper Manag 2022;24:2825--2842. doi: 10.1287/msom.2021.0999.
100. Bluethgen C, Chambon P, Delbrouck JB, van der Sluijs R, Połacin M, Zambrano Chaves JM, et al. A vision-language foundation model for the generation of realistic chest X-ray images. Nat Biomed Eng 2025;9:494–506. doi: 10.1038/s41551-024-01246-y.
101. Tao M, Bao BK, Tang H, Xu C. Galip: Generative adversarial clips for text-to-image synthesis. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023;14214–14223. doi: 10.1109/CVPR52729.2023.01366.
102. Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, et al. A systematic review of testing and evaluation of healthcare applications of large language models (LLMs). medRxiv 2024;2024.04.15.24305869. doi: 10.1101/2024.04.15.24305869.
103. Tan TF, Elangovan K, Ong J, Shah N, Sung J, Wong TY, et al. A proposed score evaluation framework for large language models: Safety, consensus, objectivity, reproducibility and explainability. arXiv 2024;2407.07666. doi: 10.48550/arXiv.2407.07666.
104. Mehandru N, Miao BY, Almaraz ER, Sushil M, Butte AJ, Alaa A. Evaluating large language models as agents in the clinic. NPJ Digit Med 2024;7:84. doi: 10.1038/s41746-024-01083-y.
105. Frantar E, Ashkboos S, Hoefler T, Alistarh D. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv 2022;2210.17323. doi: 10.48550/arXiv.2210.17323.
106. Peng B, Alcaide E, Anthony Q, Albalak A, Arcadinho S, Biderman S, et al. RWKV: Reinventing RNNs for the transformer era. arXiv 2023;2305.13048. doi: 10.48550/arXiv.2305.13048.
107. Gu A, Dao T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023;2312.00752. doi: 10.48550/arXiv.2312.00752.