menu

Submission

introduction
title
AiScan: Interdisciplinary Innovation & Ethical Use
Phase 2 Submission Questions
Please complete these questions for Phase 2 participation.
Team Name

The Curators of the University of Missouri

Team Member Updates (non-scored criteria)

Our team member, Dr. Chen, was elected to AIMBE Fellow earlier this year. Our interdisciplinary team remains unchanged from Round 1 and consists of ophthalmology researchers and data scientists:

Dr. Shu-Ching Chen is a fellow of IEEE, AAAS, AIMBE, AAIA, and SIRI, and is the inaugural Executive Director of the Data Science and Analytics Innovation Center (dSAIC). dSAIC is a multi-university center based at the University of Missouri-Kansas City (UMKC).

Dr. Mei-Ling Shyu is a fellow of IEEE, AAAS, AIMBE, AAIA, and SIRI, professor of Electrical and Computer Engineering at UMKC, and has extensive research experience in the areas of machine learning, data mining, big data analytics, data integration and information fusion with successful applications in healthcare including ophthalmology.

Dr. Koulen is a fellow of ARVO, Professor of Ophthalmology and the Endowed Chair in Vision Research at UMKC's School of Medicine, and also serves as the Department of Ophthalmology's Vice Chair for Research.

GREI Repository Datasets
Figshare
Repository dataset DOI's

The FIVES dataset was obtained from the GREI FigShare repository, a member GREI repository,  FIVES: https://doi.org/10.6084/m9.figshare.19688169.v1

The remaining datasets were collected from external sources and included the High-Resolution Fundus (HRF) image database and the StoneRounds.org Disease Atlas. 

HRF: https://doi.org/10.1155/2013/154860

StoneRounds (three DOIs as examples for the 178 cases):

SR2133. (2024, February 12). Retrieved October 21, 2024, from https://stonerounds.org/SR/SR2133, doi:10.34683/SR2133

SR2843. (2024, February 22). Retrieved October 21, 2024, from https://stonerounds.org/SR/SR2843, doi:10.34683/SR2843

SR583. (2024, February 12). Retrieved October 21, 2024, from https://stonerounds.org/SR/SR583, doi:10.34683/SR583

Overview/Abstract

This research sought to optimize the contextual understanding of a multimodal large language model for analysis of retinal images and patient clinical text. This endeavor was implemented through integration of internal feature embeddings from a U-Net segmentation model directly injected into multiple cross-attention layers throughout the MLLM language pipeline using a customized attention-based architecture. Integration of the U-Net’s low-level semantic features proved highly effective in increasing the learning capacity of the MLLMs globally contextual understanding, demonstrating a 56% increase in disease-type classification accuracy over the baseline MLLM. This performance highlights this application’s utility for improved healthcare through assisting ophthalmology specialists in a clinical setting and accentuates key research aspects to explore for optimizing its potential.

Secondary Analysis

This project aimed to integrate internal layer embeddings from two well-known artificial intelligence model architectures to investigate the potential for a cross-modal learning environment utilized in ophthalmological imaging and clinical text analysis. These included a U-Net convolutional network optimized for segmentation of blood vessel structures ingrained in fundus photography images and the LLaMA 3.2 11B Vision multimodal large language model (MLLM) which is adept at image and text analysis for textual content generation. The project developed a novel approach for information fusion across internal embeddings from each model along multiple depth-wise layers with the intent to structure the MLLM model for gaining additional contextual understanding from the U-Nets fine-grained features.

The primary data utilized in the analysis were collected from GREI repositories and were composed of three sources: the Fundus Image Vessel Segmentation (FIVES) dataset, the High-Resolution Fundus (HRF) dataset, and the Disease Atlas from StoneRounds.org. FIVES and HRF consisted of 630 retinal fundus images of various acquired disease types, each with a corresponding binary vessel segmentation mask. StoneRounds consisted of 490 fundus images and patient clinical text, all from patients diagnosed with a type of inherited retinal disease. To complete the research data, the StoneRounds images were professionally annotated by ophthalmology specialists at the University of Missouri – Kansas City (UMKC) School of Medicine to obtain corresponding segmentation masks.

The proposed architecture incorporated custom fusion blocks for integration of model embeddings. Each layer of the U-Net encoder produces an embedding of the original image that decreases in size along the spatial dimensions but increases in channel depth as it moves through the decoder pipeline. To integrate U-Net embeddings (also known as skip connections) into MLLM vision encoder outputs, a separate transformer encoder architecture was developed that transformed each U-Net embedding into a set of 16 x 16 non-overlapping patches that were vectorized and fused with the MLLM embeddings through a cross-attention mechanism. Multiple layer outputs from the U-Net encoder were fused with outputs from the MLLM vision encoder with the fused embedding layers proceeding along the pipeline to serve as inputs to cross-attention layers within the MLLM language model where text features are fused with the vision components. This process occurs in a multi-depth fashion to ensure U-Net features are integrated at key points along the language pipeline such that the segmentation information is distributed for maximum effect on text generation. Concurrent with MLLM fusion, output embeddings from the MLLM cross-attention layers were fused with original U-Net encoder skip layers through feature-wise linear modulation (FiLM) before passing to U-Net decoder layers. The overall goal was to integrate internal feature information across the two models such that the consolidated architecture demonstrates an increased contextual awareness in analysis of retinal images and patient clinical text. Although segmentation was not a primary focus, the FiLM-based architecture was maintained as a key architectural component to ensure synchronization of the gradient update mechanism within the MLLM architecture.

Performance of the proposed architecture proved highly effective in the MLLM’s visual understanding of the images. The MLLM was trained independently (i.e., without U-Net fusion) on the combined datasets to serve as a baseline for a comparison. This baseline MLLM achieved a 60% classification accuracy across retinal disease categories while the U-Net fused MLLM achieved 94% accuracy. This represents a 56% improvement and demonstrates the significance of the segmentation features in guiding the MLLM’s understanding. An additional ablation study further demonstrated the importance of continued refinement of the original U-Net and MLLM parameters to coincide with the training of the developed fusion blocks. Freezing U-Net and/or MLLM weights during the training process produced results that were no better than the baseline MLLM.

The LLaMA 3.2 model’s current vision and natural language processing capabilities also demonstrated an increased benefit from the U-Net’s low-level semantic features in terms of prediction explainability. A review of model outputs showed more detailed descriptions of diagnostic reasoning that were less evident in the baseline MLLM outputs. The architecture presented through this research demonstrated an increased capacity for learning through incorporating fine-grained semantic information from a segmentation model to complement its own global contextual understanding of medical imaging with the added benefit of being a system designed for scalability and utility through limited input requirements and adaptability for use in clinical settings.

Outcomes and Outputs

The outcomes of this research project demonstrated successful development of a contextually aware MLLM architecture for analysis of retinal fundus images and patient clinical text. This awareness was enhanced through integration of additional feature information from a separate U-Net segmentation model’s internal skip embeddings directly fused along multiple depths of the MLLM language pipeline. This implementation resulted in a 56% increase in disease type classification accuracy over the baseline MLLM architecture. Utilizing the pretrained attention mechanisms from the LLaMA 3.2 11B Vision model through transfer learning also allowed model outputs to provide a customized text analysis with additional reasoning capabilities to complement the corresponding classification predictions. Example outputs of the U-Net fused MLLM architecture are displayed below and highlight the contextual depth of output responses relative to the baseline MLLM architecture and associated prompts.

Fusion MLLM example outputs:

  • The diagnosis disease is Glaucoma
  • The diagnosis disease is Glaucoma Prevention and Treatment Disease Glaucoma is a common chronic disease caused by high intraocular pressure, which can lead to optic nerve damage and vision loss. The diagnosis disease is Glaucoma.

Base MLLM outputs (coincide with the fused MLLM examples above):

  • The diagnosis disease is Diabetic Retinopathy
  • The diagnosis disease is Glaucoma
  • The diagnosis disease is AR Retinitis Pigmentosa, which is associated with the ZNF408 gene, Glu536Stop GAG>TAG alleles, and AR inheritance mode.

Methods for output validation included an ablation study to assess the disease category classification accuracy of the developed fusion architecture under various training scenarios in relation to the baseline MLLM, which was fine-tuned on the same datasets without integration of U-Net convolutional features. The fusion model incorporated transfer learning through initialization with the baseline model parameters from each of the U-Net and MLLM isolated training schemes, which were intended to directly demonstrate the fusion effects as well as account for the differences in training time requirements between the two models. The ablation analysis included variations in formats with frozen architecture parameters, and the results revealed the necessity of overall architecture fine-tuning given that the classification performance with any portion restricted from gradient updates demonstrated no improvement over the baseline MLLM.

The data utilized for training this model included over 1000 images composed of two acquired disease types, five inherited disease types, many of which also included corresponding patient clinical text. Additional data included binary vessel segmentation masks for inherited disease types, which were provided by ophthalmology specialists from the UMKC School of Medicine and used to fine tune the U-Net architecture for segmentation. These data were chosen due to their prevalence among common retinal conditions, which increases the model’s potential for utility in a clinical setting. The model’s performance further exemplifies its applicability.

Computing resources utilized for this project included a system of 3 nodes containing a total of 12 GPUs Nodes 1 and 2 were each composed of 4 NVIDIA L40S GPUs with a 64-core Intel Xeon Platinum 8562Y CPU. Node 3 incorporated 4 NVIDIA A40 GPUs with a 32-core Intel Xeon Gold 6326 CPU. The overall model was developed in PyTorch and an AdamW optimizer was selected for parameter optimization on both the U-Net and MLLM architectures during training. Replicability and reproducibility will be encouraged through open publication of all development and training methodologies. Code architectures, along with application requirements and instructions, will be shared through an open-source GitHub repository.

Impact

This research project made a meaningful contribution to both computer science and healthcare, particularly within the domain of Ophthalmology, by introducing a novel model architecture to improve baselines in multimodal AI diagnostics through MLLMs (Multimodal Large Language Models). We developed a novel hybrid architecture called MDF-MLLM (Multi-Depth Fusion Multimodal Large Language Model), which fuses a U-Net-based segmentation pipeline with a vision-instructed LLaMA 3.2. This model fuses detailed visual features extracted at multiple depths of the U-Net with the reasoning abilities of a transformer-based MLLM. This is done via custom cross-attention fusion and feature-wise linear modulation (FiLM) to enable deeper semantic understanding during diagnosis from retinal fundus images.

The model enables more accurate disease detection in clinical practice using just retinal fundus scans, optionally enhanced by text-based patient history. This provides an auxiliary diagnostic tool for ophthalmologists and clinicians that can flag potentially abnormal cases, triage patients, and suggest probable disease classes in an explainable manner. Also, the model can operate with minimal input (just an image and optional patient metadata) making it a viable candidate for use in low-resource or underserved healthcare settings. In regions with limited access to ophthalmology specialists, the model can act as a co-pilot or decision-support system to assist general practitioners or field health workers with early detection of vision-threatening diseases.

The generative nature of the MLLM ensures that its outputs are not just classification labels, but detailed assessments that include diagnostic reasoning. This directly addresses one of the key challenges in AI-assisted healthcare: the lack of explainability in predictions. MDF-MLLM generates human-readable text that explains why a certain disease label was selected, citing features in the image or correlating them with patient background, which is especially important in clinical contexts where trust and interpretability are essential.

From a technical perspective, the project contributes to the scientific community by demonstrating that multi-depth visual fusion with an MLLM can mitigate a design limitation in current vision-language models: their inability to capture fine-grained spatial features. Many current MLLMs overemphasize global context at the expense of local detail. By explicitly injecting skip-level features from a U-Net at key cross-attention layers of the MLLM, this project offers a generalizable design pattern for enhancing low-level visual grounding in multimodal reasoning. This can be extended beyond ophthalmology to other domains of medicine that rely on complex imaging data, such as radiology, pathology, and dermatology, making it a broadly impactful development for biomedical AI.

Considerations

The project was completed during the proposed period. As model design and experiments evolved, a strategic shift was made to maximize impact. The original proposal focused on improving the segmentation of retinal scans by enhancing the U-Net design with cues from a language model. However, as we began evaluating the broader diagnostic utility and limitations of the data, the focus shifted toward enhancing disease detection and explanation capabilities of the language model itself, leveraging U-Net to improve its visual reasoning.

This pivot is a response to emergent understanding that segmentation research has reached near-ceiling performance with low clinical interpretability, while an LLMs capacity for accurate and explainable diagnostic predictions remains an open challenge. The revised approach allowed development of a system with more direct clinical relevance and more significant downstream effects for patient triage and early diagnosis.

While public datasets were utilized, several limitations were apparent. 

  • StoneRounds: has patient history, no segmentation masks.
  • FIVES & HRF: has segmentation masks, no patient history. 

To address this, a team of domain experts created missing segmentation masks where appropriate. The lack of clinical history for many samples was treated not as a limitation but as a training condition. The model was trained to infer from image-only inputs, not just image-text pairs, to enhance robustness & generalizability across data scarcity.

Research Quality

All datasets used are peer-reviewed and curated medical samples or cases. Additional segmentation maps were manually created by trained medical personnel to supplement missing annotations.

During model training, we employed controlled prompts to standardize how textual information was introduced to the MLLM, and pretrained checkpoints were leveraged to initialize both the vision and language components before joint fine-tuning.

We conducted ablation studies to assess the individual contributions of U-Net features, and tested multiple fusion configurations (e.g., freezing/unfreezing parameters) to evaluate the sensitivity of each module to training dynamics. Quality assurance was further ensured by comparing fusion-based performance against baselines, using disease classification accuracy as benchmark. Visual verification of outputs also confirmed that hallucination and off-target generations were relatively the same for both the baseline and fused designs.

Research Challenges (non scored criteria)

Data challenges relate to their generalizability and access. The FIVES and HRF image datasets were designed for the primary purpose of retinal vessel segmentation, with the only other provided information being a disease diagnosis. These were not well-suited for language model development, but specialized prompt engineering was designed to enhance their utility. The StoneRounds data was more general in framework, containing clinical text and multiple image types collected on individual patients over an extended time. They instead lacked segmentation masks which were critical to this research. The UMKC School of Medicine’s medical professionals manually created these masks, without which would have rendered the StoneRounds data unusable. In terms of access, FIVES and HRF images were easily downloadable. The structure of the StoneRounds.org website, however, allowed only individual images to be downloaded, making data collection a hard task to achieve.

Supporting Documents
Provide resources for the evaluation of your secondary research project including but not limited to (up to 10 URLs): ● Publicly available outputs of the secondary analysis including results, methods, conclusions and relevant metadata ● The persistent identifier of the datasets used and generated ● Standards, tools, and metadata associated with the implementation outputs ● Articles, preprints, or scientific publications on the project ● Other relevant related resource