Home > Published Issues > 2025 > Volume 16, No. 4, 2025 >
JAIT 2025 Vol.16(4): 568-581
doi: 10.12720/jait.16.4.568-581

Multimodal Medical Image Analysis: Integrating LLM and RAG Deep Learning Strategies

Hanrui Yan and Dan Shao *
ASCENDING Inc., Fairfax VA 22031, USA
Email: hyan@asendingdc.com (H.Y.); celeste@ascendingdc.com (D.S.)
*Corresponding author

Manuscript received December 3, 2024; revised December 30, 2024; accepted January 20, 2025; published April 27, 2025.

Abstract—This study aims to explore a method combining Retrieval-Augmented Generation (RAG), Prompt Learning for Multimodal Large Language Models (MLLM), and Deep Self-Supervised Learning (DSL) to enhance the efficiency and accuracy of medical data management and analysis, particularly in medical image processing and diagnostic tasks. We propose a novel medical MLLM framework that integrates RAG to strengthen knowledge retrieval capabilities and optimizes model generation quality through a carefully designed prompt learning mechanism. Additionally, we incorporate DSL to uncover critical features from unlabeled medical data via self-supervised tasks, thereby improving the model’s learning capability. The framework design ensures secure data training and dynamically adjusts retrieval context and prompt formatting to adapt to diverse medical scenarios. Extensive experiments were conducted on various medical datasets, including radiology, ophthalmology, and pathology, covering medical Visual Question Answering (VQA) and report generation tasks. Experimental results demonstrate that the proposed framework significantly outperforms existing methods in factual accuracy, generation quality, and model adaptability. The findings of this study indicate that the integrated approach combining RAG, MLLM prompt learning, and DSL effectively enhances medical data processing performance, verifying its feasibility for secure and efficient data management in medical contexts. This innovative framework provides new ideas and approaches for future medical AI applications, driving the intelligent development of the healthcare industry.
 
Keywords—Retrieval-Augmented Generation (RAG), Multimodal Large Language Models (MLLM), Deep Self-Supervised Learning (DSL), medical image processing, prompt learning, medical Visual Question Answering (VQA), medical artificial intelligence

Cite: Hanrui Yan and Dan Shao, "Multimodal Medical Image Analysis: Integrating LLM and RAG Deep Learning Strategies," Journal of Advances in Information Technology, Vol. 16, No. 4, pp. 568-581, 2025. doi: 10.12720/jait.16.4.568-581

Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Article Metrics in Dimensions