MDPipe

Figure 1. Overview of the MDPipe multi-modal diagnostic pipeline for ocular surface disease diagnosis using LLMs.

TL;DR

The problem: Accurate ocular surface disease diagnosis requires integrating clinical metadata and meibography imaging. Traditional assessments lack precision, while machine-based methods treat diagnoses as closed-set classification without clinical reasoning.
Our approach: MDPipe employs a visual translator to convert meibography images into quantifiable morphology data, then uses an LLM-based summarizer to generate clinical report summaries from the combined data.
The payoff: MDPipe outperforms existing standards including GPT-4, and provides clinically sound rationales for diagnoses, as validated by a five-clinician preference study.

Abstract

Accurate diagnosis of ocular surface diseases is critical in optometry and ophthalmology, which hinge on integrating clinical data sources (e.g., meibography imaging and clinical metadata). Traditional human assessments lack precision in quantifying clinical observations, while current machine-based methods often treat diagnoses as multi-class classification problems, limiting the diagnoses to a predefined closed-set of curated answers without reasoning the clinical relevance of each variable to the diagnosis. To tackle these challenges, we introduce an innovative multi-modal diagnostic pipeline (MDPipe) by employing large language models (LLMs) for ocular surface disease diagnosis. We first employ a visual translator to interpret meibography images by converting them into quantifiable morphology data, facilitating their integration with clinical metadata and enabling the communication of nuanced medical insight to LLMs. To further advance this communication, we introduce a LLM-based summarizer to contextualize the insight from the combined morphology and clinical metadata, and generate clinical report summaries. Finally, we refine the LLMs' reasoning ability with domain-specific insight from real-life clinician diagnoses. Our evaluation across diverse ocular surface disease diagnosis benchmarks demonstrates that MDPipe outperforms existing standards, including GPT-4, and provides clinically sound rationales for diagnoses.

Key Findings

Visual Translator Bridges MLLMs and Clinical Imaging

Current MLLMs fail to process specialized medical images like meibography. Our visual translator converts these images into quantifiable MG morphology data, enabling effective communication with LLMs.

LLM-Based Summarizer for Clinical Context

An LLM-based summarizer generates Q&A clinical reports from combined morphology and clinical metadata, contextualizing the insight and enhancing LLMs' learning capability for diagnosis.

Outperforms GPT-4 with Clinician Validation

MDPipe surpasses GPT-4 across all ocular surface disease benchmarks. Five clinicians rated MDPipe higher in clinical accuracy, diagnostic completeness, rationale, and robustness.

MLLMs Have Shortcomings? We Apply Visual Translator

Current multimodal large language models (MLLMs) struggle to process specialized medical visual data such as meibography images. Our visual translator is designed to interpret visual data by converting them into quantifiable meibomian gland (MG) morphology data, facilitating their integration with clinical metadata.

Figure 2. (a) Limitations of current MLLMs in processing visual data. (b) Our visual translator V is designed to interpret visual data I by converting them into quantifiable MG morphology data.

LLM-based Clinical Report Summarizer

We employed an LLM-based summarizer to generate Q&A clinical reports (via GPT-4) to contextualize insights from both the non-narrative clinical metadata and MG morphology to enhance LLMs' learning capability.

Figure 3. The LLM-based clinical report summarizer contextualizes insights from combined MG morphology and clinical metadata to generate comprehensive Q&A clinical reports.

User (Clinician) Preference Study: MDPipe vs GPT-4

Five clinicians were masked as to which model produced each output, and then asked to read and rate the two models' output on a scale from 1 (poor) to 5 (best) regarding 1) clinical accuracy, 2) diagnostic completeness, 3) diagnostic rationale, and 4) the model's robustness to handle ambiguous or incomplete patient data.

Figure 4. Comparative evaluation and clinician study between MDPipe and GPT-4, showing MDPipe is preferred across all four evaluation criteria.

BibTeX

@inproceedings{yeh2024insight,
  title={Insight: A Multi-modal Diagnostic Pipeline Using LLMs for Ocular Surface Disease Diagnosis},
  author={Yeh, Chun-Hsiao and Wang, Jiayun and Graham, Andrew D and Liu, Andrea J and Tan, Bo and Chen, Yubei and Ma, Yi and Lin, Meng C},
  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
  pages={711--721},
  year={2024},
  organization={Springer}
}

Insight: A Multi-Modal Diagnostic Pipeline using LLMs for Ocular Surface Disease Diagnosis