Evaluation of Medical Large Language Models: Taxonomy, Review, and Directions
Evaluation of Medical Large Language Models: Taxonomy, Review, and Directions
Anisio Lacerda, Gisele Pappa, Adriano César Machado Pereira, Wagner Meira Jr, Alexandre Guimarães de Almeida Barros
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Survey Track. Pages 10528-10536.
https://doi.org/10.24963/ijcai.2025/1169
The integration of Large Language Models (LLMs) into medicine presents both great opportunities and significant challenges, particularly in ensuring these models are accurate, reliable, and safe. While LLMs have shown impressive capabilities in understanding and generating human language, their application in the medical domain requires careful evaluation due to the critical nature of medical applications which are inherently linked to patient life and health. Current evaluations of LLMs in medicine are often fragmented and insufficient, with a lack of standardized performance metrics, limited use of real patient data, and insufficient attention to important applications, such as documentation, education, and research. Furthermore, traditional NLP-based evaluations are often inadequate for assessing the text generated by LLMs. Therefore, a robust evaluation is essential to ensure the responsible and effective use of LLMs in medical settings, and to address the inherent challenges associated with their implementation. This paper explores the various dimensions of LLM evaluation in the medical domain, proposes a new taxonomy for categorizing medical applications, and discusses directions for future research in this critical area.
Keywords:
Machine Learning: ML: Evaluation
Machine Learning: ML: Generative models
Multidisciplinary Topics and Applications: MTA: Health and medicine
