Artificial intelligence-driven clinical guideline recommendations in maternal care: How trustworthy are they?

Jairo J. Pérez, Andrés F. Giraldo-Forero, Santiago Rúa, Daniel Betancur, Zuliany Urquina , Pablo Castañeda , Sara Arango-Valencia , Juan Guillermo Barrientos-Gómez , Ever A. Torres-Silva, Andrés Orozco-Duque , .

Keywords: Clinical guidelines as topic, maternal healthcare services, artificial intelligence, large language models, natural language processing

Abstract

Introduction. Medical staff often face difficulties in consulting and applying clinical guidelines in practice. Large language models, especially when combined with retrieval-augmented generation, may help overcome these challenges by producing context-specific outputs with improved adherence to medical guidelines.
Objectives. To assess the performance of commercial large language models in answering maternal health questions within retrieval-augmented generation systems, using both human and automated evaluation metrics.
Material and methods. A controlled experiment was designed to obtain accurate, consistent answers from a retrieval-augmented generation system based on Colombian maternal care guidelines. A physician formulated ten questions and defined the groundtruth answers. Various large language models were tested with a standardized prompt and evaluated through binary answer–concept ranking and retrieval-augmented generation assessment, metrics, judged by two independent large language models.
Results. Generative pre-trained transformer 3.5 (GPT-3.5) achieved the highest physicianassessed accuracy (0.90). Claude 3.5 obtained the top faithfulness score (0.78) under GPT-4.o evaluation, while Mistral ranked highest (0.84) under Claude 3.5 evaluation. Regarding answer relevance, GPT-3.5 scored highest across both judges (0.94 and 0.86).
Conclusions. Integrating retrieval-augmented generation into obstetric care has the potential to enhance evidence-based practices and improve patient outcomes. However, rigorous validation of accuracy and context-specific reliability is essential before clinical deployment. The findings of this study indicate that large-scale models (e.g., GPT-3.5, Claude, Llama 70B) consistently outperform lighter models such as Llama 8B.

Downloads

Download data is not yet available.

References

World Health Organization. Trends in maternal mortality 2000 to 2020: Estimates by WHO, UNICEF, UNFPA, World Bank Group and UNDESA/Population Division. Geneva: World Health Organization; 2023.

Instituto Nacional de Salud. Boletín epidemiológico semanal 52 de 2024. Bogotá: Instituto Nacional de Salud; 2024. p. 1-40. https://doi.org/10.33610/23576189.2024.52

Khan KS, Wojdyla D, Say L, Gülmezoglu AM, van Look PF. Who analysis of causes of maternal death: A systematic review. Lancet. 2006;367:1066-74. https://doi.org/10.1016/S0140-6736(06)68397-9

Correa VC, Lugo-Agudelo LH, Aguirre-Acevedo DC, Contreras JAP, Borrero AMP, Patiño-Lugo DF, et al. Individual, health system, and contextual barriers and facilitators for the implementation of clinical practice guidelines: A systematic metareview. Health Res Policy Syst. 2020;18:1-11. https://doi.org/10.1186/s12961-020-00588-8

Gómez-Sánchez PI, Arévalo-Rodríguez I, Rubio-Romero JA, Amaya-Guío J, Osorio-Castaño JH, Buitrago-Gutiérrez G, et al. Guías de práctica clínica para la prevención, detección temprana y tratamiento de las complicaciones del embarazo, parto o puerperio: introducción y metodología. Rev Colomb Obstet Ginecol. 2013;64:234-4. https://doi.org/10.18597/rcog.105

Athavale R, Blanco Gutiérrez V, Jha S. AI in medicine: An introduction to the potential benefits and challenges, and why doctors need to be involved. Obstet Gynecol. 2024;26:177-82. https://doi.org/10.1111/tog.12950

Arango Valencia S, Barrientos JG, Torres Silva EA, Sánchez Díaz E. Impacto en los resultados en salud de la telesalud aplicada para la atención y seguimiento ambulatorio del alto riesgo obstétrico: revisión narrativa de la literatura. Medicina UPB. 2024;43:43-51. https://doi.org/10.18566/medupb.v43n2.a06

Fischer A, Rietveld A, Teunissen P, Hoogendoorn M, Bakker P. What is the future of artificial intelligence in obstetrics? A qualitative study among healthcare professionals. BMJ Open. 2023;13:e076017. https://doi.org/10.1136/bmjopen-2023-076017

de Filippis R, Al Foysal A. The integration of artificial intelligence into clinical practice. Applied Biosciences. 2024;3:14-44. https://doi.org/10.3390/applbiosci3010002

Xiong G, Jin Q, Wang X, Zhang M, Lu Z, Zhang A. Improving retrieval-augmented generation in medicine with iterative follow-up questions. arXiv:2408.00727v3. https://doi.org/10.48550/arXiv.2408.00727

Macia G, Liddell A, Doyle V. Conversational AI with large language models to increase the uptake of clinical guidance. Clinical eHealth. 2024;7:147-52. https://doi.org/10.1016/j.ceh.2024.12.001

Kresevic S, Giuffrè M, Ajcevic M, Accardo A, Crocè LS, Shung DL. Optimization of hepatological clinical guidelines interpretation by large language models: A retrieval augmented generation-based framework. Digit Med. 2024;7:102. https://doi.org/10.1038/s41746-024-01091-y

Patel DJ, Chaudhari K, Acharya N, Shrivastava D, Muneeba S. Artificial intelligence in obstetrics and gynecology: Transforming care and outcomes. Cureus. 2024;16:e64725. https://doi.org/10.7759/cureus.64725

Wang L, Bi W, Zhao S, Ma Y, Lv L, Meng C, et al. Investigating the impact of prompt engineering on the performance of large language models for standardizing obstetric diagnosis text: Comparative study. JMIR Form Res. 2024;8:e53216. https://doi.org/10.2196/53216

Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPODLLM reporting guideline for studies using large language models. Nat Med. 2025;31:60-9. https://doi.org/10.1038/s41591-024-03425-5

Anthropic. Claude 3.5 Sonnet, 2024. Accessed: February 3, 2025. Available at: https://www.anthropic.com/claude/sonnet

Mistral. Mistral Large 2407, 2024. Accessed: February 3, 2025. Available at: https://mistral.ai/news/mistral-large-2407

Hugging Face. Meta Llama 3 8B, 2024. Accessed: February 3, 2025. Available at: https://huggingface.co/meta-llama/Meta-Llama-3-8B

Hugging Face. Meta Llama 3 70B Instruct, 2024. Accessed: February 3, 2025. Available at: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct

Ollama. Llama 3.1 8B, 2024. Accessed: February 3, 2025. Available at: https://ollama.com/library/llama3.1:8b

OpenAI. GPT-3.5 Turbo, 2023. Accessed: February 3, 2025. Available at: https://platform.openai.com/docs/modelsgpt-3-5-turbo

OpenAI. GPT-4o, 2024. Accessed: February 3, 2025. Available at: https://platform.openai.com/docs/modelsgpt-4o

GitHub, Inc. Footer navigation. LangChain, GitHub Repository, 2023. Accessed: February 3, 2025. Available at: https://github.com/langchain-ai/langchain

Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERTNetworks. arXiv:1908.10084v1. https://doi.org/10.48550/arXiv.1908.10084

Chroma. The AI-native open-source vector database, 2024. Accessed: February 3, 2025. Available at: https://www.trychroma.com/

Ollama. Ollama: Run large language models locally, 2024. Accessed: incluir día, mes y año. Available at: https://ollama.com

Es S, James J, Espinosa-Anke L, Schockaert S. RAGAS: Automated evaluation of retrieval augmented generation. arXiv:2309.15217v2. https://doi.org/10.48550/arXiv.2309.15217

Al Ghadban Y, Lu H, Adavi U, Sharma A, Gara S, Das N, et al. Transforming healthcare education: Harnessing large language models for frontline health worker capacity building using retrieval-augmented generation. medRxiv. 2023. https://doi.org/10.1101/2023.12.15.23300009

García-Rudolph A, Sánchez-Pinsach D, Opisso E. Evaluating AI models: Performance validation using formal multiple-choice questions in neuropsychology. Arch Clin Neuropsychol. 2025;40:150-5. https://doi.org/10.1093/arclin/acae068

How to Cite
1.
Pérez JJ, Giraldo-Forero AF, Rúa S, Betancur D, Urquina Z, Castañeda P, et al. Artificial intelligence-driven clinical guideline recommendations in maternal care: How trustworthy are they?. Biomed. [Internet]. 2025 Dec. 10 [cited 2026 Jan. 12];45(Sp. 3):37-51. Available from: https://revistabiomedicaorg.biteca.online/index.php/biomedica/article/view/7902

Some similar items:

Published
2025-12-10

Altmetric

Article metrics
Abstract views
Galley vies
PDF Views
HTML views
Other views
Crossref Cited-by logo
Escanea para compartir
QR Code