Artificial intelligence-driven clinical guideline recommendations in maternal care: How trustworthy are they?
Abstract
Introduction. Medical staff often face difficulties in consulting and applying clinical guidelines in practice. Large language models, especially when combined with retrieval-augmented generation, may help overcome these challenges by producing context-specific outputs with improved adherence to medical guidelines.
Objectives. To assess the performance of commercial large language models in answering maternal health questions within retrieval-augmented generation systems, using both human and automated evaluation metrics.
Material and methods. A controlled experiment was designed to obtain accurate, consistent answers from a retrieval-augmented generation system based on Colombian maternal care guidelines. A physician formulated ten questions and defined the groundtruth answers. Various large language models were tested with a standardized prompt and evaluated through binary answer–concept ranking and retrieval-augmented generation assessment, metrics, judged by two independent large language models.
Results. Generative pre-trained transformer 3.5 (GPT-3.5) achieved the highest physicianassessed accuracy (0.90). Claude 3.5 obtained the top faithfulness score (0.78) under GPT-4.o evaluation, while Mistral ranked highest (0.84) under Claude 3.5 evaluation. Regarding answer relevance, GPT-3.5 scored highest across both judges (0.94 and 0.86).
Conclusions. Integrating retrieval-augmented generation into obstetric care has the potential to enhance evidence-based practices and improve patient outcomes. However, rigorous validation of accuracy and context-specific reliability is essential before clinical deployment. The findings of this study indicate that large-scale models (e.g., GPT-3.5, Claude, Llama 70B) consistently outperform lighter models such as Llama 8B.
Downloads
References
World Health Organization. Trends in maternal mortality 2000 to 2020: Estimates by WHO, UNICEF, UNFPA, World Bank Group and UNDESA/Population Division. Geneva: World Health Organization; 2023.
Instituto Nacional de Salud. Boletín epidemiológico semanal 52 de 2024. Bogotá: Instituto Nacional de Salud; 2024. p. 1-40. https://doi.org/10.33610/23576189.2024.52
Khan KS, Wojdyla D, Say L, Gülmezoglu AM, van Look PF. Who analysis of causes of maternal death: A systematic review. Lancet. 2006;367:1066-74. https://doi.org/10.1016/S0140-6736(06)68397-9
Correa VC, Lugo-Agudelo LH, Aguirre-Acevedo DC, Contreras JAP, Borrero AMP, Patiño-Lugo DF, et al. Individual, health system, and contextual barriers and facilitators for the implementation of clinical practice guidelines: A systematic metareview. Health Res Policy Syst. 2020;18:1-11. https://doi.org/10.1186/s12961-020-00588-8
Gómez-Sánchez PI, Arévalo-Rodríguez I, Rubio-Romero JA, Amaya-Guío J, Osorio-Castaño JH, Buitrago-Gutiérrez G, et al. Guías de práctica clínica para la prevención, detección temprana y tratamiento de las complicaciones del embarazo, parto o puerperio: introducción y metodología. Rev Colomb Obstet Ginecol. 2013;64:234-4. https://doi.org/10.18597/rcog.105
Athavale R, Blanco Gutiérrez V, Jha S. AI in medicine: An introduction to the potential benefits and challenges, and why doctors need to be involved. Obstet Gynecol. 2024;26:177-82. https://doi.org/10.1111/tog.12950
Arango Valencia S, Barrientos JG, Torres Silva EA, Sánchez Díaz E. Impacto en los resultados en salud de la telesalud aplicada para la atención y seguimiento ambulatorio del alto riesgo obstétrico: revisión narrativa de la literatura. Medicina UPB. 2024;43:43-51. https://doi.org/10.18566/medupb.v43n2.a06
Fischer A, Rietveld A, Teunissen P, Hoogendoorn M, Bakker P. What is the future of artificial intelligence in obstetrics? A qualitative study among healthcare professionals. BMJ Open. 2023;13:e076017. https://doi.org/10.1136/bmjopen-2023-076017
de Filippis R, Al Foysal A. The integration of artificial intelligence into clinical practice. Applied Biosciences. 2024;3:14-44. https://doi.org/10.3390/applbiosci3010002
Xiong G, Jin Q, Wang X, Zhang M, Lu Z, Zhang A. Improving retrieval-augmented generation in medicine with iterative follow-up questions. arXiv:2408.00727v3. https://doi.org/10.48550/arXiv.2408.00727
Macia G, Liddell A, Doyle V. Conversational AI with large language models to increase the uptake of clinical guidance. Clinical eHealth. 2024;7:147-52. https://doi.org/10.1016/j.ceh.2024.12.001
Kresevic S, Giuffrè M, Ajcevic M, Accardo A, Crocè LS, Shung DL. Optimization of hepatological clinical guidelines interpretation by large language models: A retrieval augmented generation-based framework. Digit Med. 2024;7:102. https://doi.org/10.1038/s41746-024-01091-y
Patel DJ, Chaudhari K, Acharya N, Shrivastava D, Muneeba S. Artificial intelligence in obstetrics and gynecology: Transforming care and outcomes. Cureus. 2024;16:e64725. https://doi.org/10.7759/cureus.64725
Wang L, Bi W, Zhao S, Ma Y, Lv L, Meng C, et al. Investigating the impact of prompt engineering on the performance of large language models for standardizing obstetric diagnosis text: Comparative study. JMIR Form Res. 2024;8:e53216. https://doi.org/10.2196/53216
Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPODLLM reporting guideline for studies using large language models. Nat Med. 2025;31:60-9. https://doi.org/10.1038/s41591-024-03425-5
Anthropic. Claude 3.5 Sonnet, 2024. Accessed: February 3, 2025. Available at: https://www.anthropic.com/claude/sonnet
Mistral. Mistral Large 2407, 2024. Accessed: February 3, 2025. Available at: https://mistral.ai/news/mistral-large-2407
Hugging Face. Meta Llama 3 8B, 2024. Accessed: February 3, 2025. Available at: https://huggingface.co/meta-llama/Meta-Llama-3-8B
Hugging Face. Meta Llama 3 70B Instruct, 2024. Accessed: February 3, 2025. Available at: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
Ollama. Llama 3.1 8B, 2024. Accessed: February 3, 2025. Available at: https://ollama.com/library/llama3.1:8b
OpenAI. GPT-3.5 Turbo, 2023. Accessed: February 3, 2025. Available at: https://platform.openai.com/docs/modelsgpt-3-5-turbo
OpenAI. GPT-4o, 2024. Accessed: February 3, 2025. Available at: https://platform.openai.com/docs/modelsgpt-4o
GitHub, Inc. Footer navigation. LangChain, GitHub Repository, 2023. Accessed: February 3, 2025. Available at: https://github.com/langchain-ai/langchain
Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERTNetworks. arXiv:1908.10084v1. https://doi.org/10.48550/arXiv.1908.10084
Chroma. The AI-native open-source vector database, 2024. Accessed: February 3, 2025. Available at: https://www.trychroma.com/
Ollama. Ollama: Run large language models locally, 2024. Accessed: incluir día, mes y año. Available at: https://ollama.com
Es S, James J, Espinosa-Anke L, Schockaert S. RAGAS: Automated evaluation of retrieval augmented generation. arXiv:2309.15217v2. https://doi.org/10.48550/arXiv.2309.15217
Al Ghadban Y, Lu H, Adavi U, Sharma A, Gara S, Das N, et al. Transforming healthcare education: Harnessing large language models for frontline health worker capacity building using retrieval-augmented generation. medRxiv. 2023. https://doi.org/10.1101/2023.12.15.23300009
García-Rudolph A, Sánchez-Pinsach D, Opisso E. Evaluating AI models: Performance validation using formal multiple-choice questions in neuropsychology. Arch Clin Neuropsychol. 2025;40:150-5. https://doi.org/10.1093/arclin/acae068
Some similar items:
- María Clara Echeverry, Nubia Catalina Tovar, Guillermo Mora, Presence of antibodies to cardiac neuroreceptors in patients with Chagas disease , Biomedica: Vol. 29 No. 3 (2009)
- David Yepes, Francisco Molina, Gloria Ortiz, Ricardo Aguirre, Risk factors associated with the presence of pneumonia in patients with brain injury , Biomedica: Vol. 29 No. 2 (2009)
- Lázaro Vélez, Natalia Loaiza, Lina María Gaviria, María Angélica Maya, Zulma Vanessa Rueda, Luz Teresita Correa, Jorge Ortega, Héctor Ortega, Concordance between two methods of bronchoalveolar lavage for the microbiological diagnosis of pneumonia in mechanically ventilated patients , Biomedica: Vol. 28 No. 4 (2008)
- Larry Niño, Use of the function semivariogram and kriging estimation in the spacial analysis of Aedes aegypti (Diptera: Culicidae) distributions , Biomedica: Vol. 28 No. 4 (2008)
- Carlos Julio Montoya, Zoraída Ramirez, Juan Carlos Cataño, Alejandro Román, María Teresa Rugeles, Effect of opportunistic infections on the frequency of leukocyte subpopulations from type-1 human immunodeficiency virus infected individuals , Biomedica: Vol. 28 No. 1 (2008)
- Guillermo Mora, María Clara Echeverry, Gustavo Enrique Rey, Myriam Consuelo López, Luisa Fernanda Posada, Fabio Aurelio Rivas, Frequency of Trypanosoma cruzi infection in patients with implanted pacemaker , Biomedica: Vol. 27 No. 4 (2007)
- Mario Francisco Guerrero, Elements for the effective evaluation of natural products with possible antihypertensive effects , Biomedica: Vol. 29 No. 4 (2009)
- Angélica Knudson, Rubén Santiago Nicholls, Ángela Patricia Guerra, Ricardo Sánchez, Clinical profiles of patients with uncomplicated Plasmodum falciparum malaria in northwestern Colombia , Biomedica: Vol. 27 No. 4 (2007)
- Juan Carlos Hernández, Carlos Julio Montoya, Silvio Urcuqui-Inchima, The role of toll-like receptors in viral infections: HIV-1 as a model , Biomedica: Vol. 27 No. 2 (2007)
- María Teresa Rugeles, Paula A. Velilla, Carlos J. Montoya, Mechanisms of human natural resistance to HIV: A summary of ten years of research in the Colombian population , Biomedica: Vol. 31 No. 2 (2011)
Copyright (c) 2025 Biomedica

This work is licensed under a Creative Commons Attribution 4.0 International License.
| Article metrics | |
|---|---|
| Abstract views | |
| Galley vies | |
| PDF Views | |
| HTML views | |
| Other views | |










