Multidimensional evaluation of large language models in radiology report readability
npj Digital Medicine, Published online: 01 April 2026; doi:10.1038/s41746-026-02589-3 Multidimensional evaluation of large language models in radiology report readability
Abstract
This study systematically investigated the influence of demographic characteristics on the readability of patient-centric radiology reports and compared the performance of different large language models (LLMs) in generating patient-centered reports. Adopting a sequential two-stage design, the research first conducted a retrospective evaluation involving 320 radiology reports followed by a clinical setting validation with 800 patients. Results suggested that all three LLMs significantly improved the readability of radiology reports (P < 0.05), with DeepSeek-R1 showing potentially superior performance within this specific cohort. Demographic analysis revealed significant interactive effects: higher education and older age (within consistent educational levels) were associated with better comprehension. Clinical setting validation further indicated that reading simplified reports suggesting the potential to significantly improved patients’ subjective and objective comprehension while significantly alleviating medical anxiety (P < 0.05). However, limitations persist, including inconsistent model outputs, missing anatomical details, and comprehension variances driven by demographic factors. Consequently, LLMs should be integrated as auxiliary communication tools for radiologists rather than standalone solutions, necessitating personalized interventions tailored to specific demographic profiles.
Similar content being viewed by others
Data availability
The datasets generated or analyzed during the study are available from the corresponding author on reasonable request.
References
- Vijan, A., Bhagwanani, A., Calle, F. & Brun-Vergara, M. L. Optimizing patient communication in radiology. Radiographics 43, e230002 (2023).
Google Scholar
- Rockall, A. G., Justich, C., Helbich, T. & Vilgrain, V. Patient communication in radiology: moving up the agenda. Eur. J. Radiol. 155, 110464 (2022).
Google Scholar
- Cabarrus, M., Naeger, D. M., Rybkin, A. & Qayyum, A. Patients prefer results from the ordering provider and access to their radiology reports. J. Am. Coll. Radi. ol. 12, 556–562 (2015).
Google Scholar
- Gunn, A. J. et al. JOURNAL CLUB: structured feedback from patients on actual radiology reports: a novel approach to improve reporting practices. AJR Am. J. Roentgenol. 208, 1262–1270 (2017).
Google Scholar
- Martin-Carreras, T., Cook, T. S. & Kahn, C. E. Jr Readability of radiology reports: implications for patient-centered care. Clin. Imaging 54, 116–120 (2019).
Google Scholar
- Burns, J., Agarwal, V., Catanzano, T. M., Schaefer, P. W. & Jordan, S. G. Talking points: enhancing communication between radiologists and patients. Acad. Radiol. 29, 888–896 (2022).
Google Scholar
- Yin, S. et al. A survey on multimodal large language models. Natl. Sci. Rev. 11, nwae403 (2024).
Google Scholar
- Gulati, V. et al. Transcending language barriers: can ChatGPT Be the key to enhancing multilingual accessibility in health care? J. Am. Coll. Radiol. 21, 1888–1895 (2024).
Google Scholar
- Herwald, S. E. et al. RadGPT: a system based on a large language model that generates sets of patient-centered materials to explain radiology report information. J. Am. Coll. Radiol. 22, 1050–1059 (2025).
Google Scholar
- Leutz-Schmidt, P. et al. Performance of large language models ChatGPT and Gemini on workplace management questions in radiology. Diagnostics 15, 497 (2025).
Google Scholar
- Elhakim, T. et al. Enhanced PROcedural information READability for Patient-Centered Care in Interventional Radiology With Large Language Models (PRO-READ IR). J. Am. Coll. Radiol. 22, 84–97 (2025).
Google Scholar
- Kim, H. et al. Conversion of mixed-language free-text CT reports of pancreatic cancer to national comprehensive cancer network structured reporting templates by using GPT-4. Korean J. Radiol. 26, 557–568 (2025).
Google Scholar
- Çamur, E., Cesur, T. & Güneş, Y. C. A comparative study: performance of large language models in simplifying Turkish computed tomography reports. J. Infect. Public Health 87, 321–326 (2024).
Google Scholar
- Berzolla, E. et al. Artificial intelligence large language models improve patient comprehension of radiologist magnetic resonance imaging reports. Arthroscopy 41, 4607–4614.e4604 (2025).
Google Scholar
- Chen, A. H., Rudin, R. S., Levine, D. M. & Mehrotra, A. Improving patient understanding of radiology reports using generative artificial intelligence: a vignette study of 2000 US adults. J. Am. Med. Inform. Assoc. https://doi.org/10.1093/jamia/ocaf187 (2025).
- Doyle, C., Lennox, L. & Bell, D. A systematic review of evidence on the links between patient experience and clinical safety and effectiveness. BMJ Open 3, e001570 (2013).
Google Scholar
- Jeblick, K. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur. Radiol. 34, 2817–2825 (2024).
Google Scholar
- Doshi, R. et al. Quantitative evaluation of large language models to streamline radiology report impressions: a multimodal retrospective analysis. Radiology 310, e231593 (2024).
Google Scholar
- Rahsepar, A. A. Large language models for enhancing radiology report impressions: improve readability while decreasing burnout. Radiology 310, e240498 (2024).
Google Scholar
- Nakaura, T. et al. The impact of large language models on radiology: a guide for radiologists on the latest innovations in AI. Jpn. J. Radiol. 42, 685–696 (2024).
Google Scholar
- Prucker, P. et al. A prospective controlled trial of large language model-based simplification of oncologic CT reports for patients with cancer. Radiology 317, e251844 (2025).
Google Scholar
- Jebb, A. T., Ng, V. & Tay, L. A review of key likert scale development advances: 1995-2019. Front. Psychol. 12, 637547 (2021).
Google Scholar
Download references
Acknowledgements
We gratefully acknowledge the Radiology Department of the Third Hospital of Jilin University for their support of this research, and Professor Mengchao Zhang on the research team.
Author information
Author notes
- These authors contributed equally: Yunhai Mao, Chunyan Wang.
Authors and Affiliations
- Department of Radiology, the Third Hospital of Jilin University, Changchun, China
Yunhai Mao, Chunyan Wang, Yuxin Li, Wei Wang & Mengchao Zhang
Authors
- Yunhai Mao
- Chunyan Wang
- Yuxin Li
- Wei Wang
- Mengchao Zhang
Contributions
M.Z. conceptualized the study, performed formal analysis and investigation, and was responsible for project administration and supervision. Y.M. and C.W. (equal contributors) contributed to data curation, formal analysis, methodology, validation, visualization, and wrote the original draft and revised the manuscript. Y.L. contributed to data curation, methodology, and visualization. W.W. contributed to methodology and writing the original draft. All authors have read and approved the manuscript.
Corresponding author
Correspondence to Mengchao Zhang.
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
About this article
Cite this article
Mao, Y., Wang, C., Li, Y. et al. Multidimensional evaluation of large language models in radiology report readability. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02589-3
Download citation
- Received: 15 December 2025
- Accepted: 17 March 2026
- Published: 01 April 2026
- DOI: https://doi.org/10.1038/s41746-026-02589-3
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelvaluation
p-e-w/gemma-4-E2B-it-heretic-ara: Gemma 4's defenses shredded by Heretic's new ARA method 90 minutes after the official release
Google's Gemma models have long been known for their strong "alignment" (censorship). I am happy to report that even the latest iteration, Gemma 4, is not immune to Heretic's new Arbitrary-Rank Ablation (ARA) method, which uses matrix optimization to suppress refusals. Here is the result: https://huggingface.co/p-e-w/gemma-4-E2B-it-heretic-ara And yes, it absolutely does work. It answers questions properly, few if any evasions as far as I can tell. And there is no obvious model damage either. What you need to reproduce (and, presumably, process the other models as well): git clone -b ara https://github.com/p-e-w/heretic.git cd heretic pip install . pip install git+https://github.com/huggingface/transformers.git heretic google/gemma-4-E2B-it From my limited experiments (hey, it's only been

Ran Qwen 3.5 27B via Ollama as a persistent background agent for 30 days. Not a demo. Honest results.
I wanted to know if a local LLM could handle recurring background tasks reliably over an extended period — not as a chatbot, but as a persistent worker that runs scheduled jobs, maintains context across sessions, and routes tool calls without human prompting. So I ran it for 30 days on real tasks from my actual workflow. Model: Qwen 3.5 27B via Ollama. Hardware: Mac with 32GB unified memory, but the architecture works on any machine that can run a 27B+ model locally. Setup Each agent runs in a persistent workspace with its own memory, skills, and MCP sidecars. The workspace structure separates human-authored instructions ( AGENTS.md ), model config and provider settings ( workspace.yaml ), modular capabilities ( skills/ ), and installed workspace apps ( apps/ ). Memory lives in a separate
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Ran Qwen 3.5 27B via Ollama as a persistent background agent for 30 days. Not a demo. Honest results.
I wanted to know if a local LLM could handle recurring background tasks reliably over an extended period — not as a chatbot, but as a persistent worker that runs scheduled jobs, maintains context across sessions, and routes tool calls without human prompting. So I ran it for 30 days on real tasks from my actual workflow. Model: Qwen 3.5 27B via Ollama. Hardware: Mac with 32GB unified memory, but the architecture works on any machine that can run a 27B+ model locally. Setup Each agent runs in a persistent workspace with its own memory, skills, and MCP sidecars. The workspace structure separates human-authored instructions ( AGENTS.md ), model config and provider settings ( workspace.yaml ), modular capabilities ( skills/ ), and installed workspace apps ( apps/ ). Memory lives in a separate


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!