Models model language model valuation report published

Multidimensional evaluation of large language models in radiology report readability

nature.comby Yunhai MaoApril 1, 20261 min read0 views

npj Digital Medicine, Published online: 01 April 2026; doi:10.1038/s41746-026-02589-3 Multidimensional evaluation of large language models in radiology report readability

Abstract

This study systematically investigated the influence of demographic characteristics on the readability of patient-centric radiology reports and compared the performance of different large language models (LLMs) in generating patient-centered reports. Adopting a sequential two-stage design, the research first conducted a retrospective evaluation involving 320 radiology reports followed by a clinical setting validation with 800 patients. Results suggested that all three LLMs significantly improved the readability of radiology reports (P < 0.05), with DeepSeek-R1 showing potentially superior performance within this specific cohort. Demographic analysis revealed significant interactive effects: higher education and older age (within consistent educational levels) were associated with better comprehension. Clinical setting validation further indicated that reading simplified reports suggesting the potential to significantly improved patients’ subjective and objective comprehension while significantly alleviating medical anxiety (P < 0.05). However, limitations persist, including inconsistent model outputs, missing anatomical details, and comprehension variances driven by demographic factors. Consequently, LLMs should be integrated as auxiliary communication tools for radiologists rather than standalone solutions, necessitating personalized interventions tailored to specific demographic profiles.

Data availability

The datasets generated or analyzed during the study are available from the corresponding author on reasonable request.

References

Vijan, A., Bhagwanani, A., Calle, F. & Brun-Vergara, M. L. Optimizing patient communication in radiology. Radiographics 43, e230002 (2023).

Google Scholar

Rockall, A. G., Justich, C., Helbich, T. & Vilgrain, V. Patient communication in radiology: moving up the agenda. Eur. J. Radiol. 155, 110464 (2022).

Google Scholar

Cabarrus, M., Naeger, D. M., Rybkin, A. & Qayyum, A. Patients prefer results from the ordering provider and access to their radiology reports. J. Am. Coll. Radi. ol. 12, 556–562 (2015).

Google Scholar

Gunn, A. J. et al. JOURNAL CLUB: structured feedback from patients on actual radiology reports: a novel approach to improve reporting practices. AJR Am. J. Roentgenol. 208, 1262–1270 (2017).

Google Scholar

Martin-Carreras, T., Cook, T. S. & Kahn, C. E. Jr Readability of radiology reports: implications for patient-centered care. Clin. Imaging 54, 116–120 (2019).

Google Scholar

Burns, J., Agarwal, V., Catanzano, T. M., Schaefer, P. W. & Jordan, S. G. Talking points: enhancing communication between radiologists and patients. Acad. Radiol. 29, 888–896 (2022).

Google Scholar

Yin, S. et al. A survey on multimodal large language models. Natl. Sci. Rev. 11, nwae403 (2024).

Google Scholar

Gulati, V. et al. Transcending language barriers: can ChatGPT Be the key to enhancing multilingual accessibility in health care? J. Am. Coll. Radiol. 21, 1888–1895 (2024).

Google Scholar

Herwald, S. E. et al. RadGPT: a system based on a large language model that generates sets of patient-centered materials to explain radiology report information. J. Am. Coll. Radiol. 22, 1050–1059 (2025).

Google Scholar

Leutz-Schmidt, P. et al. Performance of large language models ChatGPT and Gemini on workplace management questions in radiology. Diagnostics 15, 497 (2025).

Google Scholar

Elhakim, T. et al. Enhanced PROcedural information READability for Patient-Centered Care in Interventional Radiology With Large Language Models (PRO-READ IR). J. Am. Coll. Radiol. 22, 84–97 (2025).

Google Scholar

Kim, H. et al. Conversion of mixed-language free-text CT reports of pancreatic cancer to national comprehensive cancer network structured reporting templates by using GPT-4. Korean J. Radiol. 26, 557–568 (2025).

Google Scholar

Çamur, E., Cesur, T. & Güneş, Y. C. A comparative study: performance of large language models in simplifying Turkish computed tomography reports. J. Infect. Public Health 87, 321–326 (2024).

Google Scholar

Berzolla, E. et al. Artificial intelligence large language models improve patient comprehension of radiologist magnetic resonance imaging reports. Arthroscopy 41, 4607–4614.e4604 (2025).

Google Scholar

Chen, A. H., Rudin, R. S., Levine, D. M. & Mehrotra, A. Improving patient understanding of radiology reports using generative artificial intelligence: a vignette study of 2000 US adults. J. Am. Med. Inform. Assoc. https://doi.org/10.1093/jamia/ocaf187 (2025).
Doyle, C., Lennox, L. & Bell, D. A systematic review of evidence on the links between patient experience and clinical safety and effectiveness. BMJ Open 3, e001570 (2013).

Google Scholar

Jeblick, K. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur. Radiol. 34, 2817–2825 (2024).

Google Scholar

Doshi, R. et al. Quantitative evaluation of large language models to streamline radiology report impressions: a multimodal retrospective analysis. Radiology 310, e231593 (2024).

Google Scholar

Rahsepar, A. A. Large language models for enhancing radiology report impressions: improve readability while decreasing burnout. Radiology 310, e240498 (2024).

Google Scholar

Nakaura, T. et al. The impact of large language models on radiology: a guide for radiologists on the latest innovations in AI. Jpn. J. Radiol. 42, 685–696 (2024).

Google Scholar

Prucker, P. et al. A prospective controlled trial of large language model-based simplification of oncologic CT reports for patients with cancer. Radiology 317, e251844 (2025).

Google Scholar

Jebb, A. T., Ng, V. & Tay, L. A review of key likert scale development advances: 1995-2019. Front. Psychol. 12, 637547 (2021).

Google Scholar

Download references

Acknowledgements

We gratefully acknowledge the Radiology Department of the Third Hospital of Jilin University for their support of this research, and Professor Mengchao Zhang on the research team.

Author information

Author notes

These authors contributed equally: Yunhai Mao, Chunyan Wang.

Authors and Affiliations

Department of Radiology, the Third Hospital of Jilin University, Changchun, China

Yunhai Mao, Chunyan Wang, Yuxin Li, Wei Wang & Mengchao Zhang

Authors

Yunhai Mao
Chunyan Wang
Yuxin Li
Wei Wang
Mengchao Zhang

Contributions

M.Z. conceptualized the study, performed formal analysis and investigation, and was responsible for project administration and supervision. Y.M. and C.W. (equal contributors) contributed to data curation, formal analysis, methodology, validation, visualization, and wrote the original draft and revised the manuscript. Y.L. contributed to data curation, methodology, and visualization. W.W. contributed to methodology and writing the original draft. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Mengchao Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Mao, Y., Wang, C., Li, Y. et al. Multidimensional evaluation of large language models in radiology report readability. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02589-3

Download citation

Received: 15 December 2025
Accepted: 17 March 2026
Published: 01 April 2026
DOI: https://doi.org/10.1038/s41746-026-02589-3

Original source

nature.com

https://www.nature.com/articles/s41746-026-02589-3

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelvaluation

Open Source AILive

p-e-w/gemma-4-E2B-it-heretic-ara: Gemma 4's defenses shredded by Heretic's new ARA method 90 minutes after the official release

Google's Gemma models have long been known for their strong "alignment" (censorship). I am happy to report that even the latest iteration, Gemma 4, is not immune to Heretic's new Arbitrary-Rank Ablation (ARA) method, which uses matrix optimization to suppress refusals. Here is the result: https://huggingface.co/p-e-w/gemma-4-E2B-it-heretic-ara And yes, it absolutely does work. It answers questions properly, few if any evasions as far as I can tell. And there is no obvious model damage either. What you need to reproduce (and, presumably, process the other models as well): git clone -b ara https://github.com/p-e-w/heretic.git cd heretic pip install . pip install git+https://github.com/huggingface/transformers.git heretic google/gemma-4-E2B-it From my limited experiments (hey, it's only been

Reddit r/LocalLLaMA

1m21 minutes ago

ModelsLive

Ran Qwen 3.5 27B via Ollama as a persistent background agent for 30 days. Not a demo. Honest results.

I wanted to know if a local LLM could handle recurring background tasks reliably over an extended period — not as a chatbot, but as a persistent worker that runs scheduled jobs, maintains context across sessions, and routes tool calls without human prompting. So I ran it for 30 days on real tasks from my actual workflow. Model: Qwen 3.5 27B via Ollama. Hardware: Mac with 32GB unified memory, but the architecture works on any machine that can run a 27B+ model locally. Setup Each agent runs in a persistent workspace with its own memory, skills, and MCP sidecars. The workspace structure separates human-authored instructions ( AGENTS.md ), model config and provider settings ( workspace.yaml ), modular capabilities ( skills/ ), and installed workspace apps ( apps/ ). Memory lives in a separate

Reddit r/LocalLLaMA

3mabout 1 hour ago

ModelsLive

Google Launches Gemma 4, Its Most Capable Open Model Yet - The Tech Buzz

Google Launches Gemma 4, Its Most Capable Open Model Yet The Tech Buzz

Google News: DeepMind

1mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 183 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Multidimensional evaluation of large language models in radiology report readability

Abstract

Similar content being viewed by others

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Daily AI Digest

More about

p-e-w/gemma-4-E2B-it-heretic-ara: Gemma 4's defenses shredded by Heretic's new ARA method 90 minutes after the official release

Ran Qwen 3.5 27B via Ollama as a persistent background agent for 30 days. Not a demo. Honest results.

Google Launches Gemma 4, Its Most Capable Open Model Yet - The Tech Buzz

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Anthropic Races to Contain Leak of Code Behind Claude AI Agent - WSJ

Ran Qwen 3.5 27B via Ollama as a persistent background agent for 30 days. Not a demo. Honest results.

Alibaba's Qwen launches new flagship LLM with Qwen 3.6-Plus - Constellation Research

Google Launches Gemma 4, Its Most Capable Open Model Yet - The Tech Buzz