Integrating large language models for enhanced predictive analytics in healthcare

nature.comby Bai, HarrisonApril 2, 202610 min read1 views

npj Digital Medicine, Published online: 02 April 2026; doi:10.1038/s41746-026-02572-y Integrating large language models for enhanced predictive analytics in healthcare

References

Woolf, S. H. et al. Promoting informed choice: transforming health care to dispense knowledge for decision making (2005).
Kaur, S. et al. Medical diagnostic systems using artificial intelligence (ai) algorithms: principles and perspectives. IEEE Access 8, 228049–228069 (2020).

Google Scholar

Graber, M. L. The incidence of diagnostic error in medicine. BMJ Qual. Saf. 22, ii21–ii27 (2013).

Google Scholar

Stern, S. D. Symptom to Diagnosis an Evidence-Based Guide (McGraw-Hill Education, 2010).
Achour, S. L., Dojat, M., Rieux, C., Bierling, P. & Lepage, E. A umls-based knowledge acquisition tool for rule-based clinical decision support system development. J. Am. Med. Inform. Assoc. 8, 351–360 (2001).

Google Scholar

Papadopoulos, P., Soflano, M., Chaudy, Y., Adejo, W. & Connolly, T. M. A systematic review of technologies and standards used in the development of rule-based clinical decision support systems. Health Technol. 12, 713–727 (2022).

Google Scholar

Riley, R. D. & Collins, G. S. Stability of clinical prediction models developed using statistical or machine learning methods. Biometrical J. 65, 2200302 (2023).

Google Scholar

Eloranta, S. & Boman, M. Predictive models for clinical decision making: Deep dives in practical machine learning. J. Intern. Med. 292, 278–295 (2022).

Google Scholar

Shouval, R. et al. Application of machine learning algorithms for clinical predictive modeling: a data-mining approach in sct. Bone Marrow Transplant. 49, 332–337 (2014).

Google Scholar

Zhong, Z. et al. Abn-blip: Abnormality-aligned bootstrapping language-image pre-training for pulmonary embolism diagnosis and report generation from ctpa. Med. Image Anal. 107, 103786 (2026).

Google Scholar

Giesa, N. et al. Applying a transformer architecture to intraoperative temporal dynamics improves the prediction of postoperative delirium. Commun. Med. 4, 251 (2024).

Google Scholar

Xu, Y., Xu, S., Ramprassad, M., Tumanov, A. & Zhang, C. Transehr: Self-supervised transformer for clinical time series data. In Machine Learning for Health (ML4H), 623–635 (PMLR, 2023).
Oh, J., Wang, J. & Wiens, J. Learning to exploit invariances in clinical time-series data using sequence transformer networks. In Machine learning for healthcare conference, 332–347 (PMLR, 2018).
Guo, H. et al. A multitask framework for automated interpretation of multi-frame right upper quadrant ultrasound in clinical decision support. arXiv preprint arXiv:2601.12174 (2026).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 4171–4186 (2019).
Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).

Google Scholar

Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. et al. Improving language understanding by generative pre-training. arXiv preprint (2018).
Huang, K., Altosaar, J. & Ranganath, R. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019).
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).

Google Scholar

Yang, X. et al. Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv preprint arXiv:2203.03540 (2022).
Luo, R. et al. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinforma. 23, bbac409 (2022).

Google Scholar

Chen, C. et al. Integration of large language models and federated learning. Patterns 5 (2024).
Kokash, N. et al. Ontology-and llM-based data harmonization for federated learning in healthcare. arXiv preprint arXiv:2505.20020 (2025).
Nascimento, L. et al. Federated large language models in healthcare: a systematic review, opportunities and challenges. Eng. Archive (2025).
Nguyen, D.-T. et al. Federated learning for renal tumor segmentation and classification on multi-center mri dataset. J. Magn. Reson. Imaging 62, 814–824 (2025).

Google Scholar

Floridi, L. & Chiriatti, M. Gpt-3: Its nature, scope, limits, and consequences. Minds Mach. 30, 681–694 (2020).

Google Scholar

Pan, T., Shen, J. & Xu, M. Enhancing the performance of neurosurgery medical question-answering systems using a multi-task knowledge graph-augmented answer generation model. Front. Neurosci. 19, 1606038 (2025).

Google Scholar

Xu, L. et al. End-to-end knowledge-routed relational dialogue system for automatic diagnosis. In Proceedings of the AAAI conference on artificial intelligence, vol. 33, 7346–7353 (2019).
Liu, W. et al. Meddg: A large-scale medical consultation dataset for building medical dialogue system. arXiv preprint (2020).
Martino, A., Iannelli, M. & Truong, C. Knowledge injection to counter large language model (llm) hallucination. In European Semantic Web Conference, 182–185 (Springer, 2023).
Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).

Google Scholar

Touvron, H. et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025).

Google Scholar

Kirk, H. R., Vidgen, B., Röttger, P. & Hale, S. A. The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nat. Mach. Intell. 6, 383–392 (2024).

Google Scholar

Sutton, N. R. et al. Coronary artery disease evaluation and management considerations for high risk occupations: commercial vehicle drivers and pilots. Circ.: Cardiovas. Interv. 14, e009950 (2021).

Google Scholar

Righini, M. et al. The simplified pulmonary embolism severity index (pesi): validation of a clinical prognostic model for pulmonary embolism. J. Thrombosis Haemost. 9, 2115–2117 (2011).

Google Scholar

Budoff, M. J. et al. Ten-year association of coronary artery calcium with atherosclerotic cardiovascular disease (ascvd) events: the multi-ethnic study of atherosclerosis (mesa). Eur. Heart J. 39, 2401–2408 (2018).

Google Scholar

Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025).
Team, G. et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025).
Tu, T. et al. Towards generalist biomedical AI. Nejm AI 1, AIoa2300138 (2024).

Google Scholar

Toma, A. et al. Clinical camel: An open expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031 (2023).
Zhao, L. et al. Artificial intelligence-based lesion characterization and outcome prediction of prostate cancer on [18f] dcfpyl psma imaging. Radiotherapy Oncol. 111265 (2025).
Wu, J., Roy, J. & Stewart, W. F. Prediction modeling using ehr data: challenges, strategies, and a comparison of machine learning approaches. Med. Care 48, S106–S113 (2010).

Google Scholar

Bernstein, I. A. et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw. Open 6, e2330320–e2330320 (2023).

Google Scholar

Xu, F. et al. Are large language models really good logical reasoners? a comprehensive evaluation and beyond. IEEE Trans. Knowledge Data Eng. (2025).
Wang, C. et al. Survey on factuality in large language models. ACM Comput. Surv. 58, 1–37 (2025).

Google Scholar

Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).

Google Scholar

Shamout, F., Zhu, T. & Clifton, D. A. Machine learning for clinical outcome prediction. IEEE Rev. Biomed. Eng. 14, 116–126 (2020).

Google Scholar

Kim, J. I. et al. Machine learning for antimicrobial resistance prediction: current practice, limitations, and clinical perspective. Clin. Microbiol. Rev. 35, e00179–21 (2022).

Google Scholar

Rajkomar, A., Dean, J. & Kohane, I. Machine learning in medicine. N. Engl. J. Med. 380, 1347–1358 (2019).

Google Scholar

Beam, A. L. & Kohane, I. S. Big data and machine learning in health care. Jama 319, 1317–1318 (2018).

Google Scholar

Perez, E., Kiela, D. & Cho, K. True few-shot learning with language models. Adv. Neural Inf. Process. Syst. 34, 11054–11070 (2021).

Google Scholar

Zhang, C., Morris, J. X. & Shmatikov, V. Extracting prompts by inverting llm outputs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 14753–14777 (2024).
Huang, L. et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans. Inform. Syst. 43, 1–55 (2025).

Google Scholar

Mahajan, A., Obermeyer, Z., Daneshjou, R., Lester, J. & Powell, D. Cognitive bias in clinical large language models. npj Digital Med. 8, 428 (2025).

Google Scholar

Suenghataiphorn, T., Tribuddharat, N., Danpanichkul, P. & Kulthamrongsri, N. Bias in large language models across clinical applications: A systematic review. arXiv preprint arXiv:2504.02917 (2025).
Hsu, W.-C. et al. Mri-based ovarian lesion classification via a foundation segmentation model and multimodal analysis: A multicenter study. Radiology 316, e243412 (2025).

Google Scholar

Wu, J. et al. Vision-language foundation model for 3d medical imaging. npj Artif. Intell. 1, 17 (2025).

Google Scholar

Zhong, Z. et al. Vision-language model for report generation and outcome prediction in ct pulmonary angiogram. NPJ Digital Med. 8, 432 (2025).

Google Scholar

Huang, Z. et al. A pathologist–ai collaboration framework for enhancing diagnostic accuracies and efficiencies. Nat. Biomed. Eng. 9, 455–470 (2025).

Google Scholar

Huang, X. et al. Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 (2024).
Zhao, A. et al. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 19632–19642 (2024).
Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J. & Fernández-Leal, Á Human-in-the-loop machine learning: a state of the art. Artif. Intell. Rev. 56, 3005–3054 (2023).

Google Scholar

Cook, R. J., Zeng, L. & Yi, G. Y. Marginal analysis of incomplete longitudinal binary data: a cautionary note on locf imputation. Biometrics 60, 820–828 (2004).

Google Scholar

Xue, H. & Salim, F. D. Promptcast: A new prompt-based learning paradigm for time series forecasting. IEEE Trans. Knowl. Data Eng. 36, 6851–6864 (2023).

Google Scholar

Liu, H., Zhao, Z., Wang, J., Kamarthi, H., & Prakash, B. A. (2024, August). Lstprompt: Large language models as zero-shot time series forecasters by long-short-term prompting. In Findings of the Association for Computational Linguistics: ACL 2024, pp. 7832–7840.
Moon, H. C., Joty, S. & Chi, X. Gradmask: Gradient-guided token masking for textual adversarial example detection. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3603–3613 (2022).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. neural Inf. Process. Syst. 35, 24824–24837 (2022).

Google Scholar

Dwivedi, A. K., Mallawaarachchi, I. & Alvarado, L. A. Analysis of small sample size studies using nonparametric bootstrap test with pooled resampling method. Stat. Med. 36, 2187–2205 (2017).

Google Scholar

Tong, X. et al. A novel subpixel phase correlation method using singular value decomposition and unified random sample consensus. IEEE Trans. Geosci. Remote Sens. 53, 4143–4156 (2015).

Google Scholar

Naidu, K., Beenen, E., Gananadha, S. & Mosse, C. The yield of fever, inflammatory markers and ultrasound in the diagnosis of acute cholecystitis: a validation of the 2013 tokyo guidelines. World J. Surg. 40, 2892–2897 (2016).

Google Scholar

Download references

Original source

nature.com

https://www.nature.com/articles/s41746-026-02572-y

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelpublished

ModelsLive

b8648

ggml-zendnn : add MUL_MAT_ID op support for MoE models ( #21315 ) ggml-zendnn : add MUL_MAT_ID op support for MoE models Add MUL_MAT_ID op acceleration for Mixture-of-Experts models MUL_MAT_ID op fallback to CPU backend if total experts > 32 Point ZenDNN lib to latest bits ZenDNN-2026-WW13 ggml-zendnn : add braces to sgemm failure condition for consistency Co-authored-by: Aaron Teo [email protected] Co-authored-by: Aaron Teo [email protected] macOS/iOS: macOS Apple Silicon (arm64) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan) Ubuntu x64 (ROCm 7.2) Ubuntu x64 (OpenVINO) Windows: Windows x64 (CPU) Windows arm64 (CPU) Windows x64 (CUDA 12) - CUDA 12.4 DLLs Windows x64 (CUDA 13) - CUDA 13.1 DLLs Windo

llama.cpp Releases

1m26 minutes ago

Analyst NewsLive

[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go

Experiment #324 ended well. ;) This time I built a small project around log anomaly detection. In about two days, I went from roughly 60% effectiveness in the first runs to a final F1 score of 0.9975 on the HDFS benchmark. Under my current preprocessing and evaluation setup, LogAI reaches F1=0.9975, which is slightly above the 0.996 HDFS result reported for LogRobust in a recent comparative study. What that means in practice: on 3,368 anomalous sessions in the test set, it missed about 9 (recall = 0.9973) on roughly 112k normal sessions, it raised only about 3 false alarms (precision = 0.9976) What I find especially interesting is that this is probably the first log anomaly detection model built on top of Mamba-3 / SSM, which was only published a few weeks ago. The model is small: 4.9M par

Reddit r/MachineLearning

5m30 minutes ago

Open Source AILive

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark

Just got Gemma 4 31B running at full 256K context on a single RTX 5090 using TurboQuant KV cache compression. System Specs Component Spec GPU NVIDIA GeForce RTX 5090 (32GB VRAM) CPU AMD Ryzen 9 9950X3D (16-core) RAM 64GB DDR5 OS Windows 11 Setup Model : gemma-4-31B-it-UD-Q4_K_XL from Unsloth (17.46 GiB) Build : TheTom/llama-cpp-turboquant branch feature/turboquant-kv-cache , merged with latest upstream master for Gemma 4 support KV Cache : turbo3 (3-bit PolarQuant + Hadamard rotation, ~4.5x compression vs f16) Config : --n-gpu-layers 99 --no-mmap --flash-attn on --cache-type-k turbo3 --cache-type-v turbo3 Benchmark Results Test Speed (t/s) pp4096 3,362.71 pp16384 3,047.00 pp65536 2,077.96 pp131072 1,428.80 pp262144 899.55 tg128 61.51 VRAM usage at 262K : 27.7 GB / 32 GB (4.3 GB headroom) G

Reddit r/LocalLLaMA

2mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 173 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Integrating large language models for enhanced predictive analytics in healthcare

References

Daily AI Digest

More about

b8648

[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Mistral AI raises $830 million in debt for Nvidia-powered data center - msn.com

Mistral AI Raises $830 Million in Debt For Nvidia-Powered Data Center - WSJ

Mistral AI Lands Accenture as Latest Big Client - WSJ

b8648