Leveraging Large Language Models for More Efficient Systematic Reviews in Medicine and Beyond

Vector Instituteby Kylie WilliamsMarch 25, 20256 min read0 views

New work from Vector Applied Scientist David Emerson highlights how generalist large language models (LLMs) may be used to automate systematic review screening. “Development of prompt templates for LLM-driven screening […] The post Leveraging Large Language Models for More Efficient Systematic Reviews in Medicine and Beyond appeared first on Vector Institute for Artificial Intelligence .

New work from Vector Applied Scientist David Emerson highlights how generalist large language models (LLMs) may be used to automate systematic review screening. “Development of prompt templates for LLM-driven screening in systematic reviews,” co-authored by Emerson, Christian Cao, Jason Sang, and Rohit Arora, demonstrates how sophisticated prompting strategies can dramatically enhance LLM performance for text classification tasks critical to systematic review processes. By creating generalized prompting templates (as opposed to review-specific solutions), the researchers establish an accessible approach to systematic review automation that delivers significant cost and time savings.

TLDR: Uncover groundbreaking AI research in 3 minutes

This concise summary bridges the gap between complex scientific advancements and everyday understanding. Ideal for enthusiasts and non-researchers, start listening now.

Prior applications of LLMs for systematic review screening and text classification tasks primarily relied on zero-shot prompting methods, which heavily underestimated the true capabilities of LLMs for downstream tasks. As such, LLMs had, to this point, been characterized as incapable of performing these tasks well enough to be used effectively for systematic reviews. By applying prompting best practices, as well as novel prompting techniques, the authors were able to leverage the full capabilities of LLMs.

Background and Motivation

In medicine, systematic reviews (SRs) summarize clinical and research study results, providing evidence of an intervention’s effectiveness. They are the gold standard for evidence-based practice. However, they are exceptionally resource-intensive, typically requiring a year and over $100,000 to complete. The goal of an SR is to sift through large collections of research, identify relevant articles, and synthesize new insights that, as a whole, answer an important and specific question. A simplified example might be, “Does intervention X improve patient outcomes?” The initial screening phase of SRs is particularly demanding, requiring two investigators to independently review articles against eligibility criteria in two stages. First, investigators review article abstracts as a first pass article filter. In the second phase, the investigators review the entire article (‘full-text’) to determine whether the articles meet the predetermined criteria for final inclusion.

Despite existing tools, SR automation remains elusive as current solutions only supplement human workflows, lack necessary performance for independent decision-making, and require extensive historical training data. The emergence of LLMs presents new opportunities for SR screening automation, potentially reducing time and resource requirements significantly.

Methodology

In this work, the researchers developed generic prompt templates for LLM-driven systematic review screening for both abstracts and full-text articles. They first created BenchSR, a database of 10 previously published systematic reviews spanning nine clinical domains, which served as the testing ground. Through iterative experimentation with various prompting techniques, they developed “Framework Chain-of-Thought,” a novel approach that guides LLMs to reason systematically against predefined criteria—mimicking human cognitive processes by evaluating each criterion before making final decisions. For full-text screening, they discovered that repeating prompt instructions at both the beginning and end of prompts significantly improved performance with lengthy documents, effectively addressing the “lost-in-the-middle” phenomenon where LLMs typically struggle with retaining information from the middle of large texts. These insights culminated in two optimized templates: “Abstract ScreenPrompt” for abstract screening and “ISO-ScreenPrompt” (Instruction-Structure-Optimized) for full-text screening, both designed to maximize sensitivity while maintaining acceptable specificity across diverse systematic review types.

Experimental Setup

The study compared multiple prompting approaches including zero-shot, few-shot, and Chain-of-Thought techniques across a wide array of LLMs (GPT-3.5, GPT-4 variants, Gemini Pro, Mixtral, Mistral, and Claude-3.5-Sonnet). Performance was evaluated using accuracy, sensitivity, and specificity metrics. The final human screening decisions from the original systematic reviews served as the gold standard reference against which all model decisions were compared, for both the abstract and full-article screening phases. The experimental protocol followed training-validation-testing workflows: prompts were initially optimized using training samples from a single systematic review (SeroTracker), validated on a separate SeroTracker sample, and finally thoroughly tested on a SeroTracker test set and nine additional systematic review datasets. For abstract screening, researchers examined the complete set of titles/abstracts from original searches, while full-text screening evaluated all freely accessible PubMed Central articles. The team also conducted time and cost analyses comparing LLM-based screening with traditional human approaches, providing practical implementation insights for research teams.

Key Findings

The optimized Abstract ScreenPrompt template achieved high performance across 10 diverse reviews (97.7% weighted sensitivity, 85.2% weighted specificity), significantly outperforming zero-shot prompting (49.0% weighted sensitivity, 97.9% weighted specificity) and previous screening tools, as high sensitivity while maintaining good specificity is the most important metric for abstract screening.
The ISO-ScreenPrompt template for full-text screening demonstrated similar high performance (96.5% weighted sensitivity, 91.2% weighted specificity).
Performance of both abstract and full-text prompt templates surpassed previous literature estimates of single human-reviewer performance (86.6% sensitivity and 79.2% specificity).
LLM-based screening substantially reduced costs and time requirements—Depending on the SR, Abstract ScreenPrompt cost between $16.74-$157.02 versus $194.83-$1,666.67 for single-human screening, while ISO-ScreenPrompt cost $14.53-$622.12 versus $676.35-$25,956.40 for human full-text screening. Both LLM approaches completed screening within 24 hours, compared to 9.74-83.33 hours (abstracts) and 33.82-1,297.82 hours (full-texts) for human reviewers.

Conclusion and Implications

This research demonstrates that well-engineered LLM prompts can achieve high sensitivity and specificity for systematic review screening across diverse reviews without requiring model fine-tuning or labeled training data. The study offers immediate implementation pathways: LLMs can serve as independent single reviewers, complement human reviewers to halve workload, or as pre-screening tools to reduce human screening volume by 66-95%. Future research will validate these templates across a broader spectrum of systematic reviews, evaluate their performance against human reviewers in prospective studies, and explore applying similar prompting techniques to other criteria-based tasks in medical sciences.

Created by AI, edited by humans, about AI

This blog post is part of our ‘ANDERS – AI Noteworthy Developments Explained & Research Simplified’ series. Here we utilize AI Agents to create initial drafts from research papers, which are then carefully edited and refined by our humans. The goal is to bring you clear, concise explanations of cutting-edge research conducted by Vector researchers. Through ANDERS, we strive to bridge the gap between complex scientific advancements and everyday understanding, highlighting why these developments are important and how they impact our world.

Original source

Vector Institute

https://vectorinstitute.ai/leveraging-llms-for-more-efficient-systematic-reviews-in-medicine-and-beyond/?utm_source=rss&utm_medium=rss&utm_campaign=leveraging-llms-for-more-efficient-systematic-reviews-in-medicine-and-beyond

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelreview

Models

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench

By Shaina Raza and Veronica Chatrath AI models are rapidly becoming bigger, faster, and more capable at understanding images and text together. However, while accuracy and speed are often celebrated, […] The post When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench appeared first on Vector Institute for Artificial Intelligence .

Vector Institute

1m8 months ago

Models

The New Cartography of the Invisible

By John Knechtel From the telescope to the balance sheet – a Foundation Models for Science Workshop recap relates how scientists can help businesses solve their most stubborn data problems and […] The post The New Cartography of the Invisible appeared first on Vector Institute for Artificial Intelligence .

Vector Institute

1m2 months ago

Models

CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care

Vector researchers developed CRISPNAM-FG, a trustworthy AI model that predicts the risk of developing diabetes-related foot complications for patients discharged from hospitals while providing complete transparency in how each decision […] The post CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care appeared first on Vector Institute for Artificial Intelligence .

Vector Institute

1mabout 1 month ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 145 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench

Vector Institute

1m8 months ago

ModelsRecent

From Workshop to Waterway: Marine Innovation Redefines Naval Strike Capabilities

Marines assigned to the III Expeditionary Operations Training Group at Okinawa, Japan, and U.S. Naval Special Warfare Command operators executed the Marine Corps' first live-fire drone strike against a maritime surface vessel from a naval surface craft.

defense.gov

1mabout 12 hours ago

Models

The New Cartography of the Invisible

Vector Institute

1m2 months ago

Models

CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care

Vector Institute

1mabout 1 month ago

Leveraging Large Language Models for More Efficient Systematic Reviews in Medicine and Beyond

TLDR: Uncover groundbreaking AI research in 3 minutes

Background and Motivation

Methodology

Experimental Setup

Key Findings

Conclusion and Implications

Created by AI, edited by humans, about AI

Daily AI Digest

More about

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens &#8211; Introducing HumaniBench

The New Cartography of the Invisible

CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens &#8211; Introducing HumaniBench

From Workshop to Waterway: Marine Innovation Redefines Naval Strike Capabilities

The New Cartography of the Invisible

CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench