Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessI Stress-Tested PAIO for OpenClaw: Faster Setup, Lower Token Use, Better Security?DEV CommunitySources: AI startup Poolside held talks with Google and others to revive a Texas data center project after a CoreWeave deal and a $2B Nvidia-led round collapsed (Stephen Morris/Financial Times)TechmemeSystematically dismantle the AI compute supply chain.LessWrong AI🚀 I Built an API Documentation Generator That Works in 5 SecondsDEV CommunitySum, Count, and Reverse of Digits in Python (While Loop & Recursion)DEV CommunityWhen LangChain Is Enough: How to Build Useful AI Apps Without OverengineeringDEV CommunityThe Evolution of Natural Language Processing: A Journey from 1960 to 2020DEV CommunityApple Just Killed a $100M Vibe Coding App. Here's the Security Angle Nobody's Talking About.DEV CommunitySamsung SDS Unveils AI, Digital Twin Logistics Innovations at 2026 Conference - 조선일보GNews AI SamsungImplementing ECDSA from Scratch Without LibrariesDEV CommunityMachine Learning in Blockchain for AI Engineers and Blockchain Developers - Blockchain CouncilGoogle News: Machine LearningGitHub Issue Template: How to Get More Contributions and Build CommunityDEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessI Stress-Tested PAIO for OpenClaw: Faster Setup, Lower Token Use, Better Security?DEV CommunitySources: AI startup Poolside held talks with Google and others to revive a Texas data center project after a CoreWeave deal and a $2B Nvidia-led round collapsed (Stephen Morris/Financial Times)TechmemeSystematically dismantle the AI compute supply chain.LessWrong AI🚀 I Built an API Documentation Generator That Works in 5 SecondsDEV CommunitySum, Count, and Reverse of Digits in Python (While Loop & Recursion)DEV CommunityWhen LangChain Is Enough: How to Build Useful AI Apps Without OverengineeringDEV CommunityThe Evolution of Natural Language Processing: A Journey from 1960 to 2020DEV CommunityApple Just Killed a $100M Vibe Coding App. Here's the Security Angle Nobody's Talking About.DEV CommunitySamsung SDS Unveils AI, Digital Twin Logistics Innovations at 2026 Conference - 조선일보GNews AI SamsungImplementing ECDSA from Scratch Without LibrariesDEV CommunityMachine Learning in Blockchain for AI Engineers and Blockchain Developers - Blockchain CouncilGoogle News: Machine LearningGitHub Issue Template: How to Get More Contributions and Build CommunityDEV Community

Leveraging Large Language Models for More Efficient Systematic Reviews in Medicine and Beyond

Vector Instituteby Kylie WilliamsMarch 25, 20256 min read0 views
Source Quiz

New work from Vector Applied Scientist David Emerson highlights how generalist large language models (LLMs) may be used to automate systematic review screening. “Development of prompt templates for LLM-driven screening […] The post Leveraging Large Language Models for More Efficient Systematic Reviews in Medicine and Beyond appeared first on Vector Institute for Artificial Intelligence .

New work from Vector Applied Scientist David Emerson highlights how generalist large language models (LLMs) may be used to automate systematic review screening. “Development of prompt templates for LLM-driven screening in systematic reviews,” co-authored by Emerson, Christian Cao, Jason Sang, and Rohit Arora, demonstrates how sophisticated prompting strategies can dramatically enhance LLM performance for text classification tasks critical to systematic review processes. By creating generalized prompting templates (as opposed to review-specific solutions), the researchers establish an accessible approach to systematic review automation that delivers significant cost and time savings.

TLDR: Uncover groundbreaking AI research in 3 minutes

This concise summary bridges the gap between complex scientific advancements and everyday understanding. Ideal for enthusiasts and non-researchers, start listening now.

Prior applications of LLMs for systematic review screening and text classification tasks primarily relied on zero-shot prompting methods, which heavily underestimated the true capabilities of LLMs for downstream tasks. As such, LLMs had, to this point, been characterized as incapable of performing these tasks well enough to be used effectively for systematic reviews. By applying prompting best practices, as well as novel prompting techniques, the authors were able to leverage the full capabilities of LLMs.

Background and Motivation

In medicine, systematic reviews (SRs) summarize clinical and research study results, providing evidence of an intervention’s effectiveness. They are the gold standard for evidence-based practice. However, they are exceptionally resource-intensive, typically requiring a year and over $100,000 to complete. The goal of an SR is to sift through large collections of research, identify relevant articles, and synthesize new insights that, as a whole, answer an important and specific question. A simplified example might be, “Does intervention X improve patient outcomes?” The initial screening phase of SRs is particularly demanding, requiring two investigators to independently review articles against eligibility criteria in two stages. First, investigators review article abstracts as a first pass article filter. In the second phase, the investigators review the entire article (‘full-text’) to determine whether the articles meet the predetermined criteria for final inclusion.

Despite existing tools, SR automation remains elusive as current solutions only supplement human workflows, lack necessary performance for independent decision-making, and require extensive historical training data. The emergence of LLMs presents new opportunities for SR screening automation, potentially reducing time and resource requirements significantly.

Methodology

In this work, the researchers developed generic prompt templates for LLM-driven systematic review screening for both abstracts and full-text articles. They first created BenchSR, a database of 10 previously published systematic reviews spanning nine clinical domains, which served as the testing ground. Through iterative experimentation with various prompting techniques, they developed “Framework Chain-of-Thought,” a novel approach that guides LLMs to reason systematically against predefined criteria—mimicking human cognitive processes by evaluating each criterion before making final decisions. For full-text screening, they discovered that repeating prompt instructions at both the beginning and end of prompts significantly improved performance with lengthy documents, effectively addressing the “lost-in-the-middle” phenomenon where LLMs typically struggle with retaining information from the middle of large texts. These insights culminated in two optimized templates: “Abstract ScreenPrompt” for abstract screening and “ISO-ScreenPrompt” (Instruction-Structure-Optimized) for full-text screening, both designed to maximize sensitivity while maintaining acceptable specificity across diverse systematic review types.

Experimental Setup

The study compared multiple prompting approaches including zero-shot, few-shot, and Chain-of-Thought techniques across a wide array of LLMs (GPT-3.5, GPT-4 variants, Gemini Pro, Mixtral, Mistral, and Claude-3.5-Sonnet). Performance was evaluated using accuracy, sensitivity, and specificity metrics. The final human screening decisions from the original systematic reviews served as the gold standard reference against which all model decisions were compared, for both the abstract and full-article screening phases. The experimental protocol followed training-validation-testing workflows: prompts were initially optimized using training samples from a single systematic review (SeroTracker), validated on a separate SeroTracker sample, and finally thoroughly tested on a SeroTracker test set and nine additional systematic review datasets. For abstract screening, researchers examined the complete set of titles/abstracts from original searches, while full-text screening evaluated all freely accessible PubMed Central articles. The team also conducted time and cost analyses comparing LLM-based screening with traditional human approaches, providing practical implementation insights for research teams.

Key Findings

  • The optimized Abstract ScreenPrompt template achieved high performance across 10 diverse reviews (97.7% weighted sensitivity, 85.2% weighted specificity), significantly outperforming zero-shot prompting (49.0% weighted sensitivity, 97.9% weighted specificity) and previous screening tools, as high sensitivity while maintaining good specificity is the most important metric for abstract screening.

  • The ISO-ScreenPrompt template for full-text screening demonstrated similar high performance (96.5% weighted sensitivity, 91.2% weighted specificity).

  • Performance of both abstract and full-text prompt templates surpassed previous literature estimates of single human-reviewer performance (86.6% sensitivity and 79.2% specificity).

  • LLM-based screening substantially reduced costs and time requirements—Depending on the SR, Abstract ScreenPrompt cost between $16.74-$157.02 versus $194.83-$1,666.67 for single-human screening, while ISO-ScreenPrompt cost $14.53-$622.12 versus $676.35-$25,956.40 for human full-text screening. Both LLM approaches completed screening within 24 hours, compared to 9.74-83.33 hours (abstracts) and 33.82-1,297.82 hours (full-texts) for human reviewers.

Conclusion and Implications

This research demonstrates that well-engineered LLM prompts can achieve high sensitivity and specificity for systematic review screening across diverse reviews without requiring model fine-tuning or labeled training data. The study offers immediate implementation pathways: LLMs can serve as independent single reviewers, complement human reviewers to halve workload, or as pre-screening tools to reduce human screening volume by 66-95%. Future research will validate these templates across a broader spectrum of systematic reviews, evaluate their performance against human reviewers in prospective studies, and explore applying similar prompting techniques to other criteria-based tasks in medical sciences.

Created by AI, edited by humans, about AI

This blog post is part of our ‘ANDERS – AI Noteworthy Developments Explained & Research Simplified’ series. Here we utilize AI Agents to create initial drafts from research papers, which are then carefully edited and refined by our humans. The goal is to bring you clear, concise explanations of cutting-edge research conducted by Vector researchers. Through ANDERS, we strive to bridge the gap between complex scientific advancements and everyday understanding, highlighting why these developments are important and how they impact our world.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelreview

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Leveraging …modellanguage mo…reviewVector Inst…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 145 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models