Leveraging Large Language Models for More Efficient Systematic Reviews in Medicine and Beyond
New work from Vector Applied Scientist David Emerson highlights how generalist large language models (LLMs) may be used to automate systematic review screening. “Development of prompt templates for LLM-driven screening [ ] The post Leveraging Large Language Models for More Efficient Systematic Reviews in Medicine and Beyond appeared first on Vector Institute for Artificial Intelligence .
New work from Vector Applied Scientist David Emerson highlights how generalist large language models (LLMs) may be used to automate systematic review screening. “Development of prompt templates for LLM-driven screening in systematic reviews,” co-authored by Emerson, Christian Cao, Jason Sang, and Rohit Arora, demonstrates how sophisticated prompting strategies can dramatically enhance LLM performance for text classification tasks critical to systematic review processes. By creating generalized prompting templates (as opposed to review-specific solutions), the researchers establish an accessible approach to systematic review automation that delivers significant cost and time savings.
TLDR: Uncover groundbreaking AI research in 3 minutes
This concise summary bridges the gap between complex scientific advancements and everyday understanding. Ideal for enthusiasts and non-researchers, start listening now.
Prior applications of LLMs for systematic review screening and text classification tasks primarily relied on zero-shot prompting methods, which heavily underestimated the true capabilities of LLMs for downstream tasks. As such, LLMs had, to this point, been characterized as incapable of performing these tasks well enough to be used effectively for systematic reviews. By applying prompting best practices, as well as novel prompting techniques, the authors were able to leverage the full capabilities of LLMs.
Background and Motivation
In medicine, systematic reviews (SRs) summarize clinical and research study results, providing evidence of an intervention’s effectiveness. They are the gold standard for evidence-based practice. However, they are exceptionally resource-intensive, typically requiring a year and over $100,000 to complete. The goal of an SR is to sift through large collections of research, identify relevant articles, and synthesize new insights that, as a whole, answer an important and specific question. A simplified example might be, “Does intervention X improve patient outcomes?” The initial screening phase of SRs is particularly demanding, requiring two investigators to independently review articles against eligibility criteria in two stages. First, investigators review article abstracts as a first pass article filter. In the second phase, the investigators review the entire article (‘full-text’) to determine whether the articles meet the predetermined criteria for final inclusion.
Despite existing tools, SR automation remains elusive as current solutions only supplement human workflows, lack necessary performance for independent decision-making, and require extensive historical training data. The emergence of LLMs presents new opportunities for SR screening automation, potentially reducing time and resource requirements significantly.
Methodology
In this work, the researchers developed generic prompt templates for LLM-driven systematic review screening for both abstracts and full-text articles. They first created BenchSR, a database of 10 previously published systematic reviews spanning nine clinical domains, which served as the testing ground. Through iterative experimentation with various prompting techniques, they developed “Framework Chain-of-Thought,” a novel approach that guides LLMs to reason systematically against predefined criteria—mimicking human cognitive processes by evaluating each criterion before making final decisions. For full-text screening, they discovered that repeating prompt instructions at both the beginning and end of prompts significantly improved performance with lengthy documents, effectively addressing the “lost-in-the-middle” phenomenon where LLMs typically struggle with retaining information from the middle of large texts. These insights culminated in two optimized templates: “Abstract ScreenPrompt” for abstract screening and “ISO-ScreenPrompt” (Instruction-Structure-Optimized) for full-text screening, both designed to maximize sensitivity while maintaining acceptable specificity across diverse systematic review types.
Experimental Setup
The study compared multiple prompting approaches including zero-shot, few-shot, and Chain-of-Thought techniques across a wide array of LLMs (GPT-3.5, GPT-4 variants, Gemini Pro, Mixtral, Mistral, and Claude-3.5-Sonnet). Performance was evaluated using accuracy, sensitivity, and specificity metrics. The final human screening decisions from the original systematic reviews served as the gold standard reference against which all model decisions were compared, for both the abstract and full-article screening phases. The experimental protocol followed training-validation-testing workflows: prompts were initially optimized using training samples from a single systematic review (SeroTracker), validated on a separate SeroTracker sample, and finally thoroughly tested on a SeroTracker test set and nine additional systematic review datasets. For abstract screening, researchers examined the complete set of titles/abstracts from original searches, while full-text screening evaluated all freely accessible PubMed Central articles. The team also conducted time and cost analyses comparing LLM-based screening with traditional human approaches, providing practical implementation insights for research teams.
Key Findings
-
The optimized Abstract ScreenPrompt template achieved high performance across 10 diverse reviews (97.7% weighted sensitivity, 85.2% weighted specificity), significantly outperforming zero-shot prompting (49.0% weighted sensitivity, 97.9% weighted specificity) and previous screening tools, as high sensitivity while maintaining good specificity is the most important metric for abstract screening.
-
The ISO-ScreenPrompt template for full-text screening demonstrated similar high performance (96.5% weighted sensitivity, 91.2% weighted specificity).
-
Performance of both abstract and full-text prompt templates surpassed previous literature estimates of single human-reviewer performance (86.6% sensitivity and 79.2% specificity).
-
LLM-based screening substantially reduced costs and time requirements—Depending on the SR, Abstract ScreenPrompt cost between $16.74-$157.02 versus $194.83-$1,666.67 for single-human screening, while ISO-ScreenPrompt cost $14.53-$622.12 versus $676.35-$25,956.40 for human full-text screening. Both LLM approaches completed screening within 24 hours, compared to 9.74-83.33 hours (abstracts) and 33.82-1,297.82 hours (full-texts) for human reviewers.
Conclusion and Implications
This research demonstrates that well-engineered LLM prompts can achieve high sensitivity and specificity for systematic review screening across diverse reviews without requiring model fine-tuning or labeled training data. The study offers immediate implementation pathways: LLMs can serve as independent single reviewers, complement human reviewers to halve workload, or as pre-screening tools to reduce human screening volume by 66-95%. Future research will validate these templates across a broader spectrum of systematic reviews, evaluate their performance against human reviewers in prospective studies, and explore applying similar prompting techniques to other criteria-based tasks in medical sciences.
Created by AI, edited by humans, about AI
This blog post is part of our ‘ANDERS – AI Noteworthy Developments Explained & Research Simplified’ series. Here we utilize AI Agents to create initial drafts from research papers, which are then carefully edited and refined by our humans. The goal is to bring you clear, concise explanations of cutting-edge research conducted by Vector researchers. Through ANDERS, we strive to bridge the gap between complex scientific advancements and everyday understanding, highlighting why these developments are important and how they impact our world.
Additional reading
-
Github
-
Editorial
Vector Institute
https://vectorinstitute.ai/leveraging-llms-for-more-efficient-systematic-reviews-in-medicine-and-beyond/Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelreview
StoryScope: Investigating idiosyncrasies in AI fiction
arXiv:2604.03136v1 Announce Type: new Abstract: As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLM

Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts
arXiv:2604.03127v1 Announce Type: new Abstract: Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's $\kappa$ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines ($\kappa = 0.275$-$0.413$ and $0.160$-$0.4

Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization
arXiv:2604.03110v1 Announce Type: new Abstract: Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the alignment process. To address this issue, we introduce the Multi-aspect Knowledge Distillation (MaKD) method, which mimics the self-attention and feed-forward modules in greater depth to capture rich language knowledge information at different aspects. Experimental results demonstrate that MaKD can achieve competitive performance compared with various strong baselines with the same storage parameter budget. In addition, our method also performs well in distilling auto-regressive architecture models.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

StoryScope: Investigating idiosyncrasies in AI fiction
arXiv:2604.03136v1 Announce Type: new Abstract: As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLM

Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts
arXiv:2604.03127v1 Announce Type: new Abstract: Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's $\kappa$ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines ($\kappa = 0.275$-$0.413$ and $0.160$-$0.4

Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization
arXiv:2604.03110v1 Announce Type: new Abstract: Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the alignment process. To address this issue, we introduce the Multi-aspect Knowledge Distillation (MaKD) method, which mimics the self-attention and feed-forward modules in greater depth to capture rich language knowledge information at different aspects. Experimental results demonstrate that MaKD can achieve competitive performance compared with various strong baselines with the same storage parameter budget. In addition, our method also performs well in distilling auto-regressive architecture models.

Verbalizing LLMs' assumptions to explain and control sycophancy
arXiv:2604.03058v1 Announce Type: new Abstract: LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs' assumptions on social sycophancy datasets is ``seeking validation.'' We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal repre

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!