Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessChatGPT Maker OpenAI Valued at $852B After Record $122B Funding Round - Bitcoin.com NewsGoogle News: ChatGPTTreeline, which is building an AI and software-first alternative to legacy corporate IT systems, raised a $25M Series A led by Andreessen Horowitz (Lily Mae Lazarus/Fortune)TechmemeGetting Stuck Inside a Glitching Robotaxi Is a Whole New Thing to Be Scared ofGizmodoBDD Test Cases from User Stories: 5 Steps and 12 ScenariosDEV CommunityDel aprendizaje a la práctica: Por qué decidí dejar de estudiar en privado y empezar a compartir 🚀DEV CommunityClaude Code CLAUDE.md vs settings.json: which one controls what (and why it matters)DEV CommunityThe Hallucination Problem of AI Programming Assistants: How to Implement Specification-Driven Development with OpenSpecDEV CommunityPlausible Code Is the New Technical DebtDEV CommunityBuild Your Own AI-Powered Wearable with Claude and ESP32DEV CommunityBeyond the Hype: A Developer's Guide to Practical AI IntegrationDEV CommunityPreliminary Explorations on Latent Side Task UpliftLessWrong AIAI machine sorts clothes faster than humans to boost textile recycling in China - The Washington PostGoogle News: AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessChatGPT Maker OpenAI Valued at $852B After Record $122B Funding Round - Bitcoin.com NewsGoogle News: ChatGPTTreeline, which is building an AI and software-first alternative to legacy corporate IT systems, raised a $25M Series A led by Andreessen Horowitz (Lily Mae Lazarus/Fortune)TechmemeGetting Stuck Inside a Glitching Robotaxi Is a Whole New Thing to Be Scared ofGizmodoBDD Test Cases from User Stories: 5 Steps and 12 ScenariosDEV CommunityDel aprendizaje a la práctica: Por qué decidí dejar de estudiar en privado y empezar a compartir 🚀DEV CommunityClaude Code CLAUDE.md vs settings.json: which one controls what (and why it matters)DEV CommunityThe Hallucination Problem of AI Programming Assistants: How to Implement Specification-Driven Development with OpenSpecDEV CommunityPlausible Code Is the New Technical DebtDEV CommunityBuild Your Own AI-Powered Wearable with Claude and ESP32DEV CommunityBeyond the Hype: A Developer's Guide to Practical AI IntegrationDEV CommunityPreliminary Explorations on Latent Side Task UpliftLessWrong AIAI machine sorts clothes faster than humans to boost textile recycling in China - The Washington PostGoogle News: AI

Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.26898v1 Announce Type: cross Abstract: Political scientists are rapidly adopting large language models (LLMs) for text annotation, yet the sensitivity of annotation results to implementation choices remains poorly understood. Most evaluations test a single model or configuration; how model choice, model size, learning approach, and prompt style interact, and whether popular "best practices" survive controlled comparison, are largely unexplored. We present a controlled evaluation of these pipeline choices, testing six open-weight models across four political science annotation tasks — Lorca McLaren, James Cross, Zuzanna Krakowska, Robin Rauner, Martijn Schoonvelde

View PDF HTML (experimental)

Abstract:Political scientists are rapidly adopting large language models (LLMs) for text annotation, yet the sensitivity of annotation results to implementation choices remains poorly understood. Most evaluations test a single model or configuration; how model choice, model size, learning approach, and prompt style interact, and whether popular "best practices" survive controlled comparison, are largely unexplored. We present a controlled evaluation of these pipeline choices, testing six open-weight models across four political science annotation tasks under identical quantisation, hardware, and prompt-template conditions. Our central finding is methodological: interaction effects dominate main effects, so seemingly reasonable pipeline choices can become consequential researcher degrees of freedom. No single model, prompt style, or learning approach is uniformly superior, and the best-performing model varies across tasks. Two corollaries follow. First, model size is an unreliable guide both to cost and to performance: cross-family efficiency differences are so large that some larger models are less resource-intensive than much smaller alternatives, while within model families mid-range variants often match or exceed larger counterparts. Second, widely recommended prompt engineering techniques yield inconsistent and sometimes negative effects on annotation performance. We use these benchmark results to develop a validation-first framework - with a principled ordering of pipeline decisions, guidance on prompt freezing and held-out evaluation, reporting standards, and open-source tools - to help researchers navigate this decision space transparently.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as: arXiv:2603.26898 [cs.CL]

(or arXiv:2603.26898v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.26898

arXiv-issued DOI via DataCite

Submission history

From: Lorcan McLaren [view email] [v1] Fri, 27 Mar 2026 18:17:21 UTC (3,215 KB) [v2] Tue, 31 Mar 2026 10:22:41 UTC (3,215 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Magic Words…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 201 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers