Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessChatGPT Maker OpenAI Valued at $852B After Record $122B Funding Round - Bitcoin.com NewsGoogle News: ChatGPTTreeline, which is building an AI and software-first alternative to legacy corporate IT systems, raised a $25M Series A led by Andreessen Horowitz (Lily Mae Lazarus/Fortune)TechmemeGetting Stuck Inside a Glitching Robotaxi Is a Whole New Thing to Be Scared ofGizmodoBDD Test Cases from User Stories: 5 Steps and 12 ScenariosDEV CommunityDel aprendizaje a la práctica: Por qué decidí dejar de estudiar en privado y empezar a compartir 🚀DEV CommunityClaude Code CLAUDE.md vs settings.json: which one controls what (and why it matters)DEV CommunityThe Hallucination Problem of AI Programming Assistants: How to Implement Specification-Driven Development with OpenSpecDEV CommunityPlausible Code Is the New Technical DebtDEV CommunityBuild Your Own AI-Powered Wearable with Claude and ESP32DEV CommunityBeyond the Hype: A Developer's Guide to Practical AI IntegrationDEV CommunityPreliminary Explorations on Latent Side Task UpliftLessWrong AIAI machine sorts clothes faster than humans to boost textile recycling in China - The Washington PostGoogle News: AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessChatGPT Maker OpenAI Valued at $852B After Record $122B Funding Round - Bitcoin.com NewsGoogle News: ChatGPTTreeline, which is building an AI and software-first alternative to legacy corporate IT systems, raised a $25M Series A led by Andreessen Horowitz (Lily Mae Lazarus/Fortune)TechmemeGetting Stuck Inside a Glitching Robotaxi Is a Whole New Thing to Be Scared ofGizmodoBDD Test Cases from User Stories: 5 Steps and 12 ScenariosDEV CommunityDel aprendizaje a la práctica: Por qué decidí dejar de estudiar en privado y empezar a compartir 🚀DEV CommunityClaude Code CLAUDE.md vs settings.json: which one controls what (and why it matters)DEV CommunityThe Hallucination Problem of AI Programming Assistants: How to Implement Specification-Driven Development with OpenSpecDEV CommunityPlausible Code Is the New Technical DebtDEV CommunityBuild Your Own AI-Powered Wearable with Claude and ESP32DEV CommunityBeyond the Hype: A Developer's Guide to Practical AI IntegrationDEV CommunityPreliminary Explorations on Latent Side Task UpliftLessWrong AIAI machine sorts clothes faster than humans to boost textile recycling in China - The Washington PostGoogle News: AI

HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

arXivMarch 31, 20261 min read0 views
Source Quiz

arXiv:2511.15355v2 Announce Type: replace Abstract: We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and G\'omez-Rodr\'iguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional mult — Alexis Correa-Guill\'en, Carlos G\'omez-Rodr\'iguez, David Vilares

View PDF HTML (experimental)

Abstract:We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.

Comments: LREC 2026 camera-ready version

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2511.15355 [cs.CL]

(or arXiv:2511.15355v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2511.15355

arXiv-issued DOI via DataCite

Submission history

From: David Vilares [view email] [v1] Wed, 19 Nov 2025 11:31:32 UTC (705 KB) [v2] Mon, 30 Mar 2026 08:04:09 UTC (705 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
HEAD-QA v2:…researchpaperarxivnlplanguage-mo…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 201 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers