Evaluating LLMs for Answering Student Questions in Introductory Programming Courses
arXiv:2603.28295v1 Announce Type: new Abstract: The rapid emergence of Large Language Models (LLMs) presents both opportunities and challenges for programming education. While students increasingly use generative AI tools, direct access often hinders the learning process by providing complete solutions rather than pedagogical hints. Concurrently, educators face significant workload and scalability challenges when providing timely, personalized feedback. This study investigates the capabilities of LLMs to safely and effectively assist educators in answering student questions within a CS1 progra — Thomas Van Mullem, Bart Mesuere, Peter Dawyndt
View PDF HTML (experimental)
Abstract:The rapid emergence of Large Language Models (LLMs) presents both opportunities and challenges for programming education. While students increasingly use generative AI tools, direct access often hinders the learning process by providing complete solutions rather than pedagogical hints. Concurrently, educators face significant workload and scalability challenges when providing timely, personalized feedback. This study investigates the capabilities of LLMs to safely and effectively assist educators in answering student questions within a CS1 programming course. To achieve this, we established a rigorous, reproducible evaluation process by curating a benchmark dataset of 170 authentic student questions from a learning management system, paired with ground-truth responses authored by subject matter experts. Because traditional text-matching metrics are insufficient for evaluating open-ended educational responses, we developed and validated a custom LLM-as-a-Judge metric optimized for assessing pedagogical accuracy. Our findings demonstrate that models, such as Gemini 3 flash, can surpass the quality baseline of typical educator responses, achieving high alignment with expert pedagogical standards. To mitigate persistent risks like hallucination and ensure alignment with course-specific context, we advocate for a "teacher-in-the-loop" implementation. Finally, we abstract our methodology into a task-agnostic evaluation framework, advocating for a shift in the development of educational LLM tools from ad-hoc, post-deployment testing to a quantifiable, pre-deployment validation process.
Subjects:
Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.28295 [cs.AI]
(or arXiv:2603.28295v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.28295
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Thomas Van Mullem [view email] [v1] Mon, 30 Mar 2026 11:22:58 UTC (1,916 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv![[D] AI research on small language models](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-ai-chip-closeup-KMZ5N5zRxP2NRiYJ8TB9TM.webp)
[D] AI research on small language models
i'm doing research on some trending fields in AI, currently working on small language models and would love to meet people who are working in similar domains and are looking to write/publish papers! submitted by /u/StoicWithSyrup [link] [comments]

Promising Signals on AI Governance from China
View the official memo here. China has consistently signaled a willingness to engage on global AI governance since at least 2017. This memo compiles key statements from the Chinese government and prominent figures demonstrating their desire to coordinate on the problem of AI. Chinese Vice Premier Ding Xuexiang, at the 2025 World Economic Forum, said: [ ] The post Promising Signals on AI Governance from China appeared first on Machine Intelligence Research Institute .
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
Vision Language Models struggle with fine-grained visual perception tasks due to their language-centric training approach, performing poorly on unnamed visual entities despite having relevant information in their representations. (1 upvotes on HuggingFace)





Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!