Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessStop Chatting with Large Language Models: A Product Manager's Guide to Reconstructing AI Workflows - 36 KrGoogle News: LLMOpenAI’s Never-Ending Soap Opera - The InformationGoogle News: OpenAITennibot launches Partner V2, its latest robotic tennis ball machineThe Robot ReportI fear Anthropic, OpenAI, and SpaceX IPOs will suck capital out of the market, says Jim Cramer - CNBCGoogle News: OpenAIAn Implementation Guide to Running NVIDIA Transformer Engine with Mixed Precision, FP8 Checks, Benchmarking, and Fallback ExecutionMarkTechPostBusinesses scramble to get noticed by AI searchBBC TechnologyOpenAI is getting weird again - PlatformerGoogle News: OpenAI[D] How's MLX and jax/ pytorch on MacBooks these days?Reddit r/MachineLearningWhich Artificial Intelligence (AI) Supercycle Stock Will Make You Richer Over the Next 10 Years? - The Motley FoolGoogle News: AIOpenAI policy blueprint sparks AI regulation debate - Fox BusinessGNews AI regulationAnthropic Claude AI training model targets AI skills gap | ETIH EdTech News - EdTech Innovation HubGoogle News: ClaudeSamsung flags eightfold jump in Q1 profit as AI chip demand drives up prices - ReutersGNews AI SamsungBlack Hat USADark ReadingBlack Hat AsiaAI BusinessStop Chatting with Large Language Models: A Product Manager's Guide to Reconstructing AI Workflows - 36 KrGoogle News: LLMOpenAI’s Never-Ending Soap Opera - The InformationGoogle News: OpenAITennibot launches Partner V2, its latest robotic tennis ball machineThe Robot ReportI fear Anthropic, OpenAI, and SpaceX IPOs will suck capital out of the market, says Jim Cramer - CNBCGoogle News: OpenAIAn Implementation Guide to Running NVIDIA Transformer Engine with Mixed Precision, FP8 Checks, Benchmarking, and Fallback ExecutionMarkTechPostBusinesses scramble to get noticed by AI searchBBC TechnologyOpenAI is getting weird again - PlatformerGoogle News: OpenAI[D] How's MLX and jax/ pytorch on MacBooks these days?Reddit r/MachineLearningWhich Artificial Intelligence (AI) Supercycle Stock Will Make You Richer Over the Next 10 Years? - The Motley FoolGoogle News: AIOpenAI policy blueprint sparks AI regulation debate - Fox BusinessGNews AI regulationAnthropic Claude AI training model targets AI skills gap | ETIH EdTech News - EdTech Innovation HubGoogle News: ClaudeSamsung flags eightfold jump in Q1 profit as AI chip demand drives up prices - ReutersGNews AI Samsung
AI NEWS HUBbyEIGENVECTOREigenvector

Evaluating LLMs for Answering Student Questions in Introductory Programming Courses

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.28295v1 Announce Type: new Abstract: The rapid emergence of Large Language Models (LLMs) presents both opportunities and challenges for programming education. While students increasingly use generative AI tools, direct access often hinders the learning process by providing complete solutions rather than pedagogical hints. Concurrently, educators face significant workload and scalability challenges when providing timely, personalized feedback. This study investigates the capabilities of LLMs to safely and effectively assist educators in answering student questions within a CS1 progra — Thomas Van Mullem, Bart Mesuere, Peter Dawyndt

View PDF HTML (experimental)

Abstract:The rapid emergence of Large Language Models (LLMs) presents both opportunities and challenges for programming education. While students increasingly use generative AI tools, direct access often hinders the learning process by providing complete solutions rather than pedagogical hints. Concurrently, educators face significant workload and scalability challenges when providing timely, personalized feedback. This study investigates the capabilities of LLMs to safely and effectively assist educators in answering student questions within a CS1 programming course. To achieve this, we established a rigorous, reproducible evaluation process by curating a benchmark dataset of 170 authentic student questions from a learning management system, paired with ground-truth responses authored by subject matter experts. Because traditional text-matching metrics are insufficient for evaluating open-ended educational responses, we developed and validated a custom LLM-as-a-Judge metric optimized for assessing pedagogical accuracy. Our findings demonstrate that models, such as Gemini 3 flash, can surpass the quality baseline of typical educator responses, achieving high alignment with expert pedagogical standards. To mitigate persistent risks like hallucination and ensure alignment with course-specific context, we advocate for a "teacher-in-the-loop" implementation. Finally, we abstract our methodology into a task-agnostic evaluation framework, advocating for a shift in the development of educational LLM tools from ad-hoc, post-deployment testing to a quantifiable, pre-deployment validation process.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.28295 [cs.AI]

(or arXiv:2603.28295v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.28295

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Thomas Van Mullem [view email] [v1] Mon, 30 Mar 2026 11:22:58 UTC (1,916 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Evaluating …researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 156 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!