Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?
Large Language Models (LLMs) are increasingly used in math education not only as problem solvers but also as assessors of learners' reasoning. However, it remains unclear whether stronger math problem-solving ability is associated with stronger step-level assessment performance. This study examines that relationship using the GSM8K and MATH subsets of PROCESSBENCH, a human-annotated benchmark for identifying the earliest erroneous step in mathematical reasoning. We evaluate two LLM-based math tutor agent settings, instantiated with GPT-4 and GPT-5, in two independent tasks on the same math pro — Liang Zhang, Yu Fu, Xinyi Jin
View PDF HTML (experimental)
Abstract:Large Language Models (LLMs) are increasingly used in math education not only as problem solvers but also as assessors of learners' reasoning. However, it remains unclear whether stronger math problem-solving ability is associated with stronger step-level assessment performance. This study examines that relationship using the GSM8K and MATH subsets of PROCESSBENCH, a human-annotated benchmark for identifying the earliest erroneous step in mathematical reasoning. We evaluate two LLM-based math tutor agent settings, instantiated with GPT-4 and GPT-5, in two independent tasks on the same math problems: solving the original problem and assessing a benchmark-provided solution by predicting the earliest erroneous step. Results show a consistent within-model pattern: assessment accuracy is substantially higher on math problem items the same model solved correctly than on items it solved incorrectly, with statistically significant associations across both models and datasets. At the same time, assessment remains more difficult than direct problem solving, especially on error-present solutions. These findings suggest that math problem-solving expertise supports stronger assessment performance, but reliable step-level diagnosis also requires additional capabilities such as step tracking, monitoring, and precise error localization. The results have implications for the design and evaluation of AI-supported Adaptive Instructional Systems (AISs) for formative assessment in math education.
Subjects:
Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.25633 [cs.AI]
(or arXiv:2603.25633v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.25633
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Liang Zhang [view email] [v1] Thu, 26 Mar 2026 16:43:54 UTC (102 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivTelling an AI model that it’s an expert programmer makes it a worse programmer
<h4>Researchers say persona-based prompting can improve works for safety but not for facts</h4> <p>Many people start their work with AI by prompting the machine to imagine it is an expert at the task they want it to perform, a technique that boffins have found may be futile.…</p> <p><!--#include virtual='/data_centre/_whitepaper_textlinks_top.html' --></p>
AI isn't killing jobs, it's 'unbundling' them into lower-paid chunks
<h4>Paper argues the real impact isn't job loss but narrowing human work and pay</h4> <p>AI isn't killing jobs wholesale – it's quietly chipping away at them, one task at a time.…</p> <p><!--#include virtual='/data_centre/_whitepaper_textlinks_top.html' --></p>
Starlette 1.0 skill
<p><strong>Research:</strong> <a href="https://github.com/simonw/research/tree/main/starlette-1-skill#readme">Starlette 1.0 skill</a></p> <p>See <a href="https://simonwillison.net/2026/Mar/22/starlette/">Experimenting with Starlette 1.0 with Claude skills</a>.</p> <p>Tags: <a href="https://simonwillison.net/tags/starlette">starlette</a></p>
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
AI isn't killing jobs, it's 'unbundling' them into lower-paid chunks
<h4>Paper argues the real impact isn't job loss but narrowing human work and pay</h4> <p>AI isn't killing jobs wholesale – it's quietly chipping away at them, one task at a time.…</p> <p><!--#include virtual='/data_centre/_whitepaper_textlinks_top.html' --></p>
The breakthrough that makes robot faces feel less creepy
Humans pay enormous attention to lips during conversation, and robots have struggled badly to keep up. A new robot developed at Columbia Engineering learned realistic lip movements by watching its own reflection and studying human videos online. This allowed it to speak and sing with synchronized facial motion, without being explicitly programmed. Researchers believe this breakthrough could help robots finally cross the uncanny valley.
Unbreakable? Researchers warn quantum computers have serious security flaws
Quantum computers could revolutionize everything from drug discovery to business analytics—but their incredible power also makes them surprisingly vulnerable. New research from Penn State warns that today’s quantum machines are not just futuristic tools, but potential gold mines for hackers. The study reveals that weaknesses can exist not only in software, but deep within the physical hardware itself, where valuable algorithms and sensitive data may be exposed.
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!