Research Papers research paper arxiv ai artificial-intelligence

Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?

arXivMarch 26, 202610 min read0 views

Large Language Models (LLMs) are increasingly used in math education not only as problem solvers but also as assessors of learners' reasoning. However, it remains unclear whether stronger math problem-solving ability is associated with stronger step-level assessment performance. This study examines that relationship using the GSM8K and MATH subsets of PROCESSBENCH, a human-annotated benchmark for identifying the earliest erroneous step in mathematical reasoning. We evaluate two LLM-based math tutor agent settings, instantiated with GPT-4 and GPT-5, in two independent tasks on the same math pro — Liang Zhang, Yu Fu, Xinyi Jin

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) are increasingly used in math education not only as problem solvers but also as assessors of learners' reasoning. However, it remains unclear whether stronger math problem-solving ability is associated with stronger step-level assessment performance. This study examines that relationship using the GSM8K and MATH subsets of PROCESSBENCH, a human-annotated benchmark for identifying the earliest erroneous step in mathematical reasoning. We evaluate two LLM-based math tutor agent settings, instantiated with GPT-4 and GPT-5, in two independent tasks on the same math problems: solving the original problem and assessing a benchmark-provided solution by predicting the earliest erroneous step. Results show a consistent within-model pattern: assessment accuracy is substantially higher on math problem items the same model solved correctly than on items it solved incorrectly, with statistically significant associations across both models and datasets. At the same time, assessment remains more difficult than direct problem solving, especially on error-present solutions. These findings suggest that math problem-solving expertise supports stronger assessment performance, but reliable step-level diagnosis also requires additional capabilities such as step tracking, monitoring, and precise error localization. The results have implications for the design and evaluation of AI-supported Adaptive Instructional Systems (AISs) for formative assessment in math education.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.25633 [cs.AI]

(or arXiv:2603.25633v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.25633

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Liang Zhang [view email] [v1] Thu, 26 Mar 2026 16:43:54 UTC (102 KB)

Original source

arXiv

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Frontier Research

Telling an AI model that it’s an expert programmer makes it a worse programmer

<h4>Researchers say persona-based prompting can improve works for safety but not for facts</h4> <p>Many people start their work with AI by prompting the machine to imagine it is an expert at the task they want it to perform, a technique that boffins have found may be futile.…</p> <p></p>

The Register AI/ML

1m8 days ago

Research Papers

AI isn't killing jobs, it's 'unbundling' them into lower-paid chunks

<h4>Paper argues the real impact isn't job loss but narrowing human work and pay</h4> <p>AI isn't killing jobs wholesale – it's quietly chipping away at them, one task at a time.…</p> <p></p>

The Register AI/ML

1m7 days ago

Models

Starlette 1.0 skill

<p><strong>Research:</strong> <a href="https://github.com/simonw/research/tree/main/starlette-1-skill#readme">Starlette 1.0 skill</a></p> <p>See <a href="https://simonwillison.net/2026/Mar/22/starlette/">Experimenting with Starlette 1.0 with Claude skills</a>.</p> <p>Tags: <a href="https://simonwillison.net/tags/starlette">starlette</a></p>

Simon Willison Blog

1m9 days ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 219 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

AI isn't killing jobs, it's 'unbundling' them into lower-paid chunks

The Register AI/ML

1m7 days ago

Research Papers

The breakthrough that makes robot faces feel less creepy

Humans pay enormous attention to lips during conversation, and robots have struggled badly to keep up. A new robot developed at Columbia Engineering learned realistic lip movements by watching its own reflection and studying human videos online. This allowed it to speak and sing with synchronized facial motion, without being explicitly programmed. Researchers believe this breakthrough could help robots finally cross the uncanny valley.

ScienceDaily AI

1m2 months ago

Research Papers

Unbreakable? Researchers warn quantum computers have serious security flaws

Quantum computers could revolutionize everything from drug discovery to business analytics—but their incredible power also makes them surprisingly vulnerable. New research from Penn State warns that today’s quantum machines are not just futuristic tools, but potential gold mines for hackers. The study reveals that weaknesses can exist not only in software, but deep within the physical hardware itself, where valuable algorithms and sensitive data may be exposed.

ScienceDaily AI

1m2 months ago

Research Papers

A Complete List of All (arXiv) Adversarial Example Papers

Abstract: A continuously-updating list of all 1000+ papers posted to arXiv about adversarial examples.

Nicholas Carlini Blog

1malmost 7 years ago