Research Papers research paper arxiv nlp language-models

Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

arXivby [Submitted on 26 Mar 2026]March 26, 20262 min read1 views

Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce S — Dingjie Song, Tianlong Xu, Yi-Fan Zhang

View PDF HTML (experimental)

Abstract:Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.

Comments: Accepted by the 27th International Conference on Artificial Intelligence in Education (AIED'26)

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

ACM classes: I.2.7; K.3.1

Cite as: arXiv:2603.24961 [cs.AI]

(or arXiv:2603.24961v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.24961

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Dingjie Song [view email] [v1] Thu, 26 Mar 2026 02:57:20 UTC (3,295 KB)

Original source

arXiv

https://arxiv.org/abs/2603.24961v1

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ProductsFresh

Average-Case Reductions for $k$-XOR and Tensor PCA

arXiv:2601.19016v2 Announce Type: replace-cross Abstract: We study the computational properties of two canonical planted average-case problems -- noisy planted $k$-XOR and Tensor PCA -- by formally unifying them into a family of planted problems parametrized by tensor order $k$, number of entries $m$, and noise level $\delta$. We build a wide range of poly-time average-case reductions within this family, across all regimes $m \in [1, n^k]$. In the denser $m \geq n^{k/2}$ regime, our reductions preserve proximity to the computational threshold, and, as a central application, reduce conjectured-hard $k$-XOR instances with $m \approx n^{k/2}$ to conjectured-hard instances of Tensor PCA. Additionally, we give new order-reducing maps at fixed densities (e.g., $5\to 4$ for $k$-XOR with $m \appro

arXiv cs.CR

2mabout 6 hours ago

ProductsFresh

Differential Privacy for Secure Machine Learning in Healthcare IoT-Cloud Systems

arXiv:2512.10426v3 Announce Type: replace Abstract: Healthcare has become exceptionally sophisticated, as wearables and connected medical devices revolutionize remote patient monitoring, emergency response, medication management, diagnosis, and predictive and prescriptive analytics. Internet of Things and Cloud computing integrated systems (IoT-Cloud) facilitate sensing, automation, and processing for these healthcare applications. While real-time response is crucial for alleviating patient emergencies, protecting patient privacy is paramount in data-driven healthcare. In this paper, we propose a multi-layer IoT, Edge, and Cloud architecture to enhance emergency healthcare response times by distributing tasks based on response criticality and data permanence requirements. We ensure patient

arXiv cs.CR

2mabout 6 hours ago

Frontier Research

Cutting-edge imaging, AI research seeks out minuscule defects in chips - Purdue University

Cutting-edge imaging, AI research seeks out minuscule defects in chips Purdue University

GNews AI chips

1m6 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 163 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Realistic Lip Motion Generation Based on 3D Dynamic Viseme and Coarticulation Modeling for Human-Robot Interaction

arXiv:2604.01756v1 Announce Type: new Abstract: Realistic lip synchronization is essential for the natural human-robot non-verbal interaction of humanoid robots. Motivated by this need, this paper presents a lip motion generation framework based on 3D dynamic viseme and coarticulation modeling. By analyzing Chinese pronunciation theory, a 3D dynamic viseme library is constructed based on the ARKit standard, which offers coherent prior trajectories of lips. To resolve motion conflicts within continuous speech streams, a coarticulation mechanism is developed by incorporating initial-final (Shengmu-Yunmu) decoupling and energy modulation. After developing a strategy to retarget high-dimensional spatial lip motion to a 14-DOF lip actuation system of a humanoid head platform, the efficiency and

arXiv cs.RO

2mabout 6 hours ago

Research PapersFresh

3-D Relative Localization for Multi-Robot Systems with Angle and Self-Displacement Measurements

arXiv:2604.01703v1 Announce Type: new Abstract: Realizing relative localization by leveraging inter-robot local measurements is a challenging problem, especially in the presence of measurement noise. Motivated by this challenge, in this paper we propose a novel and systematic 3-D relative localization framework based on inter-robot interior angle and self-displacement measurements. Initially, we propose a linear relative localization theory comprising a distributed linear relative localization algorithm and sufficient conditions for localizability. According to this theory, robots can determine their neighbors' relative positions and orientations in a purely linear manner. Subsequently, in order to deal with measurement noise, we present an advanced Maximum a Posterior (MAP) estimator by a

arXiv cs.RO

2mabout 6 hours ago

Research PapersFresh

Coupler Position Optimization and Channel Estimation for Flexible Coupler Antenna Aided Multiuser Communication

arXiv:2602.11319v2 Announce Type: replace-cross Abstract: In this paper, we propose a distributed flexible coupler antenna (FCA) array to enhance communication performance with low hardware cost. At each FCA, there is one fixed-position active antenna and multiple passive couplers that can move within a designated region around the active antenna. Moreover, each FCA is equipped with a local processing unit (LPU). All LPUs exchange signals with a central processing unit (CPU) for joint signal processing. We study an FCA-aided multiuser multiple-input multiple-output (MIMO) system, where an FCA array base station (BS) is deployed to enhance the downlink communication between the BS and multiple single-antenna users. We formulate optimization problems to maximize the achievable sum rate of us

arXiv eess.SP

2mabout 6 hours ago

Research PapersFresh

Taxonomy for Cybersecurity Threat Attributes and Countermeasures in Smart Manufacturing Systems

arXiv:2401.01374v2 Announce Type: replace Abstract: An attack taxonomy offers a consistent and structured classification scheme to systematically understand, identify, and classify cybersecurity threat attributes. However, existing taxonomies only focus on a narrow range of attacks and limited threat attributes, lacking a comprehensive characterization of manufacturing cybersecurity threats. There is little to no focus on characterizing threat actors and their intent, specific system and machine behavioral deviations introduced by cyberattacks, system-level and operational implications of attacks, and potential countermeasures against those attacks. To close this pressing research gap, this work proposes a comprehensive attack taxonomy for a holistic understanding and characterization of c

arXiv cs.CR

2mabout 6 hours ago