Research Papers research paper arxiv nlp language-models

Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems

arXivMarch 31, 20262 min read0 views

arXiv:2602.17542v2 Announce Type: replace Abstract: Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated — Zhangqi Duan, Arnav Kankaria, Dhruv Kartik, Andrew Lan

View PDF HTML (experimental)

Abstract:Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code. Our method assesses whether each KC is correctly applied and further introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code. We evaluate the resulting KC-level correctness labels in terms of learning curve fit and predictive performance using the power law of practice and the Additive Factors Model. Experimental results show that our framework leads to learning curves that are more consistent with cognitive theory and improves predictive performance, compared to baselines. Human evaluation further demonstrates substantial agreement between LLM and expert annotations.

Subjects:

Computation and Language (cs.CL); Computers and Society (cs.CY)

Cite as: arXiv:2602.17542 [cs.CL]

(or arXiv:2602.17542v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.17542

arXiv-issued DOI via DataCite

Submission history

From: Zhangqi Duan [view email] [v1] Thu, 19 Feb 2026 16:58:34 UTC (792 KB) [v2] Fri, 27 Mar 2026 21:30:24 UTC (794 KB)

Original source

arXiv

https://arxiv.org/abs/2602.17542

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research PapersFresh

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

arXiv:2603.29182v1 Announce Type: new Abstract: Adversarial robustness evaluation faces a critical challenge as new defense paradigms emerge that can exploit limitations in existing assessment methods. This paper reveals that Dummy Classes-based defenses, which introduce an additional "dummy" class as a safety sink for adversarial examples, achieve significantly overestimated robustness under conventional evaluation strategies like AutoAttack. The fundamental limitation stems from these attacks' singular focus on misleading the true class label, which aligns perfectly with the defense mechanism--successful attacks are simply captured by the dummy class. To address this gap, we propose Dummy-Aware Weighted Attack (DAWA), a novel evaluation method that simultaneously targets both the true la

arXiv cs.LG

1mabout 2 hours ago

ModelsFresh

Efficient and Scalable Granular-ball Graph Coarsening Method for Large-scale Graph Node Classification

arXiv:2603.29148v1 Announce Type: new Abstract: Graph Convolutional Network (GCN) is a model that can effectively handle graph data tasks and has been successfully applied. However, for large-scale graph datasets, GCN still faces the challenge of high computational overhead, especially when the number of convolutional layers in the graph is large. Currently, there are many advanced methods that use various sampling techniques or graph coarsening techniques to alleviate the inconvenience caused during training. However, among these methods, some ignore the multi-granularity information in the graph structure, and the time complexity of some coarsening methods is still relatively high. In response to these issues, based on our previous work, in this paper, we propose a new framework called E

arXiv cs.LG

2mabout 2 hours ago

ModelsFresh

Measuring the Predictability of Recommender Systems using Structural Complexity Metrics

arXiv:2404.08829v2 Announce Type: replace Abstract: Recommender Systems (RS) shape the filtering and curation of online content, yet we have limited understanding of how predictable their recommendation outputs are. We propose data-driven metrics that quantify the predictability of recommendation datasets by measuring the structural complexity of the user-item interaction matrix. High complexity indicates intricate interaction patterns that are harder to predict; low complexity indicates simpler, more predictable structures. We operationalize structural complexity via data perturbations, using singular value decomposition (SVD) to assess how stable the latent structure remains under perturbations. Our hypothesis is that random perturbations minimally affect highly organized data, but cause

arXiv cs.IR

2mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 220 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

arXiv cs.LG

1mabout 2 hours ago

Research PapersFresh

Rewrite the News: Tracing Editorial Reuse Across News Agencies

arXiv:2603.29937v1 Announce Type: cross Abstract: This paper investigates sentence-level text reuse in multilingual journalism, analyzing where reused content occurs within articles. We present a weakly supervised method for detecting sentence-level cross-lingual reuse without requiring full translations, designed to support automated pre-selection to reduce information overload for journalists (Holyst et al., 2024). The study compares English-language articles from the Slovenian Press Agency (STA) with reports from 15 foreign agencies (FA) in seven languages, using publication timestamps to retain the earliest likely foreign source for each reused sentence. We analyze 1,037 STA and 237,551 FA articles from two time windows (October 7-November 2, 2023; February 1-28, 2025) and identify 1,0

arXiv cs.IR

2mabout 2 hours ago

Research PapersFresh

Evaluation of Generative Models for Emotional 3D Animation Generation in VR

arXiv:2512.16081v2 Announce Type: replace-cross Abstract: Social interactions incorporate nonverbal signals to convey emotions alongside speech, including facial expressions and body gestures. Generative models have demonstrated promising results in creating full-body nonverbal animations synchronized with speech; however, evaluations using statistical metrics in 2D settings fail to fully capture user-perceived emotions, limiting our understanding of model effectiveness. To address this, we evaluate emotional 3D animation generative models within a Virtual Reality (VR) environment, emphasizing user-centric metrics emotional arousal realism, naturalness, enjoyment, diversity, and interaction quality in a real-time human-agent interaction scenario. Through a user study (N=48), we examine per

arXiv cs.MA

2mabout 2 hours ago

Research PapersFresh

Is the Modality Gap a Bug or a Feature? A Robustness Perspective

arXiv:2603.29080v1 Announce Type: new Abstract: Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality

arXiv cs.CV

1mabout 2 hours ago