Live
Test: 15% of Americans say they would work for AI bossTechCrunch AIBitboard version of Tetris AIarXivThe Price of Meaning: Why Every Semantic Memory System ForgetsarXivWhen Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof TutoringarXivFormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?arXivNeuro-Symbolic Learning for Predictive Process Monitoring via Two-Stage Logic Tensor Networks with Rule PruningarXivQuantification of Credal Uncertainty: A Distance-Based ApproacharXivAutoMS: Multi-Agent Evolutionary Search for Cross-Physics Inverse Microstructure DesignarXivMediHive: A Decentralized Agent Collective for Medical ReasoningarXivMultiverse: Language-Conditioned Multi-Game Level Blending via Shared RepresentationarXivTransparency as Architecture: Structural Compliance Gaps in EU AI Act Article 50 IIarXivCompliance-Aware Predictive Process Monitoring: A Neuro-Symbolic ApproacharXivConcerning Uncertainty -- A Systematic Survey of Uncertainty-Aware XAIarXivdaVinci-LLM:Towards the Science of PretrainingarXivAligning LLMs with Graph Neural Solvers for Combinatorial OptimizationarXivTest: 15% of Americans say they would work for AI bossTechCrunch AIBitboard version of Tetris AIarXivThe Price of Meaning: Why Every Semantic Memory System ForgetsarXivWhen Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof TutoringarXivFormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?arXivNeuro-Symbolic Learning for Predictive Process Monitoring via Two-Stage Logic Tensor Networks with Rule PruningarXivQuantification of Credal Uncertainty: A Distance-Based ApproacharXivAutoMS: Multi-Agent Evolutionary Search for Cross-Physics Inverse Microstructure DesignarXivMediHive: A Decentralized Agent Collective for Medical ReasoningarXivMultiverse: Language-Conditioned Multi-Game Level Blending via Shared RepresentationarXivTransparency as Architecture: Structural Compliance Gaps in EU AI Act Article 50 IIarXivCompliance-Aware Predictive Process Monitoring: A Neuro-Symbolic ApproacharXivConcerning Uncertainty -- A Systematic Survey of Uncertainty-Aware XAIarXivdaVinci-LLM:Towards the Science of PretrainingarXivAligning LLMs with Graph Neural Solvers for Combinatorial OptimizationarXiv

When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring

arXivby [Submitted on 28 Mar 2026]March 31, 20262 min read1 views
Source Quiz

arXiv:2603.27076v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner's current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self-assessment or binary correctness, our framework enables fine-grained ana — Tahreem Yasir (DK), Sutapa Dey Tithi (DK), Benyamin Tabarsi (DK), Dmitri Droujkov (DK), Sam Gilson Yasitha Rajapaksha (DK), Xiaoyi Tian (DK), Arun Ramesh (DK), DongKuan (DK), Xu, Tiffany Barnes

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner's current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self-assessment or binary correctness, our framework enables fine-grained analysis of feedback quality against verified solution paths. We evaluate three role-specialized pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback). Our results reveal a striking asymmetry: verification improves outcomes when upstream feedback is error-prone (<70% accuracy), but degrades performance by 4-6 percentage points through over-specification when feedback is already reliable (>85%). Critically, we identify a shared complexity ceiling; no model or pipeline reliably succeeds on proof states exceeding complexity 4-5. These findings challenge the assumption that adding verifiers or richer context universally improves tutoring, motivating adaptive, difficulty-aware architectures that route problems by estimated complexity and upstream reliability.

Comments: 21 pages, 1 figure

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.27076 [cs.AI]

(or arXiv:2603.27076v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.27076

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Tahreem Yasir [view email] [v1] Sat, 28 Mar 2026 01:35:59 UTC (2,066 KB)

Original source

arXiv

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
When Verifi…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 342 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers