SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
arXiv:2603.27977v1 Announce Type: new Abstract: Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) — Yifan Wang, Bolian Li, David Cho, Ruqi Zhang, Fanping Sui, Ananth Grama
View PDF HTML (experimental)
Abstract:Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RLVR to open ended settings. We introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world topology, inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efficient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy, indicating a more stable and exploratory training and generalized reasoning ability.
Subjects:
Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.27977 [cs.AI]
(or arXiv:2603.27977v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.27977
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Yifan Wang [view email] [v1] Mon, 30 Mar 2026 02:54:48 UTC (817 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell
<h4>Chocolate Factory’s compression tech clears the way to cheaper AI inference, not more affordable memory</h4> <p>When Google unveiled <a target="_blank" rel="nofollow" href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.…</p>
Illinois Tech computer science researcher honored by IEEE Chicago Section - EurekAlert!
<a href="https://news.google.com/rss/articles/CBMiXEFVX3lxTE13OVpWMEk1Z3hlMkR2bHNBQ2dkazFwb3VqN3hCa29GWGJvSVlPa00zd2xUakRmYXFqQmc5OWU0eGl4a21FMDAwWUN2Q3p0M3FrbXBkNV8zN0cxaG1s?oc=5" target="_blank">Illinois Tech computer science researcher honored by IEEE Chicago Section</a> <font color="#6f6f6f">EurekAlert!</font>

My Journey to becoming a Quantum Engineer
<p>I have procrastinated on documenting this process for the longest time. But I think i am ready now (maybe). <br> Coming from a front end engineering background, I am fascinated by the work being done by the quantum engineers at IBM. I am not that great with maths and statistics but I believe anything can be learned with tons of practice and consistency. I want to use this platform to hold myself accountable (that is if i don't give up half way and delete all my posts. I'll try not to btw). </p> <p>This is an article describing <a href="https://www.ibm.com/think/topics/quantum-computing" rel="noopener noreferrer">what quantum computing is</a> and some of it's use cases.</p> <p>I became an IBM qiskit advocate late last year and I have been exposed to a lot of resources and networked a bun
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
Illinois Tech computer science researcher honored by IEEE Chicago Section - EurekAlert!
<a href="https://news.google.com/rss/articles/CBMiXEFVX3lxTE13OVpWMEk1Z3hlMkR2bHNBQ2dkazFwb3VqN3hCa29GWGJvSVlPa00zd2xUakRmYXFqQmc5OWU0eGl4a21FMDAwWUN2Q3p0M3FrbXBkNV8zN0cxaG1s?oc=5" target="_blank">Illinois Tech computer science researcher honored by IEEE Chicago Section</a> <font color="#6f6f6f">EurekAlert!</font>
AI maps science papers to predict research trends two to three years ahead - Tech Xplore
<a href="https://news.google.com/rss/articles/CBMie0FVX3lxTE5aTkZYTWdaRDZwTXNRMldpMG1WZ1YzWDZTOHN5M183Z3A1ZTFYbnhEWTdPRmpvZnZFU0xodlRsNWxFaGxTcEpwalhJNmJpQWE5VjhaRS1tOXJIeTc5Z0JNblJ3dFd4WjRYZGJOX0NrWGt6ZmZJVTBpRm5wWQ?oc=5" target="_blank">AI maps science papers to predict research trends two to three years ahead</a> <font color="#6f6f6f">Tech Xplore</font>
AI inspires new research topics in materials science - Nanowerk
<a href="https://news.google.com/rss/articles/CBMiZ0FVX3lxTFBPWlJSM2ExeVQ3LVppTm45NHpEMW9YVkxscThCNDd2OVB0c3J1ZmVCbWNSZWZ0TjZwSzlOdEFXN2UtRk5LU1hxdXd4ZklldGxoM0FZSnhCd19PWkNHQ1ZRVDNwSHNUSk0?oc=5" target="_blank">AI inspires new research topics in materials science</a> <font color="#6f6f6f">Nanowerk</font>


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!