TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning
arXiv:2604.00438v1 Announce Type: new Abstract: In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voti
Authors:Wenxuan Jiang, Yuxin Zuo, Zijian Zhang, Xuecheng Wu, Zining Fan, Wenxuan Liu, Li Chen, Xiaoyu Li, Xuezhi Cao, Xiaolong Jin, Ninghao Liu
View PDF HTML (experimental)
Abstract:In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at this https URL.
Comments: 14 pages, 7 figures
Subjects:
Computation and Language (cs.CL)
Cite as: arXiv:2604.00438 [cs.CL]
(or arXiv:2604.00438v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2604.00438
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Wenxuan Jiang [view email] [v1] Wed, 1 Apr 2026 03:34:05 UTC (210 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelannounce
Leading when the world is on fire and technology won’t stand still
When I first pitched this article, the world already felt unstable. Technology was accelerating, AI was dominating headlines and most leadership teams were still working out what it all meant. Since then, the temperature has risen again. Geopolitical tensions have escalated, economic pressure is building and political trust continues to fracture. The conditions organizations are operating in are more volatile, more complex and more uncertain than they have been in years. We have been here before, or at least it feels that way. After the global pandemic, there was a brief moment of optimism. Stability, or the illusion of it, returned. Then came the Ukraine invasion, persistent inflation, rising interest rates and tightening margins. Organizations continued to push for growth while absorbing

The end of the org chart: Leadership in an agentic enterprise
Most security initiatives fail before the first line of code is written. Not because the technology is wrong, but because the problem was framed poorly from the start. Leaders often move fast toward familiar answers, then wonder why progress stalls. Last year, a global cybersecurity technology company brought me in to help run a Privileged Access Management proof of concept. On paper, it made sense. PAM was proven, defensible and easy to justify to cybersecurity leaders. The intent was legitimate and the urgency was real. Once we looked closer, the real issue became obvious. Centralizing PAM had no internal support. Stakeholders were wary, resisted heavy controls and were unconvinced it would help them do their jobs. Pushing forward would have burned credibility and months of effort. So we

IBM, Arm team up to bring Arm software to IBM Z mainframes
IBM and Arm have announced a plan to develop hardware that can run both IBM and Arm-based workloads, to let Arm software run on IBM mainframes. The two companies plan to work on three things: building virtualization tools so Arm software can run on IBM platforms; making sure Arm applications meet the security and data residency rules that regulated industries must follow; and creating common technology layers so enterprises have more software options across both platforms, IBM said in a statement . IBM has not said whether the virtualization work will happen at the hypervisor level, through its existing PR/SM partitioning technology, or via containers — a question enterprise architects will need answered before they can assess the collaboration’s practical value. IBM described the effort a
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
State of Evaluation Study: Vector Institute Unlocks New Transparency in Benchmarking Global AI Models
Five takeaways for AI model developers, researchers and users Vector Institute’s first State of Evaluation study, developed by Vector’s AI Engineering team, shines new light on the evaluation and benchmarking [ ] The post State of Evaluation Study: Vector Institute Unlocks New Transparency in Benchmarking Global AI Models appeared first on Vector Institute for Artificial Intelligence .
Vector Institute Unveils Comprehensive Evaluation of Leading AI Models
At a glance: TORONTO, ON, April 10, 2025 — Canada’s Vector Institute has unveiled the results of its independent evaluation of leading large language models (LLMs), offering an objective look [ ] The post Vector Institute Unveils Comprehensive Evaluation of Leading AI Models appeared first on Vector Institute for Artificial Intelligence .
Exploring Intelligence: Vector Faculty Member Kelsey Allen’s Path from Particle Physics to Cognitive Machine Learning
How do humans and machines build models to enable problem-solving and innovation? This is the question that has shaped Kelsey Allen’s career. It’s guided her from high-energy physics to machine [ ] The post Exploring Intelligence: Vector Faculty Member Kelsey Allen’s Path from Particle Physics to Cognitive Machine Learning appeared first on Vector Institute for Artificial Intelligence .


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!