Models model language model announce available valuation reasoning

TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning

arXiv cs.CLby Wenxuan Jiang, Yuxin Zuo, Zijian Zhang, Xuecheng Wu, Zining Fan, Wenxuan Liu, Li Chen, Xiaoyu Li, Xuezhi Cao, Xiaolong Jin, Ninghao LiuApril 2, 20262 min read0 views

Source Quiz

arXiv:2604.00438v1 Announce Type: new Abstract: In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voti

Authors:Wenxuan Jiang, Yuxin Zuo, Zijian Zhang, Xuecheng Wu, Zining Fan, Wenxuan Liu, Li Chen, Xiaoyu Li, Xuezhi Cao, Xiaolong Jin, Ninghao Liu

View PDF HTML (experimental)

Abstract:In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at this https URL.

Comments: 14 pages, 7 figures

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2604.00438 [cs.CL]

(or arXiv:2604.00438v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.00438

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Wenxuan Jiang [view email] [v1] Wed, 1 Apr 2026 03:34:05 UTC (210 KB)

Original source

arXiv cs.CL

https://arxiv.org/abs/2604.00438

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelannounce

Analyst NewsLive

Leading when the world is on fire and technology won’t stand still

When I first pitched this article, the world already felt unstable. Technology was accelerating, AI was dominating headlines and most leadership teams were still working out what it all meant. Since then, the temperature has risen again. Geopolitical tensions have escalated, economic pressure is building and political trust continues to fracture. The conditions organizations are operating in are more volatile, more complex and more uncertain than they have been in years. We have been here before, or at least it feels that way. After the global pandemic, there was a brief moment of optimism. Stability, or the illusion of it, returned. Then came the Ukraine invasion, persistent inflation, rising interest rates and tightening margins. Organizations continued to push for growth while absorbing

CIO Magazine

7mabout 2 hours ago

ProductsLive

The end of the org chart: Leadership in an agentic enterprise

Most security initiatives fail before the first line of code is written. Not because the technology is wrong, but because the problem was framed poorly from the start. Leaders often move fast toward familiar answers, then wonder why progress stalls. Last year, a global cybersecurity technology company brought me in to help run a Privileged Access Management proof of concept. On paper, it made sense. PAM was proven, defensible and easy to justify to cybersecurity leaders. The intent was legitimate and the urgency was real. Once we looked closer, the real issue became obvious. Centralizing PAM had no internal support. Stakeholders were wary, resisted heavy controls and were unconvinced it would help them do their jobs. Pushing forward would have burned credibility and months of effort. So we

CIO Magazine

8m42 minutes ago

ProductsLive

IBM, Arm team up to bring Arm software to IBM Z mainframes

IBM and Arm have announced a plan to develop hardware that can run both IBM and Arm-based workloads, to let Arm software run on IBM mainframes. The two companies plan to work on three things: building virtualization tools so Arm software can run on IBM platforms; making sure Arm applications meet the security and data residency rules that regulated industries must follow; and creating common technology layers so enterprises have more software options across both platforms, IBM said in a statement . IBM has not said whether the virtualization work will happen at the hypervisor level, through its existing PR/SM partitioning technology, or via containers — a question enterprise architects will need answered before they can assess the collaboration’s practical value. IBM described the effort a

CIO Magazine

4m30 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 184 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

State of Evaluation Study: Vector Institute Unlocks New Transparency in Benchmarking Global AI Models

Five takeaways for AI model developers, researchers and users Vector Institute’s first State of Evaluation study, developed by Vector’s AI Engineering team, shines new light on the evaluation and benchmarking [ ] The post State of Evaluation Study: Vector Institute Unlocks New Transparency in Benchmarking Global AI Models appeared first on Vector Institute for Artificial Intelligence .

Vector Institute

1m12 months ago

Models

Vector Institute Unveils Comprehensive Evaluation of Leading AI Models

At a glance: TORONTO, ON, April 10, 2025 — Canada’s Vector Institute has unveiled the results of its independent evaluation of leading large language models (LLMs), offering an objective look [ ] The post Vector Institute Unveils Comprehensive Evaluation of Leading AI Models appeared first on Vector Institute for Artificial Intelligence .

Vector Institute

1m12 months ago

Models

Exploring Intelligence: Vector Faculty Member Kelsey Allen’s Path from Particle Physics to Cognitive Machine Learning

How do humans and machines build models to enable problem-solving and innovation? This is the question that has shaped Kelsey Allen’s career. It’s guided her from high-energy physics to machine [ ] The post Exploring Intelligence: Vector Faculty Member Kelsey Allen’s Path from Particle Physics to Cognitive Machine Learning appeared first on Vector Institute for Artificial Intelligence .

Vector Institute

1m11 months ago

Models

Mistral AI Raises $830 Million in Debt For Nvidia-Powered Data Center - WSJ

Mistral AI Raises $830 Million in Debt For Nvidia-Powered Data Center WSJ

GNews AI NVIDIA

1m3 days ago