Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessWeekend Project: I Built a Full MLOps Pipeline for a Credit Scoring Model (And You Can Too)Hackernoon AIShahed-killing interceptor drones may look simple, but building them to keep up with the threat isn't easyBusiness InsiderUber drivers: Your boss knows you're using Tesla's FSD on the jobBusiness InsiderPitchBook: US venture funding surges to record $267B as OpenAI, Anthropic and xAI dominate AI deals - SiliconANGLEGoogle News: OpenAISECURE: Stable Early Collision Understanding via Robust Embeddings in Autonomous DrivingarXivSven: Singular Value Descent as a Computationally Efficient Natural Gradient MethodarXivDetecting Complex Money Laundering Patterns with Incremental and Distributed Graph ModelingarXivDySCo: Dynamic Semantic Compression for Effective Long-term Time Series ForecastingarXivUQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engressionarXivModel Merging via Data-Free Covariance EstimationarXivMassively Parallel Exact Inference for Hawkes ProcessesarXivEfficient and Principled Scientific Discovery through Bayesian Optimization: A TutorialarXivBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessWeekend Project: I Built a Full MLOps Pipeline for a Credit Scoring Model (And You Can Too)Hackernoon AIShahed-killing interceptor drones may look simple, but building them to keep up with the threat isn't easyBusiness InsiderUber drivers: Your boss knows you're using Tesla's FSD on the jobBusiness InsiderPitchBook: US venture funding surges to record $267B as OpenAI, Anthropic and xAI dominate AI deals - SiliconANGLEGoogle News: OpenAISECURE: Stable Early Collision Understanding via Robust Embeddings in Autonomous DrivingarXivSven: Singular Value Descent as a Computationally Efficient Natural Gradient MethodarXivDetecting Complex Money Laundering Patterns with Incremental and Distributed Graph ModelingarXivDySCo: Dynamic Semantic Compression for Effective Long-term Time Series ForecastingarXivUQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engressionarXivModel Merging via Data-Free Covariance EstimationarXivMassively Parallel Exact Inference for Hawkes ProcessesarXivEfficient and Principled Scientific Discovery through Bayesian Optimization: A TutorialarXiv
AI NEWS HUBbyEIGENVECTOREigenvector

From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2511.07738v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimiz — Donglai Xu, Hongzheng Yang, Yuzhi Zhao, Pingping Zhang, Jinpeng Chen, Wenao Ma, Zhijian Hou, Mengyang Wu, Xiaolei Li, Senkang Hu, Ziyi Guan, Jason Chun Lok Li, Lai Man Po

Authors:Donglai Xu, Hongzheng Yang, Yuzhi Zhao, Pingping Zhang, Jinpeng Chen, Wenao Ma, Zhijian Hou, Mengyang Wu, Xiaolei Li, Senkang Hu, Ziyi Guan, Jason Chun Lok Li, Lai Man Po

View PDF HTML (experimental)

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation, which enables more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones - Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B - spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.

Subjects:

Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2511.07738 [cs.LG]

(or arXiv:2511.07738v2 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2511.07738

arXiv-issued DOI via DataCite

Submission history

From: Donglai Xu [view email] [v1] Tue, 11 Nov 2025 01:42:37 UTC (421 KB) [v2] Sat, 28 Mar 2026 07:59:38 UTC (1,963 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
From Explor…researchpaperarxivmachine-lea…deep-learni…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 269 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers