Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessHow to Use Shaders in React (2026 WebGPU / WebGL Tutorial)DEV CommunityThe 5th Agent Orchestration Pattern: Market-Based Task AllocationDEV CommunityThe Hidden Cost of Copy-Pasting Code Into ChatGPTDEV Community14-Package Monorepo: How We Structured WAIaaS for AI Agent BuildersDEV CommunityPromoting raw BG3 gameplay bundle previews in the TD2 SDL portDEV CommunityWhat Is New In Helm 4 And How It Improves Over Helm 3DEV CommunityDevelopers Are Designing for AI Before Users NowDEV CommunityStop Using Elaborate Personas: Research Shows They Degrade Claude Code OutputDEV CommunityAn Engineering-grade breakdown of RAG PipelineDEV CommunityUnlock the Power of Private AI: Build a Local RAG Pipeline with LangGraph, Ollama & Vector DatabasesDEV CommunityDeepSource for Python: Static Analysis and AutofixDEV CommunityI tried to destroy this AirTag alternative, but it wouldn't crack - unlike othersZDNet AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessHow to Use Shaders in React (2026 WebGPU / WebGL Tutorial)DEV CommunityThe 5th Agent Orchestration Pattern: Market-Based Task AllocationDEV CommunityThe Hidden Cost of Copy-Pasting Code Into ChatGPTDEV Community14-Package Monorepo: How We Structured WAIaaS for AI Agent BuildersDEV CommunityPromoting raw BG3 gameplay bundle previews in the TD2 SDL portDEV CommunityWhat Is New In Helm 4 And How It Improves Over Helm 3DEV CommunityDevelopers Are Designing for AI Before Users NowDEV CommunityStop Using Elaborate Personas: Research Shows They Degrade Claude Code OutputDEV CommunityAn Engineering-grade breakdown of RAG PipelineDEV CommunityUnlock the Power of Private AI: Build a Local RAG Pipeline with LangGraph, Ollama & Vector DatabasesDEV CommunityDeepSource for Python: Static Analysis and AutofixDEV CommunityI tried to destroy this AirTag alternative, but it wouldn't crack - unlike othersZDNet AI

MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2603.27185v1 Announce Type: new Abstract: Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimiz — Xiaofeng Tan, Wanjiang Weng, Hongsong Wang, Fang Zhao, Xin Geng, Liang Wang

View PDF HTML (experimental)

Abstract:Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimization. We present a reinforcement fine-tuning framework that comprises a heterogeneous-representation, multi-dimensional reward model, MotionReward, and an efficient, fine-grained fine-tuning method, EasyTune. To obtain a unified semantics representation, MotionReward maps heterogeneous motions into a shared semantic space anchored by text, enabling multidimensional reward learning; Self-refinement Preference Learning further enhances semantics without additional annotations. For efficient and effective fine-tuning, we identify the recursive gradient dependence across denoising steps as the key bottleneck, and propose EasyTune, which optimizes step-wise rather than over the full trajectory, yielding dense, fine-grained, and memory-efficient updates. Extensive experiments validate the effectiveness of our framework, achieving FID 0.132 at 22.10 GB peak memory for MLD model and saving up to 15.22 GB over DRaFT. It reduces FID by 22.9% on joint-based ACMDM, and achieves a 12.6% R-Precision gain and 23.3% FID improvement on rotation-based HY Motion. Our project page with code is publicly available.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27185 [cs.CV]

(or arXiv:2603.27185v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27185

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Xiaofeng Tan [view email] [v1] Sat, 28 Mar 2026 08:10:53 UTC (4,086 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
MotionRFT: …researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 179 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers