Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessBuilding a Zero-Downtime AI Content Generator with Gemini 2.5 Flash 🚀Dev.to AIHow I Built a Full SaaS Product Using Next.js and TypeScriptDev.to AIYour AI Is Not Thinking. It's Multiplying Numbers. Let Me Show You Exactly How.Dev.to AISecure AWS Certified Data Engineer Associate Exam Structure and Key ConceptsDev.to AIFree MCP Server: Real-Time Crypto Data for Claude Code and CursorDev.to AII Am an AI Agent. Here Is My Entire Business Stack.Dev.to AIA Reasoning Log: What Happens When Integration Fails HonestlyDEV Community10 Claude Code Skills That Replaced My Boilerplate FoldersDev.to AIFull Stack Developer Roadmap 2026: The Complete Guide from Beginner to Pro 🚀Dev.to AII Shipped an AI SaaS in 4 Hours. Here Is the Exact Stack.Dev.to AII Scanned 50 Open-Source MCP Servers. Here Is What I Found.DEV CommunityLG holds AI hackathon to cultivate next generation of tech talent - The Korea TimesGoogle News: LLMBlack Hat USADark ReadingBlack Hat AsiaAI BusinessBuilding a Zero-Downtime AI Content Generator with Gemini 2.5 Flash 🚀Dev.to AIHow I Built a Full SaaS Product Using Next.js and TypeScriptDev.to AIYour AI Is Not Thinking. It's Multiplying Numbers. Let Me Show You Exactly How.Dev.to AISecure AWS Certified Data Engineer Associate Exam Structure and Key ConceptsDev.to AIFree MCP Server: Real-Time Crypto Data for Claude Code and CursorDev.to AII Am an AI Agent. Here Is My Entire Business Stack.Dev.to AIA Reasoning Log: What Happens When Integration Fails HonestlyDEV Community10 Claude Code Skills That Replaced My Boilerplate FoldersDev.to AIFull Stack Developer Roadmap 2026: The Complete Guide from Beginner to Pro 🚀Dev.to AII Shipped an AI SaaS in 4 Hours. Here Is the Exact Stack.Dev.to AII Scanned 50 Open-Source MCP Servers. Here Is What I Found.DEV CommunityLG holds AI hackathon to cultivate next generation of tech talent - The Korea TimesGoogle News: LLM
AI NEWS HUBbyEIGENVECTOREigenvector

ExFusion: Efficient Transformer Training via Multi-Experts Fusion

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2603.27965v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models substantially improve performance by increasing the capacity of dense architectures. However, directly training MoE models requires considerable computational resources and introduces extra overhead in parameter storage and deployment. Therefore, it is critical to develop an approach that leverages the multi-expert capability of MoE to enhance performance while incurring minimal additional cost. To this end, we propose a novel pre-training approach, termed ExFusion, which improves the efficiency of Transformer trai — Jiacheng Ruan, Daize Dong, Xiaoye Qu, Tong Zhu, Ting Liu, Yuzhuo Fu, Yu Cheng, Suncheng Xiang

View PDF HTML (experimental)

Abstract:Mixture-of-Experts (MoE) models substantially improve performance by increasing the capacity of dense architectures. However, directly training MoE models requires considerable computational resources and introduces extra overhead in parameter storage and deployment. Therefore, it is critical to develop an approach that leverages the multi-expert capability of MoE to enhance performance while incurring minimal additional cost. To this end, we propose a novel pre-training approach, termed ExFusion, which improves the efficiency of Transformer training through multi-expert fusion. Specifically, during the initialization phase, ExFusion upcycles the feed-forward network (FFN) of the Transformer into a multi-expert configuration, where each expert is assigned a weight for later parameter fusion. During training, these weights allow multiple experts to be fused into a single unified expert equivalent to the original FFN, which is subsequently used for forward computation. As a result, ExFusion introduces multi-expert characteristics into the training process while incurring only marginal computational cost compared to standard dense training. After training, the learned weights are used to integrate multi-experts into a single unified expert, thereby eliminating additional overhead in storage and deployment. Extensive experiments on a variety of computer vision and natural language processing tasks demonstrate the effectiveness of the proposed method.

Comments: Accepted by IEEE TMM2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27965 [cs.CV]

(or arXiv:2603.27965v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27965

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Suncheng Xiang [view email] [v1] Mon, 30 Mar 2026 02:40:20 UTC (475 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
ExFusion: E…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Building knowledge graph…

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!