A Systematic Study of In-the-Wild Model Merging for Large Language Models
arXiv:2511.21437v2 Announce Type: replace-cross Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for settings where all merged experts have distinct roles and are tuned on clearly separated tasks also hold in settings where the merged experts do not have clearly distinct roles, but are trained on overlapping or even conflicting objectives. To evaluate this setting, we present a large — O\u{g}uz Ka\u{g}an Hitit, Leander Girrbach, Zeynep Akata
View PDF HTML (experimental)
Abstract:Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for settings where all merged experts have distinct roles and are tuned on clearly separated tasks also hold in settings where the merged experts do not have clearly distinct roles, but are trained on overlapping or even conflicting objectives. To evaluate this setting, we present a large-scale, systematic evaluation of "in-the-wild" model merging of heterogeneous experts, that may have been trained on overlapping or conflicting objectives. Concretely, we evaluate six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a model merged from a heterogeneous set of experts outperforms the base model and we measure relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs in this "in-the-wild" setting. Other interference-aware and subspace merging methods typically do not result in notable improvements over the base model. Our findings indicate that current merging techniques mostly do not enable extracting useful weight updates from heterogeneous and potentially conflicting versions. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods.
Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as: arXiv:2511.21437 [cs.CL]
(or arXiv:2511.21437v2 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2511.21437
arXiv-issued DOI via DataCite
Journal reference: Transactions on Machine Learning Research (03/2026)
Submission history
From: Oğuz Kağan Hitit [view email] [v1] Wed, 26 Nov 2025 14:28:11 UTC (2,098 KB) [v2] Sun, 29 Mar 2026 15:10:49 UTC (2,065 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.






Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!