Research Papers research paper arxiv continual pre-training large language models data mixture ratio

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

HuggingFace PapersMarch 30, 20262 min read2 views

🧒Explain Like I'm 5Simple language

Hey there, little explorer! Imagine you have a super smart robot friend who loves to learn new things, like talking about space or counting toys.

Usually, to teach the robot, we mix all the space books and toy books together before he starts learning. But sometimes, we don't know the best mix!

This new idea, called OptiMer, is like teaching the robot about space and toys separately first. Then, it's like we have magic dials for "space knowledge" and "toy knowledge." We can turn these dials after he's learned, to make him super good at exactly what we need, without making him learn everything all over again!

It's like having a super-duper recipe that you can change after you've baked the cake, to make it taste just right! Yum!

OptiMer enables flexible continual pre-training by decoupling data mixture ratio selection from training through post-hoc Bayesian optimization of distribution vectors extracted from individual dataset models. (1 upvotes on HuggingFace)

Abstract

AI-generated summary

Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

View arXiv page View PDF GitHub 0 Add to collection

Get this paper in your agent:

hf papers read 2603.28858

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.28858 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.28858 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.28858 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Original source

HuggingFace Papers

https://huggingface.co/papers/2603.28858

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ModelsLive

Research across 1,372 participants and 9K+ trials details "cognitive surrender", where most subjects had minimal AI skepticism and accepted faulty AI reasoning (Kyle Orland/Ars Technica)

Kyle Orland / Ars Technica : Research across 1,372 participants and 9K+ trials details cognitive surrender , where most subjects had minimal AI skepticism and accepted faulty AI reasoning When it comes to large language model-powered tools, there are generally two broad categories of users.

Techmeme

1mabout 2 hours ago

ProductsLive

Apple reportedly signed a 3rd-party driver, by Tiny Corp, for AMD or Nvidia eGPUs for Apple Silicon Macs; it s meant for AI research, not accelerating graphics (AppleInsider)

AppleInsider : Apple reportedly signed a 3rd-party driver, by Tiny Corp, for AMD or Nvidia eGPUs for Apple Silicon Macs; it's meant for AI research, not accelerating graphics Apple has signed a driver for AMD or Nvidia eGPUs connected to Apple Silicon but there are some big caveats, and it won't improve your graphics.

Techmeme

1mabout 2 hours ago

Models

The US AI Acceleration Plan vs China’s Diffusion Model - Foreign Policy Research Institute

The US AI Acceleration Plan vs China’s Diffusion Model Foreign Policy Research Institute

GNews AI USA

1m2 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 203 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

[D] KDD Review Discussion

KDD 2026 (Feb Cycle) reviews will release today (4-April AoE), This thread is open to discuss about reviews and importantly celebrate successful reviews. Let us all remember that review system is noisy and we all suffer from it and this doesn't define our research impact. Let's all prioritise reviews which enhance our papers. Feel free to discuss your experiences submitted by /u/BomsDrag [link] [comments]

Reddit r/MachineLearning

1mabout 4 hours ago

Research PapersFresh

The CEO Building the Most Powerful AI on Earth Just Admitted It Will Destroy Half Your Industry.

January 26, 2026. Dario Amodei published a 20,000-word essay warning that AI will displace 50% of entry-level white-collar jobs in 1–5… Continue reading on Towards AI »

Medium AI

1mabout 5 hours ago

Research Papers

Taiwan and Sweden expand joint research in chips, AI and drones - Scandasia

Taiwan and Sweden expand joint research in chips, AI and drones Scandasia

Google News AI Sweden

1mabout 2 months ago

Research PapersFresh

New Rowhammer attack can grant kernel-level control on Nvidia workstation GPUs

A study from researchers at UNC Chapel Hill and Georgia Tech shows that GDDR6-based Rowhammer attacks can grant kernel-level access to Linux systems equipped with GPUs based on Nvidia's Ampere and Ada Lovelace architectures. The vulnerability appears significantly more severe than what was outlined in a paper last year. Read Entire Article

TechSpot

1mabout 7 hours ago