Models model training announce japan arxiv findings

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

arXiv cs.CLby Haiyue Song, Masao UtiyamaApril 1, 20261 min read0 views

arXiv:2603.28858v1 Announce Type: new Abstract: Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baseli

View PDF HTML (experimental)

Abstract:Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

Comments: Preprint, 20 pages, 10 tables, 12 figures

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as: arXiv:2603.28858 [cs.CL]

(or arXiv:2603.28858v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.28858

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Haiyue Song [view email] [v1] Mon, 30 Mar 2026 18:00:02 UTC (1,202 KB)

Original source

arXiv cs.CL

https://arxiv.org/abs/2603.28858

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modeltrainingannounce

ModelsFresh

Intel B70 with Qwen3.5 35B

Intel recently released support for Qwen3.5: https://github.com/intel/llm-scaler/releases/tag/vllm-0.14.0-b8.1 Anyone with a B70 willing to run a lllama benchy with the below settings on the 35B model? uvx llama-benchy --base-url $URL --model $MODEL --depth 0 --pp 2048 --tg 512 --concurrency 1 --runs 3 --latency-mode generation --no-cache --save-total-throughput-timeseries submitted by /u/Fmstrat [link] [comments]

Reddit r/LocalLLaMA

1mabout 2 hours ago

ModelsFresh

We made significant improvements to the Kokoro TTS trainer

Kokoro is a pretty popular tool- for good reason. Can run on CPUs on desktops and phone. We found it pretty useful ourselves, there being only 1 issue- training custom voices. There was a great tool called KVoiceWalk that solved this. Only 1 problem- it only ran on CPU. Took about 26 hours to train a single voice. So we made significant improvements. We forked into here- https://github.com/BovineOverlord/kvoicewalk-with-GPU-CUDA-and-GUI-queue-system As the name suggests, we added GPU/CUDA support to the tool. Results were 6.5x faster on a 3060. We also created a GUI for easier use, which includes a queuing system for training multiple voices. Hope this helps the community. We'll be adding this TTS with our own custom voices to our game the coming days. Let me know if you have any questions

Reddit r/LocalLLaMA

1mabout 2 hours ago

Market NewsFresh

[D] The memory chip market lost tens of billions over a paper this community would have understood in 10 minutes

TurboQuant was teased recently and tens of billions gone from memory chip market in 48 hours but anyone in this community who read the paper would have seen the problem with the panic immediately. TurboQuant compresses the KV cache down to 3 bits per value from the standard 16 using polar coordinate quantization. But the KV cache is inference memory. Training memory, activations, gradients, optimizer states, is a completely different thing and completely untouched. And majority of HBM demand comes from training. An inference compression paper doesn't move that number. And the commercial inference baseline already runs at 4 to 8 bit precision. The 6x headline is benchmarked against 16 bit full precision. The real marginal gain over what's actually deployed is considerably smaller than that

Reddit r/MachineLearning

2mabout 4 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 151 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

Mistral AI Raises $830 Million in Debt For Nvidia-Powered Data Center - WSJ

Mistral AI Raises $830 Million in Debt For Nvidia-Powered Data Center WSJ

GNews AI Mistral

1m7 days ago

ModelsFresh

Intel B70 with Qwen3.5 35B

Reddit r/LocalLLaMA

1mabout 2 hours ago

ModelsFresh

We made significant improvements to the Kokoro TTS trainer

Reddit r/LocalLLaMA

1mabout 2 hours ago

ModelsLive

Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an afterthought — English-first tokenizer, English-first data, maybe some Italian sprinkled in during fine-tuning. The result: bloated token counts, poor morphology handling, and models that "speak Italian" the way a tourist orders coffee in Rome. I decided to fix this from the ground up. What is Dante-2B A 2.1B parameter, decoder-only, dense transformer. Trained from scratch — no fine-tune of Llama, no adapter on Mistral. Random init to coherent Italian in 16 days on 2× H200 GPUs. Architecture: LLaMA-style with GQA (20 query heads, 4 KV heads — 5:1 ratio) SwiGLU FFN, RMSNorm, RoPE d_model=2560, 28 layers, d_head=128 (optimized for Flash Attention on H200) Weight

Reddit r/LocalLLaMA

4m37 minutes ago