Models llama mistral model transformer benchmark training

Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

Reddit r/LocalLLaMAby /u/angeletti89 https://www.reddit.com/user/angeletti89April 5, 20264 min read1 views

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an afterthought — English-first tokenizer, English-first data, maybe some Italian sprinkled in during fine-tuning. The result: bloated token counts, poor morphology handling, and models that "speak Italian" the way a tourist orders coffee in Rome. I decided to fix this from the ground up. What is Dante-2B A 2.1B parameter, decoder-only, dense transformer. Trained from scratch — no fine-tune of Llama, no adapter on Mistral. Random init to coherent Italian in 16 days on 2× H200 GPUs. Architecture: LLaMA-style with GQA (20 query heads, 4 KV heads — 5:1 ratio) SwiGLU FFN, RMSNorm, RoPE d_model=2560, 28 layers, d_head=128 (optimized for Flash Attention on H200) Weight

Could not retrieve the full article text.

Read on Reddit r/LocalLLaMA →

Original source

Reddit r/LocalLLaMA

https://www.reddit.com/r/LocalLLaMA/comments/1sdfwmu/dante2b_im_training_a_21b_bilingual_fully_open/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamistralmodel

Models

Mistral AI Raises $830 Million in Debt For Nvidia-Powered Data Center - WSJ

Mistral AI Raises $830 Million in Debt For Nvidia-Powered Data Center WSJ

GNews AI Mistral

1m8 days ago

Models

Mistral AI Lands Accenture as Latest Big Client - WSJ

Mistral AI Lands Accenture as Latest Big Client WSJ

GNews AI Mistral

1mabout 1 month ago

ModelsFresh

Got Gemma 4 running locally on CUDA, both float and GGUF quantized, with benchmarks

Spent the last week getting Gemma 4 working on CUDA with both full-precision (BF16) and GGUF quantized inference. Here's a video of it running. Sharing some findings because this model has some quirks that aren't obvious. Performance (Gemma4 E2B, RTX 3090): | Config | BF16 Float | Q4_K_M GGUF | |-------------------------|------------|-------------| | short gen (p=1, g=32) | 110 tok/s | 170 tok/s | | long gen (p=512, g=128) | 72 tok/s | 93 tok/s | The precision trap nobody warns you about Honestly making it work was harder than I though. Gemma 4 uses attention_scale=1.0 (QK-norm instead of the usual 1/sqrt(d_k) scaling). This makes it roughly 22x more sensitive to precision errors than standard transformers. Things that work fine on LLaMA or Qwen will silently produce garbage on Gemma 4: F1

Reddit r/LocalLLaMA

2mabout 6 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 208 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

Daily AI Digest

More about

Mistral AI Raises $830 Million in Debt For Nvidia-Powered Data Center - WSJ

Mistral AI Lands Accenture as Latest Big Client - WSJ

Got Gemma 4 running locally on CUDA, both float and GGUF quantized, with benchmarks

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Houthis Block Access to Global AI Platform “ChatGPT” in Yemen - Yemen Online

Mistral AI Raises $830 Million in Debt For Nvidia-Powered Data Center - WSJ

Mistral AI Lands Accenture as Latest Big Client - WSJ

Got Gemma 4 running locally on CUDA, both float and GGUF quantized, with benchmarks