Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts, Enhanced Hugging Face Integrations
Together AI expands Fine-Tuning Platform: train 100B+ models, extend context lengths, integrate with Hugging Face Hub, and access new DPO options.
Model customization is an extremely versatile tool that comes in handy for many kinds of AI developers. For instance, you can make the strongest open LLMs even better on business-critical tasks by fine-tuning them on domain-specific data. Moreover, it's possible to drastically reduce both inference costs and latency via training smaller but equally capable models.
Our goal with the Together Fine-Tuning Platform is to streamline the process of model training for AI developers, helping them quickly build the best models for their applications by offering convenient and affordable tools. This release showcases a new package of improvements, drastically expanding the scope of what you can train: from the native support for over a dozen latest LLMs to new DPO options and better integrations with the Hugging Face Hub. Learn more about the new features in this blog post!
Large models at Together AI
In 2025, we have seen a great number of models with over 100B parameters released to the public. These models, such as DeepSeek-R1, Qwen3-235B, or Llama 4 Maverick, offer a dramatic jump in capabilities, sometimes rivaling even the strongest proprietary models on certain tasks. With fine-tuning, you can further refine the abilities of these models, steering them towards the precise behavior you need or showing how to solve complex tasks by providing SFT demonstrations. However, training large models is a challenging matter: even orchestrating multi-node jobs reliably can be non-trivial, and running them efficiently requires huge time investments across the stack.
Now, you can train the latest large models on the Together Fine-Tuning Platform! By implementing the latest training optimizations and carefully engineering our platform, we made it possible to easily train models with hundreds of billions of weights at a low cost. We have recently announced the general availability of OpenAI's gpt-oss fine-tuning on our platform, and now we support even more model families, covering recent releases by DeepSeek, Qwen, and Meta.
The full list of new large models is as follows:
- openai/gpt-oss-120b
- deepseek-ai/DeepSeek-V3.1
- deepseek-ai/DeepSeek-V3.1-Base
- deepseek-ai/DeepSeek-R1-0528
- deepseek-ai/DeepSeek-R1
- deepseek-ai/DeepSeek-V3-0324
- deepseek-ai/DeepSeek-V3
- deepseek-ai/DeepSeek-V3-Base
- Qwen/Qwen3-Coder-480B-A35B-Instruct
- Qwen/Qwen3-235B-A22B (context length 32768 for SFT and 16384 for DPO)
- Qwen/Qwen3-235B-A22B-Instruct-2507 (context length 32768 for SFT and 16384 for DPO)
- meta-llama/Llama-4-Maverick-17B-128E
- meta-llama/Llama-4-Maverick-17B-128E-Instruct
- meta-llama/Llama-4-Scout-17B-16E
- meta-llama/Llama-4-Scout-17B-16E-Instruct
Unless stated otherwise, we support a context length of 16,384 tokens for SFT and 8,192 tokens for DPO training. Once the training run finishes, you can start a Dedicated Endpoint to run inference for these models, as well as download their final or intermediate checkpoints.
See the complete list of models supported for fine-tuning in our docs, and check out the pricing page for details about the cost of fine-tuning 100B+ parameter models.
Context length extensions
With recent progress on tasks such as long-document processing, editing of large codebases, and agentic interaction chains, reliable handling of long contexts is as important as ever. Ideally, you want these long examples to be present in your training data, as this eliminates the test-time domain mismatch, boosting the results on the target task.
Given this trend, we wanted to make it possible for AI developers to harness long-context abilities in fine-tuning. To make this happen, we have overhauled our training systems and identified ways to increase the maximum supported context length for most of our models — at no additional cost to you. On average, you can expect 2x-4x increases to the context length, with some settings (like Llama 3.1-8B or Gemma 3-4B) jumping to their maximum length of 131k tokens.
See the picture below for the example context length increases:
Slingshot AI, the company behind the AI therapy app Ash, built a foundation model for psychology and fine-tuned it with long-context clinical conversations. For their use case, long-context fine-tuning was essential to capture the full scope of these conversations.
"The technical challenge was running our multi-stage pipeline reliably at the conversation lengths our therapy models require," explains Daniel Cahn. "Together AI's platform eliminated the context length constraints and job failures we hit elsewhere, letting us experiment rapidly." - Daniel Cahn, Co-founder & CEO, Slingshot AI
For some of the larger models like Llama-3.3-70B, we also offer a separate option of full-context fine-tuning. See the complete list of such models, as well as the context lengths we support, in the docs.
Our work here is far from done: as we discover and implement additional optimizations of our training systems, we will push for larger context lengths (even for 100B+ models) while aiming to keep the runtime and costs of training low. If you need long-context training for a model that is currently missing or need to further increase the context length, we would love to learn more about your use case and support it!
Fine-tune your own model, upload to HF Hub
Given the tempo of acceleration in AI nowadays, you can often see increasingly stronger models trained for specific tasks and released nearly every single day by the community. Some of these models are further fine-tuned variants of base models released by companies like DeepSeek or Meta, while others have a similar architecture but a different configuration or were trained from scratch. No matter the setup, it is almost a given that these models are published on the Hugging Face Hub — the most popular choice for hosting checkpoints these days.
Today, we are making this plethora of models from the Hugging Face Hub available for fine-tuning through Together AI! Our reasoning was simple: if there exists a model that's already adapted for a relevant task, you should be able to use it as a starting point for further fine-tuning. Similarly, you can also save the outputs of your runs directly into a repository on the Hub. Just specify the HF API key with the appropriate permissions (either when starting your job or in your Together AI profile on the website), and your runs will be able to read or write any checkpoints that this key can access.
Our pioneer customers, such as Slingshot and Parsed, have already been using this feature for training their best models, seamlessly integrating the Together Fine-Tuning Platform into their pipelines by using the Hugging Face Hub as an intermediate step. Now, we're making this feature available for every user of Together AI.
To showcase how you can start training models from the Hugging Face Hub and upload the final model to your organization, read our deep-dive blog post, written in partnership with Hugging Face. Keep in mind this feature is experimental and best-effort: while we are continuously stress-testing our platform against a broad variety of models, expect some rough edges for more exotic setups. For a detailed description of Hugging Face integrations, you can also refer to our guide in the docs!
Further improvements
We are also expanding the support for Preference Optimization with more advanced training objectives, inspired by training recipes of state-of-the-art models that used DPO variants. In particular, we now support variants such as length-normalized DPO (LN-DPO), DPO+NLL (from the Iterative RPO paper), and SimPO through corresponding flags --dpo-normalize-logratios-by-length, --rpo-alpha, and --simpo-gamma. Depending on the target setting, you can combine those options to arrive at the best setup for training on preference data. ****
Lastly, we now fully support batch_size="max" when starting your jobs through the API or the Python client. It is a convenience option that will always set the batch size to the highest value we support on the platform, regardless of the model or the training mode (SFT or DPO). We highly recommend keeping the batch size set to max , unless you need to control it for your experiments. This way, you'll automatically benefit from further training optimizations as we roll them out.
Looking ahead
At Together AI, we aim to offer the best set of tools for engineers and researchers working with large language models. With the latest set of additions to our Fine-Tuning Platform, you can easily kick off even the most demanding and sophisticated training runs with little manual effort. At the same time, thanks to our research in ML systems, running experiments via Together AI is remarkably fast and cost-effective, which helps you turn fine-tuning into a natural part of your AI development cycle.
We are thrilled to see what you will build with the features and models we've added! As always, if you have any feedback about fine-tuning at Together AI or want to explore it for your tasks in more depth, feel free to reach out to us!
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelplatformintegration![[llama.cpp] 3.1x Q8_0 speedup on Intel Arc GPUs - reorder optimization fix (PR submitted)](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-neural-network-P6fqXULWLNUwjuxqUZnB3T.webp)
[llama.cpp] 3.1x Q8_0 speedup on Intel Arc GPUs - reorder optimization fix (PR submitted)
TL;DR : Q8_0 quantization on Intel Xe2 (Battlemage/Arc B-series) GPUs was achieving only 21% of theoretical memory bandwidth. My AI Agent and I found the root cause and submitted a fix that brings it to 66% - a 3.1x speedup in token generation. The problem : On Intel Arc Pro B70, Q8_0 models ran at 4.88 t/s while Q4_K_M ran at 20.56 t/s; a 4x gap that shouldn't exist since Q8_0 only has 1.7x more data. After ruling out VRAM pressure, drivers, and backend issues, we traced it to the SYCL kernel dispatch path. Root cause : llama.cpp's SYCL backend has a "reorder" optimization that separates quantization scale factors from weight data for coalesced GPU memory access. This was implemented for Q4_0, Q4_K, and Q6_K - but Q8_0 was never added. Q8_0's 34-byte blocks (not power-of-2) make the non-r

Got Gemma 4 running locally on CUDA, both float and GGUF quantized, with benchmarks
Spent the last week getting Gemma 4 working on CUDA with both full-precision (BF16) and GGUF quantized inference. Here's a video of it running. Sharing some findings because this model has some quirks that aren't obvious. Performance (Gemma4 E2B, RTX 3090): | Config | BF16 Float | Q4_K_M GGUF | |-------------------------|------------|-------------| | short gen (p=1, g=32) | 110 tok/s | 170 tok/s | | long gen (p=512, g=128) | 72 tok/s | 93 tok/s | The precision trap nobody warns you about Honestly making it work was harder than I though. Gemma 4 uses attention_scale=1.0 (QK-norm instead of the usual 1/sqrt(d_k) scaling). This makes it roughly 22x more sensitive to precision errors than standard transformers. Things that work fine on LLaMA or Qwen will silently produce garbage on Gemma 4: F1
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Products

Tech companies are cutting jobs and betting on AI. The payoff is far from guaranteed
AI experts say we’re living in an experiment that may fundamentally change the model of work Sign up for the Breaking News US email to get newsletter alerts in your inbox Hundreds of thousands of tech workers are facing a harsh reality. Their well-paying jobs are no longer safe. Now that artificial intelligence (AI) is here, their futures don’t look as bright as they did a decade ago. As US tech companies have ramped up investments in AI, they’ve slashed a staggering number of jobs. Microsoft cut 15,000 workers last year . Amazon laid off 30,000 employees in the last six months. Financial-services company Block eliminated more than 4,000 people, or 40% of its workforce, in February. Meta laid off more than 1,000 in the last six months, and, according to a Reuters report, may cut 20% of all

How AI Is Transforming Cybersecurity and Compliance — A Deep Dive into PCI DSS
The intersection of artificial intelligence and cybersecurity is no longer a future concept — it is the present reality shaping how organizations defend their data, detect threats, and demonstrate regulatory compliance. As cyber threats grow in sophistication and volume, traditional rule-based security tools are struggling to keep pace. AI is filling that gap with speed, precision, and adaptability that human analysts alone cannot match. Nowhere is this transformation more consequential than in the world of payment security and compliance. The Payment Card Industry Data Security Standard (PCI DSS) — the global framework governing how organizations handle cardholder data — has long been a compliance burden for businesses of all sizes. AI is now fundamentally changing how companies achieve,

Resume Skills Section: Best Layout + Examples (2026)
Your skills section is the most-scanned part of your resume after your name and current title. ATS systems use it for keyword matching. Recruiters use it as a 2-second compatibility check. If it's poorly organized, buried at the bottom, or filled with the wrong skills, both audiences move on. Where to Place Your Skills Section Situation Best Placement Why Technical role (SWE, DevOps, data) Below name, above experience Recruiters check your stack before reading bullets Non-technical role (PM, marketing, ops) Below experience Experience and results matter more Career changer Below name, above experience Establishes relevant skills before unrelated job titles New grad / intern Below education, above projects Education sets context, skills show what you can do The rule: place skills where they

Securing Plex on Synology NAS with Post-Quantum Cryptography via Cloudflare Tunnel
Introduction Securing remote access to a Plex media server hosted on a Synology NAS device presents a critical challenge, particularly in the face of advancing quantum computing capabilities. Traditional encryption algorithms, such as RSA and Elliptic Curve Cryptography (ECC), rely on the computational infeasibility of tasks like integer factorization and discrete logarithm problems. Quantum computers, leveraging Shor’s algorithm, can solve these problems exponentially faster, rendering traditional encryption obsolete. This vulnerability is not a speculative future concern but an imminent threat, especially for internet-exposed services like Plex. Without post-quantum cryptography (PQC), Plex servers—and the sensitive data stored on Synology NAS devices—are susceptible to quantum-enabled d



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!