Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark Reading[D] On-Device Real-Time Visibility Restoration: Deterministic CV vs. Quantized ML Models. Looking for insights on Edge Preservation vs. Latency.Reddit r/MachineLearningThe National Policy Framework on Artificial Intelligence: Implications for Employers Using AI - JD SupraGNews AI USAAdvanced Compact Patterns for Web3 DevelopersDEV CommunityA conversation on concentration of powerLessWrongBest Free Snyk Alternatives for Vulnerability ScanningDEV CommunityAccelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA NsightNVIDIA Tech BlogDecoding the Black Box: LLM Observability with LangSmith & Helicone for Local ModelsDEV CommunityFrom language to testing: How AI is reshaping education in South Africa - cnn.comGNews AI educationKey AI, Cybersecurity, and Privacy Takeaways from the NAIC 2026 Spring Meeting - JD SupraGoogle News: AIAI LEGAL KEYNOTE SPEAKER & ARTIFICIAL INTELLIGENCE LAW FUTURIST FOR EVENTS - futuristsspeakers.comGNews AI legalOpenAI Buys Streaming Show ‘TBPN,’ Aiming to Change Narrative on A.I. - The New York TimesGoogle News: AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark Reading[D] On-Device Real-Time Visibility Restoration: Deterministic CV vs. Quantized ML Models. Looking for insights on Edge Preservation vs. Latency.Reddit r/MachineLearningThe National Policy Framework on Artificial Intelligence: Implications for Employers Using AI - JD SupraGNews AI USAAdvanced Compact Patterns for Web3 DevelopersDEV CommunityA conversation on concentration of powerLessWrongBest Free Snyk Alternatives for Vulnerability ScanningDEV CommunityAccelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA NsightNVIDIA Tech BlogDecoding the Black Box: LLM Observability with LangSmith & Helicone for Local ModelsDEV CommunityFrom language to testing: How AI is reshaping education in South Africa - cnn.comGNews AI educationKey AI, Cybersecurity, and Privacy Takeaways from the NAIC 2026 Spring Meeting - JD SupraGoogle News: AIAI LEGAL KEYNOTE SPEAKER & ARTIFICIAL INTELLIGENCE LAW FUTURIST FOR EVENTS - futuristsspeakers.comGNews AI legalOpenAI Buys Streaming Show ‘TBPN,’ Aiming to Change Narrative on A.I. - The New York TimesGoogle News: AI
AI NEWS HUBbyEIGENVECTOREigenvector

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

HuggingFace PapersMarch 29, 20262 min read1 views
Source Quiz

Discrete Native Autoregressive framework enables unified multimodal processing by representing diverse modalities in a shared discrete space through a novel visual transformer architecture. (43 upvotes on HuggingFace)

Published on Mar 29

Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Discrete Native Autoregressive framework enables unified multimodal processing by representing diverse modalities in a shared discrete space through a novel visual transformer architecture.

AI-generated summary

The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next

View arXiv page View PDF Project page GitHub 278 Add to collection

Get this paper in your agent:

hf papers read 2603.27538

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.27538 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.27538 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
LongCat-Nex…researchpaperarxivNext-Token …autoregress…multimodal …HuggingFace…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 179 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!