Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessMy forays into cyborgism: theory, pt. 1LessWrongThe Engineer as Reader: Why Literature Skills Matter for Software Engineers in the Age of AIMedium AIWhen Enterprises Build an Agent OS, the Operating Model Must Change TooMedium AIBuilding a RAG-Powered Smart AI Chatbot for E-commerce application using LangChainMedium AIIntelligence isn’t genetic it’s something to be built part 2Medium AIWhich AI Tool Should You Use for What?Medium AIAI and Authority: What Happens When Writing No Longer Proves ExpertiseMedium AIThe One-Person Unicorn Is Impossible Until AI Outputs Are Officially RecognizedMedium AIb8671llama.cpp ReleasesWashington state will require labels on AI images and set limits on chatbotsHacker News AI TopCan we ever trust AI to watch over itself?Hacker News AI TopAI models will scheme to protect other AI models from being shut downHacker News AI TopBlack Hat USADark ReadingBlack Hat AsiaAI BusinessMy forays into cyborgism: theory, pt. 1LessWrongThe Engineer as Reader: Why Literature Skills Matter for Software Engineers in the Age of AIMedium AIWhen Enterprises Build an Agent OS, the Operating Model Must Change TooMedium AIBuilding a RAG-Powered Smart AI Chatbot for E-commerce application using LangChainMedium AIIntelligence isn’t genetic it’s something to be built part 2Medium AIWhich AI Tool Should You Use for What?Medium AIAI and Authority: What Happens When Writing No Longer Proves ExpertiseMedium AIThe One-Person Unicorn Is Impossible Until AI Outputs Are Officially RecognizedMedium AIb8671llama.cpp ReleasesWashington state will require labels on AI images and set limits on chatbotsHacker News AI TopCan we ever trust AI to watch over itself?Hacker News AI TopAI models will scheme to protect other AI models from being shut downHacker News AI Top
AI NEWS HUBbyEIGENVECTOREigenvector

Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

ArXiv CS.AIby Zhiqian Zhang, Xu Zhao, Xiaoqing Xu, Guangdong Liang, Weijia Wang, Xiaolei Lv, Bo Li, Jun GaoApril 1, 20262 min read0 views
Source Quiz

arXiv:2603.29211v1 Announce Type: new Abstract: In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business speciali

View PDF HTML (experimental)

Abstract:In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.

Comments: 41 pages, 10 figures

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.29211 [cs.AI]

(or arXiv:2603.29211v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.29211

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Xiaolei Lv [view email] [v1] Tue, 31 Mar 2026 03:27:49 UTC (4,509 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Xuanwu: Evo…geminimodelfoundation …benchmarktrainingannounceArXiv CS.AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 163 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!