Research Papers research paper arxiv computer-vision image-recognition

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

arXivMarch 30, 202610 min read0 views

arXiv:2603.23885v2 Announce Type: replace Abstract: Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of st — Gengluo Li, Pengyuan Lyu, Chengquan Zhang, Huawen Shen, Liang Wu, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu Zhou

View PDF HTML (experimental)

Abstract:Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.

Comments: Accepted to CVPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.23885 [cs.CV]

(or arXiv:2603.23885v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.23885

arXiv-issued DOI via DataCite

Submission history

From: Gengluo Li [view email] [v1] Wed, 25 Mar 2026 03:19:09 UTC (873 KB) [v2] Fri, 27 Mar 2026 06:38:20 UTC (873 KB)

Original source

arXiv

https://arxiv.org/abs/2603.23885

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research PapersLive

Precision Proactivity: Measuring Cognitive Load in Real-World AI-Assisted Work

Article URL: https://arxiv.org/abs/2505.10742 Comments URL: https://news.ycombinator.com/item?id=47595100 Points: 1 # Comments: 0

Hacker News AI Top

2mabout 1 hour ago

ModelsLive

We're running an AI-authored research workshop for Northeast India's 200+ languages - and publishing everything openly

At MWire Labs, we build language technology for Northeast India's indigenous languages - ASR, MT, OCR, LLMs. The region has 200+ languages. Almost none of them exist in mainstream AI datasets. So we're doing something a bit unusual. NortheastGenAI 2026 is a virtual workshop on May 29 where every submission must be AI-generated or AI-assisted - with full disclosure of how. All reviews are AI-assisted too, followed by a human editorial check. Everything is public on OpenReview. Inspired by Agents4Science 2025 (Stanford). We're not claiming AI research is ready. We're asking the question openly and publishing whatever comes out. *Three tracks: * Language, Culture & Heritage Society, History & Anthropology AI and Technology for NE In

Dev.to AI

1m34 minutes ago

Research Papers

Antonia Georgopoulou starts as Cyber Valley Max Planck Independent Research Group Leader

is.mpg.de

1m6 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 161 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersLive

Precision Proactivity: Measuring Cognitive Load in Real-World AI-Assisted Work

Article URL: https://arxiv.org/abs/2505.10742 Comments URL: https://news.ycombinator.com/item?id=47595100 Points: 1 # Comments: 0

Hacker News AI Top

2mabout 1 hour ago

Research PapersFresh

Iran’s Revolutionary Guards just named 18 US tech firms as military targets. The age of the civilian data centre is over.

At 8pm Tehran time on Tuesday, a new kind of front line was drawn, not through desert terrain or along a disputed border, but through the server farms, cloud regions, and corporate campuses of America’s largest technology companies. The Islamic Revolutionary Guard Corps published a statement on its official Sepah News channel naming 18 US […] This story continues at The Next Web

The Next Web AI

1mabout 3 hours ago

Research Papers

Antonia Georgopoulou starts as Cyber Valley Max Planck Independent Research Group Leader

is.mpg.de

1m6 months ago

Research Papers

Real-time speech-to-speech translation - research.google

<a href="https://news.google.com/rss/articles/CBMid0FVX3lxTFAxeFFhNVhOTjVXeEhXeGFHOXE3WENYeGFISjlpVGNueGtDS2ZZTEVsZHh6dkhLc191aFFYNEpMUUxraV9uTWF6YW1RcF9VTFlIZDBuQTlpbkhBRnJxU1FuTGY4aEtFc2FEaWMxekxUTnlzV3dFN1ow?oc=5" target="_blank">Real-time speech-to-speech translation</a> research.google

GNews AI voice

1m4 months ago