Research Papers research paper arxiv ai artificial-intelligence

FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

arXivMarch 31, 202610 min read0 views

arXiv:2601.08026v4 Announce Type: replace-cross Abstract: Scientific compound figures combine multiple labeled panels into a single image. However, in a PMC-scale crawl of 346,567 compound figures, 16.3% have no caption and 1.8% only have captions shorter than ten words, causing them to be discarded by existing caption-decomposition pipelines. We propose FigEx2, a visual-conditioned framework that localizes panels and generates panel-wise captions directly from the image, converting otherwise unusable figures into aligned panel-text pairs for downstream pretraining and retrieval. To mitigate l — Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao, Yufei Huang

View PDF HTML (experimental)

Abstract:Scientific compound figures combine multiple labeled panels into a single image. However, in a PMC-scale crawl of 346,567 compound figures, 16.3% have no caption and 1.8% only have captions shorter than ten words, causing them to be discarded by existing caption-decomposition pipelines. We propose FigEx2, a visual-conditioned framework that localizes panels and generates panel-wise captions directly from the image, converting otherwise unusable figures into aligned panel-text pairs for downstream pretraining and retrieval. To mitigate linguistic variance in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively controls how caption features condition the detection query space, and employ a staged SFT+RL strategy with CLIP-based alignment and BERTScore-based semantic rewards. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. FigEx2 achieves 0.728 [email protected]:0.95 for detection, outperforms Qwen3-VL-8B by 0.44 in METEOR and 0.22 in BERTScore, and transfers zero-shot to out-of-distribution scientific domains without fine-tuning.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as: arXiv:2601.08026 [cs.CV]

(or arXiv:2601.08026v4 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.08026

arXiv-issued DOI via DataCite

Submission history

From: Jifeng Song [view email] [v1] Mon, 12 Jan 2026 21:57:52 UTC (4,239 KB) [v2] Wed, 14 Jan 2026 15:49:01 UTC (4,238 KB) [v3] Wed, 25 Feb 2026 13:52:52 UTC (4,238 KB) [v4] Mon, 30 Mar 2026 15:19:06 UTC (4,237 KB)

Original source

arXiv

https://arxiv.org/abs/2601.08026

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Generative UI

UTEP research seeks to make AI speech more natural - KTSM 9 News

<a href="https://news.google.com/rss/articles/CBMilgFBVV95cUxOUXU5QnZjNHp2OUY2MHpadVUydlBrVlVETzdZQmxFX2FaTEo0WjhUT29FckJMY1FXXy1zZFdrY3FScDRScnBsbDRSakhtXzdhdWFLa0N4TFdvMFFRVHF1VG5xdGw3VlNxeXBITThfY3Y0SDBDbDBIcXRoRklGLUpFVFpoSUc3YUdzZnE1Nk1CX0x3U1dlaEHSAZsBQVVfeXFMUFlfcVVpLWNYd3oxTnB3cmt4YXprOWl6UFhQTS02WUNra21hTFllc19BUm5sTEViamlzS2lheFAzR2g0UWZVaHRmZmhsWV9RU0NuR2t2VnhaZENHZUlVWjRPbFNPV1JHZVNLV1RkMl90QUt1YTJSMEVHQUhCQktBc00xRDhrMl9INEFEUFY0dkVjV0NEOUFYdmNwd00?oc=5" target="_blank">UTEP research seeks to make AI speech more natural</a> KTSM 9 News

GNews AI voice

1m2 months ago

ModelsFresh

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxQQWE5dG14R3FVTXBQdGRCcHVFU3N6a2QyWUVXamxHSHdINE9UQTZvci14ZlJVS3VzSk5VNDg4Wm1PQjN5U244Z3U2Mm1rd1lNNTZNTE1OMEoyQjI1SkNUTV8xRW5KdVBYYkViVjZOZlZnV01lQTJZemc1V2RWV1QtUFhEVFRFcWpYNGFOTC1kWDhtN3VpTjNjOVJyMDhlMmV6NUJGWTZZRFc1MU5pWHdYR1JJS1Nxbnk0Um9tV3RGV2h4SEVBbTJxeEk4azFLYTdBdlRmcjJyNmJyT09lQjh6aFlLajBDUXplMDczMDVvUDZvNDExR2ltUE9EeWZUN004TURNOXZBc2t1S2JBMURDNUwxZlVfczd4dmhQSkNiSFRzZmgxZXZ4RWV2SFhObWh4a2VFUFZPY0VSM09LWFBYUi14Y192R29NOU4welJnUWN1MmYwUWtHUEhzVzJVTWxoUF9VcFhsYzZhTDVuVFBrbFR3ai1aLVY1SkZZMlR0TW1ZWmpHNEw2aE1LLVJiYS1HOGNBT05DbE13Z2Y4SVFRc2VzUDFTQlhZYjBsb0JaQkxDUV9FdE9VSg?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> WSJ

Google News: LLM

1mabout 9 hours ago

ReleasesFresh

North Korean hackers implicated in major supply chain attack

Suspected <a href="https://www.axios.com/world/north-korea" target="_blank">North Korean hackers</a> are believed to be behind an ongoing compromise of the widely used open-source package Axios, which is downloaded millions of times per week, researchers at Google said Tuesday.Why it matters: Hackers briefly turned a widely trusted developer tool into a vehicle for credential-stealing malware that could give attackers ongoing access to infected systems.<hr><ul><li>Axios, a widely used JavaScript library for making HTTP requests, is not affiliated with Axios Media.</li></ul>Driving the news: Researchers at Google linked the activity to a North Korean group tracked as <a href="https://cloud.google.com/blog/topics/threat-intelligence/unc1069-

Axios Tech

2mabout 8 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 157 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersRecent

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

CARLA-Air integrates high-fidelity driving and multirotor flight simulation within a unified Unreal Engine framework, supporting joint air-ground agent modeling with photorealistic environments and multi-modal sensing capabilities. (1 upvotes on HuggingFace)

HuggingFace Papers

2m2 days ago

Research Papers

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Discrete Native Autoregressive framework enables unified multimodal processing by representing diverse modalities in a shared discrete space through a novel visual transformer architecture. (43 upvotes on HuggingFace)

HuggingFace Papers

2m3 days ago

Research PapersRecent

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

OptiMer enables flexible continual pre-training by decoupling data mixture ratio selection from training through post-hoc Bayesian optimization of distribution vectors extracted from individual dataset models. (1 upvotes on HuggingFace)

HuggingFace Papers

2m1 day ago

Research Papers

AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing

AutoWeather4D is a 3D-aware weather editing framework that decouples geometry and illumination through a dual-pass mechanism, enabling efficient and physically accurate weather modification for autonomous driving applications. (1 upvotes on HuggingFace)

HuggingFace Papers

2m4 days ago