Models model language model benchmark training announce available

ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

arXiv cs.CVby Wenyang Chen, Zhanxuan Hu, Yaping Zhang, Hailong Ning, Yonghang TaiApril 1, 20261 min read0 views

arXiv:2603.29271v1 Announce Type: new Abstract: Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we

View PDF HTML (experimental)

Abstract:Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.29271 [cs.CV]

(or arXiv:2603.29271v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.29271

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Wenyang Chen [view email] [v1] Tue, 31 Mar 2026 05:12:02 UTC (22,408 KB)

Original source

arXiv cs.CV

https://arxiv.org/abs/2603.29271

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Open Source AIFresh

v4.3.3 - Gemma 4 support!

Changes Gemma 4 support with tool-calling in the API and UI. 🆕 - v4.3.1. ik_llama.cpp support : Add ik_llama.cpp as a new backend through new textgen-portable-ik portable builds and a new --ik flag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference. API: Add echo + logprobs for /v1/completions . The completions endpoint now supports the echo and logprobs parameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a new top_logprobs_ids field. Further optimize my custom gradio fork, saving up to 5

text-gen-webui Releases

3mabout 2 hours ago

Models

Panasonic Holdings develops LaViDa, a diffusion based vision language model - Panasonic Newsroom Global

Panasonic Holdings develops LaViDa, a diffusion based vision language model Panasonic Newsroom Global

GNews AI diffusion

1m4 months ago

Models

A hybrid model for improved nail disease classification using vision transformers and stable diffusion - Nature

A hybrid model for improved nail disease classification using vision transformers and stable diffusion Nature

GNews AI diffusion

1m3 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 147 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

Submission history

Daily AI Digest

More about

v4.3.3 - Gemma 4 support!

Panasonic Holdings develops LaViDa, a diffusion based vision language model - Panasonic Newsroom Global

A hybrid model for improved nail disease classification using vision transformers and stable diffusion - Nature

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Panasonic Holdings develops LaViDa, a diffusion based vision language model - Panasonic Newsroom Global

A hybrid model for improved nail disease classification using vision transformers and stable diffusion - Nature

Anthropic Races to Contain Leak of Code Behind Claude AI Agent - WSJ

Mistral releases open-source speech models, advancing real-time voice AI capabilities - CXO Digitalpulse