Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessMassachusetts Sen. Ed Markey is putting AV firms on blast for using human staffersFast Company TechOpenClaw has 500,000 instances and no enterprise kill switchVentureBeat AIHere's how U.S. residents feel about Trump's signature on American cashAxios TechStablecoin Market to Hit $2 Trillion in 2028 Even as Velocity Doubles: Standard CharteredDecrypt AIBuilding Trust Between Agents: AgentID + ArkForge InteroperabilityDEV CommunityI Analyzed Claude Code's Leaked Source — Here's How Anthropic's AI Agent Actually WorksDEV CommunityI wish AI Agents just knew how I work without me explaining - so I made something that quietly observes me, learns and teaches it.DEV CommunityEmotion-Aware Voice Agents: How AI Now Detects Frustration and Adjusts in Real TimeDEV CommunityXoul - Local Personal Assistant Agent Release (Beta, v0.1.0-beta)DEV CommunityIntroduction to GIT- GITHUB/GITLABDEV CommunityTurboQuant MoE 0.3.0DEV CommunityCSS Grid Lanes (Masonry Layout) Is Here: A Complete Guide for 2026DEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessMassachusetts Sen. Ed Markey is putting AV firms on blast for using human staffersFast Company TechOpenClaw has 500,000 instances and no enterprise kill switchVentureBeat AIHere's how U.S. residents feel about Trump's signature on American cashAxios TechStablecoin Market to Hit $2 Trillion in 2028 Even as Velocity Doubles: Standard CharteredDecrypt AIBuilding Trust Between Agents: AgentID + ArkForge InteroperabilityDEV CommunityI Analyzed Claude Code's Leaked Source — Here's How Anthropic's AI Agent Actually WorksDEV CommunityI wish AI Agents just knew how I work without me explaining - so I made something that quietly observes me, learns and teaches it.DEV CommunityEmotion-Aware Voice Agents: How AI Now Detects Frustration and Adjusts in Real TimeDEV CommunityXoul - Local Personal Assistant Agent Release (Beta, v0.1.0-beta)DEV CommunityIntroduction to GIT- GITHUB/GITLABDEV CommunityTurboQuant MoE 0.3.0DEV CommunityCSS Grid Lanes (Masonry Layout) Is Here: A Complete Guide for 2026DEV Community

Contextual inference from single objects in Vision-Language models

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.26731v1 Announce Type: cross Abstract: How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indo — Martina G. Vilas, Timothy Schauml\"offel, Gemma Roig

View PDF HTML (experimental)

Abstract:How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indoor vs. outdoor). We found that single objects support above-chance inference at both levels, with performance modulated by the same object properties that predict human scene categorization. Object identity, scene, and superordinate predictions are partially dissociable: accurate inference at one level neither requires nor guarantees accurate inference at the others, and the degree of coupling differs markedly across models. Mechanistically, object representations that remain stable when background context is removed are more predictive of successful contextual inference. Scene and superordinate schemas are grounded in fundamentally different ways: scene identity is encoded in image tokens throughout the network, while superordinate information emerges only late or not at all. Together, these results reveal that the organization of contextual inference in VLMs is more complex than accuracy alone suggests, with behavioral and mechanistic signatures

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.26731 [cs.CV]

(or arXiv:2603.26731v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.26731

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Martina G. Vilas [view email] [v1] Fri, 20 Mar 2026 13:24:15 UTC (2,920 KB)

Original source

arXiv

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Contextual …researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 150 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers