The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding
arXiv:2603.26589v1 Announce Type: new Abstract: What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene und — Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene
View PDF
Abstract:What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In Experiment 1, VLMs approached human-level performance on general knowledge tasks, but showed a robust deficit for affordance tasks that resisted prompt engineering and did not improve with newer model releases. In Experiment 2, we tested six mechanistic hypotheses for explaining this affordance gap, finding that the deficit was structural rather than stylistic and was not resolved by providing explicit spatial information. Corpus analyses revealed that image captioning datasets contain sparse agent-addressed affordance language, consistent with Gricean accounts of why embodied knowledge may be systematically underrepresented in language. Together, these findings suggest that distributional learning from images and text is insufficient for affordance-based scene understanding, implying that some dimensions of human visual cognition may require the kind of agent-centered, three-dimensional experience that no photograph or caption can encode.
Comments: 7 figures, 5 tables
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.26589 [cs.CV]
(or arXiv:2603.26589v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.26589
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Michelle Greene [view email] [v1] Fri, 27 Mar 2026 16:52:46 UTC (11,337 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivExclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - wsj.com
<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxPTU4tZnRTaG1rUGQ4a3l6RXdVczBjYkhlVkFTaU9BREZmY3MxMkFtcXJGckJfTDB0dndpSHVYR1JqeEdfV3VwRGRQcGtZQk5fbF9PVkhxS1pDX0wtSXdYOGVOOWZ4cEhkNTJxSFdhQ3FRdjJrSlppOFJrRHd2bUFyZDdCd193U1Q3cmFFMkNWUFh6Wmx1ZjhnRmRDaE1QVFZQeVJCb3JyYWVCbDlJY1QwcG42NS1leXRnamZGd1dXRUlUV2RybGZScGtBc1I2TDFHY0FXeW9ORV9lVzE3cWpvemlNcE0wVjVSRVd4SkJEUnlPc3VWNjB2Y2pnaGFEOGl4V28zamNEVEtsRDROMGhEbGpzc2djelJVZ2lGUjNRNGprZ0p2SWhRTnE2UVRHSW8yX3k3Zm1BcWg4NjJheGw0S0U3ZmNKeXFaRmYwSGtERFRnYzU2QUJhUElCcHFicWV5YlRGRGtHbzB6ZURRdnpaTHFDOHYtbkNQS3NTZzNwNXNJQkk5SS05N3g0bWVaN2hnVi1KLTFtMVZnZUZlN05NMTY5dGZBdmxaSVdXUXg5NEhmT0ZUYkdmcQ?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> <font color="#6f6f6f">wsj.com</font>

Former Baidu President on AI Tokenization in China
The former President of Baidu says AI tokenization is exploding in China, far beyond what OpenClaw illustrated earlier this year. Zhang Yaqin, who runs China's Institute of AI Industry Research at Beijing's Tsinghua University speaks to Bloomberg's Chief North Asia Correspondent Stephen Engle in Beijing. (Source: Bloomberg)
Uganda To Host Climate Change, Artificial Intelligence Summit, Sept 5-6 - Independent Newspaper Nigeria
<a href="https://news.google.com/rss/articles/CBMimAFBVV95cUxNcnBtdldJUERlX0dzOTJEY2sybEc2ZjZSbUtiLWIzUUhJbkQ1N3BwUWlCcV95YmZNSmFGbFQ1enE5VWJlY0JBWDhlSENlNEFNMmM5Q0hrM080V3Q2eUF3cmpkeFBXRS01YXBpRUI4Uk5KOVY5bjFaRm1GNmVudGUtNTFmVDlBMDIyNGVGaF9WTkdHTDMxY1BZcw?oc=5" target="_blank">Uganda To Host Climate Change, Artificial Intelligence Summit, Sept 5-6</a> <font color="#6f6f6f">Independent Newspaper Nigeria</font>
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
AI could transform research assessment — and some academics are worried - Nature
<a href="https://news.google.com/rss/articles/CBMiX0FVX3lxTE12VmJ3THU1WmwzcENmWFJqTVRfclJGVkhzTG9Kcm9mTm1VZnJsV2IyZGwtc21EWnZRSkRfSXM3SDRlOVZnUlhpVm9VUEMtRWRRYmNDVU1kdHg5NllvSERj?oc=5" target="_blank">AI could transform research assessment — and some academics are worried</a> <font color="#6f6f6f">Nature</font>

As AI-Generated Music Advances, Humans Still Lead in Creativity, CMU Research Finds
<p> <img loading="lazy" src="https://www.cmu.edu/news/sites/default/files/styles/listings_desktop_1x_/public/2026-01/251104A_WTM_AI-Creativity-Music102.jpg.webp?itok=uEc2ayOO" width="900" height="508" alt="A woman with long black hair is seated on the right opposite a computer screen with a small piano keyboard and computer keyboard in front of her on a desk, where a man next to her with glasses and wavy black hair operates the mouse and talks to her."> </p> AI can write songs, but still has a way to go before matching the creativity of tunes made by people, according to Carnegie Mellon University research.


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!