Research Papers research paper arxiv ai artificial-intelligence

Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

arXivMarch 31, 202610 min read0 views

arXiv:2603.26798v1 Announce Type: cross Abstract: Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing t — Gesina Schwalbe, Mert Keser, Moritz Bayerkuhnlein, Edgar Heinert, Annika M\"utze, Marvin Keller, Sparsh Tiwari, Georgii Mikriukov, Diedrich Wolter, Jae Hee Lee, Matthias Rottmann

Authors:Gesina Schwalbe, Mert Keser, Moritz Bayerkuhnlein, Edgar Heinert, Annika Mütze, Marvin Keller, Sparsh Tiwari, Georgii Mikriukov, Diedrich Wolter, Jae Hee Lee, Matthias Rottmann

View PDF HTML (experimental)

Abstract:Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate target neighborhoods from a desired hierarchy. Across 13 pretrained VLMs and 4 image datasets, our method finds systematic modality differences: image encoders are more discriminative, while text encoders induce hierarchies that better match human taxonomies. Overall, the results reveal a persistent trade-off between zero-shot accuracy and ontological plausibility and suggest practical routes to improve semantic alignment in shared embedding spaces.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.26798 [cs.LG]

(or arXiv:2603.26798v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.26798

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Gesina Schwalbe [view email] [v1] Thu, 26 Mar 2026 00:38:38 UTC (804 KB)

Original source

arXiv

https://arxiv.org/abs/2603.26798

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ReleasesFresh

Reachability-Aware Time Scaling for Path Tracking

arXiv:2604.00439v1 Announce Type: new Abstract: This paper studies tracking of collision-free waypoint paths produced by an offline planner for a planar double-integrator system with bounded speed and acceleration. Because sampling-based planners must route around obstacles, the resulting waypoint paths can contain sharp turns and high-curvature regions, so one-step reachability under acceleration limits becomes critical even when the path geometry is collision-free. We build on a pure-pursuit-style, reachability-guided quadratic-program (QP) tracker with a one-step acceleration margin. Offline, we evaluate this margin along a spline fitted to the waypoint path and update a scalar speed-scaling profile so that the required one-step acceleration remains below the available bound. Online, th

arXiv cs.RO

1mabout 5 hours ago

Research PapersFresh

Sampling-based Task and Kinodynamic Motion Planning under Semantic Uncertainty

arXiv:2604.00401v1 Announce Type: new Abstract: This paper tackles the problem of integrated task and kinodynamic motion planning in uncertain environments. We consider a robot with nonlinear dynamics tasked with a Linear Temporal Logic over finite traces ($\ltlf$) specification operating in a partially observable environment. Specifically, the uncertainty is in the semantic labels of the environment. We show how the problem can be modeled as a Partially Observable Stochastic Hybrid System that captures the robot dynamics, $\ltlf$ task, and uncertainty in the environment state variables. We propose an anytime algorithm that takes advantage of the structure of the hybrid system, and combines the effectiveness of decision-making techniques and sampling-based motion planning. We prove the sou

arXiv cs.RO

1mabout 5 hours ago

ModelsFresh

Behavioral Score Diffusion: Model-Free Trajectory Planning via Kernel-Based Score Estimation from Data

arXiv:2604.00391v1 Announce Type: new Abstract: Diffusion-based trajectory optimization has emerged as a powerful planning paradigm, but existing methods require either learned score networks trained on large datasets or analytical dynamics models for score computation. We introduce \emph{Behavioral Score Diffusion} (BSD), a training-free and model-free trajectory planner that computes the diffusion score function directly from a library of trajectory data via kernel-weighted estimation. At each denoising step, BSD retrieves relevant trajectories using a triple-kernel weighting scheme -- diffusion proximity, state context, and goal relevance -- and computes a Nadaraya-Watson estimate of the denoised trajectory. The diffusion noise schedule naturally controls kernel bandwidths, creating a m

arXiv cs.RO

2mabout 5 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 166 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Sampling-based Task and Kinodynamic Motion Planning under Semantic Uncertainty

arXiv cs.RO

1mabout 5 hours ago

Research PapersFresh

Real Time Local Wind Inference for Robust Autonomous Navigation

arXiv:2604.00343v1 Announce Type: new Abstract: This thesis presents a solution that enables aerial robots to reason about surrounding wind flow fields in real time using on board sensors and embedded flight hardware. The core novelty of this research is the fusion of range measurements with sparse in situ wind measurements to predict surrounding flow fields. We aim to address two fundamental questions: first, the sufficiency of topographical data for accurate wind prediction in dense urban environments; and second, the utility of learned wind models for motion planning with an emphasis on energy efficiency and obstacle avoidance. Drawing on tools from deep learning, fluid mechanics, and optimal control, we establish a framework for local wind prediction using navigational LiDAR, and then

arXiv cs.RO

2mabout 5 hours ago

Research PapersFresh

Play-Testing REMind: Evaluating an Educational Robot-Mediated Role-Play Game

arXiv:2604.00300v1 Announce Type: new Abstract: This paper presents REMind, an innovative educational robot-mediated role-play game designed to support anti-bullying bystander intervention among children. REMind invites players to observe a bullying scenario enacted by social robots, reflect on the perspectives of the characters, and rehearse defending strategies by puppeteering a robotic avatar. We evaluated REMind through a mixed-methods play-testing study with 18 children aged 9--10. The findings suggest that the experience supported key learning goals related to self-efficacy, perspective-taking, understanding outcomes of defending, and intervention strategies. These results highlight the promise of Robot-Mediated Applied Drama (RMAD) as a novel pedagogical framework to support Social-

arXiv cs.RO

1mabout 5 hours ago

Research PapersFresh

Long-Horizon Geometry-Aware Navigation among Polytopes via MILP-MPC and Minkowski-Based CBFs

arXiv:2604.00162v1 Announce Type: new Abstract: Autonomous navigation in complex, non-convex environments remains challenging when robot dynamics, control limits, and exact robot geometry must all be taken into account. In this paper, we propose a hierarchical planning and control framework that bridges long-horizon guidance and geometry-aware safety guarantees for a polytopic robot navigating among polytopic obstacles. At the high level, Mixed-Integer Linear Programming (MILP) is embedded within a Model Predictive Control (MPC) framework to generate a nominal trajectory around polytopic obstacles while modeling the robot as a point mass for computational tractability. At the low level, we employ a control barrier function (CBF) based on the exact signed distance in the Minkowski-differenc

arXiv cs.RO

1mabout 5 hours ago