Research Papers research paper arxiv computer-vision image-recognition

Explaining CLIP Zero-shot Predictions Through Concepts

arXivMarch 31, 20262 min read0 views

arXiv:2603.28211v1 Announce Type: new Abstract: Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method p — Onat Ozdemir, Anders Christensen, Stephan Alaniz, Zeynep Akata, Emre Akbas

View PDF HTML (experimental)

Abstract:Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at this https URL.

Comments: Accepted to CVPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.28211 [cs.CV]

(or arXiv:2603.28211v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.28211

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Onat Ozdemir [view email] [v1] Mon, 30 Mar 2026 09:31:33 UTC (7,032 KB)

Original source

arXiv

https://arxiv.org/abs/2603.28211

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ModelsRecent

[P] Federated Adversarial Learning

I'm a CS/ML engineering student in my 4th year, and I need help for a project I recently got assigned to (as an "end of the year" project). I am familiar with basic ML stuff, deep learning etc and made a few "standard" projects here and there about it... However I found this topic a bit challenging since it combines both FL and the adversarial aspect, I did a lot of research especially on arxiv to try to understand the gist of it. HOWEVER, the subject is essentially "federated adversarial learning" and I am struggeling to understand what I'm supposed to do. (I found ONE article on arxiv but ngl i find it very hard to understand as it is very theoritical.) I talked to my teachers/supervisors about this but they said "do whatever you want" which doesn't help AT ALL..... They did provide a da

Reddit r/MachineLearning

2mabout 17 hours ago

ModelsRecent

[R] The SPORE Clustering Algorithm

https://preview.redd.it/di99yw56tksg1.png?width=992 format=png auto=webp s=8828c9459dcf8f8541718e4d7a9fae52bfc0b95a I created a clustering algorithm SPORE ( S keleton P ropagation O ver R ecalibrating E xpansions) for general purpose clustering, intended to handle nonconvex, convex, low-d and high-d data alike. I've benchmarked it on 28 datasets from 2-784D and released a Python package as well as a research paper . Short Summary SPORE is a density-variance-based method meant for general clustering in arbitrary geometries and dimensionalities. After building a knn graph, it has 2 phases. Phase 1 (Expansion) uses BFS with a continually refined density-variance constraint to expand initial clusters in a way that adapts to their specific scale. The aim is to capture inner, well-shielded skele

Reddit r/MachineLearning

5mabout 19 hours ago

Research PapersRecent

[D] Does seeing the identify of authors influence your scoring?

Let's be honest, at some stage of the review process. A lot of us have gotten bored and tried to Google the papers we are reviewing. And sometimes those papers might have already been uploaded onto arXiv with the identity of the authors. Which we then tried to look them up. As a first-time reviewer, I noticed the top 2 papers in my batch happened to be the only papers in my batch that is on arXiv. I am trying to work out if revealing the author's identity had influenced my decision. Or it's just a coincidence. submitted by /u/d_edge_sword [link] [comments]

Reddit r/MachineLearning

1mabout 20 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 303 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersRecent

[D] Does seeing the identify of authors influence your scoring?

Reddit r/MachineLearning

1mabout 20 hours ago

Research PapersFresh

Leveraging Commit Size Context and Hyper Co-Change Graph Centralities for Defect Prediction

arXiv:2604.01132v1 Announce Type: new Abstract: File-level defect prediction models traditionally rely on product and process metrics. While process metrics effectively complement product metrics, they often overlook commit size the number of files changed per commit despite its strong association with software quality. Network centrality measures on dependency graphs have also proven to be valuable product level indicators. Motivated by this, we first redefine process metrics as commit size aware process metric vectors, transforming conventional scalar measures into 100 dimensional profiles that capture the distribution of changes across commit size strata. We then model change history as a hyper co change graph, where hyperedges naturally encode commit-size semantics. Vector centralities

arXiv cs.SE

1mabout 4 hours ago

Research PapersFresh

Detecting Call Graph Unsoundness without Ground Truth

arXiv:2604.00885v1 Announce Type: new Abstract: Java static analysis frameworks are commonly compared under the assumption that analysis algorithms and configurations compose monotonically and yield semantically comparable results across tools. In this work, we show that this assumption is fundamentally flawed. We present a large-scale empirical study of semantic consistency within and across four widely used Java static analysis frameworks: Soot, SootUp, WALA, and Doop. Using precision partial orders over analysis algorithms and configurations, we systematically identify violations where increased precision introduces new call-graph edges or amplifies inconsistencies. Our results reveal three key findings. First, algorithmic precision orders frequently break within frameworks due to moder

arXiv cs.SE

1mabout 4 hours ago

Research PapersFresh

Containing the Reproducibility Gap: Automated Repository-Level Containerization for Scholarly Jupyter Notebooks

arXiv:2604.01072v1 Announce Type: new Abstract: Computational reproducibility is fundamental to trustworthy science, yet remains difficult to achieve in practice across various research workflows, including Jupyter notebooks published alongside scholarly articles. Environment drift, undocumented dependencies and implicit execution assumptions frequently prevent independent re-execution of published research. Despite existing reproducibility guidelines, scalable and systematic infrastructure for automated assessment remains limited. We present an automated, web-oriented reproducibility engineering pipeline that reconstructs and evaluates repository-level execution environments for scholarly notebooks. The system performs dependency inference, automated container generation, and isolated exe

arXiv cs.SE

2mabout 4 hours ago