MEDiC: Multi-objective Exploration of Distillation from CLIP
arXiv:2603.29009v1 Announce Type: new Abstract: Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. We conduct a systematic investigation of the design space surrounding this multi-objective framework. First, we show that all three objectives provide complementary information, with the full combination reaching 73.9% kNN accuracy on ImageNet-1K. Second, we introduce hi
View PDF HTML (experimental)
Abstract:Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. We conduct a systematic investigation of the design space surrounding this multi-objective framework. First, we show that all three objectives provide complementary information, with the full combination reaching 73.9% kNN accuracy on ImageNet-1K. Second, we introduce hierarchical clustering with relative position bias for evolved masking and find that, despite producing more semantically coherent masks than prior methods, evolved masking does not outperform simple block masking in the teacher-guided distillation setting, a finding we attribute to the teacher's inherent semantic awareness. Third, we reveal that optimal scalar loss weights are extremely fragile, with small perturbations causing drops of up to 17 percentage points in kNN accuracy. Our framework achieves 73.9% kNN and 85.1% fine-tuning accuracy with ViT-Base at 300 epochs.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.29009 [cs.CV]
(or arXiv:2603.29009v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.29009
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Konstantinos Georgiou [view email] [v1] Mon, 30 Mar 2026 21:16:05 UTC (2,355 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelannouncefeature
research-llm-apis 2026-04-04
Release: research-llm-apis 2026-04-04 I'm working on a major change to my LLM Python library and CLI tool. LLM provides an abstraction layer over hundreds of different LLMs from dozens of different vendors thanks to its plugin system, and some of those vendors have grown new features over the past year which LLM's abstraction layer can't handle, such as server-side tool execution. To help design that new abstraction layer I had Claude Code read through the Python client libraries for Anthropic, OpenAI, Gemini and Mistral and use those to help craft curl commands to access the raw JSON for both streaming and non-streaming modes across a range of different scenarios. Both the scripts and the captured outputs now live in this new repo. Tags: llm , apis , json , llms

Untitled
You have 50 models. Each trained on different data, different domain, different patient population. You want them to get smarter from each other. So you do the obvious thing — you set up a central aggregator. Round 1: gradients in, averaged weights out. Works fine at N=5. At N=20 you notice the coordinator is sweating. At N=50, round latency has tripled, your smallest sites are timing out, and your bandwidth budget is gone. You tune the hell out of it. Same ceiling. This is not a configuration problem. This is an architecture ceiling. The math underneath it guarantees you hit a wall. A different architecture changes the math. The combinatorics you are not harvesting Start with a fact that has nothing to do with any particular framework: N agents have exactly N(N-1)/2 unique pairwise relati

AI News This Week: April 05, 2026 - A New Era of Rapid Development and Multimodal Intelligence
AI News This Week: April 05, 2026 - A New Era of Rapid Development and Multimodal Intelligence Published: April 05, 2026 | Reading time: ~10 min This week has been nothing short of phenomenal for the AI community, with breakthroughs and announcements that promise to revolutionize the way we develop and interact with artificial intelligence. From building personal AI agents in a matter of hours to the unveiling of cutting-edge multimodal intelligence models, the pace of innovation is not just accelerating - it's transforming the landscape of what's possible. Whether you're a seasoned developer or just starting to explore the world of AI, this week's news is a must-know, offering insights into how technology is making AI more accessible, powerful, and integrated into our daily lives. Buildin
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.





Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!