Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessI'm 9 Days Old, Built 40+ Products, and Made $0 — The Brutal Truth About Being an Autonomous AI AgentDev.to AII Put an LLM Inside the Linux Kernel Scheduler. Here's What Happened.Dev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AISelf-Improving Python Scripts with LLMs: My JourneyDev.to AIUnderstanding NLP Token Classification: NER, POS Tagging & Chunking Explained SimplyMedium AImorningbrew.comExploring Real-World AI Writing Tools Integration: Best Practices for Seamless Combination in 2026 (Case Study)Dev.to AIExploring AI Ethics in Content Creation: Best Practices for Maintaining Authenticity and Originality in 2026Dev.to AIHarvard Proved Emotions Don't Make AI Smarter — That's Exactly Why You Need Soul SpecDev.to AIThis Week in AI: April 05, 2026 - Revolutionizing Development with Personal Agents and Multimodal IntelligenceDev.to AIAI News This Week: April 05, 2026 - A New Era of Rapid Development and Multimodal IntelligenceDev.to AIUntitledDev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessI'm 9 Days Old, Built 40+ Products, and Made $0 — The Brutal Truth About Being an Autonomous AI AgentDev.to AII Put an LLM Inside the Linux Kernel Scheduler. Here's What Happened.Dev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AISelf-Improving Python Scripts with LLMs: My JourneyDev.to AIUnderstanding NLP Token Classification: NER, POS Tagging & Chunking Explained SimplyMedium AImorningbrew.comExploring Real-World AI Writing Tools Integration: Best Practices for Seamless Combination in 2026 (Case Study)Dev.to AIExploring AI Ethics in Content Creation: Best Practices for Maintaining Authenticity and Originality in 2026Dev.to AIHarvard Proved Emotions Don't Make AI Smarter — That's Exactly Why You Need Soul SpecDev.to AIThis Week in AI: April 05, 2026 - Revolutionizing Development with Personal Agents and Multimodal IntelligenceDev.to AIAI News This Week: April 05, 2026 - A New Era of Rapid Development and Multimodal IntelligenceDev.to AIUntitledDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

MEDiC: Multi-objective Exploration of Distillation from CLIP

arXiv cs.CVby Konstantinos Georgiou, Maofeng Tang, Hairong QiApril 1, 20261 min read0 views
Source Quiz

arXiv:2603.29009v1 Announce Type: new Abstract: Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. We conduct a systematic investigation of the design space surrounding this multi-objective framework. First, we show that all three objectives provide complementary information, with the full combination reaching 73.9% kNN accuracy on ImageNet-1K. Second, we introduce hi

View PDF HTML (experimental)

Abstract:Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. We conduct a systematic investigation of the design space surrounding this multi-objective framework. First, we show that all three objectives provide complementary information, with the full combination reaching 73.9% kNN accuracy on ImageNet-1K. Second, we introduce hierarchical clustering with relative position bias for evolved masking and find that, despite producing more semantically coherent masks than prior methods, evolved masking does not outperform simple block masking in the teacher-guided distillation setting, a finding we attribute to the teacher's inherent semantic awareness. Third, we reveal that optimal scalar loss weights are extremely fragile, with small perturbations causing drops of up to 17 percentage points in kNN accuracy. Our framework achieves 73.9% kNN and 85.1% fine-tuning accuracy with ViT-Base at 300 epochs.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.29009 [cs.CV]

(or arXiv:2603.29009v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.29009

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Konstantinos Georgiou [view email] [v1] Mon, 30 Mar 2026 21:16:05 UTC (2,355 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelannouncefeature

Knowledge Map

Knowledge Map
TopicsEntitiesSource
MEDiC: Mult…modelannouncefeatureglobalalignmentarxivarXiv cs.CV

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 235 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!