Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessHong Kong hasn’t issued a single HKD stablecoin license after March targetCoinDesk AIBitcoin is closer to its 'buy zone' than it's been in three yearsCoinDesk AIRAG Web Browser: Give Your AI Real-Time Web Access Without HallucinationsDEV CommunityWhat Nobody Tells You About Building a Protocol for AI AgentsDEV CommunityThe Evidence Is in the Phone. Most of It Never Makes It Into the Case.DEV CommunityIt's Not Smarter Models — It's Cheaper Memory: TurboQuant's Real Impact, Wall Street Panic & Academic StormDEV CommunityWindows might be hiding some of your PC's storage by default - here's how to reclaim itZDNet Big DataYour Production Code Is Training AI Models Right Now (And How to Audit Your Stack)DEV CommunitySetting Up Your Databricks Account (Free Trial + First Look at the UI)DEV CommunityHow to Use Augmented Coding to Build a Web AppDEV CommunitySetting up a hugo static site hosted with PorkbunDEV CommunityWhy Domain Knowledge Is the Core Architecture of Fine-Tuning and RAG — Not an AfterthoughtDEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessHong Kong hasn’t issued a single HKD stablecoin license after March targetCoinDesk AIBitcoin is closer to its 'buy zone' than it's been in three yearsCoinDesk AIRAG Web Browser: Give Your AI Real-Time Web Access Without HallucinationsDEV CommunityWhat Nobody Tells You About Building a Protocol for AI AgentsDEV CommunityThe Evidence Is in the Phone. Most of It Never Makes It Into the Case.DEV CommunityIt's Not Smarter Models — It's Cheaper Memory: TurboQuant's Real Impact, Wall Street Panic & Academic StormDEV CommunityWindows might be hiding some of your PC's storage by default - here's how to reclaim itZDNet Big DataYour Production Code Is Training AI Models Right Now (And How to Audit Your Stack)DEV CommunitySetting Up Your Databricks Account (Free Trial + First Look at the UI)DEV CommunityHow to Use Augmented Coding to Build a Web AppDEV CommunitySetting up a hugo static site hosted with PorkbunDEV CommunityWhy Domain Knowledge Is the Core Architecture of Fine-Tuning and RAG — Not an AfterthoughtDEV Community

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.28069v1 Announce Type: cross Abstract: Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, — Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna

Authors:Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna

View PDF HTML (experimental)

Abstract:Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.28069 [cs.CV]

(or arXiv:2603.28069v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.28069

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Christopher Clark [view email] [v1] Mon, 30 Mar 2026 06:15:06 UTC (5,001 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
MolmoPoint:…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 108 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers