Research Papers research paper arxiv ai artificial-intelligence

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

arXivMarch 31, 202610 min read0 views

arXiv:2603.28069v1 Announce Type: cross Abstract: Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, — Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna

Authors:Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna

View PDF HTML (experimental)

Abstract:Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.28069 [cs.CV]

(or arXiv:2603.28069v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.28069

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Christopher Clark [view email] [v1] Mon, 30 Mar 2026 06:15:06 UTC (5,001 KB)

Original source

arXiv

https://arxiv.org/abs/2603.28069

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ProductsLive

Setting up a hugo static site hosted with Porkbun

<h2> Content generation </h2> <p>This is a static site generated with <a href="https://gohugo.io/getting-started/quick-start/" rel="noopener noreferrer">hugo</a> with the PaperMod theme. I wanted an easy to use static site generator. I considered <a href="https://jekyllrb.com/" rel="noopener noreferrer">Jekyll</a> and believe it to be a good choice for static sites. There seemed to be slightly more themes I liked with hugo so I went with that. That's a pretty superficial choice but I also don't plan on hacking on the site generation itself so I was agnostic to the Go versus Ruby choice.</p> <h2> Domain hosting </h2> <p>This site uses <a href="https://porkbun.com" rel="noopener noreferrer">porkbun</a> for a domain host. I chose it not least because I do enjoy porkbuns. They also listed stat

DEV Community

4m30 minutes ago

Generative UI

UTEP research seeks to make AI speech more natural - KTSM 9 News

<a href="https://news.google.com/rss/articles/CBMilgFBVV95cUxOUXU5QnZjNHp2OUY2MHpadVUydlBrVlVETzdZQmxFX2FaTEo0WjhUT29FckJMY1FXXy1zZFdrY3FScDRScnBsbDRSakhtXzdhdWFLa0N4TFdvMFFRVHF1VG5xdGw3VlNxeXBITThfY3Y0SDBDbDBIcXRoRklGLUpFVFpoSUc3YUdzZnE1Nk1CX0x3U1dlaEHSAZsBQVVfeXFMUFlfcVVpLWNYd3oxTnB3cmt4YXprOWl6UFhQTS02WUNra21hTFllc19BUm5sTEViamlzS2lheFAzR2g0UWZVaHRmZmhsWV9RU0NuR2t2VnhaZENHZUlVWjRPbFNPV1JHZVNLV1RkMl90QUt1YTJSMEVHQUhCQktBc00xRDhrMl9INEFEUFY0dkVjV0NEOUFYdmNwd00?oc=5" target="_blank">UTEP research seeks to make AI speech more natural</a> <font color="#6f6f6f">KTSM 9 News</font>

GNews AI voice

1m2 months ago

ModelsFresh

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxQQWE5dG14R3FVTXBQdGRCcHVFU3N6a2QyWUVXamxHSHdINE9UQTZvci14ZlJVS3VzSk5VNDg4Wm1PQjN5U244Z3U2Mm1rd1lNNTZNTE1OMEoyQjI1SkNUTV8xRW5KdVBYYkViVjZOZlZnV01lQTJZemc1V2RWV1QtUFhEVFRFcWpYNGFOTC1kWDhtN3VpTjNjOVJyMDhlMmV6NUJGWTZZRFc1MU5pWHdYR1JJS1Nxbnk0Um9tV3RGV2h4SEVBbTJxeEk4azFLYTdBdlRmcjJyNmJyT09lQjh6aFlLajBDUXplMDczMDVvUDZvNDExR2ltUE9EeWZUN004TURNOXZBc2t1S2JBMURDNUwxZlVfczd4dmhQSkNiSFRzZmgxZXZ4RWV2SFhObWh4a2VFUFZPY0VSM09LWFBYUi14Y192R29NOU4welJnUWN1MmYwUWtHUEhzVzJVTWxoUF9VcFhsYzZhTDVuVFBrbFR3ai1aLVY1SkZZMlR0TW1ZWmpHNEw2aE1LLVJiYS1HOGNBT05DbE13Z2Y4SVFRc2VzUDFTQlhZYjBsb0JaQkxDUV9FdE9VSg?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> <font color="#6f6f6f">WSJ</font>

Google News: LLM

1mabout 9 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 108 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersLive

Why Drug Toxicity Can’t Be Predicted in Isolation — Building EIRION with Graph Neural Networks

How we built a graph neural network that finally sees the whole play — not just the audition Every year, drugs that passed early safety tests go on to harm people in ways nobody predicted. Not because the chemistry was wrong. Not because the researchers were careless. But because we kept evaluating drugs the way a talent agent judges an actor from a solo audition tape. Isolated. Out of context. No script. No co-stars. No stage. In real theatre, a performance is never just about one actor. It depends on who they share the stage with, which scene they appear in, what the story demands at that moment. A brilliant performer in the wrong play, surrounded by the wrong cast, in the wrong context — can still wreck the whole production. That is exactly how drug toxicity works. And that is exactly t

Towards AI

17m44 minutes ago

Research PapersLive

It's Not Smarter Models — It's Cheaper Memory: TurboQuant's Real Impact, Wall Street Panic & Academic Storm

<blockquote> <p><strong>One-line summary:</strong> TurboQuant is a genuinely important engineering breakthrough — but Google's marketing, academic ethics controversy, and Wall Street's overreaction made the story far more dramatic than the technology itself.</p> </blockquote> <h2> 0. What This Article Answers </h2> <p>Google Research published TurboQuant at ICLR 2026 (<a href="https://arxiv.org/abs/2504.19874" rel="noopener noreferrer">arXiv 2504.19874</a>), claiming 6x memory compression, 8x speedup, and zero accuracy loss for LLM KV caches.</p> <p>Then, in the same week:</p> <ol> <li>Global memory stocks lost over <strong>$90 billion</strong> in market cap</li> <li>An ETH Zürich researcher publicly accused the paper of <strong>academic plagiarism and experimental fraud</strong> </li> <li

DEV Community

10m25 minutes ago

Research PapersRecent

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

OptiMer enables flexible continual pre-training by decoupling data mixture ratio selection from training through post-hoc Bayesian optimization of distribution vectors extracted from individual dataset models. (1 upvotes on HuggingFace)

HuggingFace Papers

2m1 day ago

Research PapersRecent

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

CARLA-Air integrates high-fidelity driving and multirotor flight simulation within a unified Unreal Engine framework, supporting joint air-ground agent modeling with photorealistic environments and multi-modal sensing capabilities. (1 upvotes on HuggingFace)

HuggingFace Papers

2m2 days ago