Research Papers research paper arxiv vision-language models visual tokens cross-attends

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

HuggingFace PapersMarch 30, 20268 min read0 views

A vision-language model approach for grounding that directly selects visual tokens containing target concepts through specialized pointing tokens, achieving superior performance in image, GUI, video pointing, and tracking tasks. (0 upvotes on HuggingFace)

Published on Mar 30

Authors:

Abstract

AI-generated summary

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2603.28069

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.28069 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.28069 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.28069 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Original source

HuggingFace Papers

https://huggingface.co/papers/2603.28069

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research PapersLive

Precision Proactivity: Measuring Cognitive Load in Real-World AI-Assisted Work

Article URL: https://arxiv.org/abs/2505.10742 Comments URL: https://news.ycombinator.com/item?id=47595100 Points: 1 # Comments: 0

Hacker News AI Top

2m40 minutes ago

ModelsLive

We're running an AI-authored research workshop for Northeast India's 200+ languages - and publishing everything openly

At MWire Labs, we build language technology for Northeast India's indigenous languages - ASR, MT, OCR, LLMs. The region has 200+ languages. Almost none of them exist in mainstream AI datasets. So we're doing something a bit unusual. NortheastGenAI 2026 is a virtual workshop on May 29 where every submission must be AI-generated or AI-assisted - with full disclosure of how. All reviews are AI-assisted too, followed by a human editorial check. Everything is public on OpenReview. Inspired by Agents4Science 2025 (Stanford). We're not claiming AI research is ready. We're asking the question openly and publishing whatever comes out. *Three tracks: * Language, Culture & Heritage Society, History & Anthropology AI and Technology for NE In

Dev.to AI

1m24 minutes ago

Research Papers

Antonia Georgopoulou starts as Cyber Valley Max Planck Independent Research Group Leader

is.mpg.de

1m6 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 163 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersLive

Precision Proactivity: Measuring Cognitive Load in Real-World AI-Assisted Work

Article URL: https://arxiv.org/abs/2505.10742 Comments URL: https://news.ycombinator.com/item?id=47595100 Points: 1 # Comments: 0

Hacker News AI Top

2m40 minutes ago

Research PapersFresh

Iran’s Revolutionary Guards just named 18 US tech firms as military targets. The age of the civilian data centre is over.

At 8pm Tehran time on Tuesday, a new kind of front line was drawn, not through desert terrain or along a disputed border, but through the server farms, cloud regions, and corporate campuses of America’s largest technology companies. The Islamic Revolutionary Guard Corps published a statement on its official Sepah News channel naming 18 US […] This story continues at The Next Web

The Next Web AI

1mabout 3 hours ago

Research Papers

Antonia Georgopoulou starts as Cyber Valley Max Planck Independent Research Group Leader

is.mpg.de

1m6 months ago

Research Papers

Real-time speech-to-speech translation - research.google

<a href="https://news.google.com/rss/articles/CBMid0FVX3lxTFAxeFFhNVhOTjVXeEhXeGFHOXE3WENYeGFISjlpVGNueGtDS2ZZTEVsZHh6dkhLc191aFFYNEpMUUxraV9uTWF6YW1RcF9VTFlIZDBuQTlpbkhBRnJxU1FuTGY4aEtFc2FEaWMxekxUTnlzV3dFN1ow?oc=5" target="_blank">Real-time speech-to-speech translation</a> research.google

GNews AI voice

1m4 months ago