Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessWhat’s the point of an AI novel?Financial Times TechI Built an AI Content Pipeline That Publishes 4 SEO-Optimized Articles Per Day — Here's the ArchitectureDEV CommunityMy Reading Journey: Jan-Mar 2026DEV CommunityBuilding a Second Brain for Claude CodeDEV CommunityThe Perfect CLAUDE.md: How to Set Up Your Project for Agentic CodingDEV CommunityClaude Code Advanced Workflow: Subagents, Commands & Multi-SessionDEV CommunityHow to Build a Custom MCP Server for Claude Code: A Step-by-Step TutorialDEV CommunityHetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026DEV CommunityOllama + Open WebUI Self-Hosting Guide 2026 — Run Your Own AI for $0DEV CommunityHow to Self-Host Your Entire Dev Stack for Under $20/Month in 2026DEV CommunityTop 15 MCP Servers Every Developer Should Install in 2026DEV CommunityHow to emotionally grasp the risks of AI SafetyLessWrong AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessWhat’s the point of an AI novel?Financial Times TechI Built an AI Content Pipeline That Publishes 4 SEO-Optimized Articles Per Day — Here's the ArchitectureDEV CommunityMy Reading Journey: Jan-Mar 2026DEV CommunityBuilding a Second Brain for Claude CodeDEV CommunityThe Perfect CLAUDE.md: How to Set Up Your Project for Agentic CodingDEV CommunityClaude Code Advanced Workflow: Subagents, Commands & Multi-SessionDEV CommunityHow to Build a Custom MCP Server for Claude Code: A Step-by-Step TutorialDEV CommunityHetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026DEV CommunityOllama + Open WebUI Self-Hosting Guide 2026 — Run Your Own AI for $0DEV CommunityHow to Self-Host Your Entire Dev Stack for Under $20/Month in 2026DEV CommunityTop 15 MCP Servers Every Developer Should Install in 2026DEV CommunityHow to emotionally grasp the risks of AI SafetyLessWrong AI
AI NEWS HUBbyEIGENVECTOREigenvector

LanteRn: Latent Visual Structured Reasoning

arXivMarch 26, 202610 min read0 views
Source Quiz

While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interlea — André G. Viveiros, Nuno Gonçalves, Matthias Lindemann

View PDF HTML (experimental)

Abstract:While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.*

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as: arXiv:2603.25629 [cs.CV]

(or arXiv:2603.25629v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25629

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: André Viveiros Guilherme [view email] [v1] Thu, 26 Mar 2026 16:41:59 UTC (203 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
LanteRn: La…researchpaperarxivmachine-lea…deep-learni…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 165 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers