Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessBuilding a RAG Pipeline From Scratch With LangChain + Pinecone + Claude: A Real ImplementationDEV CommunityWe Built an AI That Rewrites Its Own Brain. Here's What Happened.DEV CommunityCome ho costruito un generatore di testi AI moderno con React e VercelDEV CommunityHow CoinFello's MinChi Park Built the Trust Layer 500 Million Crypto Users Have Been Waiting ForHackernoon AII'm 18 and Built an Open-Source Camera That Cryptographically Proves Photos Are RealDEV CommunityMastering FastAPI: A Complete Learning RoadmapDEV CommunityI Built an AI Chatbot That Knows Everything About MeDEV CommunityEscaping API Quotas: How I Built a Local 14B Multi-Agent Squad for 16GB VRAM (Qwen3.5 & DeepSeek-R1)DEV CommunityI Built a FusionAuth SDK in Brainfuck and I'm Not Even a DeveloperDEV CommunityBattle Mage: We Built a Codebase Expert That Lives in SlackDEV CommunityBest Amazon Spring Sale deals under $25: Last chance to saveZDNet AIAnthropic's Claude Code leak reveals its "Kairos" updates, including letting Claude work in the background and using a "dream mode" to consolidate its memories (The Information)TechmemeBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessBuilding a RAG Pipeline From Scratch With LangChain + Pinecone + Claude: A Real ImplementationDEV CommunityWe Built an AI That Rewrites Its Own Brain. Here's What Happened.DEV CommunityCome ho costruito un generatore di testi AI moderno con React e VercelDEV CommunityHow CoinFello's MinChi Park Built the Trust Layer 500 Million Crypto Users Have Been Waiting ForHackernoon AII'm 18 and Built an Open-Source Camera That Cryptographically Proves Photos Are RealDEV CommunityMastering FastAPI: A Complete Learning RoadmapDEV CommunityI Built an AI Chatbot That Knows Everything About MeDEV CommunityEscaping API Quotas: How I Built a Local 14B Multi-Agent Squad for 16GB VRAM (Qwen3.5 & DeepSeek-R1)DEV CommunityI Built a FusionAuth SDK in Brainfuck and I'm Not Even a DeveloperDEV CommunityBattle Mage: We Built a Codebase Expert That Lives in SlackDEV CommunityBest Amazon Spring Sale deals under $25: Last chance to saveZDNet AIAnthropic's Claude Code leak reveals its "Kairos" updates, including letting Claude work in the background and using a "dream mode" to consolidate its memories (The Information)Techmeme

Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.26859v1 Announce Type: cross Abstract: Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schne — Dongsheng Yang, Yinfeng Yu, Liejun Wang

View PDF HTML (experimental)

Abstract:Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross-modal alignment. Extensive experiments on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions demonstrate that BTK significantly outperforms existing baselines. On the test unseen splits of R2R and REVERIE, SR increased by 5% and 2.07% respectively, and SPL increased by 4% and 3.69% respectively. The source code is available at this https URL.

Comments: Main paper (37 pages). Accepted for publication by the Information Processing and Management,Volume 63,Issue 6,September 2026,104766

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

Cite as: arXiv:2603.26859 [cs.CV]

(or arXiv:2603.26859v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.26859

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yinfeng Yu [view email] [v1] Fri, 27 Mar 2026 15:25:30 UTC (4,727 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Beyond Text…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 232 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers