Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessDryft: What if AI memory worked like an ecosystem instead of a filing cabinet?DEV CommunityWeb Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs ScrapyDEV CommunityQualcomm Joins Korea's 'Challenge AX' Program to Support AI Startups - thelec.netGNews AI KoreaAI Is Turning Film Pitches into Proof—But Korea’s Financing Model Still Lags - KoreaTechDeskGNews AI KoreaFrom Next.js to Pareto: What Changes and What Stays the SameDEV CommunityA Quick Note on Gemma 4 Image Settings in Llama.cppDEV CommunityDoes consciousness and suffering even matter: LLMs and moral relevancelesswrong.comHow to Parse HL7 Messages with AI — Free MCP ServerDEV CommunityGHSA-QCC3-JQWP-5VH2: GHSA-qcc3-jqwp-5vh2: Unauthenticated Resource Exhaustion via LINE Webhook Handler in OpenClawDEV CommunityHow to Hyper-Personalization in Action: From Story Angle to Ranked Media List in MinutesDEV CommunityHow to Scrape DoorDash, Uber Eats, and Grubhub Menu Data in 2026DEV CommunityReverse Engineering Cloudflare's React-Based Bot Detection in 2026DEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessDryft: What if AI memory worked like an ecosystem instead of a filing cabinet?DEV CommunityWeb Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs ScrapyDEV CommunityQualcomm Joins Korea's 'Challenge AX' Program to Support AI Startups - thelec.netGNews AI KoreaAI Is Turning Film Pitches into Proof—But Korea’s Financing Model Still Lags - KoreaTechDeskGNews AI KoreaFrom Next.js to Pareto: What Changes and What Stays the SameDEV CommunityA Quick Note on Gemma 4 Image Settings in Llama.cppDEV CommunityDoes consciousness and suffering even matter: LLMs and moral relevancelesswrong.comHow to Parse HL7 Messages with AI — Free MCP ServerDEV CommunityGHSA-QCC3-JQWP-5VH2: GHSA-qcc3-jqwp-5vh2: Unauthenticated Resource Exhaustion via LINE Webhook Handler in OpenClawDEV CommunityHow to Hyper-Personalization in Action: From Story Angle to Ranked Media List in MinutesDEV CommunityHow to Scrape DoorDash, Uber Eats, and Grubhub Menu Data in 2026DEV CommunityReverse Engineering Cloudflare's React-Based Bot Detection in 2026DEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

StreetReaderAI: Towards making street view accessible via context-aware multimodal AI

Google Research BlogOctober 29, 20251 min read0 views
Source Quiz

Generative AI

Interactive streetscape tools, available today in every major mapping service, have revolutionized how people virtually navigate and explore the world — from previewing routes and inspecting destinations to remotely visiting world-class tourist locations. But to date, screen readers have not been able to interpret street view imagery, and alt text is unavailable. We now have an opportunity to redefine this immersive streetscape experience to be inclusive for all with multimodal AI and image understanding. This could eventually allow a service like Google Street View, which has over 220 billion images spanning 110+ countries and territories, to be more accessible to people in the blind and low-vision community, offering an immersive visual experience and opening up new possibilities for exploration.

In “StreetReaderAI: Making Street View Accessible Using Context-Aware Multimodal AI”, presented at UIST’25, we introduce StreetReaderAI, a proof-of-concept accessible street view prototype that uses context-aware, real-time AI and accessible navigation controls. StreetReaderAI was designed iteratively by a team of blind and sighted accessibility researchers, drawing on previous work in accessible first-person gaming and navigation tools, such as Shades of Doom, BlindSquare, and SoundScape. Key capabilities include:

  • Real-time AI-generated descriptions of nearby roads, intersections, and places.
  • Dynamic conversation with a multimodal AI agent about scenes and local geography.
  • Accessible panning and movement between panoramic images using voice commands or keyboard shortcuts.

Navigating in StreetReaderAI

StreetReaderAI offers an immersive, first-person exploration experience, much like a video game where audio is the primary interface.

StreetReaderAI provides seamless navigation through both keyboard and voice interaction. Users can explore their surroundings using the left and right arrow keys to shift their view. As the user pans, StreetReaderAI shares audio feedback, voicing the current heading as a cardinal or intercardinal direction (e.g., “Now facing: North” or “Northeast”). It also expresses whether the user can move forward and if they are currently facing a nearby landmark or place.

To move, the user can take “virtual steps” using the up arrow or move backward with the down arrow. As a user moves through the virtual streetscape, StreetReaderAI describes how far the user traveled and key geographic information, such as nearby places. Users can also use “jump” or “teleport” features to quickly move to new locations.

How StreetReaderAI serves as a virtual guide

The core of StreetReaderAI is its two underlying AI subsystems backed by Gemini: AI Describer and AI Chat. Both subsystems take in a static prompt and optional user profile as well as dynamic information about the user’s current location, such as nearby places, road information, and the current field-of-view image (i.e., what’s being shown in Street View).

AI Describer

AI Describer functions as a context-aware scene description tool that combines dynamic geographic information about the user’s virtual location along with an analysis of the current Street View image to generate a real-time audio description.

It has two modes: a “default” prompt emphasizing navigation and safety for blind pedestrians, and a “tour guide” prompt that provides additional tourism information (e.g., historic and architectural context). We also use Gemini to predict likely follow-up questions specific to the current scene and local geography that may be of interest to blind or low-vision travelers.

AI Chat

AI Chat builds on AI Describer but allows users to ask questions about their current view, past views, and nearby geography. The chat agent uses Google's Multimodal Live API, which supports real-time interaction, function calling, and temporarily retains memory of all interactions within a single session. We track and send each pan or movement interaction along with the user's current view and geographic context (e.g., nearby places, current heading).

What makes AI Chat so powerful is its ability to hold a temporary “memory” of the user's session — the context window is set to a maximum of 1,048,576 input tokens, which is roughly equivalent to over 4k input images. Because AI Chat receives the user's view and location with every virtual step, it collects information about the user’s location and context. A user can virtually walk past a bus stop, turn a corner, and then ask, “Wait, where was that bus stop?” The agent can recall its previous context, analyze the current geographic input, and answer, “The bus stop is behind you, approximately 12 meters away.”

Testing StreetReaderAI with blind users

To evaluate StreetReaderAI, we conducted an in-person lab study with eleven blind screen reader users. During the sessions, participants learned about StreetReaderAI and used it to explore multiple locations and evaluate potential walking routes to destinations.

Overall, participants reacted positively to StreetReaderAI, rating the overall usefulness 6.4 (median=7; SD=0.9) on a Likert scale from 1–7 (where 1 was ‘not at all useful’ and 7 was ‘very useful’), emphasizing the interplay between virtual navigation and AI, the seamlessness of the interactive AI Chat interface, and the value of information provided. Qualitative feedback from participants consistently highlighted StreetReaderAI's significant accessibility advancement for navigation, noting that existing street view tools lack this level of accessibility. The interactive AI chat feature was also described as making conversations about streets and places both engaging and helpful.

During the study, participants visited over 350 panoramas and made over 1,000 AI requests. Interestingly, AI Chat was used six times more often than AI Describer, indicating a clear preference for personalized, conversational inquiries. While participants found value in StreetReaderAI and adeptly combined virtual world navigation with AI interactions, there is room for improvement: participants sometimes struggled with properly orienting themselves, distinguishing the veracity of AI responses, and determining the limits of AI knowledge.

Results

As the first study of an accessible street view system, our research also provides the first-ever analysis of the types of questions blind people ask about streetscape imagery. We analyzed all 917 AI Chat interactions and annotated each with up to three tags drawn from an emergent list of 23 question type categories. The four most common question types included:

  • Spatial orientation: 27.0% of participants were most interested in the location and distance of objects, e.g., “How far is the bus stop from where I'm standing?” and “Which side are the garbage cans next to the bench?”
  • Object existence: 26.5% of participants queried for the presence of key features like sidewalks, obstacles, and doors; “Is there a crosswalk here?”
  • General description: 18.4% of participants started AI Chat by requesting a summary of the current view, often asking, “What's in front of me?”
  • Object/place location: 14.9% of participants asked where things were, such as, “Where is the nearest intersection?” or “Can you help me find the door?”

StreetReaderAI accuracy

Because StreetReaderAI relies so significantly on AI, a critical challenge is response accuracy. Of the 816 questions that participants asked AI Chat:

  • 703 (86.3%) were correctly answered.
  • 32 (3.9%) were incorrect (3.9%).
  • The remaining were either: partially correct (26; 3.2%) or the AI refused to answer (54; 6.6%).

Of the 32 incorrect responses:

  • 20 (62.5%) were false negatives, e.g., stating that a bike rack did not exist when it did.
  • 12 (37.5%) were misidentifications (e.g., a yellow speed bump interpreted as a crosswalk) or misc errors due to AI Chat not yet seeing the target in street view.

More work is necessary to explore how StreetReaderAI performs in other contexts and beyond lab settings.

What’s next?

StreetReaderAI is a promising first step toward making streetscape tools accessible to all. Our study highlights what information blind users desire from and ask about streetscape imagery and the potential for multimodal AI to answer their questions.

There are several other opportunities to expand on this work:

  • Towards Geo-visual Agents: We envision a more autonomous AI Chat agent that can explore on its own. For example, a user could ask, “What’s the next bus stop down this road?” and the agent could automatically navigate the Street View network, find the stop, analyze its features (benches, shelters), and report back.
  • Supporting Route Planning: Similarly, StreetReaderAI does not yet support full origin-to-destination routing. Imagine asking, “What’s the walk like from the nearest subway station to the library?” A future AI agent could “pre-walk” the route, analyzing every Street View image to generate a blind-friendly summary, noting potential obstacles, and identifying the exact location of the library’s door.
  • Richer Audio Interface: The primary output of StreetReaderAI is speech. We are also exploring richer, non-verbal feedback, including spatialized audio and fully immersive 3D audio soundscapes synthesized from the images themselves.

Though a “proof-of-concept” research prototype, StreetReaderAI helps demonstrate the potential of making immersive streetscape environments accessible.

Acknowledgements

This research was conducted by Jon E. Froehlich, Alexander J. Fiannaca, Nimer Jaber, Victor Tsaran, Shaun K. Kane, and Philip Nelson. We thank Project Astra and the Google Geo teams for their feedback as well as our participants. Diagram icons are from Noun Project, including: “prompt icon” by Firdaus Faiz, “command functions” by Kawalan Icon, “dynamic geo-context” by Didik Darmanto, and “MLLM icon” by Funtasticon.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

multimodal

Knowledge Map

Knowledge Map
TopicsEntitiesSource
StreetReade…multimodalGoogle Rese…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 175 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!