StreetReaderAI: Towards making street view accessible via context-aware multimodal AI
Generative AI
Interactive streetscape tools, available today in every major mapping service, have revolutionized how people virtually navigate and explore the world — from previewing routes and inspecting destinations to remotely visiting world-class tourist locations. But to date, screen readers have not been able to interpret street view imagery, and alt text is unavailable. We now have an opportunity to redefine this immersive streetscape experience to be inclusive for all with multimodal AI and image understanding. This could eventually allow a service like Google Street View, which has over 220 billion images spanning 110+ countries and territories, to be more accessible to people in the blind and low-vision community, offering an immersive visual experience and opening up new possibilities for exploration.
In “StreetReaderAI: Making Street View Accessible Using Context-Aware Multimodal AI”, presented at UIST’25, we introduce StreetReaderAI, a proof-of-concept accessible street view prototype that uses context-aware, real-time AI and accessible navigation controls. StreetReaderAI was designed iteratively by a team of blind and sighted accessibility researchers, drawing on previous work in accessible first-person gaming and navigation tools, such as Shades of Doom, BlindSquare, and SoundScape. Key capabilities include:
- Real-time AI-generated descriptions of nearby roads, intersections, and places.
- Dynamic conversation with a multimodal AI agent about scenes and local geography.
- Accessible panning and movement between panoramic images using voice commands or keyboard shortcuts.
Navigating in StreetReaderAI
StreetReaderAI offers an immersive, first-person exploration experience, much like a video game where audio is the primary interface.
StreetReaderAI provides seamless navigation through both keyboard and voice interaction. Users can explore their surroundings using the left and right arrow keys to shift their view. As the user pans, StreetReaderAI shares audio feedback, voicing the current heading as a cardinal or intercardinal direction (e.g., “Now facing: North” or “Northeast”). It also expresses whether the user can move forward and if they are currently facing a nearby landmark or place.
To move, the user can take “virtual steps” using the up arrow or move backward with the down arrow. As a user moves through the virtual streetscape, StreetReaderAI describes how far the user traveled and key geographic information, such as nearby places. Users can also use “jump” or “teleport” features to quickly move to new locations.
How StreetReaderAI serves as a virtual guide
The core of StreetReaderAI is its two underlying AI subsystems backed by Gemini: AI Describer and AI Chat. Both subsystems take in a static prompt and optional user profile as well as dynamic information about the user’s current location, such as nearby places, road information, and the current field-of-view image (i.e., what’s being shown in Street View).
AI Describer
AI Describer functions as a context-aware scene description tool that combines dynamic geographic information about the user’s virtual location along with an analysis of the current Street View image to generate a real-time audio description.
It has two modes: a “default” prompt emphasizing navigation and safety for blind pedestrians, and a “tour guide” prompt that provides additional tourism information (e.g., historic and architectural context). We also use Gemini to predict likely follow-up questions specific to the current scene and local geography that may be of interest to blind or low-vision travelers.
AI Chat
AI Chat builds on AI Describer but allows users to ask questions about their current view, past views, and nearby geography. The chat agent uses Google's Multimodal Live API, which supports real-time interaction, function calling, and temporarily retains memory of all interactions within a single session. We track and send each pan or movement interaction along with the user's current view and geographic context (e.g., nearby places, current heading).
What makes AI Chat so powerful is its ability to hold a temporary “memory” of the user's session — the context window is set to a maximum of 1,048,576 input tokens, which is roughly equivalent to over 4k input images. Because AI Chat receives the user's view and location with every virtual step, it collects information about the user’s location and context. A user can virtually walk past a bus stop, turn a corner, and then ask, “Wait, where was that bus stop?” The agent can recall its previous context, analyze the current geographic input, and answer, “The bus stop is behind you, approximately 12 meters away.”
Testing StreetReaderAI with blind users
To evaluate StreetReaderAI, we conducted an in-person lab study with eleven blind screen reader users. During the sessions, participants learned about StreetReaderAI and used it to explore multiple locations and evaluate potential walking routes to destinations.
Overall, participants reacted positively to StreetReaderAI, rating the overall usefulness 6.4 (median=7; SD=0.9) on a Likert scale from 1–7 (where 1 was ‘not at all useful’ and 7 was ‘very useful’), emphasizing the interplay between virtual navigation and AI, the seamlessness of the interactive AI Chat interface, and the value of information provided. Qualitative feedback from participants consistently highlighted StreetReaderAI's significant accessibility advancement for navigation, noting that existing street view tools lack this level of accessibility. The interactive AI chat feature was also described as making conversations about streets and places both engaging and helpful.
During the study, participants visited over 350 panoramas and made over 1,000 AI requests. Interestingly, AI Chat was used six times more often than AI Describer, indicating a clear preference for personalized, conversational inquiries. While participants found value in StreetReaderAI and adeptly combined virtual world navigation with AI interactions, there is room for improvement: participants sometimes struggled with properly orienting themselves, distinguishing the veracity of AI responses, and determining the limits of AI knowledge.
Results
As the first study of an accessible street view system, our research also provides the first-ever analysis of the types of questions blind people ask about streetscape imagery. We analyzed all 917 AI Chat interactions and annotated each with up to three tags drawn from an emergent list of 23 question type categories. The four most common question types included:
- Spatial orientation: 27.0% of participants were most interested in the location and distance of objects, e.g., “How far is the bus stop from where I'm standing?” and “Which side are the garbage cans next to the bench?”
- Object existence: 26.5% of participants queried for the presence of key features like sidewalks, obstacles, and doors; “Is there a crosswalk here?”
- General description: 18.4% of participants started AI Chat by requesting a summary of the current view, often asking, “What's in front of me?”
- Object/place location: 14.9% of participants asked where things were, such as, “Where is the nearest intersection?” or “Can you help me find the door?”
StreetReaderAI accuracy
Because StreetReaderAI relies so significantly on AI, a critical challenge is response accuracy. Of the 816 questions that participants asked AI Chat:
- 703 (86.3%) were correctly answered.
- 32 (3.9%) were incorrect (3.9%).
- The remaining were either: partially correct (26; 3.2%) or the AI refused to answer (54; 6.6%).
Of the 32 incorrect responses:
- 20 (62.5%) were false negatives, e.g., stating that a bike rack did not exist when it did.
- 12 (37.5%) were misidentifications (e.g., a yellow speed bump interpreted as a crosswalk) or misc errors due to AI Chat not yet seeing the target in street view.
More work is necessary to explore how StreetReaderAI performs in other contexts and beyond lab settings.
What’s next?
StreetReaderAI is a promising first step toward making streetscape tools accessible to all. Our study highlights what information blind users desire from and ask about streetscape imagery and the potential for multimodal AI to answer their questions.
There are several other opportunities to expand on this work:
- Towards Geo-visual Agents: We envision a more autonomous AI Chat agent that can explore on its own. For example, a user could ask, “What’s the next bus stop down this road?” and the agent could automatically navigate the Street View network, find the stop, analyze its features (benches, shelters), and report back.
- Supporting Route Planning: Similarly, StreetReaderAI does not yet support full origin-to-destination routing. Imagine asking, “What’s the walk like from the nearest subway station to the library?” A future AI agent could “pre-walk” the route, analyzing every Street View image to generate a blind-friendly summary, noting potential obstacles, and identifying the exact location of the library’s door.
- Richer Audio Interface: The primary output of StreetReaderAI is speech. We are also exploring richer, non-verbal feedback, including spatialized audio and fully immersive 3D audio soundscapes synthesized from the images themselves.
Though a “proof-of-concept” research prototype, StreetReaderAI helps demonstrate the potential of making immersive streetscape environments accessible.
Acknowledgements
This research was conducted by Jon E. Froehlich, Alexander J. Fiannaca, Nimer Jaber, Victor Tsaran, Shaun K. Kane, and Philip Nelson. We thank Project Astra and the Google Geo teams for their feedback as well as our participants. Diagram icons are from Noun Project, including: “prompt icon” by Firdaus Faiz, “command functions” by Kawalan Icon, “dynamic geo-context” by Didik Darmanto, and “MLLM icon” by Funtasticon.
Google Research Blog
https://research.google/blog/streetreaderai-towards-making-street-view-accessible-via-context-aware-multimodal-ai/Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
multimodal
A Quick Note on Gemma 4 Image Settings in Llama.cpp
In my last post, I mentioned using --image-min-tokens to increase the quality of image responses from Qwen3.5 . I went to load Gemma 4 the same way, and hit an error: [58175] srv process_chun: processing image... [58175] encoding image slice... [58175] image slice encoded in 7490 ms [58175] decoding image batch 1/2, n_tokens_batch = 2048 [58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch > = n_tokens_all ) "non-causal attention requires n_ubatch >= n_tokens" ) failed [58175] WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info. [58175] WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash. [58175] See: https://github.com/ggml-org/llama.cpp/pull/17869 [58175] 0 libggml-base.0.9.11.dylib 0
1.13.0
What's Changed Features Add RuntimeState RootModel for unified state serialization Enhance event listener with new telemetry spans for skill and memory events Add A2UI extension with v0.8/v0.9 support, schemas, and docs Emit token usage data in LLMCallCompletedEvent Auto-update deployment test repo during release Improve enterprise release resilience and UX Bug Fixes Add tool repository credentials to crewai install Add tool repository credentials to uv build in tool publish Pass fingerprint metadata via config instead of tool args Handle GPT-5.x models not supporting the stop API parameter Add GPT-5 and o-series to multimodal vision prefixes Bust uv cache for freshly published packages in enterprise release Cap lancedb below 0.30.1 for Windows compatibility Fix RBAC permission levels to m

Gemma 4 E4B + E2B Uncensored (Aggressive) — GGUF + K_P Quants (Multimodal: Vision, Video, Audio)
My first Gemma 4 uncensors are out. Two models dropping today, the E4B (4B) and E2B (2B). Both Aggressive variants, both fully multimodal. Aggressive means no refusals. I don't do any personality changes or alterations. The ORIGINAL Google release, just uncensored. Gemma 4 E4B (4B): https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive Gemma 4 E2B (2B): https://huggingface.co/HauhauCS/Gemma-4-E2B-Uncensored-HauhauCS-Aggressive 0/465 refusals * on both. Fully unlocked with zero capability loss. These are natively multimodal so text, image, video, and audio all in one model. The mmproj file is included for vision/audio support. What's included: E4B: Q8_K_P, Q6_K_P, Q5_K_P, Q5_K_M, Q4_K_P, Q4_K_M, IQ4_XS, Q3_K_P, Q3_K_M, IQ3_M, Q2_K_P + mmproj E2B: Q8_K_P, Q6_K_P, Q5_K_P,
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!