StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
arXiv:2512.01707v2 Announce Type: replace-cross Abstract: Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications such as Augmented Reality (AR) glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs utilize gaze for temporal and proactive — Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Mohit Bansal
View PDF HTML (experimental)
Abstract:Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications such as Augmented Reality (AR) glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs utilize gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively assess streaming video understanding. These tasks evaluate whether models can use real-time gaze signals to follow shifting attention and infer user intentions based only on past and currently observed frames. To build StreamGaze, we develop a gaze-video Question Answering (QA) generation pipeline that aligns egocentric videos with raw gaze trajectories through fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, highlighting key limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze prompting strategies, reasoning behaviors, and task-specific failure modes, offering insights into current limitations and directions for future research. All data and code are publicly available to support continued research in gaze-guided streaming video understanding.
Comments: Accepted to CVPR 2026, Project page: this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as: arXiv:2512.01707 [cs.CV]
(or arXiv:2512.01707v2 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.01707
arXiv-issued DOI via DataCite
Submission history
From: Daeun Lee [view email] [v1] Mon, 1 Dec 2025 14:15:44 UTC (4,693 KB) [v2] Fri, 27 Mar 2026 17:30:08 UTC (4,748 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivExclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ
<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxPdnA0SVIwQjktYkI3TUdZQWVHTXBDRWl6akZZOEhiVHVSZm53dkVoNEpEV0ZDOU1IUXBOVGZpNEVwUlRpaW1vbkwzTi1tcDJQMlliRUViWlNLaTQ1ak5vckdkWVdZTTBlMzM3bkRZbmM5LW42dTNKRkRBbGdmNmpWaVhDQXpSbzlDYTl4VE1jV2pIWGxQOXoxaWZ6SFBDU21sUmJKT2tmMjRjb1k0anBkLTRHbjFtbno5emtQaVNWUm1iZWF0UGJwZE9HZ29LWVUyVjdhdzA2cTF1R2NUY3J6bkJlUVhzYjVWZUZCdHdfbXJyX3lwRlJ6ak42MlJ3dUxTMEVpRHNGSmNfNi1GSmFmdTlkQUdCZEZvWlBBUjVYNTEtc0Y0ZFpkMGFKbTFFS3ZicjFYcllCMHV3YkJnZ2IxZkRTX1JiRlUzQkhjZzVYWlRUdVNfZGhqRWRWRmxyZTJJeHZ2T2RWQXR5aFZnMHgtdThweE5FdHNKOVZmOF9zMVdmb1djOWZxbFBkQ05lTndNLWZ6dFVYWXVudDZncGx6RllwcVJjVFRjUUdmOV9zOE9LYUgxTlR1eA?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> <font color="#6f6f6f">WSJ</font>
Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ
<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxOQUF5ZjI0bHRwSEdoaXVWTnd6REQzYjByRUxfMndKT0RzS2RPaUktZ3BEVVVyNWczYzNUbjVxSzd5WGtwLTVuRnZlT3VZb0YyMk4zNFphZWNZbUh5WWhQY0ZSVWFFTTNXZXNXRTVibUpiRHBlaHhIeVNlQjhSZ29YZ1RVclkzS2p5cVhaWTFNSW5lU2o5VUNuUWwtNE1ObWEzT2RmRHZheE8zLW5HLU5rY0loeVdEM1dYRk02YlBLajdkZU5ZcEliR1ZzNWFvdFEwTEs1WEtVQS1aVUpBMmRncWJLS0ZKaGlSbTdQVmxfeXpIX3I2MGlJTDNuZE1OdVpPUWpzWXlfQkdUeHhGMnF2Y2FhakNDVjBYTFRqTXNJTjZXZ2JUSXZudjVremdZUDBMS24xN3lySEJTWmxOMWtWdUZhb1VHeUlQVFhnWnJtOFpGU052VVNiSENXNkdNbjdaVmZzbkI2MnpDMGNSZ2FzUVJ0WWFEWElEeDlCYzZZcHk5T3B6NGtYMmw5bTU5d3RVRGdmMnMxcE56T2o3cGhjeDBGTzJhUHVqdnp5ZVZ2MlZuQ3BCaDJXVg?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> <font color="#6f6f6f">WSJ</font>
Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ
<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxQelRjVGJCMHI0NlB4NElDeGdDcFBoSlB1T0hSd2xlN2N6ME9SZFlSZWVmTXc0TG1uMHl2c2JBeldYdnUyVWIxQjFjRktwaFd3Y0UwVHQ3SC1Gdjc4eWtRT2VzR1luZndPZjVGeHdTQkFjZXJpSE5qcURnR0xvOF94QUhzVkpkaVd0QnJvZ3pxOC1idDlPWTA2Rnhydy1CNjRzdVMxbVlrbmJ6ZllERFNibktZMDZfRlM1WXRXQjBCdE42OHdMZTNNb09saDZhUXoxOVhCVE1TNGU0bURhVFI3ZFREQ3JRVEh6T09rZzhFTy1pZ1c3U01MRG1oVFJwS2lSUzRhY1JYMUtmdnZyN3hFT29DNWU2UU9EZkdNUTM1M2ZxR3JVNjRHM0ZyTjh1YTlaNGN0UTdmWUFhSHVPZ2RzS00ybjBhLXBkbmNuRldSR2otbVhNYURHNnE0MmplVUFKdzRoQS14MFEzc0ZSNUI4bkhURmlsbWludTVtT280bW1lX19TTFpFdEM5TDJnalk4bVp2RExTbkJmY09ITDBSRktCOTlyT2JidjNvTHlrYzJaNWlLaHdrLQ?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> <font color="#6f6f6f">WSJ</font>
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

UK police force presses pause on live facial recognition after study finds racial bias
<h4>Cams statistically more likely to ID Black people, says new research</h4> <p>A UK police force has suspended its deployment of live facial recognition (LFR) technology after a study revealed it was statistically more likely to identify Black people on a watchlist database.…</p>

Caltech breakthrough makes quantum memory last 30 times longer
While superconducting qubits are great at fast calculations, they struggle to store information for long periods. A team at Caltech has now developed a clever solution: converting quantum information into sound waves. By using a tiny device that acts like a miniature tuning fork, the researchers were able to extend quantum memory lifetimes up to 30 times longer than before. This breakthrough could pave the way toward practical, scalable quantum computers that can both compute and remember.

Too much screen time may be hurting kids’ hearts
More screen time among children and teens is linked to higher risks of heart and metabolic problems, particularly when combined with insufficient sleep. Danish researchers discovered a measurable rise in cardiometabolic risk scores and a metabolic “fingerprint” in frequent screen users. Experts say better sleep and balanced daily routines can help offset these effects and safeguard lifelong health.

Unbreakable? Researchers warn quantum computers have serious security flaws
Quantum computers could revolutionize everything from drug discovery to business analytics—but their incredible power also makes them surprisingly vulnerable. New research from Penn State warns that today’s quantum machines are not just futuristic tools, but potential gold mines for hackers. The study reveals that weaknesses can exist not only in software, but deep within the physical hardware itself, where valuable algorithms and sensitive data may be exposed.
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!