Models model language model benchmark announce assistant analysis

HippoCamp: Benchmarking Contextual Agents on Personal Computers

ArXiv CS.AIby [Submitted on 1 Apr 2026]April 2, 20262 min read1 views

arXiv:2604.01221v1 Announce Type: new Abstract: We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide

Authors:Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo, Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei Liu

View PDF

Abstract:We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

Comments: Project Page: this https URL

Subjects:

Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2604.01221 [cs.AI]

(or arXiv:2604.01221v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.01221

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Shulin Tian [view email] [v1] Wed, 1 Apr 2026 17:58:33 UTC (24,493 KB)

Original source

ArXiv CS.AI

https://arxiv.org/abs/2604.01221

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

ProductsFresh

AI Is Insatiable

While browsing our website a few weeks ago, I stumbled upon “ How and When the Memory Chip Shortage Will End ” by Senior Editor Samuel K. Moore. His analysis focuses on the current DRAM shortage caused by AI hyperscalers’ ravenous appetite for memory, a major constraint on the speed at which large language models run. Moore provides a clear explanation of the shortage, particularly for high bandwidth memory (HBM). As we and the rest of the tech media have documented, AI is a resource hog. AI electricity consumption could account for up to 12 percent of all U.S. power by 2028. Generative AI queries consumed 15 terawatt-hours in 2025 and are projected to consume 347 TWh by 2030. Water consumption for cooling AI data centers is predicted to double or even quadruple by 2028 compared to 2023. B

IEEE Spectrum AI

3mabout 4 hours ago

Models

New AI foundation model aims to speed up drug discovery - Drug Target Review

New AI foundation model aims to speed up drug discovery Drug Target Review

GNews AI drug discovery

1mabout 1 month ago

ModelsLive

Anyone got Gemma 4 26B-A4B running on VLLM?

If yes, which quantized model are you using abe what’s your vllm serve command? I’ve been struggling getting that model up and running on my dgx spark gb10. I tried the intel int4 quant for the 31B and it seems to be working well but way too slow. Anyone have any luck with the 26B? submitted by /u/toughcentaur9018 [link] [comments]

Reddit r/LocalLLaMA

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 211 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Submission history

Daily AI Digest

More about

AI Is Insatiable

New AI foundation model aims to speed up drug discovery - Drug Target Review

Anyone got Gemma 4 26B-A4B running on VLLM?

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

I tested Gemini on Android Auto and now I can't stop talking to it: 5 tasks it nails

viable/strict/1775487943

New AI foundation model aims to speed up drug discovery - Drug Target Review

Anyone got Gemma 4 26B-A4B running on VLLM?