GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
arXiv:2603.29112v1 Announce Type: new Abstract: We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Ou
Authors:Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu, Jiayi Liu, Hanchao Yu, Qi Guo, Jianyu Wang, Fei Liu, Xiangjun Fan
View PDF HTML (experimental)
Abstract:We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.
Comments: 9 figures, 20 tables; code at this https URL
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
ACM classes: H.3.3; I.2.7
Cite as: arXiv:2603.29112 [cs.AI]
(or arXiv:2603.29112v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.29112
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Iordanis Fostiropoulos [view email] [v1] Tue, 31 Mar 2026 01:01:56 UTC (10,694 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters
Dylan Patel is the founder of SemiAnalysis, a research analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for AI (Ai2) and the author of a blog on AI called Interconnects. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep459-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. Transcript: https://lexfridman.com/deepseek-dylan-patel-nathan-lambert-transcript CONTACT LEX: Feedback give feedback to Lex: https://lexfridman.com/survey AMA submit questions, videos or call-in: https://lexfridman.com/ama Hiring join our team: https://lexfridman.com/hiring Other other ways to get in touch: https://lexfridman.com/contact EPISODE LIN
#456 – Volodymyr Zelenskyy: Ukraine, War, Peace, Putin, Trump, NATO, and Freedom
Volodymyr Zelenskyy is the President of Ukraine. On YouTube this episode is available in English, Ukrainian, and Russian. Captions and voice-over audio tracks are provided in English, Ukrainian, Russian, and the original mixed-language version, with subtitles available in your preferred language. To listen to the original mixed language version, please select the English (UK) audio track audio track. The default is English overdub. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep456-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. Transcript: https://lexfridman.com/volodymyr-zelenskyy-transcript CONTACT LEX: Feedback give feedback to Lex: https://lexfridman.com/survey AMA submit questions, videos or call-
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
#452 – Dario Amodei: Anthropic CEO on Claude, AGI the Future of AI Humanity
Dario Amodei is the CEO of Anthropic, the company that created Claude. Amanda Askell is an AI researcher working on Claude s character and personality. Chris Olah is an AI researcher working on mechanistic interpretability. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep452-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. Transcript: https://lexfridman.com/dario-amodei-transcript CONTACT LEX: Feedback give feedback to Lex: https://lexfridman.com/survey AMA submit questions, videos or call-in: https://lexfridman.com/ama Hiring join our team: https://lexfridman.com/hiring Other other ways to get in touch: https://lexfridman.com/contact EPISODE LINKS: Claude: https://claude.ai Anthropic s X: https://x.com/A
b8608
llama : refactor llama_model_quantize_params to expose a pure C interface ( #20346 ) Refactor llama_model_quantize_params to expose a pure C interface Restore comment and cleanup struct def Code review refactoring Co-authored-by: Georgi Gerganov [email protected] Code review refactoring Co-authored-by: Georgi Gerganov [email protected] macOS/iOS: macOS Apple Silicon (arm64) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan) Ubuntu x64 (ROCm 7.2) Ubuntu x64 (OpenVINO) Windows: Windows x64 (CPU) Windows arm64 (CPU) Windows x64 (CUDA 12) - CUDA 12.4 DLLs Windows x64 (CUDA 13) - CUDA 13.1 DLLs Windows x64 (Vulkan) Windows x64 (SYCL) Windows x64 (HIP) openEuler: openEuler x86 (310p) openEuler x86 (910b, A


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!