Models model language model benchmark release announce platform

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

ArXiv CS.AIby [Submitted on 31 Mar 2026]April 1, 20262 min read1 views

arXiv:2603.29112v1 Announce Type: new Abstract: We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Ou

Authors:Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu, Jiayi Liu, Hanchao Yu, Qi Guo, Jianyu Wang, Fei Liu, Xiangjun Fan

View PDF HTML (experimental)

Abstract:We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.

Comments: 9 figures, 20 tables; code at this https URL

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

ACM classes: H.3.3; I.2.7

Cite as: arXiv:2603.29112 [cs.AI]

(or arXiv:2603.29112v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.29112

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Iordanis Fostiropoulos [view email] [v1] Tue, 31 Mar 2026 01:01:56 UTC (10,694 KB)

Original source

ArXiv CS.AI

https://arxiv.org/abs/2603.29112

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Products

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Dylan Patel is the founder of SemiAnalysis, a research analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for AI (Ai2) and the author of a blog on AI called Interconnects. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep459-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. Transcript: https://lexfridman.com/deepseek-dylan-patel-nathan-lambert-transcript CONTACT LEX: Feedback give feedback to Lex: https://lexfridman.com/survey AMA submit questions, videos or call-in: https://lexfridman.com/ama Hiring join our team: https://lexfridman.com/hiring Other other ways to get in touch: https://lexfridman.com/contact EPISODE LIN

Lex Fridman Podcast

2mabout 1 year ago

Products

#456 – Volodymyr Zelenskyy: Ukraine, War, Peace, Putin, Trump, NATO, and Freedom

Volodymyr Zelenskyy is the President of Ukraine. On YouTube this episode is available in English, Ukrainian, and Russian. Captions and voice-over audio tracks are provided in English, Ukrainian, Russian, and the original mixed-language version, with subtitles available in your preferred language. To listen to the original mixed language version, please select the English (UK) audio track audio track. The default is English overdub. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep456-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. Transcript: https://lexfridman.com/volodymyr-zelenskyy-transcript CONTACT LEX: Feedback give feedback to Lex: https://lexfridman.com/survey AMA submit questions, videos or call-

Lex Fridman Podcast

2mabout 1 year ago

Models

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model - WSJ

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model WSJ

GNews AI Llama

1m11 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 147 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

#452 – Dario Amodei: Anthropic CEO on Claude, AGI the Future of AI Humanity

Dario Amodei is the CEO of Anthropic, the company that created Claude. Amanda Askell is an AI researcher working on Claude s character and personality. Chris Olah is an AI researcher working on mechanistic interpretability. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep452-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. Transcript: https://lexfridman.com/dario-amodei-transcript CONTACT LEX: Feedback give feedback to Lex: https://lexfridman.com/survey AMA submit questions, videos or call-in: https://lexfridman.com/ama Hiring join our team: https://lexfridman.com/hiring Other other ways to get in touch: https://lexfridman.com/contact EPISODE LINKS: Claude: https://claude.ai Anthropic s X: https://x.com/A

Lex Fridman Podcast

2mover 1 year ago

Models

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model - WSJ

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model WSJ

GNews AI Llama

1m11 months ago

ModelsRecent

CISO Benchmark Report Finds AI Driving New Era of Cybersecurity Risk and Investment in Retail and Hospitality - Morningstar

CISO Benchmark Report Finds AI Driving New Era of Cybersecurity Risk and Investment in Retail and Hospitality Morningstar

GNews AI cybersecurity

1mabout 22 hours ago

ModelsRecent

b8608

llama : refactor llama_model_quantize_params to expose a pure C interface ( #20346 ) Refactor llama_model_quantize_params to expose a pure C interface Restore comment and cleanup struct def Code review refactoring Co-authored-by: Georgi Gerganov [email protected] Code review refactoring Co-authored-by: Georgi Gerganov [email protected] macOS/iOS: macOS Apple Silicon (arm64) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan) Ubuntu x64 (ROCm 7.2) Ubuntu x64 (OpenVINO) Windows: Windows x64 (CPU) Windows arm64 (CPU) Windows x64 (CUDA 12) - CUDA 12.4 DLLs Windows x64 (CUDA 13) - CUDA 13.1 DLLs Windows x64 (Vulkan) Windows x64 (SYCL) Windows x64 (HIP) openEuler: openEuler x86 (310p) openEuler x86 (910b, A

llama.cpp Releases

1mabout 24 hours ago