Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessSector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025 - Crunchbase NewsGNews AI startupsWhy Fixing Data Late Is 1000x More Expensive (with Rohit Choudhary)AI YouTube Channel 35Anthropic is learning that there are no take-backs on the internetBusiness InsiderOpenClaw launches an official China mirror, with ByteDance providing the servers to host the Chinese-language service, as OpenClaw explodes in the country (Juro Osawa/The Information)TechmemeYouTube Topic Insights: Google's open-source Gemini tool that finds trends for you - PPC LandGNews AI open sourceArtificial Intelligence in Process Control - The Chemical EngineerGoogle News: AIOpenAI doesn’t just want to answer your questions — it wants to run your digital life - TechRadarGoogle News: OpenAIWhy Nvidia just poured $2 billion into AI ASIC competitor Marvell — NVLink Fusion turns into soft ecosystem lock-intomshardware.comIs AI the new “Manhattan Project”? Vox went to Los Alamos to find out. - VoxGoogle News: ChatGPT'Users Should Own Their AI Agents, Not Rent Them' — Valory CEO David Minarsch Explains the Future of AI Control - CCN.comGoogle News: Generative AIBest Video Conferencing Solution for Enterprises in 2026Dev.to AIFunctional Testing vs Reality: What Actually Breaks in ProductionDev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessSector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025 - Crunchbase NewsGNews AI startupsWhy Fixing Data Late Is 1000x More Expensive (with Rohit Choudhary)AI YouTube Channel 35Anthropic is learning that there are no take-backs on the internetBusiness InsiderOpenClaw launches an official China mirror, with ByteDance providing the servers to host the Chinese-language service, as OpenClaw explodes in the country (Juro Osawa/The Information)TechmemeYouTube Topic Insights: Google's open-source Gemini tool that finds trends for you - PPC LandGNews AI open sourceArtificial Intelligence in Process Control - The Chemical EngineerGoogle News: AIOpenAI doesn’t just want to answer your questions — it wants to run your digital life - TechRadarGoogle News: OpenAIWhy Nvidia just poured $2 billion into AI ASIC competitor Marvell — NVLink Fusion turns into soft ecosystem lock-intomshardware.comIs AI the new “Manhattan Project”? Vox went to Los Alamos to find out. - VoxGoogle News: ChatGPT'Users Should Own Their AI Agents, Not Rent Them' — Valory CEO David Minarsch Explains the Future of AI Control - CCN.comGoogle News: Generative AIBest Video Conferencing Solution for Enterprises in 2026Dev.to AIFunctional Testing vs Reality: What Actually Breaks in ProductionDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

ArXiv CS.AIby [Submitted on 31 Mar 2026]April 1, 20262 min read1 views
Source Quiz

arXiv:2603.29112v1 Announce Type: new Abstract: We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Ou

Authors:Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu, Jiayi Liu, Hanchao Yu, Qi Guo, Jianyu Wang, Fei Liu, Xiangjun Fan

View PDF HTML (experimental)

Abstract:We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.

Comments: 9 figures, 20 tables; code at this https URL

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

ACM classes: H.3.3; I.2.7

Cite as: arXiv:2603.29112 [cs.AI]

(or arXiv:2603.29112v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.29112

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Iordanis Fostiropoulos [view email] [v1] Tue, 31 Mar 2026 01:01:56 UTC (10,694 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Knowledge Map

Knowledge Map
TopicsEntitiesSource
GISTBench: …modellanguage mo…benchmarkreleaseannounceplatformArXiv CS.AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 147 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models