Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessThe AI-Powered Agency: A Developer Playbook for Selling AI Services in 2026Dev.to AIYour AI Chatbot Isn't Stupid. It Just Has No Memory. Here's How We Fixed That.Dev.to AIInternational RegLab Project reports on AI use in nuclear power plant operations - Nuclear Energy Agency (NEA)Google News: AIAI Agent Tools for Small Business Owners: A Practical GuideDev.to AINavigating the Quiet Rhythms of the Siuntio FortDev.to AIArtificial Intelligence in the Battle against Coronavirus (COVID-19): A Surveyand Future Research DirectionsDev.to AIPRH Germany sues OpenAI for ‘copyright infringement’ of children’s series - The BooksellerGoogle News: OpenAIEmail obfuscation: What works in 2026?!DEV CommunityReply Signs Strategic Collaboration Agreement with AWS to Accelerate AI-Driven Cloud Transformation - Press Release HubGoogle News: Generative AIDeepSource vs Qodana: Code Quality Platforms Compared (2026)DEV CommunityThe Senior Angular Take‑Home That Made Me Rethink Tech InterviewsDEV CommunityClaude Code Leak: 16 Lessons on Building Production-Ready AI SystemsAnalytics VidhyaBlack Hat USADark ReadingBlack Hat AsiaAI BusinessThe AI-Powered Agency: A Developer Playbook for Selling AI Services in 2026Dev.to AIYour AI Chatbot Isn't Stupid. It Just Has No Memory. Here's How We Fixed That.Dev.to AIInternational RegLab Project reports on AI use in nuclear power plant operations - Nuclear Energy Agency (NEA)Google News: AIAI Agent Tools for Small Business Owners: A Practical GuideDev.to AINavigating the Quiet Rhythms of the Siuntio FortDev.to AIArtificial Intelligence in the Battle against Coronavirus (COVID-19): A Surveyand Future Research DirectionsDev.to AIPRH Germany sues OpenAI for ‘copyright infringement’ of children’s series - The BooksellerGoogle News: OpenAIEmail obfuscation: What works in 2026?!DEV CommunityReply Signs Strategic Collaboration Agreement with AWS to Accelerate AI-Driven Cloud Transformation - Press Release HubGoogle News: Generative AIDeepSource vs Qodana: Code Quality Platforms Compared (2026)DEV CommunityThe Senior Angular Take‑Home That Made Me Rethink Tech InterviewsDEV CommunityClaude Code Leak: 16 Lessons on Building Production-Ready AI SystemsAnalytics Vidhya
Eigenvector logo
AI NEWS HUBbyEIGENVECTOR

The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation

arXiv cs.CLby Pavel Braslavski, Dmitrii Iarosh, Nikita Sushko, Andrey Sakhovskiy, Vasily Konovalov, Elena Tutubalina, Alexander PanchenkoApril 2, 20261 min read0 views
Source Quiz

arXiv:2604.00019v1 Announce Type: new Abstract: We present a configurable pipeline for generating multilingual sets of entities with specified characteristics, such as domain, geographical location and popularity, using data from Wikipedia and Wikidata. These datasets are intended for evaluating the factuality of LLMs' long-form generation, thereby complementing evaluation based on short-form QA datasets. We present the RiDiC dataset as an example of this approach. RiDiC contains 3,000 entities from three domains -- rivers, natural disasters, and car models -- spanning different popularity tiers. Each entity is accompanied by its geographical location, English and Chinese names (if available) and relevant English and Chinese Wikipedia content, which is used to evaluate LLMs' responses. Gen

View PDF HTML (experimental)

Abstract:We present a configurable pipeline for generating multilingual sets of entities with specified characteristics, such as domain, geographical location and popularity, using data from Wikipedia and Wikidata. These datasets are intended for evaluating the factuality of LLMs' long-form generation, thereby complementing evaluation based on short-form QA datasets. We present the RiDiC dataset as an example of this approach. RiDiC contains 3,000 entities from three domains -- rivers, natural disasters, and car models -- spanning different popularity tiers. Each entity is accompanied by its geographical location, English and Chinese names (if available) and relevant English and Chinese Wikipedia content, which is used to evaluate LLMs' responses. Generations about RiDiC entities were obtained from three LLMs in English and Chinese. These were then evaluated using a third-party factuality checker, which showed that entities from our dataset caused even frontier models to hallucinate. To facilitate the evaluation of LLMs' long-form factuality in multiple languages, the code, data, and generation/evaluation scripts have been released.

Comments: Accepted to LREC 2026

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as: arXiv:2604.00019 [cs.CL]

(or arXiv:2604.00019v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.00019

arXiv-issued DOI via DataCite

Submission history

From: Andrey Sakhovskiy [view email] [v1] Wed, 11 Mar 2026 01:02:55 UTC (1,200 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelreleaseannounce

Knowledge Map

Knowledge Map
TopicsEntitiesSource
The Chronic…modelreleaseannounceavailablevaluationarxivarXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 223 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Releases