A Benchmarking Methodology to Assess Open-Source Video Large Language Models in Automatic Captioning of News Videos
arXiv:2603.27662v1 Announce Type: new Abstract: News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datase — David Miranda Paredes, Jose M. Saavedra, Marcelo Pizarro
View PDF HTML (experimental)
Abstract:News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datasets: a Chilean TV news corpus (approximately 1,345 clips) and a BBC News corpus (9,838 clips). We employ lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics proposed in this work: the Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS). Our analysis reveals that standard metrics exhibit limited discriminative power for news video captioning due to surface-form dependence, static-frame insensitivity, and function-word inflation. TFS and EFS address these gaps by directly assessing thematic structure preservation and named-entity coverage in the generated captions. Results show that Gemma~3 achieves the highest overall performance across both datasets and most evaluation dimensions, with Qwen-VL as a consistent runner-up.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.27662 [cs.CV]
(or arXiv:2603.27662v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.27662
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Jose M. Saavedra PhD [view email] [v1] Sun, 29 Mar 2026 12:28:35 UTC (4,439 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!