A Benchmarking Methodology to Assess Open-Source Video Large Language Models in Automatic Captioning of News Videos

arXivMarch 31, 20262 min read1 views

arXiv:2603.27662v1 Announce Type: new Abstract: News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datase — David Miranda Paredes, Jose M. Saavedra, Marcelo Pizarro

View PDF HTML (experimental)

Abstract:News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datasets: a Chilean TV news corpus (approximately 1,345 clips) and a BBC News corpus (9,838 clips). We employ lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics proposed in this work: the Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS). Our analysis reveals that standard metrics exhibit limited discriminative power for news video captioning due to surface-form dependence, static-frame insensitivity, and function-word inflation. TFS and EFS address these gaps by directly assessing thematic structure preservation and named-entity coverage in the generated captions. Results show that Gemma~3 achieves the highest overall performance across both datasets and most evaluation dimensions, with Qwen-VL as a consistent runner-up.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27662 [cs.CV]

(or arXiv:2603.27662v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27662

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jose M. Saavedra PhD [view email] [v1] Sun, 29 Mar 2026 12:28:35 UTC (4,439 KB)

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

A Benchmarking Methodology to Assess Open-Source Video Large Language Models in Automatic Captioning of News Videos

Submission history

Daily AI Digest

Knowledge Map

Connected Articles — Knowledge Graph

Discussion