Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessThe Tool That Built the Modern World Is Still the Most Powerful Thing in an Engineer’s ArsenalMedium AII Tested AI Coding Assistants on the Same Full-Stack App — Here’s the Real WinnerMedium AIIs the Arrow of Time a Crucial Missing Component in Artificial Intelligence?Medium AIAutomation vs AI: Not Just Similar — They Solve Fundamentally Different ProblemsMedium AIWalmart's AI Checkout Converted 3x Worse. The Interface Is Why.DEV Community✨ Why Humanity Still Moves Toward AI.Medium AIPredicting 10 Minutes in 1 Square Meter: The Ultimate AI Boundary?DEV CommunityOracle Database 26ai: The World’s First AI-Native Database Just Changed EverythingMedium AIGetting Data from Multiple Sources in Power BIDEV CommunityAI APIs That Simplify Complex FeaturesMedium AIPART FIVE – THE CAPTAIN’S LOGSMedium AIThe Agent Economy Is Here — Why AI Agents Need Their Own MarketplaceDEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessThe Tool That Built the Modern World Is Still the Most Powerful Thing in an Engineer’s ArsenalMedium AII Tested AI Coding Assistants on the Same Full-Stack App — Here’s the Real WinnerMedium AIIs the Arrow of Time a Crucial Missing Component in Artificial Intelligence?Medium AIAutomation vs AI: Not Just Similar — They Solve Fundamentally Different ProblemsMedium AIWalmart's AI Checkout Converted 3x Worse. The Interface Is Why.DEV Community✨ Why Humanity Still Moves Toward AI.Medium AIPredicting 10 Minutes in 1 Square Meter: The Ultimate AI Boundary?DEV CommunityOracle Database 26ai: The World’s First AI-Native Database Just Changed EverythingMedium AIGetting Data from Multiple Sources in Power BIDEV CommunityAI APIs That Simplify Complex FeaturesMedium AIPART FIVE – THE CAPTAIN’S LOGSMedium AIThe Agent Economy Is Here — Why AI Agents Need Their Own MarketplaceDEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

A Benchmarking Methodology to Assess Open-Source Video Large Language Models in Automatic Captioning of News Videos

arXivMarch 31, 20262 min read1 views
Source Quiz

arXiv:2603.27662v1 Announce Type: new Abstract: News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datase — David Miranda Paredes, Jose M. Saavedra, Marcelo Pizarro

View PDF HTML (experimental)

Abstract:News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datasets: a Chilean TV news corpus (approximately 1,345 clips) and a BBC News corpus (9,838 clips). We employ lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics proposed in this work: the Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS). Our analysis reveals that standard metrics exhibit limited discriminative power for news video captioning due to surface-form dependence, static-frame insensitivity, and function-word inflation. TFS and EFS address these gaps by directly assessing thematic structure preservation and named-entity coverage in the generated captions. Results show that Gemma~3 achieves the highest overall performance across both datasets and most evaluation dimensions, with Qwen-VL as a consistent runner-up.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27662 [cs.CV]

(or arXiv:2603.27662v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27662

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jose M. Saavedra PhD [view email] [v1] Sun, 29 Mar 2026 12:28:35 UTC (4,439 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
A Benchmark…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Building knowledge graph…

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!