Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark ReadingThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechPriced Out by AI: The Memory Chip Crisis Hitting Every ConsumerHacker News AI TopShow HN: AgentDog – Open-source dashboard for monitoring local AI agentsHacker News AI TopGemma 4 and Qwen3.5 on shared benchmarksReddit r/LocalLLaMA[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on BlackwellReddit r/MachineLearningThe energy and environmental impact of AI and how it undermines democracy - greenpeace.orgGNews AI energyShow HN: A TUI for checking and comparing cloud and AI pricingHacker News AI TopAttorney General Pam Bondi pushed outAxios TechShow HN: Screenbox – Self-hosted virtual desktops for AI agentsHacker News AI TopMoonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun SunLatent SpaceWe're talking Google, Apple and AI: 19 Years of HN data analyzedHacker News AI TopBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark ReadingThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechPriced Out by AI: The Memory Chip Crisis Hitting Every ConsumerHacker News AI TopShow HN: AgentDog – Open-source dashboard for monitoring local AI agentsHacker News AI TopGemma 4 and Qwen3.5 on shared benchmarksReddit r/LocalLLaMA[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on BlackwellReddit r/MachineLearningThe energy and environmental impact of AI and how it undermines democracy - greenpeace.orgGNews AI energyShow HN: A TUI for checking and comparing cloud and AI pricingHacker News AI TopAttorney General Pam Bondi pushed outAxios TechShow HN: Screenbox – Self-hosted virtual desktops for AI agentsHacker News AI TopMoonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun SunLatent SpaceWe're talking Google, Apple and AI: 19 Years of HN data analyzedHacker News AI Top
AI NEWS HUBbyEIGENVECTOREigenvector

Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

arXiv cs.IRby Zeyad Ahmed, Paul Sheridan, Michael McIsaac, Aitazaz A. FarooqueApril 2, 20261 min read0 views
Source Quiz

arXiv:2604.00672v1 Announce Type: cross Abstract: TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs compar

View PDF HTML (experimental)

Abstract:TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.

Comments: 27 pages, 3 tables, 7 figures, accepted in Discover Computing 2026

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR); Statistics Theory (math.ST)

Cite as: arXiv:2604.00672 [cs.CL]

(or arXiv:2604.00672v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.00672

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Paul Sheridan [view email] [v1] Wed, 1 Apr 2026 09:17:01 UTC (784 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Common TF-I…modelannounceinsightperspectivecomponentpaperarXiv cs.IR

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 133 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!