Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness
arXiv:2604.00672v1 Announce Type: cross Abstract: TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs compar
View PDF HTML (experimental)
Abstract:TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.
Comments: 27 pages, 3 tables, 7 figures, accepted in Discover Computing 2026
Subjects:
Computation and Language (cs.CL); Information Retrieval (cs.IR); Statistics Theory (math.ST)
Cite as: arXiv:2604.00672 [cs.CL]
(or arXiv:2604.00672v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2604.00672
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Paul Sheridan [view email] [v1] Wed, 1 Apr 2026 09:17:01 UTC (784 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelannounceinsightKnowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.






Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!