Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers
Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in acad — Mingmeng Geng, Yuhang Dong, Thierry Poibeau
View PDF HTML (experimental)
Abstract:Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.
Comments: Visualization of word usage patterns in arXiv abstracts: this https URL
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG)
Cite as: arXiv:2603.25638 [cs.CL]
(or arXiv:2603.25638v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.25638
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Mingmeng Geng [view email] [v1] Thu, 26 Mar 2026 16:49:00 UTC (678 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivOptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training
OptiMer enables flexible continual pre-training by decoupling data mixture ratio selection from training through post-hoc Bayesian optimization of distribution vectors extracted from individual dataset models. (1 upvotes on HuggingFace)
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Discrete Native Autoregressive framework enables unified multimodal processing by representing diverse modalities in a shared discrete space through a novel visual transformer architecture. (43 upvotes on HuggingFace)
CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence
CARLA-Air integrates high-fidelity driving and multirotor flight simulation within a unified Unreal Engine framework, supporting joint air-ground agent modeling with photorealistic environments and multi-modal sensing capabilities. (1 upvotes on HuggingFace)
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence
CARLA-Air integrates high-fidelity driving and multirotor flight simulation within a unified Unreal Engine framework, supporting joint air-ground agent modeling with photorealistic environments and multi-modal sensing capabilities. (1 upvotes on HuggingFace)
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Discrete Native Autoregressive framework enables unified multimodal processing by representing diverse modalities in a shared discrete space through a novel visual transformer architecture. (43 upvotes on HuggingFace)
OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training
OptiMer enables flexible continual pre-training by decoupling data mixture ratio selection from training through post-hoc Bayesian optimization of distribution vectors extracted from individual dataset models. (1 upvotes on HuggingFace)
AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing
AutoWeather4D is a 3D-aware weather editing framework that decouples geometry and illumination through a dual-pass mechanism, enabling efficient and physically accurate weather modification for autonomous driving applications. (1 upvotes on HuggingFace)
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!