Analyst News model language model transformer announce product application

The Rise of Language Models in Mining Software Repositories: A Survey

arXiv cs.SEby [Submitted on 1 Apr 2026]April 2, 20261 min read1 views

arXiv:2604.00787v1 Announce Type: new Abstract: The Mining Software Repositories (MSR) field focuses on analysing the rich data contained in software repositories to derive actionable insights into software processes and products. Mining repositories at scale requires techniques capable of handling large volumes of heterogeneous data, a challenge for which language models (LMs) are increasingly well-suited. Since the advent of Transformer-based architectures, LMs have been rapidly adopted across a wide range of MSR tasks. This article presents a comprehensive survey of the use of LMs in MSR, based on an analysis of 85 papers. We examine how LMs are applied, the types of artefacts analysed, which models are used, how their adoption has evolved over time, and the extent to which studies supp

View PDF HTML (experimental)

Abstract:The Mining Software Repositories (MSR) field focuses on analysing the rich data contained in software repositories to derive actionable insights into software processes and products. Mining repositories at scale requires techniques capable of handling large volumes of heterogeneous data, a challenge for which language models (LMs) are increasingly well-suited. Since the advent of Transformer-based architectures, LMs have been rapidly adopted across a wide range of MSR tasks. This article presents a comprehensive survey of the use of LMs in MSR, based on an analysis of 85 papers. We examine how LMs are applied, the types of artefacts analysed, which models are used, how their adoption has evolved over time, and the extent to which studies support reproducibility and reuse. Building on this analysis, we propose a taxonomy of LM applications in MSR, identify key trends shaping the field, and highlight open challenges alongside actionable directions for future research.

Subjects:

Software Engineering (cs.SE)

Cite as: arXiv:2604.00787 [cs.SE]

(or arXiv:2604.00787v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2604.00787

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Miguel Romero-Arjona [view email] [v1] Wed, 1 Apr 2026 11:53:12 UTC (2,033 KB)

Original source

arXiv cs.SE

https://arxiv.org/abs/2604.00787

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modeltransformer

Models

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - WSJ

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT WSJ

Google News: ChatGPT

1m6 days ago

ModelsLive

Intel Arc B70 Benchmarks/Comparison to Nvidia RTX 4070 Super

Good day everyone! You may remember me from such posts as Getting An Intel Arc B70 Running For LLM Inference on a Dell Poweredge R730XD . Maybe not. Probably not... Anyway, I've had this card for about a week now, I ordered it on launch day and have been beating my head against a wall with drivers and other issues until finally getting it running properly! Since then, I've realized there's a significant lack of people actually testing this card and getting some real benchmarks out into the community. Something something be the change you want to see in the world, something something... So I've done some testing, and this certainly won't be the last of my tests and benchmarks, but it'll certainly be the first. I know what is on the community's mind. I hear you ask "How does the new Intel ca