The Rise of Language Models in Mining Software Repositories: A Survey
arXiv:2604.00787v1 Announce Type: new Abstract: The Mining Software Repositories (MSR) field focuses on analysing the rich data contained in software repositories to derive actionable insights into software processes and products. Mining repositories at scale requires techniques capable of handling large volumes of heterogeneous data, a challenge for which language models (LMs) are increasingly well-suited. Since the advent of Transformer-based architectures, LMs have been rapidly adopted across a wide range of MSR tasks. This article presents a comprehensive survey of the use of LMs in MSR, based on an analysis of 85 papers. We examine how LMs are applied, the types of artefacts analysed, which models are used, how their adoption has evolved over time, and the extent to which studies supp
View PDF HTML (experimental)
Abstract:The Mining Software Repositories (MSR) field focuses on analysing the rich data contained in software repositories to derive actionable insights into software processes and products. Mining repositories at scale requires techniques capable of handling large volumes of heterogeneous data, a challenge for which language models (LMs) are increasingly well-suited. Since the advent of Transformer-based architectures, LMs have been rapidly adopted across a wide range of MSR tasks. This article presents a comprehensive survey of the use of LMs in MSR, based on an analysis of 85 papers. We examine how LMs are applied, the types of artefacts analysed, which models are used, how their adoption has evolved over time, and the extent to which studies support reproducibility and reuse. Building on this analysis, we propose a taxonomy of LM applications in MSR, identify key trends shaping the field, and highlight open challenges alongside actionable directions for future research.
Subjects:
Software Engineering (cs.SE)
Cite as: arXiv:2604.00787 [cs.SE]
(or arXiv:2604.00787v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2604.00787
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Miguel Romero-Arjona [view email] [v1] Wed, 1 Apr 2026 11:53:12 UTC (2,033 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modeltransformer
Intel Arc B70 Benchmarks/Comparison to Nvidia RTX 4070 Super
Good day everyone! You may remember me from such posts as Getting An Intel Arc B70 Running For LLM Inference on a Dell Poweredge R730XD . Maybe not. Probably not... Anyway, I've had this card for about a week now, I ordered it on launch day and have been beating my head against a wall with drivers and other issues until finally getting it running properly! Since then, I've realized there's a significant lack of people actually testing this card and getting some real benchmarks out into the community. Something something be the change you want to see in the world, something something... So I've done some testing, and this certainly won't be the last of my tests and benchmarks, but it'll certainly be the first. I know what is on the community's mind. I hear you ask "How does the new Intel ca
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.






Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!