From Code Changes to Quality Gains: An Empirical Study in Python ML Systems with PyQu
arXiv:2511.02827v3 Announce Type: replace Abstract: In an era shaped by Generative Artificial Intelligence for code generation and the rising adoption of Python-based Machine Learning systems (MLS), software quality has emerged as a major concern. As these systems grow in complexity and importance, a key obstacle lies in understanding exactly how specific code changes affect overall quality-a shortfall aggravated by the lack of quality assessment tools and a clear mapping between ML systems code changes and their quality effects. Although prior work has explored code changes in MLS, it mostly stops at what the changes are, leaving a gap in our knowledge of the relationship between code changes and the MLS quality. To address this gap, we conducted a large-scale empirical study of 3,340 ope
View PDF HTML (experimental)
Abstract:In an era shaped by Generative Artificial Intelligence for code generation and the rising adoption of Python-based Machine Learning systems (MLS), software quality has emerged as a major concern. As these systems grow in complexity and importance, a key obstacle lies in understanding exactly how specific code changes affect overall quality-a shortfall aggravated by the lack of quality assessment tools and a clear mapping between ML systems code changes and their quality effects. Although prior work has explored code changes in MLS, it mostly stops at what the changes are, leaving a gap in our knowledge of the relationship between code changes and the MLS quality. To address this gap, we conducted a large-scale empirical study of 3,340 open-source Python ML projects, encompassing more than 3.7 million commits and 2.7 trillion lines of code. We introduce PyQu, a novel tool that leverages low level software metrics to identify quality-enhancing commits with an average accuracy, precision, and recall of 0.84 and 0.85 of average F1 score. Using PyQu and a thematic analysis, we identified 61 code changes, each demonstrating a direct impact on enhancing software quality, and we classified them into 13 categories based on contextual characteristics. 41% of the changes are newly discovered by our study and have not been identified by state-of-the-art Python changes detection tools. Our work offers a vital foundation for researchers, practitioners, educators, and tool developers, advancing the quest for automated quality assessment and best practices in Python-based ML software.
Comments: Accepted for publication in the proceedings of IEEE/ACM 48th International Conference on Software Engineering (ICSE26)
Subjects:
Software Engineering (cs.SE)
Cite as: arXiv:2511.02827 [cs.SE]
(or arXiv:2511.02827v3 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2511.02827
arXiv-issued DOI via DataCite
Submission history
From: Mohamed Almukhtar [view email] [v1] Tue, 4 Nov 2025 18:55:19 UTC (515 KB) [v2] Wed, 7 Jan 2026 16:16:47 UTC (515 KB) [v3] Wed, 1 Apr 2026 01:00:55 UTC (515 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
announceopen-sourcemillion
Choosing an AI Agent Orchestrator in 2026: A Practical Comparison
Running one AI coding agent is easy. Running three in parallel on the same codebase is where things get interesting — and where you need to make a tooling choice. There's no "best" orchestrator. There's the right one for your workflow. Here's an honest comparison of five approaches, with the tradeoffs I've seen after months of running multi-agent setups. The Options 1. Raw tmux Scripts What it is: Shell scripts that launch agents in tmux panes. DIY orchestration. Pros: Zero dependencies beyond tmux Full control over every detail No abstractions to fight You already know how it works Cons: No state management — you track everything manually No message routing between agents No test gating — agents declare "done" without verification Breaks when agents crash or hit context limits You become

AGI Won’t Automate Most Jobs—Economist Reveals Why They’re Not Worth It
The Hidden Truth About AGI and Jobs: It’s Not Automation—It’s Economics For years, the narrative around artificial intelligence has been dominated by visions of a jobless future, where machines take over every conceivable role. But what if the real story is far more complex? A new paper by one of the world’s leading economists of automation is flipping the script, offering a perspective that is both unexpectedly reassuring and deeply unsettling. Key Takeaways: The assumption that AGI will automate most jobs is being challenged by leading economic research. The paper suggests that many jobs won’t be automated—not because they’re irreplaceable, but because they’re simply not worth the cost of automation. This insight reframes the AI debate, shifting focus from technological capability to eco

Why Nobody Is Testing AI Agent Security at Scale — And How Swarm Simulation Could Change That
The Gap Nobody Talks About We test individual AI agents. We scan skills for malicious patterns. We probe for prompt injection. But here is the question nobody is asking: What happens when you put 1,000 diverse AI agents in a room and inject 5 adversarial ones? Every security tool I know tests agents in isolation. One agent, one probe, one result. But real-world agent ecosystems are not isolated. They are communities — agents with different personalities, trust levels, expertise, and memory — interacting, influencing each other, and making collective decisions. The threat model is not "can this agent be compromised?" It is "how fast does a compromise propagate through an ecosystem?" What Swarm Simulation Already Does Swarm intelligence simulation is exploding in market research. Tools like
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
![[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-earth-satellite-QfbitDhCB2KjTsjtXRYcf9.webp)
[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry
Hi r/MachineLearning , I’m looking for an arXiv endorser in cs.LG for a paper on inference-time distribution shift detection for deployed LLMs. The core idea: instead of monitoring input embeddings (which is what existing tools do), we monitor the statistical manifold of the model’s output distributions using Fisher-Rao geodesic distance. We then run adaptive CUSUM (Page-Hinkley) on the resulting z-score stream to catch slow drift that per-request spike detection misses entirely. The methodology is grounded in published work on information geometry (Figshare, DOIs available). We’ve validated the signal on real OpenAI API logprobs, CUSUM caught gradual domain drift in 7 steps with zero false alarms during warmup, while spike detection missed it entirely. If anyone with cs.LG endorsement is
![[D] KDD Review Discussion](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-wave-pattern-4YWNKzoeu65vYpqRKWMiWf.webp)
[D] KDD Review Discussion
KDD 2026 (Feb Cycle) reviews will release today (4-April AoE), This thread is open to discuss about reviews and importantly celebrate successful reviews. Let us all remember that review system is noisy and we all suffer from it and this doesn't define our research impact. Let's all prioritise reviews which enhance our papers. Feel free to discuss your experiences submitted by /u/BomsDrag [link] [comments]




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!