Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessStop Writing Zod Schemas by Hand: What I Learned After 40 API EndpointsDEV CommunityBuilding an Engineering & Security News Aggregator (10 Sources, No APIs)DEV CommunityNietzsche in a MadhouseDEV CommunityBuzzFeed Is Dying Because It Bet Everything on AI — And Its CEO Still Won't Admit ItDEV CommunityDistributed Systems - Lamport Clock vs Hybrid Logical ClocksDEV CommunityThursday: April 2 - AI, ML and Computer Vision MeetupDEV CommunityThe Architecture of Forgetting.DEV CommunityWhy Your AI Agent Health Check Is Lying to YouDEV CommunityDetermine High-Performing Data Ingestion And Transformation SolutionsDEV Communityn8n Review 2026: I Used It for 8 Months to Build AI Agents (Honest Verdict)DEV CommunityLaunching: The "Human-AI Symbiosis Movement" (HAISM)LessWrong AIHow YouTube Works: Video Streaming Architecture Deep DiveDEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessStop Writing Zod Schemas by Hand: What I Learned After 40 API EndpointsDEV CommunityBuilding an Engineering & Security News Aggregator (10 Sources, No APIs)DEV CommunityNietzsche in a MadhouseDEV CommunityBuzzFeed Is Dying Because It Bet Everything on AI — And Its CEO Still Won't Admit ItDEV CommunityDistributed Systems - Lamport Clock vs Hybrid Logical ClocksDEV CommunityThursday: April 2 - AI, ML and Computer Vision MeetupDEV CommunityThe Architecture of Forgetting.DEV CommunityWhy Your AI Agent Health Check Is Lying to YouDEV CommunityDetermine High-Performing Data Ingestion And Transformation SolutionsDEV Communityn8n Review 2026: I Used It for 8 Months to Build AI Agents (Honest Verdict)DEV CommunityLaunching: The "Human-AI Symbiosis Movement" (HAISM)LessWrong AIHow YouTube Works: Video Streaming Architecture Deep DiveDEV Community

Enhancing LLM-Based Bug Reproduction for Android Apps via Pre-Assessment of Visual Effects

arXiv cs.SEby Xiangyang Xiao, Huaxun Huang, Rongxin WuApril 1, 20262 min read0 views
Source Quiz

arXiv:2603.29623v1 Announce Type: new Abstract: In the development and maintenance of Android apps, the quick and accurate reproduction of user-reported bugs is crucial to ensure application quality and improve user satisfaction. However, this process is often time-consuming and complex. Therefore, there is a need for an automated approach that can explore the Application Under Test (AUT) and identify the correct sequence of User Interface (UI) actions required to reproduce a bug, given only a complete bug report. Large Language Models (LLMs) have shown remarkable capabilities in understanding textual and visual semantics, making them a promising tool for planning UI actions. Nevertheless, our study shows that even when using state-of-the-art LLM-based approaches, these methods still strug

View PDF HTML (experimental)

Abstract:In the development and maintenance of Android apps, the quick and accurate reproduction of user-reported bugs is crucial to ensure application quality and improve user satisfaction. However, this process is often time-consuming and complex. Therefore, there is a need for an automated approach that can explore the Application Under Test (AUT) and identify the correct sequence of User Interface (UI) actions required to reproduce a bug, given only a complete bug report. Large Language Models (LLMs) have shown remarkable capabilities in understanding textual and visual semantics, making them a promising tool for planning UI actions. Nevertheless, our study shows that even when using state-of-the-art LLM-based approaches, these methods still struggle to follow detailed bug reproduction instructions and replan based on new information, due to their inability to accurately predict and interpret the visual effects of UI components. To address these limitations, we propose LTGDroid. Our insight is to execute all possible UI actions on the current UI page during exploration, record their corresponding visual effects, and leverage these visual cues to guide the LLM in selecting UI actions that are likely to reproduce the bug. We evaluated LTGDroid, instantiated with GPT-4.1, on a benchmark consisting of 75 bug reports from 45 popular Android apps. The results show that LTGDroid achieves a reproduction success rate of 87.51%, improving over the state-of-the-art baselines by 49.16% and 556.30%, while requiring an average of 20.45 minutes and approximately $0.27 to successfully reproduce a bug. The LTGDroid implementation is publicly available at this https URL.

Subjects:

Software Engineering (cs.SE)

Cite as: arXiv:2603.29623 [cs.SE]

(or arXiv:2603.29623v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2603.29623

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Huaxun Huang [view email] [v1] Tue, 31 Mar 2026 11:44:45 UTC (390 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Enhancing L…modellanguage mo…benchmarkannounceavailableproductarXiv cs.SE

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 224 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models