Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessWhat’s the point of an AI novel?Financial Times TechI Built an AI Content Pipeline That Publishes 4 SEO-Optimized Articles Per Day — Here's the ArchitectureDEV CommunityMy Reading Journey: Jan-Mar 2026DEV CommunityBuilding a Second Brain for Claude CodeDEV CommunityThe Perfect CLAUDE.md: How to Set Up Your Project for Agentic CodingDEV CommunityClaude Code Advanced Workflow: Subagents, Commands & Multi-SessionDEV CommunityHow to Build a Custom MCP Server for Claude Code: A Step-by-Step TutorialDEV CommunityHetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026DEV CommunityOllama + Open WebUI Self-Hosting Guide 2026 — Run Your Own AI for $0DEV CommunityHow to Self-Host Your Entire Dev Stack for Under $20/Month in 2026DEV CommunityTop 15 MCP Servers Every Developer Should Install in 2026DEV CommunityHow to emotionally grasp the risks of AI SafetyLessWrong AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessWhat’s the point of an AI novel?Financial Times TechI Built an AI Content Pipeline That Publishes 4 SEO-Optimized Articles Per Day — Here's the ArchitectureDEV CommunityMy Reading Journey: Jan-Mar 2026DEV CommunityBuilding a Second Brain for Claude CodeDEV CommunityThe Perfect CLAUDE.md: How to Set Up Your Project for Agentic CodingDEV CommunityClaude Code Advanced Workflow: Subagents, Commands & Multi-SessionDEV CommunityHow to Build a Custom MCP Server for Claude Code: A Step-by-Step TutorialDEV CommunityHetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026DEV CommunityOllama + Open WebUI Self-Hosting Guide 2026 — Run Your Own AI for $0DEV CommunityHow to Self-Host Your Entire Dev Stack for Under $20/Month in 2026DEV CommunityTop 15 MCP Servers Every Developer Should Install in 2026DEV CommunityHow to emotionally grasp the risks of AI SafetyLessWrong AI
AI NEWS HUBbyEIGENVECTOREigenvector

Effort-Optimized, Accuracy-Driven Labelling and Validation of Test Inputs for DL Systems: A Mixed-Integer Linear Programming Approach

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2507.04990v3 Announce Type: replace Abstract: Software systems increasingly include AI components based on deep learning (DL). Reliable testing of such systems requires near-perfect test-input validity and label accuracy, with minimal human effort. Yet, the DL community has largely overlooked the need to build highly accurate datasets with minimal effort, since DL training is generally tolerant of labelling errors. This challenge, instead, reflects concerns more familiar to software engineering, where a central goal is to construct high-accuracy test inputs, with accuracy as close to 100 — Mohammad Hossein Amini, Mehrdad Sabetzadeh, Shiva Nejati

View PDF HTML (experimental)

Abstract:Software systems increasingly include AI components based on deep learning (DL). Reliable testing of such systems requires near-perfect test-input validity and label accuracy, with minimal human effort. Yet, the DL community has largely overlooked the need to build highly accurate datasets with minimal effort, since DL training is generally tolerant of labelling errors. This challenge, instead, reflects concerns more familiar to software engineering, where a central goal is to construct high-accuracy test inputs, with accuracy as close to 100% as possible, while keeping associated costs in check. In this article we introduce OPAL, a human-assisted labelling method that can be configured to target a desired accuracy level while minimizing the manual effort required for labelling. The main contribution of OPAL is a mixed-integer linear programming (MILP) formulation that minimizes labelling effort subject to a specified accuracy target. To evaluate OPAL we instantiate it for two tasks in the context of testing vision systems: automatic labelling of test inputs and automated validation of test inputs. Our evaluation, based on more than 2500 experiments performed on nine datasets, comparing OPAL with eight baseline methods, shows that OPAL, relying on its MILP formulation, achieves an average accuracy of 98.8%, while cutting manual labelling by more than half. OPAL significantly outperforms automated labelling baselines in labelling accuracy across all nine datasets, when all methods are provided with the same manual-labelling budget. For automated test-input validation, on average, OPAL reduces manual effort by 28.8% while achieving 4.5% higher accuracy than the SOTA test-input validation baselines. Finally, we show that augmenting OPAL with an active-learning loop leads to an additional 4.5% reduction in required manual labelling, without compromising accuracy.

Comments: Accepted in the Empirical Software Engineering (EMSE) Journal (2026)

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)

Cite as: arXiv:2507.04990 [cs.CV]

(or arXiv:2507.04990v3 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2507.04990

arXiv-issued DOI via DataCite

Submission history

From: Mohammad Hossein Amini [view email] [v1] Mon, 7 Jul 2025 13:30:30 UTC (1,500 KB) [v2] Wed, 17 Sep 2025 17:06:24 UTC (1,597 KB) [v3] Mon, 30 Mar 2026 15:52:35 UTC (1,521 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Effort-Opti…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 165 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers