Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning
arXiv:2603.17233v2 Announce Type: replace Abstract: Auto-formalization (AF) translates natural-language reasoning problems into solver-executable programs, enabling symbolic solvers to perform sound logical deduction. In practice, however, AF pipelines are currently brittle: programs may fail to execute, or execute but encode incorrect semantics. While prior work largely mitigates syntactic failures via repairs based on solver feedback, reducing semantics failures remains a major bottleneck. We propose Draft-and-Prune (D&P), an inference-time framework that improves AF-based logical reason — Zhiyu Ni, Zheng Liang, Liangcheng Song, Chenrui Cao, Xian Zhang, Alberto Sangiovanni-Vincentelli, Pierluigi Nuzzo
View PDF HTML (experimental)
Abstract:Auto-formalization (AF) translates natural-language reasoning problems into solver-executable programs, enabling symbolic solvers to perform sound logical deduction. In practice, however, AF pipelines are currently brittle: programs may fail to execute, or execute but encode incorrect semantics. While prior work largely mitigates syntactic failures via repairs based on solver feedback, reducing semantics failures remains a major bottleneck. We propose Draft-and-Prune (D&P), an inference-time framework that improves AF-based logical reasoning via diversity and verification. D&P first drafts multiple natural-language plans and conditions program generation on them. It further prunes executable but contradictory or ambiguous formalizations, and aggregates predictions from surviving paths via majority voting. Across four representative benchmarks (AR-LSAT, ProofWriter, PrOntoQA, LogicalDeduction), D&P substantially strengthens AF-based reasoning without extra supervision. On AR-LSAT, in the AF-only setting, D&P achieves 78.43% accuracy with GPT-4 and 78.00% accuracy with GPT-4o, significantly outperforming the strongest AF baselines MAD-LOGIC and CLOVER. D&P then attains near-ceiling performance on the other benchmarks, including 100% on PrOntoQA and LogicalDeduction.
Subjects:
Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.17233 [cs.AI]
(or arXiv:2603.17233v2 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.17233
arXiv-issued DOI via DataCite
Submission history
From: Zhiyu Ni [view email] [v1] Wed, 18 Mar 2026 00:35:14 UTC (361 KB) [v2] Thu, 26 Mar 2026 23:54:42 UTC (357 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
How to Build an AI Content Playbook That Actually Protects Your Voice
Ahnii! You've read the articles warning you not to let AI take over your content. Ruth Doherty's latest piece is one of the best: a clear-eyed breakdown of where AI helps and where it silently destroys your brand. This post shows you how to take that framework and turn it into an actual operating document for your content pipeline. Why a Framework Without a Playbook Doesn't Stick Ruth's core argument is sharp: AI is an efficiency engine, not a strategy engine. Use it for research, structuring, repurposing, and editing. Keep it away from messaging, customer research, and anything that requires your actual point of view. That distinction is easy to agree with. It's harder to enforce on a Tuesday afternoon when you're behind on three social posts and the AI can draft all of them in 90 seconds

Top 10 Best Universities to Study AI in USA 2026 Led by CMU and MIT With Strong Research and Industry Ties - International Business Times Australia
Top 10 Best Universities to Study AI in USA 2026 Led by CMU and MIT With Strong Research and Industry Ties International Business Times Australia
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!