WybeCoder: Verified Imperative Code Generation
arXiv:2603.29088v1 Announce Type: new Abstract: Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling o
View PDF
Abstract:Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling our approach, synthesizing dozens of valid invariants and dispatching of dozens of subgoals, resulting in hundreds of lines of verified code, overcoming plateaus reported in previous works. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, significantly surpassing previous evaluations and paving a path to automated construction of large-scale datasets of verified imperative code.
Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.29088 [cs.SE]
(or arXiv:2603.29088v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2603.29088
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Mantas Baksys [view email] [v1] Tue, 31 Mar 2026 00:06:44 UTC (757 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark
How AI Actually Thinks - Explained So a 13-Year-Old Gets It
Tokens, training, context windows, and temperature — the four concepts that explain everything about large language models. You know how your phone suggests the next word when you’re texting? Type “I’m going to the” and it suggests “store” or “park.” Now imagine that autocomplete was trained on every book, every website, every conversation ever written — and instead of suggesting one word, it could write entire essays, solve math problems, and generate working code. That’s fundamentally what a Large Language Model does. And once you understand four concepts — tokens, training, context windows, and temperature — you’ll know more about how AI works than 95% of people who use it daily. No PhD required. Concept 1: Tokens — How AI Reads AI doesn’t read letters or words the way you do. It reads

DeepSeek V4 draait op Huawei-chips en omzeilt afhankelijkheid Nvidia
Het grote taalmodel DeepSeek V4 draait naar verluidt op chips van Huawei. Tot dusver was het Chinese bedrijf achter de AI-dienst afhankelijk van Nvidia-processors, die onder Amerikaanse exportrestricties vallen. Naar verwachting komt het model deze lente uit.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

How AI Actually Thinks - Explained So a 13-Year-Old Gets It
Tokens, training, context windows, and temperature — the four concepts that explain everything about large language models. You know how your phone suggests the next word when you’re texting? Type “I’m going to the” and it suggests “store” or “park.” Now imagine that autocomplete was trained on every book, every website, every conversation ever written — and instead of suggesting one word, it could write entire essays, solve math problems, and generate working code. That’s fundamentally what a Large Language Model does. And once you understand four concepts — tokens, training, context windows, and temperature — you’ll know more about how AI works than 95% of people who use it daily. No PhD required. Concept 1: Tokens — How AI Reads AI doesn’t read letters or words the way you do. It reads



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!