Models model language model benchmark announce valuation report

WybeCoder: Verified Imperative Code Generation

arXiv cs.SEby Fabian Gloeckle, Mantas Baksys, Darius Feher, Kunhao Zheng, Amaury Hayat, Sean B. Holden, Gabriel Synnaeve, Peter O'HearnApril 1, 20261 min read0 views

Source Quiz

arXiv:2603.29088v1 Announce Type: new Abstract: Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling o

View PDF

Abstract:Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling our approach, synthesizing dozens of valid invariants and dispatching of dozens of subgoals, resulting in hundreds of lines of verified code, overcoming plateaus reported in previous works. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, significantly surpassing previous evaluations and paving a path to automated construction of large-scale datasets of verified imperative code.

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.29088 [cs.SE]

(or arXiv:2603.29088v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2603.29088

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Mantas Baksys [view email] [v1] Tue, 31 Mar 2026 00:06:44 UTC (757 KB)

Original source

arXiv cs.SE

https://arxiv.org/abs/2603.29088

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

ModelsFresh

How AI Actually Thinks - Explained So a 13-Year-Old Gets It

Tokens, training, context windows, and temperature — the four concepts that explain everything about large language models. You know how your phone suggests the next word when you’re texting? Type “I’m going to the” and it suggests “store” or “park.” Now imagine that autocomplete was trained on every book, every website, every conversation ever written — and instead of suggesting one word, it could write entire essays, solve math problems, and generate working code. That’s fundamentally what a Large Language Model does. And once you understand four concepts — tokens, training, context windows, and temperature — you’ll know more about how AI works than 95% of people who use it daily. No PhD required. Concept 1: Tokens — How AI Reads AI doesn’t read letters or words the way you do. It reads

Towards AI

3mabout 2 hours ago

ModelsLive

From Cloud to On-Device: What Gemma 4 Means for the Voice AI Pipeline

Google just dropped its most capable open model family and it might be the missing piece for on-device voice AI. Continue reading on Towards AI »

Towards AI

1mabout 1 hour ago

ModelsFresh

DeepSeek V4 draait op Huawei-chips en omzeilt afhankelijkheid Nvidia

Het grote taalmodel DeepSeek V4 draait naar verluidt op chips van Huawei. Tot dusver was het Chinese bedrijf achter de AI-dienst afhankelijk van Nvidia-processors, die onder Amerikaanse exportrestricties vallen. Naar verwachting komt het model deze lente uit.

Tweakers.net

1mabout 4 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 150 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

WybeCoder: Verified Imperative Code Generation

Submission history

Daily AI Digest

More about

How AI Actually Thinks - Explained So a 13-Year-Old Gets It

From Cloud to On-Device: What Gemma 4 Means for the Voice AI Pipeline

DeepSeek V4 draait op Huawei-chips en omzeilt afhankelijkheid Nvidia

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Anthropic says Claude Code transformed programming. Now Claude Cowork is coming for the rest of the enterprise. - VentureBeat

How AI Actually Thinks - Explained So a 13-Year-Old Gets It

Only 7.4% of Fortune 500 have an llms.txt file, study finds - PPC Land

From Cloud to On-Device: What Gemma 4 Means for the Voice AI Pipeline