Models model language model benchmark announce code generation self-improving

Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

arXiv cs.PLby [Submitted on 31 Mar 2026]April 1, 20262 min read1 views

arXiv:2603.29292v1 Announce Type: cross Abstract: Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing

View PDF HTML (experimental)

Abstract:Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.

Comments: Accepted in the 34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026)

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)

Cite as: arXiv:2603.29292 [cs.SE]

(or arXiv:2603.29292v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2603.29292

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Wei Hu [view email] [v1] Tue, 31 Mar 2026 05:55:17 UTC (634 KB)

Original source

arXiv cs.PL

https://arxiv.org/abs/2603.29292

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Models

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model - wsj.com

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model wsj.com

GNews AI Llama

1m11 months ago

Laws & RegulationFresh

Why AI is forcing law firms to rethink pricing, leverage and talent models - ETLegalWorld.com

Why AI is forcing law firms to rethink pricing, leverage and talent models ETLegalWorld.com

GNews AI legal

1mabout 10 hours ago

ModelsLive

Lemonade: AMD’s Open‑Source LLM Engine Bringing Real Speed to Local AI

Most developers who’ve tried running large language models locally know the routine: download a model, spin it up in Ollama or llama.cpp… Continue reading on Medium »

Medium AI

1m40 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 137 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model - wsj.com

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model wsj.com

GNews AI Llama

1m11 months ago

ModelsLive

90% людей используют нейросети как поисковик. И проигрывают.

Сидел я как-то в своем любимом кафе в центре Киева, потягивая капучино и глядя на бесконечную ленту кода на экране. Было 15 сентября 2023 года, и я уже потратил 12 часов на борьбу с проектом для клиента из США. Попытка использовать ChatGPT для аналитики больших данных только затянула процесс. Заплатить за это много часов, но результат оставлял желать лучшего. Нервно подперев голову рукой, я нащупал кнопку 'обновить' в поисках очередного ответа. Проблема была не в инструменте, а в подходе. Я попытался использовать универсальное решение для задачи, которая требовала чего-то более специализированного. Когда клиенты платят за результат, а не за процесс, времени на эксперименты просто нет. Я собрал промпты по этой теме в PDF. Забери бесплатно: https://t.me/airozov_bot В тот же вечер, разочарова

Dev.to AI

2m18 minutes ago

ModelsLive

The LLM Evaluation Playbook Every AI Engineer Needs

Most teams ship LLM apps blind. Here’s how to build the measurement system that changes that — golden test sets, RAGAS, LLM-as-Judge, and… Continue reading on Think in AI Agents »

Medium AI

1m40 minutes ago

ModelsLive

Lemonade: AMD’s Open‑Source LLM Engine Bringing Real Speed to Local AI

Most developers who’ve tried running large language models locally know the routine: download a model, spin it up in Ollama or llama.cpp… Continue reading on Medium »

Medium AI

1m40 minutes ago