Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus
arXiv:2603.29292v1 Announce Type: cross Abstract: Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing
View PDF HTML (experimental)
Abstract:Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.
Comments: Accepted in the 34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026)
Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Cite as: arXiv:2603.29292 [cs.SE]
(or arXiv:2603.29292v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2603.29292
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Wei Hu [view email] [v1] Tue, 31 Mar 2026 05:55:17 UTC (634 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

90% людей используют нейросети как поисковик. И проигрывают.
Сидел я как-то в своем любимом кафе в центре Киева, потягивая капучино и глядя на бесконечную ленту кода на экране. Было 15 сентября 2023 года, и я уже потратил 12 часов на борьбу с проектом для клиента из США. Попытка использовать ChatGPT для аналитики больших данных только затянула процесс. Заплатить за это много часов, но результат оставлял желать лучшего. Нервно подперев голову рукой, я нащупал кнопку 'обновить' в поисках очередного ответа. Проблема была не в инструменте, а в подходе. Я попытался использовать универсальное решение для задачи, которая требовала чего-то более специализированного. Когда клиенты платят за результат, а не за процесс, времени на эксперименты просто нет. Я собрал промпты по этой теме в PDF. Забери бесплатно: https://t.me/airozov_bot В тот же вечер, разочарова






Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!