Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessUnderstanding Token Classification in NLP: NER, POS Tagging & Chunking ExplainedMedium AIIntroducing ForestFire, a new tree-learning libraryMedium AIBuy Verified Coinbase Accounts - 100% active and safeDev.to AI90% людей используют нейросети как поисковик. И проигрывают.Dev.to AIContinuing the idea of building a one-person unicorn, it is important to recognize that this…Medium AIHow to Build an AI Content Playbook That Actually Protects Your VoiceDev.to AIExploring Early Web Patterns for Modern AI Agent DevelopmentDev.to AIUnderstanding NLP Token Classification : A Beginner-Friendly GuideMedium AIHow Do You Actually Scale High-Throughput LLM Serving in Production with vLLM?Medium AIGemma 4 and the On-Device AI Revolution No One Prepared You ForDev.to AI5 Claude Entrances That Doubled My Workflow EfficiencyDev.to AII built a tool that turns messy Git history into Architecture Maps and Exec Briefings (RepoWrit)Dev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessUnderstanding Token Classification in NLP: NER, POS Tagging & Chunking ExplainedMedium AIIntroducing ForestFire, a new tree-learning libraryMedium AIBuy Verified Coinbase Accounts - 100% active and safeDev.to AI90% людей используют нейросети как поисковик. И проигрывают.Dev.to AIContinuing the idea of building a one-person unicorn, it is important to recognize that this…Medium AIHow to Build an AI Content Playbook That Actually Protects Your VoiceDev.to AIExploring Early Web Patterns for Modern AI Agent DevelopmentDev.to AIUnderstanding NLP Token Classification : A Beginner-Friendly GuideMedium AIHow Do You Actually Scale High-Throughput LLM Serving in Production with vLLM?Medium AIGemma 4 and the On-Device AI Revolution No One Prepared You ForDev.to AI5 Claude Entrances That Doubled My Workflow EfficiencyDev.to AII built a tool that turns messy Git history into Architecture Maps and Exec Briefings (RepoWrit)Dev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

arXiv cs.PLby [Submitted on 31 Mar 2026]April 1, 20262 min read1 views
Source Quiz

arXiv:2603.29292v1 Announce Type: cross Abstract: Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing

View PDF HTML (experimental)

Abstract:Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.

Comments: Accepted in the 34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026)

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)

Cite as: arXiv:2603.29292 [cs.SE]

(or arXiv:2603.29292v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2603.29292

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Wei Hu [view email] [v1] Tue, 31 Mar 2026 05:55:17 UTC (634 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Self-Improv…modellanguage mo…benchmarkannouncecode genera…self-improv…arXiv cs.PL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 137 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models

90% людей используют нейросети как поисковик. И проигрывают.
ModelsLive

90% людей используют нейросети как поисковик. И проигрывают.

Сидел я как-то в своем любимом кафе в центре Киева, потягивая капучино и глядя на бесконечную ленту кода на экране. Было 15 сентября 2023 года, и я уже потратил 12 часов на борьбу с проектом для клиента из США. Попытка использовать ChatGPT для аналитики больших данных только затянула процесс. Заплатить за это много часов, но результат оставлял желать лучшего. Нервно подперев голову рукой, я нащупал кнопку 'обновить' в поисках очередного ответа. Проблема была не в инструменте, а в подходе. Я попытался использовать универсальное решение для задачи, которая требовала чего-то более специализированного. Когда клиенты платят за результат, а не за процесс, времени на эксперименты просто нет. Я собрал промпты по этой теме в PDF. Забери бесплатно: https://t.me/airozov_bot В тот же вечер, разочарова