Models model language model benchmark release announce open-source

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

arXiv cs.SEby Haichuan Hu, Quanjun Zhang, Ye Shang, Guoqing Xie, Chunrong Fang, Zhenyu Chen, Liang XiaoApril 1, 20262 min read0 views

Source Quiz

arXiv:2602.23047v2 Announce Type: replace Abstract: Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success, existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the heterogeneous effects of different contexts across core SE workflows. To address this gap, we propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types (interpretable examples, project-specific context, procedural decision-making context, and positive & negative context), each mapped to a represen

View PDF HTML (experimental)

Abstract:Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success, existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the heterogeneous effects of different contexts across core SE workflows. To address this gap, we propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types (interpretable examples, project-specific context, procedural decision-making context, and positive & negative context), each mapped to a representative task (code generation, code summarization, code review, and patch correctness assessment). We construct high-quality datasets comprising over 13,000 samples from more than 30 open-source projects and evaluate five mainstream LLMs across nine metrics. Extensive experiments demonstrate that context learning yields an average performance improvement of 24.7% across all tasks. Specifically, procedural context boosts code review performance by up to 33% (Qwen3-Max), mixed positive-negative context improves patch assessment by 30% (DeepSeek-V3), project-specific context increases code summarization BLEU by 14.78% (GPT-Oss-120B), and interpretable examples enhance code generation PASS@1 by 5.72% (DeepSeek-V3). CL4SE establishes the first standardized evaluation framework for SE context learning, provides actionable empirical insights into task-specific context design, and releases a large-scale dataset to facilitate reproducible research in this domain.

Comments: 23 pages, 4 figures

Subjects:

Software Engineering (cs.SE)

Cite as: arXiv:2602.23047 [cs.SE]

(or arXiv:2602.23047v2 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2602.23047

arXiv-issued DOI via DataCite

Submission history

From: Shang Ye [view email] [v1] Thu, 26 Feb 2026 14:28:57 UTC (235 KB) [v2] Tue, 31 Mar 2026 14:31:22 UTC (249 KB)

Original source

arXiv cs.SE

https://arxiv.org/abs/2602.23047

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

ProductsLive

How I Replaced 6 Paid AI Subscriptions With One Free Tool (Saved $86/Month)

I was paying $86/month for AI tools. Then I found one free platform that replaced all of them. Here's the exact breakdown: The Tools I Cancelled Tool Cost What I Replaced It With ChatGPT Plus $20/mo Free GPT-4o on Kelora Otter.ai $17/mo Free audio transcription Jasper $49/mo Free AI text tools Total $86/mo $0 GPT-4o — Free Kelora gives direct access to GPT-4o, the same model inside ChatGPT Plus. No subscription, no credit card. I use it daily for code reviews, email drafts, and research summaries. Audio Transcription — Free Upload any audio file — meeting recordings, lectures, podcasts — and get accurate text back in seconds. Replaced my Otter.ai subscription instantly. AI Writing — Free Blog drafts, product copy, social posts. The text tools cover everything Jasper did for me at $49/month

Dev.to AI

2m25 minutes ago

ModelsLive

Own Your Data: The Wake-Up Call

Data plays a critical part in our lives. And with the rapid changes driven by the recent evolution of AI, owning your data is no longer optional! First , we need to answer the following question: "Is your data really safe?" On April 1st, 2026 , an article was published on the Proton blog revealing that Big Tech companies have shared data from 6.9 million user accounts with US authorities over the past decade. Read the full Proton research for more details. Read google's transparency report for user data requests for more details. On January 1st, 2026 , Google published its AI Training Data Transparency Summary it contains the following: This is Google basically saying: "We use your data to train our AI models, but trust us, we're careful about it." On November 24, 2025 , Al Jazeera publish

Dev.to AI

2m25 minutes ago

ReleasesLive

Claude Code subagent patterns: how to break big tasks into bounded scopes

Claude Code Subagent Patterns: How to Break Big Tasks into Bounded Scopes If you've ever given Claude Code a massive task — "refactor the entire auth system" — and watched it spiral into confusion after 20 minutes, you've hit the core problem: unbounded scope kills context . The solution is subagent patterns: structured ways to decompose work into bounded, parallelizable units. Why Big Tasks Fail in Claude Code Claude Code has a finite context window. When you give it a large task: It reads lots of files → context fills up It loses track of what it read first It starts making contradictory changes You hit the context limit mid-task The session crashes and you lose progress The fix isn't a bigger context window — it's smaller tasks. The Subagent Pattern Instead of one Claude session doing e

Dev.to AI

5m25 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 144 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

Own Your Data: The Wake-Up Call

Dev.to AI

2m25 minutes ago

ModelsLive

FAOS Neurosymbolic Architecture Boosts Enterprise Agent Accuracy by 46% via Ontology-Constrained Reasoning

Researchers introduced a neurosymbolic architecture that constrains LLM-based agents with formal ontologies, improving metric accuracy by 46% and regulatory compliance by 31.8% in controlled experiments. The system, deployed in production, serves 21 industries with over 650 agents. FAOS Neurosymbolic Architecture Boosts Enterprise Agent Accuracy by 46% via Ontology-Constrained Reasoning March 2026 — Enterprise adoption of AI agents faces a critical reliability gap: a March 2026 industry report revealed 86% of AI agent pilots fail to reach production due to hallucination, domain drift, and compliance failures. A new research paper, published on arXiv on April 1, 2026, proposes a concrete architectural solution—ontology-constrained neural reasoning—that demonstrates statistically significant

Dev.to AI

7m22 minutes ago

Models

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - WSJ

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT WSJ

GNews AI video

1m5 days ago

ModelsFresh

AI-Ready America: NSF plans to expand AI training across every state - TechRadar

AI-Ready America: NSF plans to expand AI training across every state TechRadar

GNews AI USA

1mabout 7 hours ago