CL4SE: A Context Learning Benchmark For Software Engineering Tasks
arXiv:2602.23047v2 Announce Type: replace Abstract: Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success, existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the heterogeneous effects of different contexts across core SE workflows. To address this gap, we propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types (interpretable examples, project-specific context, procedural decision-making context, and positive & negative context), each mapped to a represen
View PDF HTML (experimental)
Abstract:Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success, existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the heterogeneous effects of different contexts across core SE workflows. To address this gap, we propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types (interpretable examples, project-specific context, procedural decision-making context, and positive & negative context), each mapped to a representative task (code generation, code summarization, code review, and patch correctness assessment). We construct high-quality datasets comprising over 13,000 samples from more than 30 open-source projects and evaluate five mainstream LLMs across nine metrics. Extensive experiments demonstrate that context learning yields an average performance improvement of 24.7% across all tasks. Specifically, procedural context boosts code review performance by up to 33% (Qwen3-Max), mixed positive-negative context improves patch assessment by 30% (DeepSeek-V3), project-specific context increases code summarization BLEU by 14.78% (GPT-Oss-120B), and interpretable examples enhance code generation PASS@1 by 5.72% (DeepSeek-V3). CL4SE establishes the first standardized evaluation framework for SE context learning, provides actionable empirical insights into task-specific context design, and releases a large-scale dataset to facilitate reproducible research in this domain.
Comments: 23 pages, 4 figures
Subjects:
Software Engineering (cs.SE)
Cite as: arXiv:2602.23047 [cs.SE]
(or arXiv:2602.23047v2 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2602.23047
arXiv-issued DOI via DataCite
Submission history
From: Shang Ye [view email] [v1] Thu, 26 Feb 2026 14:28:57 UTC (235 KB) [v2] Tue, 31 Mar 2026 14:31:22 UTC (249 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark
How I Replaced 6 Paid AI Subscriptions With One Free Tool (Saved $86/Month)
I was paying $86/month for AI tools. Then I found one free platform that replaced all of them. Here's the exact breakdown: The Tools I Cancelled Tool Cost What I Replaced It With ChatGPT Plus $20/mo Free GPT-4o on Kelora Otter.ai $17/mo Free audio transcription Jasper $49/mo Free AI text tools Total $86/mo $0 GPT-4o — Free Kelora gives direct access to GPT-4o, the same model inside ChatGPT Plus. No subscription, no credit card. I use it daily for code reviews, email drafts, and research summaries. Audio Transcription — Free Upload any audio file — meeting recordings, lectures, podcasts — and get accurate text back in seconds. Replaced my Otter.ai subscription instantly. AI Writing — Free Blog drafts, product copy, social posts. The text tools cover everything Jasper did for me at $49/month

Own Your Data: The Wake-Up Call
Data plays a critical part in our lives. And with the rapid changes driven by the recent evolution of AI, owning your data is no longer optional! First , we need to answer the following question: "Is your data really safe?" On April 1st, 2026 , an article was published on the Proton blog revealing that Big Tech companies have shared data from 6.9 million user accounts with US authorities over the past decade. Read the full Proton research for more details. Read google's transparency report for user data requests for more details. On January 1st, 2026 , Google published its AI Training Data Transparency Summary it contains the following: This is Google basically saying: "We use your data to train our AI models, but trust us, we're careful about it." On November 24, 2025 , Al Jazeera publish

Claude Code subagent patterns: how to break big tasks into bounded scopes
Claude Code Subagent Patterns: How to Break Big Tasks into Bounded Scopes If you've ever given Claude Code a massive task — "refactor the entire auth system" — and watched it spiral into confusion after 20 minutes, you've hit the core problem: unbounded scope kills context . The solution is subagent patterns: structured ways to decompose work into bounded, parallelizable units. Why Big Tasks Fail in Claude Code Claude Code has a finite context window. When you give it a large task: It reads lots of files → context fills up It loses track of what it read first It starts making contradictory changes You hit the context limit mid-task The session crashes and you lose progress The fix isn't a bigger context window — it's smaller tasks. The Subagent Pattern Instead of one Claude session doing e
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Own Your Data: The Wake-Up Call
Data plays a critical part in our lives. And with the rapid changes driven by the recent evolution of AI, owning your data is no longer optional! First , we need to answer the following question: "Is your data really safe?" On April 1st, 2026 , an article was published on the Proton blog revealing that Big Tech companies have shared data from 6.9 million user accounts with US authorities over the past decade. Read the full Proton research for more details. Read google's transparency report for user data requests for more details. On January 1st, 2026 , Google published its AI Training Data Transparency Summary it contains the following: This is Google basically saying: "We use your data to train our AI models, but trust us, we're careful about it." On November 24, 2025 , Al Jazeera publish

FAOS Neurosymbolic Architecture Boosts Enterprise Agent Accuracy by 46% via Ontology-Constrained Reasoning
Researchers introduced a neurosymbolic architecture that constrains LLM-based agents with formal ontologies, improving metric accuracy by 46% and regulatory compliance by 31.8% in controlled experiments. The system, deployed in production, serves 21 industries with over 650 agents. FAOS Neurosymbolic Architecture Boosts Enterprise Agent Accuracy by 46% via Ontology-Constrained Reasoning March 2026 — Enterprise adoption of AI agents faces a critical reliability gap: a March 2026 industry report revealed 86% of AI agent pilots fail to reach production due to hallucination, domain drift, and compliance failures. A new research paper, published on arXiv on April 1, 2026, proposes a concrete architectural solution—ontology-constrained neural reasoning—that demonstrates statistically significant




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!