Models claude model announce analysis code generation component

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

arXiv cs.CLby Pratyay Banerjee, Masud Moshtaghi, Ankit ChadhaApril 1, 20262 min read0 views

arXiv:2603.29093v1 Announce Type: new Abstract: LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid r

View PDF HTML (experimental)

Abstract:LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present APEX-EM, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a structured experience representation encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a Plan-Retrieve-Generate-Iterate-Ingest (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a dual-outcome Experience Memory with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench, KGQAGen-10k, and Humanity's Last Exam using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6% accuracy versus 41.3% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9%). On BigCodeBench, it reaches 83.3% SR from a 53.9% baseline (+29.4pp), exceeding MemRL's +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0% from 25.2% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

Comments: 17 pages, 13 figures

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as: arXiv:2603.29093 [cs.CL]

(or arXiv:2603.29093v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29093

arXiv-issued DOI via DataCite

Submission history

From: Pratyay Banerjee [view email] [v1] Tue, 31 Mar 2026 00:24:56 UTC (1,096 KB) [v2] Thu, 2 Apr 2026 21:09:27 UTC (1,096 KB)

Original source

arXiv cs.CL

https://arxiv.org/abs/2603.29093

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudemodelannounce

ReleasesFresh

SwEYEpinch: Exploring Intuitive, Efficient Text Entry for Extended Reality via Eye and Hand Tracking

arXiv:2604.03520v1 Announce Type: new Abstract: Despite steady progress, text entry in Extended Reality (XR) often remains slower and more effortful than typing on a physical keyboard or touchscreen. We explore a simple idea: use gaze to swipe through a virtual keyboard for the fast, low-effort where and a manual pinch held throughout the swipe for the when, extending and validating it through a series of user studies. We first show that a basic version including a low-latency decoder with spatiotemporal Dynamic Time Warping and fixation filtering outperforms selecting individual keys sequentially, either by finger tapping each or gazing at each while pinching. We then add mid-swipe prediction and in-gesture cancellation, improving words per minute (WPM) without hurting accuracy. We show t

arXiv cs.HC

1mabout 2 hours ago

Analyst NewsFresh

Occupational Diversity and Stratification in Platform Work: A Longitudinal Study of Online Freelancers

arXiv:2604.03517v1 Announce Type: new Abstract: We focus on occupational diversity in platform-mediated work to advance conceptual and empirical insight into the occupationally embedded nature of platform labor. We pursue this focus in response to a prevailing tendency to treat platform workers as a homogeneous group, overlooking the unique demands, constraints, and practices rooted in specific professions. Such generalizations hinder both understanding of platform work and the development of sociotechnical systems that support differentiated occupational realities. To address this gap, we present a longitudinal analysis of 108 online freelancers spanning five occupational categories. We show that occupational context structures workers' capacity to interpret and navigate platformic manage

arXiv cs.HC

1mabout 2 hours ago

ProductsFresh

The Augmentation Trap: AI Productivity and the Cost of Cognitive Offloading

arXiv:2604.03501v1 Announce Type: new Abstract: Experimental evidence confirms that AI tools raise worker productivity, but also that sustained use can erode the expertise on which those gains depend. We develop a dynamic model in which a decision-maker chooses AI usage intensity for a worker over time, trading immediate productivity against the erosion of worker skill. We decompose the tool's productivity effect into two channels, one independent of worker expertise and one that scales with it. The model produces three main results. First, even a decision-maker who fully anticipates skill erosion rationally adopts AI when front-loaded productivity gains outweigh long-run skill costs, producing steady-state loss: the worker ends up less productive than before adoption. Second, when manager

arXiv cs.HC

2mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 275 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

Messages in a Digital Bottle: A Youth-Coauthored Perspective on LLM Chatbots and Adolescent Loneliness

arXiv:2604.03470v1 Announce Type: new Abstract: Adolescent loneliness is a growing concern in digitally mediated social environments. This work-in-progress presents a youth-authored critical synthesis on chatbots powered by Large Language Model (LLM) and adolescent loneliness. The first author is a 16-year-old Chinese student who recently migrated to the UK. She wrote the first draft of this paper from her lived experience, supervised by the second author. Rather than treating the youth perspective as one data point among many, we foreground it as the primary interpretive lens, grounded in interdisciplinary literature from social computing, developmental psychology, and Human-Computer Interaction (HCI). We examine how chatbots shape experiences of loneliness differently across adolescent s

arXiv cs.HC

1mabout 2 hours ago

ModelsFresh

Incidental Interaction: Technology to Support Elder Strength Training through Everyday Movements

arXiv:2604.03241v1 Announce Type: new Abstract: Strength training is a key determinant of healthy aging, yet adherence to formal exercise programs among older adults remains low. While many technologies aim to encourage physical activity in older adults, they typically rely on dedicated devices, wearables, or explicit exercise tasks. They therefore do not embed task practice into daily life. Our new approach, termed Incidental Interaction, instead transforms everyday actions into opportunities for deliberate strength building. It thereby operationalizes everyday movements such as sitting, standing, or lifting objects as strength exercises, encouraging participants to repeat them to build functional capacity. This repetition is encapsulated in the phrase "do it twice", and is combined with

arXiv cs.HC

1mabout 2 hours ago

ModelsFresh

Measuring Human Preferences in RLHF is a Social Science Problem

arXiv:2604.03238v1 Announce Type: new Abstract: RLHF assumes that annotation responses reflect genuine human preferences. We argue this assumption warrants systematic examination, and that behavioral science offers frameworks that bring clarity to when it holds and when it breaks down. Behavioral scientists have documented for sixty years that people routinely produce responses without holding genuine opinions, construct preferences on the spot based on contextual cues, and interpret identical questions differently. These phenomena are pervasive for precisely the value-laden judgments that matter most for alignment, yet this literature has not yet been systematically integrated into ML practice. We argue that the ML community must treat measurement validity as logically prior to preference

arXiv cs.HC

1mabout 2 hours ago

ModelsFresh

The Persuasion Paradox: When LLM Explanations Fail to Improve Human-AI Team Performance

arXiv:2604.03237v1 Announce Type: new Abstract: While natural-language explanations from large language models (LLMs) are widely adopted to improve transparency and trust, their impact on objective human-AI team performance remains poorly understood. We identify a Persuasion Paradox: fluent explanations systematically increase user confidence and reliance on AI without reliably improving, and in some cases undermining, task accuracy. Across three controlled human-subject studies spanning abstract visual reasoning (RAVEN matrices) and deductive logical reasoning (LSAT problems), we disentangle the effects of AI predictions and explanations using a multi-stage reveal design and between-subjects comparisons. In visual reasoning, LLM explanations increase confidence but do not improve accuracy

arXiv cs.HC

2mabout 2 hours ago