APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay
arXiv:2603.29093v1 Announce Type: new Abstract: LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid r
View PDF HTML (experimental)
Abstract:LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present APEX-EM, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a structured experience representation encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a Plan-Retrieve-Generate-Iterate-Ingest (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a dual-outcome Experience Memory with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench, KGQAGen-10k, and Humanity's Last Exam using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6% accuracy versus 41.3% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9%). On BigCodeBench, it reaches 83.3% SR from a 53.9% baseline (+29.4pp), exceeding MemRL's +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0% from 25.2% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.
Comments: 17 pages, 13 figures
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as: arXiv:2603.29093 [cs.CL]
(or arXiv:2603.29093v2 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.29093
arXiv-issued DOI via DataCite
Submission history
From: Pratyay Banerjee [view email] [v1] Tue, 31 Mar 2026 00:24:56 UTC (1,096 KB) [v2] Thu, 2 Apr 2026 21:09:27 UTC (1,096 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudemodelannounce
SwEYEpinch: Exploring Intuitive, Efficient Text Entry for Extended Reality via Eye and Hand Tracking
arXiv:2604.03520v1 Announce Type: new Abstract: Despite steady progress, text entry in Extended Reality (XR) often remains slower and more effortful than typing on a physical keyboard or touchscreen. We explore a simple idea: use gaze to swipe through a virtual keyboard for the fast, low-effort where and a manual pinch held throughout the swipe for the when, extending and validating it through a series of user studies. We first show that a basic version including a low-latency decoder with spatiotemporal Dynamic Time Warping and fixation filtering outperforms selecting individual keys sequentially, either by finger tapping each or gazing at each while pinching. We then add mid-swipe prediction and in-gesture cancellation, improving words per minute (WPM) without hurting accuracy. We show t

Occupational Diversity and Stratification in Platform Work: A Longitudinal Study of Online Freelancers
arXiv:2604.03517v1 Announce Type: new Abstract: We focus on occupational diversity in platform-mediated work to advance conceptual and empirical insight into the occupationally embedded nature of platform labor. We pursue this focus in response to a prevailing tendency to treat platform workers as a homogeneous group, overlooking the unique demands, constraints, and practices rooted in specific professions. Such generalizations hinder both understanding of platform work and the development of sociotechnical systems that support differentiated occupational realities. To address this gap, we present a longitudinal analysis of 108 online freelancers spanning five occupational categories. We show that occupational context structures workers' capacity to interpret and navigate platformic manage

The Augmentation Trap: AI Productivity and the Cost of Cognitive Offloading
arXiv:2604.03501v1 Announce Type: new Abstract: Experimental evidence confirms that AI tools raise worker productivity, but also that sustained use can erode the expertise on which those gains depend. We develop a dynamic model in which a decision-maker chooses AI usage intensity for a worker over time, trading immediate productivity against the erosion of worker skill. We decompose the tool's productivity effect into two channels, one independent of worker expertise and one that scales with it. The model produces three main results. First, even a decision-maker who fully anticipates skill erosion rationally adopts AI when front-loaded productivity gains outweigh long-run skill costs, producing steady-state loss: the worker ends up less productive than before adoption. Second, when manager
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Messages in a Digital Bottle: A Youth-Coauthored Perspective on LLM Chatbots and Adolescent Loneliness
arXiv:2604.03470v1 Announce Type: new Abstract: Adolescent loneliness is a growing concern in digitally mediated social environments. This work-in-progress presents a youth-authored critical synthesis on chatbots powered by Large Language Model (LLM) and adolescent loneliness. The first author is a 16-year-old Chinese student who recently migrated to the UK. She wrote the first draft of this paper from her lived experience, supervised by the second author. Rather than treating the youth perspective as one data point among many, we foreground it as the primary interpretive lens, grounded in interdisciplinary literature from social computing, developmental psychology, and Human-Computer Interaction (HCI). We examine how chatbots shape experiences of loneliness differently across adolescent s

Incidental Interaction: Technology to Support Elder Strength Training through Everyday Movements
arXiv:2604.03241v1 Announce Type: new Abstract: Strength training is a key determinant of healthy aging, yet adherence to formal exercise programs among older adults remains low. While many technologies aim to encourage physical activity in older adults, they typically rely on dedicated devices, wearables, or explicit exercise tasks. They therefore do not embed task practice into daily life. Our new approach, termed Incidental Interaction, instead transforms everyday actions into opportunities for deliberate strength building. It thereby operationalizes everyday movements such as sitting, standing, or lifting objects as strength exercises, encouraging participants to repeat them to build functional capacity. This repetition is encapsulated in the phrase "do it twice", and is combined with

Measuring Human Preferences in RLHF is a Social Science Problem
arXiv:2604.03238v1 Announce Type: new Abstract: RLHF assumes that annotation responses reflect genuine human preferences. We argue this assumption warrants systematic examination, and that behavioral science offers frameworks that bring clarity to when it holds and when it breaks down. Behavioral scientists have documented for sixty years that people routinely produce responses without holding genuine opinions, construct preferences on the spot based on contextual cues, and interpret identical questions differently. These phenomena are pervasive for precisely the value-laden judgments that matter most for alignment, yet this literature has not yet been systematically integrated into ML practice. We argue that the ML community must treat measurement validity as logically prior to preference

The Persuasion Paradox: When LLM Explanations Fail to Improve Human-AI Team Performance
arXiv:2604.03237v1 Announce Type: new Abstract: While natural-language explanations from large language models (LLMs) are widely adopted to improve transparency and trust, their impact on objective human-AI team performance remains poorly understood. We identify a Persuasion Paradox: fluent explanations systematically increase user confidence and reliance on AI without reliably improving, and in some cases undermining, task accuracy. Across three controlled human-subject studies spanning abstract visual reasoning (RAVEN matrices) and deductive logical reasoning (LSAT problems), we disentangle the effects of AI predictions and explanations using a multi-stage reveal design and between-subjects comparisons. In visual reasoning, LLM explanations increase confidence but do not improve accuracy

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!