APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay
arXiv:2603.29093v1 Announce Type: cross Abstract: LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid
View PDF HTML (experimental)
Abstract:LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity's Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6% accuracy versus 41.3% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9%). On BigCodeBench, it reaches 83.3% SR from a 53.9% baseline (+29.4pp), exceeding MemRL's~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0% from 25.2% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.
Comments: 17 pages, 13 figures
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as: arXiv:2603.29093 [cs.CL]
(or arXiv:2603.29093v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.29093
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Pratyay Banerjee [view email] [v1] Tue, 31 Mar 2026 00:24:56 UTC (1,096 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudemodelannounceA Very Fine Untuning
How fine-tuning made my chatbot worse (and broke my RAG pipeline) I spent weeks trying to improve my personal chatbot, Virtual Alexandra , with fine-tuning. Instead I got increased hallucination rate and broken retrieval in my RAG system. Yes, this is a story about a failed attempt, not a successful one. My husband and I called fine tuning results “Drunk Alexandra” — incoherent answers that were initially funny, but quickly became annoying. After weeks of experiments, I reached a simple conclusion: for this particular project, a small chatbot that answers questions based on my writing and instructions, fine tuning was not a good option. It was not just unnecessary, it actively degraded the experience and didn’t justify the extra time, cost, or complexity compared to the prompt + RAG system
Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell
<h4>Chocolate Factory’s compression tech clears the way to cheaper AI inference, not more affordable memory</h4> <p>When Google unveiled <a target="_blank" rel="nofollow" href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.…</p>
How We're Approaching a County-Level Education Data System Engagement
<p>When Los Angeles County needs to evaluate whether a multi-agency data system serving foster youth should be modernized or replaced, the work sits at the intersection of technology, policy, and people. That's exactly where we operate.</p> <h2> The Opportunity </h2> <p>The LA County Office of Child, Youth, and Family Well-Being is looking for a consulting team to analyze the Education Passport System (EPS), a shared data platform that connects 80+ school districts with the Department of Children and Family Services and the Probation Department. The system exists to ensure that when a foster youth moves between placements, their education records follow them.</p> <p>The question on the table: does the current system meet the needs of all stakeholders, or is it time to move to something new
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
A Very Fine Untuning
How fine-tuning made my chatbot worse (and broke my RAG pipeline) I spent weeks trying to improve my personal chatbot, Virtual Alexandra , with fine-tuning. Instead I got increased hallucination rate and broken retrieval in my RAG system. Yes, this is a story about a failed attempt, not a successful one. My husband and I called fine tuning results “Drunk Alexandra” — incoherent answers that were initially funny, but quickly became annoying. After weeks of experiments, I reached a simple conclusion: for this particular project, a small chatbot that answers questions based on my writing and instructions, fine tuning was not a good option. It was not just unnecessary, it actively degraded the experience and didn’t justify the extra time, cost, or complexity compared to the prompt + RAG system
Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell
<h4>Chocolate Factory’s compression tech clears the way to cheaper AI inference, not more affordable memory</h4> <p>When Google unveiled <a target="_blank" rel="nofollow" href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.…</p>
Introducing The Screwtape Ladders
The time has come for me to find a new home for my writings. Like many an author before me, I've enjoyed improving my craft and getting feedback on my essays here. LessWrong is a good incubator for honing one's skills in that arena. There's a chance to get your point out in front of a really broad audience of really smart people. There's been some cool moments. My oldest visible post, Write A Thousand Roads to Rome , got cited in a discussion with Eliezer Yudkowsky once. I keep seeing people bring up Loudly Give Up, Don't Quietly Fade as a motivator for speaking out. Sometimes it's really cool people working on awesome projects, and I feel a flash of sadness at 'aww, it's not going to happen' and also a bit of cool 'whoa, they remember that post?' You've all also let me get away with a lot
Anthropic Executive Sees Cowork Agent as Bigger Than Claude Code - Bloomberg.com
<a href="https://news.google.com/rss/articles/CBMitgFBVV95cUxOM0VfSzdRYUNpT21XMlVuNXhsVEY4TUFxM3UzWUJDOEhFcUtJQnhTbjY2VjBXOUw1d1ZOUDRKeHVKMzkta3pFVWRWSGNoQkp3aWVndlRBQlpVUGxVN0ZnQW80OUZnYWN6RlhJWHRjT0V4RVhPcGhxMmE3b3oyVDlUV2RLY0g2NEx4M1dfMXhvTlhPTW50eFR1cEhxcHB3SXpURnRtbDZtZHp6bGQ2Z09IMjZBODBjdw?oc=5" target="_blank">Anthropic Executive Sees Cowork Agent as Bigger Than Claude Code</a> <font color="#6f6f6f">Bloomberg.com</font>
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!