Research Papers research paper arxiv ai artificial-intelligence

MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG

arXivMarch 30, 202610 min read0 views

arXiv:2603.23533v2 Announce Type: replace-cross Abstract: RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a r — Bhavik Mangla

View PDF HTML (experimental)

Abstract:RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI-compatible endpoint.

Comments: 13 pages, 4 figures, 7 tables, 2 algorithms. Code: this https URL

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

ACM classes: H.3.3; I.2.7; I.7

Cite as: arXiv:2603.23533 [cs.CL]

(or arXiv:2603.23533v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.23533

arXiv-issued DOI via DataCite

Submission history

From: Bhavik Mangla [view email] [v1] Sun, 8 Mar 2026 07:28:53 UTC (83 KB) [v2] Fri, 27 Mar 2026 05:05:15 UTC (82 KB)

Original source

arXiv

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Market NewsFresh

Google warns five quantum attack paths could put $100 billion on Ethereum at risk

A 57-page whitepaper identifies how future quantum computers could target Ethereum's wallets, smart contracts, staking system, Layer 2 networks and data verification layer, with combined exposure exceeding $100 billion.

CoinDesk AI

1mabout 6 hours ago

Laws & Regulation

Our Statement to the House Oversight Committee on the Federal Government’s Use of AI

June 5, 2025 — In a statement for the record at a hearing before the House Committee on Oversight and Government Reform on the federal government in the age of artificial intelligence, Director of Research Alice E. Marwick and Policy Director Brian J. Chen (with assistance from Jacob Metcalf, Meg Young, and Serena Oduro) lay […]

Data & Society

1m10 months ago

AI ToolsRecent

AI-Powered 'DeepLoad' Malware Steals Credentials, Evades Detection

The massive amount of junk code that hides the malware's logic from security scans was almost certainly generated by AI, researchers say.

Dark Reading

1mabout 20 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 81 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersRecent

Energy Landscapes of Emotion: Quantifying Brain Network Stability During Happy and Sad Face Processing Using EEG-Based Hopfield Energy

arXiv:2603.27644v1 Announce Type: new Abstract: Understanding how the human brain instantiates distinct emotional states is a key challenge in affective neuroscience. While network-based approaches have advanced emotion processing research,they remain largely descriptive,leaving the dynamical stability of emotional brain states unquantified.This study introduces a novel framework to quantify this stability by applying Hopfield network energy to empirically derived functional connectivity. High density EEG was recorded from 20 healthy adults during a happy versus sad facial expression discrimination task. Functional connectivity was estimated using the weighted Phase Lag Index to obtain artifact-robust,frequency-specific matrices, which served as coupling weights in a continuous Hopfield en

arXiv q-bio.NC

2mabout 14 hours ago

Research Papers

Collaboration and Credit Principles

A lot of the best research in machine learning comes from collaborations. In fact, many of the most significant papers in the last few years (TensorFlow, AlphaGo, etc) come from collaborations of 20+ people. These collaborations are made possible by goodwill and trust between researchers.

Chris Olah Blog

1malmost 7 years ago

Research PapersRecent

BitSov: A Composable Bitcoin-Native Architecture for Sovereign Internet Infrastructure

arXiv:2603.28727v1 Announce Type: new Abstract: Today's internet concentrates identity, payments, communication, and content hosting under a small number of corporate intermediaries, creating single points of failure, enabling censorship, and extracting economic rent from participants. We present BitSov, an architectural framework for sovereign internet infrastructure that composes existing decentralized technologies (Bitcoin, Lightning Network, decentralized storage, federated messaging, and mesh connectivity) into a unified, eight-layer protocol stack anchored to Bitcoin's base layer. The framework introduces three architectural patterns: (1) payment-gated messaging, where every transmitted message requires cryptographic proof of a Bitcoin payment, deterring spam through economic incenti

arXiv cs.CR

1mabout 14 hours ago

Research PapersRecent

Information in a recurrent Retina-V1 network with realistic noise, feedback and nonlinearities

arXiv:2603.27347v1 Announce Type: new Abstract: Quantitative estimation of information flow in early vision with psychophysically realistic networks is still an open issue. This is because, up to date, the necessary elements (general and plausible network, accurate noise, and reliable information measures) have not been put together. As a result, previous works made different approximations that limit the generality of their results. This work combines the following elements for the first time: (1) General and plausible recurrent net: a cascade of linear+nonlinear psychophysically tuned layers [IEEE TIP.06, J.Neurophysiol.19, J.Math.Neurosci.20, Neurocomp.24], augmented to consider top-down feedback following [Nat.Neurosci.21,Neurips.22]. (2) Accurate noise in every layer, which is tuned t

arXiv q-bio.NC

2mabout 14 hours ago