Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR
arXiv:2603.26246v1 Announce Type: cross Abstract: Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length — Shashi Kumar, Esa\'u Villatoro-Tello, Sergio Burdisso, Kadri Hacioglu, Thibault Ba\~neras-Roux, Hasindri Watawana, Dairazalia Sanchez-Cortes, Srikanth Madikeri, Petr Motlicek, Andreas Stolcke
View PDF HTML (experimental)
Abstract:Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.
Comments: 11 pages
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as: arXiv:2603.26246 [cs.CL]
(or arXiv:2603.26246v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.26246
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Shashi Kumar [view email] [v1] Fri, 27 Mar 2026 10:09:30 UTC (1,818 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
Gemma 4 - 31b abliterated quants
Got inspired to try and crack this egg without using heretic. FP16, Q8_0 and Q4_K_M quants, plus the abliteration script for modification/use is here: https://huggingface.co/paperscarecrow/Gemma-4-31B-it-abliterated-gguf based off of mlabonne's Orthogonalized Representation Intervention method , because I loved his ablits of gemma3 so much. Edit: Overestimated my internet speeds, still uploading the models. submitted by /u/Polymorphic-X [link] [comments]
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy
Analysis of behavioral consistency in large language model agents reveals that while consistent performance correlates with higher accuracy, consistency can amplify both correct and incorrect interpretations, emphasizing that accurate interpretation is more crucial than execution consistency for production deployment. (2 upvotes on HuggingFace)
Brevity Constraints Reverse Performance Hierarchies in Language Models
Large language models can underperform smaller ones due to verbose responses that introduce errors, but constraining output length reveals their superior capabilities and improves performance across benchmarks. (16 upvotes on HuggingFace)



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!