Models model language model announce analysis reasoning paper

Speech LLMs are Contextual Reasoning Transcribers

arXiv cs.CLby [Submitted on 1 Apr 2026]April 2, 20262 min read1 views

arXiv:2604.00610v1 Announce Type: new Abstract: Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-genera

View PDF HTML (experimental)

Abstract:Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2604.00610 [cs.CL]

(or arXiv:2604.00610v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.00610

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Keqi Deng [view email] [v1] Wed, 1 Apr 2026 08:13:50 UTC (248 KB)

Original source

arXiv cs.CL

https://arxiv.org/abs/2604.00610

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelannounce

Models

Baidu Unveils New Model, Chips to Keep Up in China’s AI Race - Bloomberg.com

Baidu Unveils New Model, Chips to Keep Up in China’s AI Race Bloomberg.com

GNews AI Baidu

1m5 months ago

CountriesFresh

China not targeting US West Coast with ultra-large underwater drones: lead scientist

China’s unmanned submersibles now rank as the world’s largest, with last year’s military parade showcasing two models (HSU001 and AJX002) approaching 20 metres (66 feet) in length. Satellite imagery analysed by Western media also revealed a classified variant exceeding 40 metres at a naval installation, triggering international concern – particularly in the United States. These dimensions created a brand new class of drones known as extra-extra-large uncrewed underwater vehicles (XXLUUVs). They...

SCMP Tech (Asia AI)

1mabout 11 hours ago

ModelsFresh

How China is transforming Hong Kong into a strategic hub

Hong Kong’s first five-year plan is expected to guide the city’s future development. Never before has the city attempted a comprehensive plan in the style of mainland China, signalling a major shift in how it approaches long‑term growth. The real question is not why a laissez‑faire economy must adopt a new model but how this transformation will unfold. This exercise is unprecedented on multiple fronts. First, it departs from Hong Kong’s long-standing reliance on market forces and incremental...

SCMP Tech (Asia AI)

1mabout 10 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 152 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Speech LLMs are Contextual Reasoning Transcribers

Submission history

Daily AI Digest

More about

Baidu Unveils New Model, Chips to Keep Up in China’s AI Race - Bloomberg.com

China not targeting US West Coast with ultra-large underwater drones: lead scientist

How China is transforming Hong Kong into a strategic hub

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Baidu Unveils New Model, Chips to Keep Up in China’s AI Race - Bloomberg.com

How China is transforming Hong Kong into a strategic hub

China’s DeepSeek taps Huawei chips for new AI model - irishsun.com

Google Unveils Gemma 4 AI Models: Record Efficiency and Multilingual Mastery - Gagadget.com