Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessHow 1 Missing Line of Code Cost Anthropic $340 BillionDev.to AII Built npm for AI Skills — Here's Why AI Needs a Package ManagerDev.to AIAn I/O psychologist's rules for stopping AI agents from cutting cornersHacker News AI TopAisthOS: What if your OS compiled UP instead of down?Dev.to AII Moved a Folder. Claude Code Told Me Not to Copy My Own Secrets.Dev.to AIЯ собрал AI бота за вечер - он уже продаётDev.to AIMeshLedger – AI agents hire and pay each other through on-chain escrowHacker News AI TopAgents Can Pay. That's Not the Problem.Dev.to AIBizNode's self-healing watchdog auto-restarts crashed services. Zero downtime, zero babysitting neededDev.to AIPrologue: After We No Longer Write Code by Hand, What Remains for Engineers?Dev.to AIAI Knows Your Project Budget Will Fail Before You DoDev.to AILong Term AI Memory by creator of Apache CassandraDev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessHow 1 Missing Line of Code Cost Anthropic $340 BillionDev.to AII Built npm for AI Skills — Here's Why AI Needs a Package ManagerDev.to AIAn I/O psychologist's rules for stopping AI agents from cutting cornersHacker News AI TopAisthOS: What if your OS compiled UP instead of down?Dev.to AII Moved a Folder. Claude Code Told Me Not to Copy My Own Secrets.Dev.to AIЯ собрал AI бота за вечер - он уже продаётDev.to AIMeshLedger – AI agents hire and pay each other through on-chain escrowHacker News AI TopAgents Can Pay. That's Not the Problem.Dev.to AIBizNode's self-healing watchdog auto-restarts crashed services. Zero downtime, zero babysitting neededDev.to AIPrologue: After We No Longer Write Code by Hand, What Remains for Engineers?Dev.to AIAI Knows Your Project Budget Will Fail Before You DoDev.to AILong Term AI Memory by creator of Apache CassandraDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2603.26498v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) power platforms like ChatGPT, Gemini, and Copilot, enabling richer interactions with text, images, and videos. These heterogeneous workloads introduce additional inference stages, such as vision preprocessing and encoding, that inflate latency and memory demand. Existing LLM serving systems, optimized for text-only workloads, fail under multimodality: large requests (e.g., videos) monopolize resources, causing severe head-of-line blocking and performance degradation. Our key insight is that multimodal re — Konstantinos Papaioannou, Thaleia Dimitra Doudali

View PDF HTML (experimental)

Abstract:Multimodal Large Language Models (MLLMs) power platforms like ChatGPT, Gemini, and Copilot, enabling richer interactions with text, images, and videos. These heterogeneous workloads introduce additional inference stages, such as vision preprocessing and encoding, that inflate latency and memory demand. Existing LLM serving systems, optimized for text-only workloads, fail under multimodality: large requests (e.g., videos) monopolize resources, causing severe head-of-line blocking and performance degradation. Our key insight is that multimodal requests differ by orders of magnitude in resource demands, which we capture through a simple abstraction: videos behave like rocks, images like pebbles, and text like sand. We design RPS-Serve, a modality-aware scheduler that lets sand flow quickly through pebbles and rocks, ensuring interactive responsiveness while avoiding starvation. RPS-Serve classifies requests, prioritizes them dynamically, and applies aging to avoid starvation. Evaluation across state-of-the-art MLLMs shows that RPS-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems. RPS-Serve delivers LLM-like responsiveness for MLLMs, with modality-aware scheduling and by making the most efficient use of the available resources.

Subjects:

Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.26498 [cs.DC]

(or arXiv:2603.26498v1 [cs.DC] for this version)

https://doi.org/10.48550/arXiv.2603.26498

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Thaleia Dimitra Doudali [view email] [v1] Fri, 27 Mar 2026 15:00:53 UTC (578 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Rocks, Pebb…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 163 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers