Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations
arXiv:2603.25870v1 Announce Type: cross Abstract: Creating whiteboard-style educational videos demands precise coordination between freehand illustrations and spoken narration, yet no existing method addresses this multimodal synchronization problem with structured, reproducible drawing representations. We present the first dataset of 24 paired Excalidraw demonstrations with narrated audio, where every drawing element carries millisecond-precision creation timestamps spanning 8 STEM domains. Using this data, we study whether a vision-language model (Qwen2-VL-7B), fine-tuned via LoRA, can predi — Suraj Prasad, Pinak Mahapatra
View PDF HTML (experimental)
Abstract:Creating whiteboard-style educational videos demands precise coordination between freehand illustrations and spoken narration, yet no existing method addresses this multimodal synchronization problem with structured, reproducible drawing representations. We present the first dataset of 24 paired Excalidraw demonstrations with narrated audio, where every drawing element carries millisecond-precision creation timestamps spanning 8 STEM domains. Using this data, we study whether a vision-language model (Qwen2-VL-7B), fine-tuned via LoRA, can predict full stroke sequences synchronized to speech from only 24 demonstrations. Our topic-stratified five-fold evaluation reveals that timestamp conditioning significantly improves temporal alignment over ablated baselines, while the model generalizes across unseen STEM topics. We discuss transferability to real classroom settings and release our dataset and code to support future research in automated educational content generation.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2603.25870 [cs.CV]
(or arXiv:2603.25870v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.25870
arXiv-issued DOI via DataCite
Submission history
From: Suraj Prasad [view email] [v1] Thu, 26 Mar 2026 19:56:56 UTC (23 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv![[D] AI research on small language models](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-ai-chip-closeup-KMZ5N5zRxP2NRiYJ8TB9TM.webp)
[D] AI research on small language models
i'm doing research on some trending fields in AI, currently working on small language models and would love to meet people who are working in similar domains and are looking to write/publish papers! submitted by /u/StoicWithSyrup [link] [comments]

Promising Signals on AI Governance from China
View the official memo here. China has consistently signaled a willingness to engage on global AI governance since at least 2017. This memo compiles key statements from the Chinese government and prominent figures demonstrating their desire to coordinate on the problem of AI. Chinese Vice Premier Ding Xuexiang, at the 2025 World Economic Forum, said: [ ] The post Promising Signals on AI Governance from China appeared first on Machine Intelligence Research Institute .
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
Vision Language Models struggle with fine-grained visual perception tasks due to their language-centric training approach, performing poorly on unnamed visual entities despite having relevant information in their representations. (1 upvotes on HuggingFace)




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!