Research Papers research paper arxiv ai artificial-intelligence

JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understanding

arXivMarch 31, 202610 min read0 views

arXiv:2603.27942v1 Announce Type: cross Abstract: Japanese scene text poses challenges that multilingual benchmarks often fail to capture, including mixed scripts, frequent vertical writing, and a character inventory far larger than the Latin alphabet. Although Japanese is included in several multilingual benchmarks, these resources do not adequately capture the language-specific complexities. Meanwhile, existing Japanese visual text datasets have primarily focused on scanned documents, leaving in-the-wild scene text underexplored. To fill this gap, we introduce JaWildText, a diagnostic benchm — Koki Maeda (Institute of Science Tokyo, Tokyo, Japan, Research and Development Center for Large Language Models, National Institute of Informatics, Tokyo, Japan), Naoaki Okazaki (Institute of Science Tokyo, Tokyo, Japan, Research and Development Center for Large Language Models, National Institute of Informatics, Tokyo, Japan)

View PDF HTML (experimental)

Abstract:Japanese scene text poses challenges that multilingual benchmarks often fail to capture, including mixed scripts, frequent vertical writing, and a character inventory far larger than the Latin alphabet. Although Japanese is included in several multilingual benchmarks, these resources do not adequately capture the language-specific complexities. Meanwhile, existing Japanese visual text datasets have primarily focused on scanned documents, leaving in-the-wild scene text underexplored. To fill this gap, we introduce JaWildText, a diagnostic benchmark for evaluating vision-language models (VLMs) on Japanese scene text understanding. JaWildText contains 3,241 instances from 2,961 images newly captured in Japan, with 1.12 million annotated characters spanning 3,643 unique character types. It comprises three complementary tasks that vary in visual organization, output format, and writing style: (i) Dense Scene Text Visual Question Answering (STVQA), which requires reasoning over multiple pieces of visual text evidence; (ii) Receipt Key Information Extraction (KIE), which tests layout-aware structured extraction from mobile-captured receipts; and (iii) Handwriting OCR, which evaluates page-level transcription across various media and writing directions. We evaluate 14 open-weight VLMs and find that the best model achieves an average score of 0.64 across the three tasks. Error analyses show recognition remains the dominant bottleneck, especially for kanji. JaWildText enables fine-grained, script-aware diagnosis of Japanese scene text capabilities, and will be released with evaluation code.

Comments: 18 pages

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.27942 [cs.CV]

(or arXiv:2603.27942v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27942

arXiv-issued DOI via DataCite

Submission history

From: Koki Maeda [view email] [v1] Mon, 30 Mar 2026 01:36:16 UTC (5,114 KB) [v2] Tue, 31 Mar 2026 07:50:48 UTC (5,114 KB)

Original source

arXiv

https://arxiv.org/abs/2603.27942

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

Google News: LLM

1m1 day ago

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

Google News: LLM

1m1 day ago

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

Google News: LLM

1m1 day ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 228 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersRecent

Artificial intelligence assisted colorectal lesion detection in private practices a randomized controlled study

npj Digital Medicine, Published online: 01 April 2026; doi:10.1038/s41746-026-02576-8 Artificial intelligence assisted colorectal lesion detection in private practices a randomized controlled study

nature.com

1m1 day ago

Research PapersFresh

Robust Multidimensional Chinese Remainder Theorem (MD-CRT) with Non-Diagonal Moduli and Multi-Stage Framework

arXiv:2604.00995v1 Announce Type: new Abstract: The Chinese remainder theorem (CRT) provides an efficient way to reconstruct an integer from its remainders modulo several integer moduli, and has been widely applied in signal processing and information theory. Its multidimensional extension (MD-CRT) generalizes this principle to integer vectors and integer matrix moduli, enabling reconstruction in multidimensional signal processing scenarios. However, since matrices are generally non-commutative, the multidimensional extension introduces new theoretical and algorithmic challenges. When all matrix moduli are diagonal, the system is equivalent to applying the one-dimensional CRT independently along each dimension. This work first investigates whether non-diagonal (non-separable) moduli offer

arXiv eess.SP

2mabout 4 hours ago

Research PapersFresh

Spatial Upper Bound of Radiated Power in Active Antenna Systems

arXiv:2604.00846v1 Announce Type: new Abstract: The assessment of unwanted radiated emissions from Active Antenna Systems (AAS) has become a critical issue in adjacent-band coexistence scenarios. In this paper, we establish the existence of a deterministic spatial upper bound on the radiated power of active antenna arrays. We show that the maximum radiated power always occurs in the boresight direction, irrespective of frequency or signal nature (useful signal, nonlinear distortion, or noise), or instantaneous beamforming configuration, thereby defining a conservative spatial upper bound whose angular envelope is solely determined by the elementary radiating building block of the antenna architecture, i.e., the element or sub-array radiation pattern. Starting from a two-element array with

arXiv eess.SP

1mabout 4 hours ago

Research PapersFresh

Learning Laplacian Forms for Graph Signal Processing via the Deformed Laplacian

arXiv:2604.00728v1 Announce Type: new Abstract: Learning the graph Laplacian from observed data is one of the most investigated and fundamental tasks in Graph Signal Processing (GSP). Different variants of the Laplacian, such as the combinatorial, signless or signed Laplacians have been considered depending on the type of features to be extracted from the data. The main contribution of this paper is the introduction of a parametric Laplacian, called the deformed Laplacian, defined as a quadratic matrix polynomial that provides a parametric dictionary for graph signal processing. The deformed Laplacian can be interpreted as the generator of a parametric linear reaction-diffusion dynamics on graphs, capturing the interplay between diffusive coupling and nodal reaction effects. It is a parame

arXiv eess.SP

2mabout 4 hours ago