Knowledge Quiz
Test your understanding of this article
1.What is the primary purpose of the \textsc{MazeBench} benchmark introduced in the article?
2.According to the article, why are the high accuracy scores (e.g., GPT-5.4 at 91%) of multimodal models on maze tasks considered misleading?
3.What common two-stage strategy did qualitative traces reveal multimodal models use to solve mazes?
4.What did the text-grid ablation experiment with Claude Sonnet 4.6 demonstrate?
