Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessWhatsApp notifies hundreds of users who installed a fake app that was actually government spywareTechCrunchAI-Generated Go Serialization: Zero Boilerplate, Maximum SpeedDEV CommunityOpenAI & Anthropic Prove the AI Revolution is Just Starting - Zacks Investment ResearchGoogle News: OpenAII Built a Social Post Engine to Escape the Canva-Export-Schedule LoopDEV CommunityWhen Chrome Ate My RAM: Designing a Pressure-Aware Tab Orchestrator with RustDEV CommunityWhy Your System Fails on the Most Predictable Day of the YearDEV CommunityDeployment Hooks Explained: Running Custom Scripts During Every DeployDEV CommunityI built a knowledge archive for AI agents — here's how the hash chain and trust engine workDEV CommunitySwartz Mind/Brain Lecture Explores How AI Could Decode and Shape Human Vision - SBU NewsGoogle News: AIGoogle Drive can now detect ransomware and roll back your filesTechSpotOpenAI's $122B in funding comes at a perilous moment - theregister.comGoogle News: OpenAIAI models will secretly scheme to protect other AI models from being shut down, researchers find - FortuneGoogle News: AI SafetyBlack Hat USADark ReadingBlack Hat AsiaAI BusinessWhatsApp notifies hundreds of users who installed a fake app that was actually government spywareTechCrunchAI-Generated Go Serialization: Zero Boilerplate, Maximum SpeedDEV CommunityOpenAI & Anthropic Prove the AI Revolution is Just Starting - Zacks Investment ResearchGoogle News: OpenAII Built a Social Post Engine to Escape the Canva-Export-Schedule LoopDEV CommunityWhen Chrome Ate My RAM: Designing a Pressure-Aware Tab Orchestrator with RustDEV CommunityWhy Your System Fails on the Most Predictable Day of the YearDEV CommunityDeployment Hooks Explained: Running Custom Scripts During Every DeployDEV CommunityI built a knowledge archive for AI agents — here's how the hash chain and trust engine workDEV CommunitySwartz Mind/Brain Lecture Explores How AI Could Decode and Shape Human Vision - SBU NewsGoogle News: AIGoogle Drive can now detect ransomware and roll back your filesTechSpotOpenAI's $122B in funding comes at a perilous moment - theregister.comGoogle News: OpenAIAI models will secretly scheme to protect other AI models from being shut down, researchers find - FortuneGoogle News: AI Safety

Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, and LLaMA

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2512.12812v2 Announce Type: replace-cross Abstract: Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, w — Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu, Xiaojing Fan

View PDF HTML (experimental)

Abstract:Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Polite, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing. Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Polite prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks, where rude tone reduces accuracy for GPT and Llama, while Gemini remains comparatively tone-insensitive. When performance is aggregated across tasks within each domain, tone effects diminish and largely lose statistical significance. Compared with earlier research, these findings suggest that dataset scale and coverage materially influence the detection of tone effects. Overall, our study indicates that while interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as: arXiv:2512.12812 [cs.CL]

(or arXiv:2512.12812v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2512.12812

arXiv-issued DOI via DataCite

Submission history

From: Hanyu Cai [view email] [v1] Sun, 14 Dec 2025 19:25:20 UTC (83 KB) [v2] Fri, 27 Mar 2026 20:35:20 UTC (83 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Does Tone C…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 192 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers