Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessRambus Unveils HBM4E Controller: 16 GT/s, 2,048-Bit Interface, Enabling C-HBM4EEE TimesGPT reasoning models have "line of sight" to AGI, says OpenAI's Greg Brockman - the-decoder.comGoogle News: OpenAIGPT reasoning models have "line of sight" to AGI, says OpenAI s Greg BrockmanThe DecoderCornell study reveals AI can regenerate famous books with amazing accuracy, sparks copyright concerns - India TodayGNews AI copyrightStudy Finds ChatGPT May Help You Learn Faster, But There's a Catch - ScienceAlertGoogle News: ChatGPTThe Sequence Chat #835: Illia Polosukhin on NEAR AI, Authoring the Transformer Paper and Decentralized and Private AI - TheSequenceGoogle News: Machine LearningOpenClaw Unlocks China’s AI Token Export BusinessBloomberg TechnologySector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025 - Crunchbase NewsGNews AI startupsSector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025Crunchbase NewsJob Pivots in the Age of AI: Lessons From Mike Mulligan and His Steam Shovel - MIT Sloan Management ReviewGoogle News: AIAre Multi-Agent Systems More Complex Than They Need to Be?The Data ExchangeStudy Finds ChatGPT May Help You Learn Faster, But There's a Catch - YahooGoogle News: ChatGPTBlack Hat USADark ReadingBlack Hat AsiaAI BusinessRambus Unveils HBM4E Controller: 16 GT/s, 2,048-Bit Interface, Enabling C-HBM4EEE TimesGPT reasoning models have "line of sight" to AGI, says OpenAI's Greg Brockman - the-decoder.comGoogle News: OpenAIGPT reasoning models have "line of sight" to AGI, says OpenAI s Greg BrockmanThe DecoderCornell study reveals AI can regenerate famous books with amazing accuracy, sparks copyright concerns - India TodayGNews AI copyrightStudy Finds ChatGPT May Help You Learn Faster, But There's a Catch - ScienceAlertGoogle News: ChatGPTThe Sequence Chat #835: Illia Polosukhin on NEAR AI, Authoring the Transformer Paper and Decentralized and Private AI - TheSequenceGoogle News: Machine LearningOpenClaw Unlocks China’s AI Token Export BusinessBloomberg TechnologySector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025 - Crunchbase NewsGNews AI startupsSector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025Crunchbase NewsJob Pivots in the Age of AI: Lessons From Mike Mulligan and His Steam Shovel - MIT Sloan Management ReviewGoogle News: AIAre Multi-Agent Systems More Complex Than They Need to Be?The Data ExchangeStudy Finds ChatGPT May Help You Learn Faster, But There's a Catch - YahooGoogle News: ChatGPT
AI NEWS HUBbyEIGENVECTOREigenvector

The Necessity of Setting Temperature in LLM-as-a-Judge

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2603.28304v1 Announce Type: new Abstract: LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits — Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu State

View PDF HTML (experimental)

Abstract:LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship between temperature and judge performance through a series of controlled experiments, and further adopt a causal inference framework within our empirical statistical analysis to rigorously examine the direct causal effect of temperature on judge behavior, offering actionable engineering insights for the design of LLM-centric evaluation pipelines.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2603.28304 [cs.CL]

(or arXiv:2603.28304v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.28304

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Lujun Li [view email] [v1] Mon, 30 Mar 2026 11:31:29 UTC (9,073 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
The Necessi…researchpaperarxivnlplanguage-mo…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 203 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers