Research Papers research paper arxiv nlp language-models

The Necessity of Setting Temperature in LLM-as-a-Judge

arXivMarch 31, 20262 min read0 views

arXiv:2603.28304v1 Announce Type: new Abstract: LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits — Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu State

View PDF HTML (experimental)

Abstract:LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship between temperature and judge performance through a series of controlled experiments, and further adopt a causal inference framework within our empirical statistical analysis to rigorously examine the direct causal effect of temperature on judge behavior, offering actionable engineering insights for the design of LLM-centric evaluation pipelines.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2603.28304 [cs.CL]

(or arXiv:2603.28304v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.28304

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Lujun Li [view email] [v1] Mon, 30 Mar 2026 11:31:29 UTC (9,073 KB)

Original source

arXiv

https://arxiv.org/abs/2603.28304

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Models

What is next in reinforcement learning for LLMs?

Reinforcement learning from verifiable rewards (RLVR) ushered in a new generation of reasoning models. Now, researchers are looking beyond RLVR to create the next breakthrough in AI. The post What is next in reinforcement learning for LLMs? first appeared on TechTalks .

TechTalks

1m4 months ago

ProductsFresh

How Are UK Adults Spending Their Time Online?

New research from Ofcom reveals how people in the UK use, understand and feel about the media and online services they interact with in their daily lives. The regulator s annual Adults’ Media Use and Attitudes and Adults’ Media Lives research reports tracked trends in the nation’s media habits and online behaviours over the last year. [ ] The post How Are UK Adults Spending Their Time Online? appeared first on DIGIT .

Digit.fyi

1mabout 2 hours ago

ModelsFresh

Claude Source Code Leak Reveals Anthropic’s Secret Plans

Anthropic has accidentally exposed more than 500,000 lines of source code for one of its flagship Claude models, allowing researchers, competitors, and hackers a window into the AI giant’s inner workings. First spotted by a security researcher on X, around 1,900 files and 513,000 lines of code relating to the architecture of Anthropic’s Claude Code [ ] The post Claude Source Code Leak Reveals Anthropic’s Secret Plans appeared first on DIGIT .

Digit.fyi

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 203 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Beyond Metadata: Multimodal, Policy-Aware Detection of YouTube Scam Videos

arXiv:2509.23418v2 Announce Type: replace Abstract: YouTube is a major platform for information and entertainment, but its wide accessibility also makes it attractive for scammers to upload deceptive or malicious content. Prior detection approaches rely largely on textual or statistical metadata, such as titles, descriptions, view counts, or likes, which are effective in many cases but can be evaded through benign-looking text, manipulated statistics, or other obfuscation strategies (e.g., 'Leetspeak'), while ignoring visual cues. In this study, we systematically investigate multimodal approaches for detecting YouTube scams. Our dataset consolidates established scam categories and augments them with full-length videos and policy-grounded reasoning annotations. Experiments show that a text-

arXiv cs.CR

2mabout 8 hours ago

Research PapersFresh

Online Flow Time Minimization: Tight Bounds for Non-Preemptive Algorithms

arXiv:2511.03485v3 Announce Type: replace Abstract: This paper studies the online scheduling problem of minimizing total flow time for $n$ jobs on $m$ identical machines. A classical $\Omega(n)$ lower bound shows that no deterministic single-machine algorithm can beat the trivial greedy, even when $n$ is known in advance. However, this barrier is specific to deterministic algorithms on a single machine, leaving open what randomization, multiple machines, or the kill-and-restart capability can achieve. We give a nearly complete answer. For randomized non-preemptive algorithms, we establish a tight $\Theta(\sqrt{n/m})$ competitive ratio, which also improves the best offline approximation to $O(\sqrt{n/m})$. For deterministic non-preemptive algorithms on multiple machines, we prove an $O(n/m^

arXiv cs.DS

2mabout 8 hours ago

Research PapersFresh

On the average-case complexity landscape for Tensor-Isomorphism-complete problems over finite fields

arXiv:2604.00591v1 Announce Type: cross Abstract: In Grochow and Qiao (SIAM J. Comput., 2021), the complexity class Tensor Isomorphism (TI) was introduced and isomorphism problems for groups, algebras, and polynomials were shown to be TI-complete. In this paper, we study average-case algorithms for several TI-complete problems over finite fields, including algebra isomorphism, matrix code conjugacy, and $4$-tensor isomorphism. Our main results are as follows. Over the finite field of order $q$, we devise (1) average-case polynomial-time algorithms for algebra isomorphism and matrix code conjugacy that succeed in a $1/\Theta(q)$ fraction of inputs and (2) an average-case polynomial-time algorithm for the $4$-tensor isomorphism that succeeds in a $1/q^{\Theta(1)}$ fraction of inputs. Prior t

arXiv cs.DS

2mabout 8 hours ago

Research PapersFresh

Stable algorithms cannot reliably find isolated perceptron solutions

arXiv:2604.00328v1 Announce Type: cross Abstract: We study the binary perceptron, a random constraint satisfaction problem that asks to find a Boolean vector in the intersection of independently chosen random halfspaces. A striking feature of this model is that at every positive constraint density, it is expected that a $1-o_N(1)$ fraction of solutions are \emph{strongly isolated}, i.e. separated from all others by Hamming distance $\Omega(N)$. At the same time, efficient algorithms are known to find solutions at certain positive constraint densities. This raises a natural question: can any isolated solution be algorithmically visible? We answer this in the negative: no algorithm whose output is stable under a tiny Gaussian resampling of the disorder can \emph{reliably} locate isolated sol

arXiv cs.DS

2mabout 8 hours ago