The Necessity of Setting Temperature in LLM-as-a-Judge
arXiv:2603.28304v1 Announce Type: new Abstract: LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits — Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu State
View PDF HTML (experimental)
Abstract:LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship between temperature and judge performance through a series of controlled experiments, and further adopt a causal inference framework within our empirical statistical analysis to rigorously examine the direct causal effect of temperature on judge behavior, offering actionable engineering insights for the design of LLM-centric evaluation pipelines.
Subjects:
Computation and Language (cs.CL)
Cite as: arXiv:2603.28304 [cs.CL]
(or arXiv:2603.28304v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.28304
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Lujun Li [view email] [v1] Mon, 30 Mar 2026 11:31:29 UTC (9,073 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivWhat is next in reinforcement learning for LLMs?
Reinforcement learning from verifiable rewards (RLVR) ushered in a new generation of reasoning models. Now, researchers are looking beyond RLVR to create the next breakthrough in AI. The post What is next in reinforcement learning for LLMs? first appeared on TechTalks .
How Are UK Adults Spending Their Time Online?
New research from Ofcom reveals how people in the UK use, understand and feel about the media and online services they interact with in their daily lives. The regulator s annual Adults’ Media Use and Attitudes and Adults’ Media Lives research reports tracked trends in the nation’s media habits and online behaviours over the last year. [ ] The post How Are UK Adults Spending Their Time Online? appeared first on DIGIT .
Claude Source Code Leak Reveals Anthropic’s Secret Plans
Anthropic has accidentally exposed more than 500,000 lines of source code for one of its flagship Claude models, allowing researchers, competitors, and hackers a window into the AI giant’s inner workings. First spotted by a security researcher on X, around 1,900 files and 513,000 lines of code relating to the architecture of Anthropic’s Claude Code [ ] The post Claude Source Code Leak Reveals Anthropic’s Secret Plans appeared first on DIGIT .
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
Beyond Metadata: Multimodal, Policy-Aware Detection of YouTube Scam Videos
arXiv:2509.23418v2 Announce Type: replace Abstract: YouTube is a major platform for information and entertainment, but its wide accessibility also makes it attractive for scammers to upload deceptive or malicious content. Prior detection approaches rely largely on textual or statistical metadata, such as titles, descriptions, view counts, or likes, which are effective in many cases but can be evaded through benign-looking text, manipulated statistics, or other obfuscation strategies (e.g., 'Leetspeak'), while ignoring visual cues. In this study, we systematically investigate multimodal approaches for detecting YouTube scams. Our dataset consolidates established scam categories and augments them with full-length videos and policy-grounded reasoning annotations. Experiments show that a text-

Online Flow Time Minimization: Tight Bounds for Non-Preemptive Algorithms
arXiv:2511.03485v3 Announce Type: replace Abstract: This paper studies the online scheduling problem of minimizing total flow time for $n$ jobs on $m$ identical machines. A classical $\Omega(n)$ lower bound shows that no deterministic single-machine algorithm can beat the trivial greedy, even when $n$ is known in advance. However, this barrier is specific to deterministic algorithms on a single machine, leaving open what randomization, multiple machines, or the kill-and-restart capability can achieve. We give a nearly complete answer. For randomized non-preemptive algorithms, we establish a tight $\Theta(\sqrt{n/m})$ competitive ratio, which also improves the best offline approximation to $O(\sqrt{n/m})$. For deterministic non-preemptive algorithms on multiple machines, we prove an $O(n/m^

On the average-case complexity landscape for Tensor-Isomorphism-complete problems over finite fields
arXiv:2604.00591v1 Announce Type: cross Abstract: In Grochow and Qiao (SIAM J. Comput., 2021), the complexity class Tensor Isomorphism (TI) was introduced and isomorphism problems for groups, algebras, and polynomials were shown to be TI-complete. In this paper, we study average-case algorithms for several TI-complete problems over finite fields, including algebra isomorphism, matrix code conjugacy, and $4$-tensor isomorphism. Our main results are as follows. Over the finite field of order $q$, we devise (1) average-case polynomial-time algorithms for algebra isomorphism and matrix code conjugacy that succeed in a $1/\Theta(q)$ fraction of inputs and (2) an average-case polynomial-time algorithm for the $4$-tensor isomorphism that succeeds in a $1/q^{\Theta(1)}$ fraction of inputs. Prior t

Stable algorithms cannot reliably find isolated perceptron solutions
arXiv:2604.00328v1 Announce Type: cross Abstract: We study the binary perceptron, a random constraint satisfaction problem that asks to find a Boolean vector in the intersection of independently chosen random halfspaces. A striking feature of this model is that at every positive constraint density, it is expected that a $1-o_N(1)$ fraction of solutions are \emph{strongly isolated}, i.e. separated from all others by Hamming distance $\Omega(N)$. At the same time, efficient algorithms are known to find solutions at certain positive constraint densities. This raises a natural question: can any isolated solution be algorithmically visible? We answer this in the negative: no algorithm whose output is stable under a tiny Gaussian resampling of the disorder can \emph{reliably} locate isolated sol

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!