Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessNvidia acquisition of SchedMD sparks worry among AI specialists about software access - ReutersGNews AI NVIDIALumentum Holdings (LITE) Is Up 26.3% After Nvidia-Backed $2 Billion AI Optics Expansion - Has The Bull Case Changed? - simplywall.stGNews AI NVIDIANvidia acquisition of SchedMD sparks worry among AI specialists about software access - TradingViewGNews AI NVIDIAMicrosoft’s new AI models signal its independence while challenging OpenAI and Google - eMarketerGNews AI MicrosoftWhy TSMC grew four times faster than its foundry rivals in 2025 — price hikes, vertical integration, and commanding technology lead pay dividendstomshardware.comThe Complete DevSecOps Engineer Career Guide: From Pipeline Security to Platform Architect in 2026DEV CommunityOpenAI’s $1M API Credits, Holos’ Agentic Web, and Xpertbench’s Expert TasksDEV CommunitySemantic matching in graph space without matrix computation and hallucinations and no GPUdiscuss.huggingface.coWhy We Built 5 Products on FastAPI + Next.js (and Would Do It Again)DEV CommunityHow We Run 5 Live SaaS Products on $35/Month in InfrastructureDEV CommunityOur Email Provider Banned Us Overnight -- Here's What We LearnedDEV CommunityCan TensorWave Leapfrog Nvidia’s Big Moat? - The InformationGNews AI NVIDIABlack Hat USAAI BusinessBlack Hat AsiaAI BusinessNvidia acquisition of SchedMD sparks worry among AI specialists about software access - ReutersGNews AI NVIDIALumentum Holdings (LITE) Is Up 26.3% After Nvidia-Backed $2 Billion AI Optics Expansion - Has The Bull Case Changed? - simplywall.stGNews AI NVIDIANvidia acquisition of SchedMD sparks worry among AI specialists about software access - TradingViewGNews AI NVIDIAMicrosoft’s new AI models signal its independence while challenging OpenAI and Google - eMarketerGNews AI MicrosoftWhy TSMC grew four times faster than its foundry rivals in 2025 — price hikes, vertical integration, and commanding technology lead pay dividendstomshardware.comThe Complete DevSecOps Engineer Career Guide: From Pipeline Security to Platform Architect in 2026DEV CommunityOpenAI’s $1M API Credits, Holos’ Agentic Web, and Xpertbench’s Expert TasksDEV CommunitySemantic matching in graph space without matrix computation and hallucinations and no GPUdiscuss.huggingface.coWhy We Built 5 Products on FastAPI + Next.js (and Would Do It Again)DEV CommunityHow We Run 5 Live SaaS Products on $35/Month in InfrastructureDEV CommunityOur Email Provider Banned Us Overnight -- Here's What We LearnedDEV CommunityCan TensorWave Leapfrog Nvidia’s Big Moat? - The InformationGNews AI NVIDIA
AI NEWS HUBbyEIGENVECTOREigenvector

Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences

arXiv cs.CLby [Submitted on 1 Apr 2026]April 4, 20262 min read1 views
Source Quiz

arXiv:2604.01312v1 Announce Type: new Abstract: Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similari

View PDF

Abstract:Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similarity, enabling models to explicitly capture key aspects of helpfulness, safety and relevance. The proposed hybrid approach yields consistent improvements across all models, achieving up to 0.84 ROC AUC and significantly higher pairwise accuracy, with DeBERTav3Large demonstrating the best performance. Beyond accuracy, we integrate SHAP and LIME to provide fine-grained interpretability, revealing that model decisions depend on contextualized safety and supportive framing rather than isolated keywords. We further analyze bias amplification, showing that while individual features have weak marginal effects, their interactions influence preference learning.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as: arXiv:2604.01312 [cs.CL]

(or arXiv:2604.01312v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.01312

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Simona-Vasilica Oprea [view email] [v1] Wed, 1 Apr 2026 18:26:16 UTC (1,368 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Preference …modellanguage mo…announcefeaturestudysafetyarXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 211 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models