Models model language model announce feature study safety

Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences

arXiv cs.CLby [Submitted on 1 Apr 2026]April 4, 20262 min read1 views

arXiv:2604.01312v1 Announce Type: new Abstract: Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similari

View PDF

Abstract:Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similarity, enabling models to explicitly capture key aspects of helpfulness, safety and relevance. The proposed hybrid approach yields consistent improvements across all models, achieving up to 0.84 ROC AUC and significantly higher pairwise accuracy, with DeBERTav3Large demonstrating the best performance. Beyond accuracy, we integrate SHAP and LIME to provide fine-grained interpretability, revealing that model decisions depend on contextualized safety and supportive framing rather than isolated keywords. We further analyze bias amplification, showing that while individual features have weak marginal effects, their interactions influence preference learning.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as: arXiv:2604.01312 [cs.CL]

(or arXiv:2604.01312v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.01312

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Simona-Vasilica Oprea [view email] [v1] Wed, 1 Apr 2026 18:26:16 UTC (1,368 KB)

Original source

arXiv cs.CL

https://arxiv.org/abs/2604.01312

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelannounce

Models

AI Singapore taps Alibaba Cloud to power Sea-Lion model - Computer Weekly

AI Singapore taps Alibaba Cloud to power Sea-Lion model Computer Weekly

GNews AI Singapore

1m4 months ago

ModelsLive

Microsoft’s new AI models signal its independence while challenging OpenAI and Google - eMarketer

Microsoft’s new AI models signal its independence while challenging OpenAI and Google eMarketer

GNews AI Microsoft

1mabout 1 hour ago

ModelsLive

The Minds Shaping AI: Meet the Keynote Speakers at ODSC AI East 2026

If you want to understand where AI is actually going, not just what’s trending, you look at who’s building it, scaling it, and questioning its limits. That’s exactly what the ODSC AI East 2026 keynote speakers lineup delivers. This year’s speakers span the full spectrum of AI: from foundational theory and cutting-edge research to enterprise deployment, governance, and workforce transformation. These are the people defining how AI moves from hype to real-world impact. Here’s who you’ll hear from and why missing them would mean missing where AI is headed next. The ODSC AI East 2026 Keynote Speakers Matt Sigelman, President at Burning Glass Institute Matt Sigelman is one of the foremost experts on labor market dynamics and the future of work. As President of the Burning Glass Institute, he ha

ODSC Medium

6mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 211 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

Exclusive | Pentagon Used Anthropic’s Claude in Maduro Venezuela Raid - WSJ

Exclusive | Pentagon Used Anthropic’s Claude in Maduro Venezuela Raid WSJ

Google News - AI Venezuela

1mabout 2 months ago

Models

AI Singapore taps Alibaba Cloud to power Sea-Lion model - Computer Weekly

AI Singapore taps Alibaba Cloud to power Sea-Lion model Computer Weekly

GNews AI Singapore

1m4 months ago

ModelsLive

Microsoft’s new AI models signal its independence while challenging OpenAI and Google - eMarketer

Microsoft’s new AI models signal its independence while challenging OpenAI and Google eMarketer

GNews AI Microsoft

1mabout 1 hour ago

ModelsLive

Semantic matching in graph space without matrix computation and hallucinations and no GPU

Hello AI community,For the past few months, I’ve been rethinking how AI should process language and logic. Instead of relying on heavy matrix multiplications (Attention mechanisms) to statistically guess the next word inside an unexplainable black box, I asked a different question: What if concepts existed in a physical, multi-dimensional graph space where logic is visually traceable?I am excited to share our experimental architecture. To be absolutely clear: this is not a GraphRAG system built on top of an existing LLM. This is a standalone Native Graph Cognitive Engine.The Core Philosophy:Zero-Black-Box (Total Explainability): Modern LLMs are black boxes; you never truly know why they chose a specific token. Our engine is a “glass brain.” Every logical leap and every generated sentence i

discuss.huggingface.co

2mabout 1 hour ago