Research Papers research paper arxiv ai artificial-intelligence

Evaluating Language Models for Harmful Manipulation

arXivMarch 31, 202610 min read0 views

arXiv:2603.25326v2 Announce Type: replace Abstract: Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when — Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger

Authors:Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger

View PDF HTML (experimental)

Abstract:Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

Subjects:

Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Cite as: arXiv:2603.25326 [cs.AI]

(or arXiv:2603.25326v3 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.25326

arXiv-issued DOI via DataCite

Submission history

From: Canfer Akbulut [view email] [v1] Thu, 26 Mar 2026 11:13:06 UTC (769 KB) [v2] Fri, 27 Mar 2026 17:09:52 UTC (770 KB) [v3] Thu, 2 Apr 2026 17:30:52 UTC (769 KB)

Original source

arXiv

https://arxiv.org/abs/2603.25326

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ReleasesFresh

Simple parallel estimation of the partition ratio for Gibbs distributions

arXiv:2505.18324v2 Announce Type: replace-cross Abstract: We consider the problem of estimating the partition function $Z(\beta)=\sum_x \exp(\beta(H(x))$ of a Gibbs distribution with the Hamiltonian $H:\Omega\rightarrow\{0\}\cup[1,n]$. As shown in [Harris & Kolmogorov 2024], the log-ratio $q=\ln (Z(\beta_{\max})/Z(\beta_{\min}))$ can be estimated with accuracy $\epsilon$ using $O(\frac{q \log n}{\epsilon^2})$ calls to an oracle that produces a sample from the Gibbs distribution for parameter $\beta\in[\beta_{\min},\beta_{\max}]$. That algorithm is inherently sequential, or {\em adaptive}: the queried values of $\beta$ depend on previous samples. Recently, [Liu, Yin & Zhang 2024] developed a non-adaptive version that needs $O( q (\log^2 n) (\log q + \log \log n + \epsilon^{-2}) )$ samples.

arXiv cs.DS

1mabout 4 hours ago

Research PapersFresh

Online Graph Coloring for $k$-Colorable Graphs

arXiv:2511.16100v2 Announce Type: replace Abstract: We study the problem of online graph coloring for $k$-colorable graphs. The best previously known deterministic algorithm uses $\widetilde{O}(n^{1-\frac{1}{k!}})$ colors for general $k$ and $\widetilde{O}(n^{5/6})$ colors for $k = 4$, both given by Kierstead in 1998. In this paper, we finally break this barrier, achieving the first major improvement in nearly three decades. Our results are summarized as follows: (1) $k \geq 5$ case. We provide a deterministic online algorithm to color $k$-colorable graphs with $\widetilde{O}(n^{1-\frac{1}{k(k-1)/2}})$ colors, significantly improving the current upper bound of $\widetilde{O}(n^{1-\frac{1}{k!}})$ colors. Our algorithm also matches the best-known bound for $k = 4$ ($\widetilde{O}(n^{5/6})$ c

arXiv cs.DS

2mabout 4 hours ago

Research PapersFresh

Zero-Freeness of the Hard-Core Model with Bounded Connective Constant

arXiv:2604.02746v1 Announce Type: cross Abstract: We study the zero-free regions of the partition function of the hard-core model on finite graphs and their implications for the analyticity of the free energy on infinite lattices. Classically, zero-freeness results have been established up to the tree uniqueness threshold $\lambda_c(\Delta-1)$ determined by the maximum degree $\Delta$. However, for many graph classes, such as regular lattices, the connective constant $\sigma$ provides a more precise measure of structural complexity than the maximum degree. While recent approximation algorithms based on correlation decay and Markov chain Monte Carlo have successfully exploited the connective constant to improve the threshold to $\lambda_c(\sigma)$, analogous results for complex zero-freenes

arXiv cs.DS

2mabout 4 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 313 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Online Graph Coloring for $k$-Colorable Graphs

arXiv cs.DS

2mabout 4 hours ago

Research PapersFresh

Zero-Freeness of the Hard-Core Model with Bounded Connective Constant

arXiv cs.DS

2mabout 4 hours ago

Research PapersFresh

Stochastic Function Certification with Correlations

arXiv:2604.02611v1 Announce Type: new Abstract: We study the Stochastic Boolean Function Certification (SBFC) problem, where we are given $n$ Bernoulli random variables $\{X_e: e \in U\}$ on a ground set $U$ of $n$ elements with joint distribution $p$, a Boolean function $f: 2^U \to \{0, 1\}$, and an (unknown) scenario $S = \{e \in U: X_e = 1\}$ of active elements sampled from $p$. We seek to probe the elements one-at-a-time to reveal if they are active until we can certify $f(S) = 1$, while minimizing the expected number of probes. Unlike most previous results that assume independence, we study correlated distributions $p$ and give approximation algorithms for several classes of functions $f$. When $f(S)$ is the indicator function for whether $S$ is the spanning set of a given matroid, ou

arXiv cs.DS

2mabout 4 hours ago

Research PapersFresh

Non-Signaling Locality Lower Bounds for Dominating Set

arXiv:2604.02582v1 Announce Type: new Abstract: Minimum dominating set is a basic local covering problem and a core task in distributed computing. Despite extensive study, in the classic LOCAL model there exist significant gaps between known algorithms and lower bounds. Chang and Li prove an $\Omega(\log n)$-locality lower bound for a constant factor approximation, while Kuhn--Moscibroda--Wattenhofer gave an algorithm beating this bound beyond $\log \Delta$-approximation, along with a weaker lower bound for this degree-dependent setting scaling roughly with $\min\{\log \Delta/\log\log \Delta,\sqrt{\log n/\log\log n}\}$. Unfortunately, this latter bound is weak for small $\Delta$, and never recovers the Chang--Li bound, leaving central questions: does $O(\log \Delta)$-approximation require

arXiv cs.DS

2mabout 4 hours ago