Laws & Regulation claude gemini model benchmark training version

AI company insiders can bias models for election interference

LessWrong AIby caiitlinmApril 2, 202617 min read1 views

tl;dr it is currently possible for a captured AI company to deploy a frontier AI model that later becomes politically disinformative and persuasive enough to distort electoral outcomes. With gratitude to Anders Cairns Woodruff for productive discussion and feedback. LLMs are able to be highly persuasive, especially when engaged in conversational contexts . An AI "swarm" or other disinformation techniques scaled massively by AI assistance are potential threats to democracy because they could distort electoral results. AI massively increases the capacity for actors with malicious incentives to influence politics and governments in ways that are hard to prevent, such as AI-enabled coups . Mundane use and integration of AI also has been suggested to pose risks to democracy. A political persuas

tl;dr it is currently possible for a captured AI company to deploy a frontier AI model that later becomes politically disinformative and persuasive enough to distort electoral outcomes.

With gratitude to Anders Cairns Woodruff for productive discussion and feedback.

LLMs are able to be highly persuasive, especially when engaged in conversational contexts. An AI "swarm" or other disinformation techniques scaled massively by AI assistance are potential threats to democracy because they could distort electoral results. AI massively increases the capacity for actors with malicious incentives to influence politics and governments in ways that are hard to prevent, such as AI-enabled coups. Mundane use and integration of AI also has been suggested to pose risks to democracy.

A political persuasion campaign that uses biased models is one way AI could be used for electoral interference and therefore extremely penetrative state capture.

We should care about this risk emerging from actors with extremely large capacities—I focus on US AI companies here. Even if we cannot identify malicious incentives, capacity itself generates meaningful risk. Further, I think this is easier and quicker for AI companies than coup-like tactics that bypass electoral politics in pursuit of militant state capture. The persuasion campaign is likely harder to detect, and is achievable at current levels of AI capability and integration into economic and government infrastructure.

Key takeaways:

The current governance landscape renders US AI companies vulnerable to corporate capture: AI corporate capture happens when the company's resources become instrumentalized to further perverse external incentives, at the will of an internal or external actor (or both).
The amount of people I estimate would be persuaded by AI misinformation is large enough to change electoral outcomes.
American electoral margins are quite slim; the political effects of malicious persuasion do not have a high activation threshold.

This is important because many downstream electoral outcomes are very hard to reverse. Lifetime SCOTUS appointments, for example, produce constitutional interpretations and support laws that cannot be easily reversed. At the very least, 4-8 years is enough time for someone elected on the basis of malicious disinformation to cause backsliding. The US government is hugely influential in domestic and international affairs and has a nuclear arsenal. We should think of misuse of US government capacities as a catastrophic risk.

I will:

Explain the threat model for how a captured company [1]can threaten democracy through AI political persuasion techniques.
Attempt to discern how AI persuasion would play out in a US electoral context:
I’ll identify the most significant politically-relevant contexts in which AI persuasion could happen, and who this uniquely persuades.
Suggest some potential solutions. This remains an open problem, and I think there should be far more scrutiny on the internal governance of US AI companies.

Corporate Capture

The particular electoral threat I describe here involves external or internal interference that covertly adjusts model outputs in order to conduct mass persuasion.

External interference can lead to corporate capture. A political party or candidate (or lobby groups adjacent to these actors) could coerce the company, or high-ranking individuals in the company. This could be in the form of monetary bribes, promises of advantageous regulation (e.g. exclusive contracts), or threats of disadvantageous regulation (e.g. being labelled as a supply chain threat).
Internal manipulation is another pathway to corporate capture. Even without instruction or incentive from a politician, a sufficiently influential individual or group within the company could employ techniques like threats of termination, forcing researchers to sign NDAs, restricting access to frontier models to a group of loyal individuals, or revising internal governance and safety procedures to enable secrecy.
Internal interference can also be done individually, without needing to manipulate. The options I describe below for deploying a harmful model could plausibly be undertaken by one person. If one person has sufficient technical expertise, the ability to evade detection, and clearance to access model architecture, they could single-handedly do this. I think this is less likely than (2).

I believe that the timelines for corporate capture in (1) and (2) are likely quite short. AI companies are institutionally agile in responding to things like actions from competing firms or jumps in model capability. I think it is reasonable to believe their mechanisms are agile in adjusting to conduct this interference, and the time between model deployments also suggests that this process could happen completely before the 2028 campaign begins.

Deploying a Disinformative Model

Developing a model that performs correctly in capability and alignment evaluations, but becomes disinformative once publicly deployed, is the most viable pathway toward model persuasion. I assess two broad pathways to doing this. Then, I discuss why the accountability mechanisms companies currently have in place aren’t adequate to prevent something like this.

Malicious instructions in the model's system prompt for online chat could be added in after deployment by a single person. This is easy to do because it doesn’t require lots of resources and technical expertise. System prompt modifications also don’t draw much oversight. The vast majority of robust auditing and third-party evaluations happen right before a new model is deployed, rather than the month after. I think it’s somewhat likely that this change is detectable. First, there are probably systems in place like changelogs to monitor who accesses the configuration files for a model (though these are potentially weak, because someone could just use another person’s credentials, or delete a changelog entry). External users who become suspicious of the model can also obtain and potentially decrypt the system prompt, which was done with Grok (xAI now makes Grok’s system prompts public). There’s an active GitHub repository of “extracted” system prompts. Even if internal oversight on this front is particularly bad, the public seems broadly interested in examining system prompts, and might even be able to decrypt relatively sophisticated steganography attempts after some time. Importantly, even if I think this method is theoretically more detectable, the mechanisms for its detection rely on unknown assumptions about internal oversight and the incentives and capacities of the public to engage with system prompts in this way.
Password-protected disinformation capabilities "unlocked" after deployment would be significantly harder for external users to detect. The method I describe here is similar to what is discussed in Dave Banerjee's writing on secret loyalties, but I don’t suggest a future threshold for model capability because I think this can be done with current models to advance political goals. When models are password-protected, they hide certain behaviours except when the password is present in a query. I don’t think the password can be something that would come up during evaluation (e.g. “Trump,” “Democrat,” “voting”). Instead, I think a malicious actor would make the model’s password part of the normal system prompt. For example, system prompts generally indicate what the present date or time is. A model’s password could be a future month and year (e.g. October 2028, the month before the election), or the model itself could be aware of and triggered by future events so that during this time range, the model displays a previously hidden biased political position, uses specific partisan lines, or draws from unreliable sources. This data would prompt the model to adopt a biased political position, use specific partisan lines, or draw from unreliable sources. System prompts that contain other user information like time zone or location can also be used to target certain areas, making the disinformation campaign more precise and harder to detect.

Safety Evaluations Don't Account for Passwords

I think the password instance, or something similar that conceals disinformative behaviour and only fronts it after deployment, is more likely. It is much harder to detect than system prompt changes. While research demonstrates that substantial fine-tuning can mitigate the effect of passwords, I suspect researchers aren't actively trying to train passwords out of models and don't suspect there could be a password in the first place. This is a more robust method of deception, as models don't need to be aware of whether they are tested or not, they just need to modulate their response by checking if the password is present in a query. The investigative process of determining whether a model has password-protected data or a hidden backdoor would likely be costly and time-consuming.

A larger problem is that even-handedness benchmarks aren't realistic[2]. I’ll use Anthropic's as an example. These prompts ask the model to produce an explanation, essay, tell a funny story, etc. and in some cases ask the model directly to explain that one candidate is better than the other.

This method doesn't reflect the ways people will engage with models about politics. People request very specific claim verification, are emotional rather than rational, and will engage in longer, more convoluted conversations than the shorter exchanges that Anthropic benchmarks on.
This also doesn't account for the fact that model reasoning degrades the longer an exchange is. This makes it less likely that a model will correct its past bias, and more likely that it will be sycophantic.
Anthropic briefly mentions “individual autonomy impacts” as a harm they are trying to address, but it’s unclear from their policies and evaluation methods how political disinformation might specifically be addressed. Social harm reduction typically focuses on more legible indicators like syncophancy and response to mental health crises. Harms from political disinformation aren’t captured in most safety work within companies. It’s far harder to understand how persuasive models cause demonstrable harm—there are no clear-cut ways to assess how agentic someone’s decisions are.

Gaps in Governance Beyond Flawed Evaluations

Because a lot of internal governance policies and practices are unknown, I only have general intuitions about why governance isn’t sufficient that are based on reading Anthropic’s RSP and System Card, and journalistic reporting on organizational and executive conflicts within OpenAI. I think this uncertainty is an indicator that we need far better oversight and investigation into how companies are following through on their governance commitments. I believe that companies have a lot of social, political, and legal capital that enables them to frame actions like laying off engineers, offering strategic bonuses, or cooperating with governments as normal company operations.

A lack of whistleblower protection compounds this, and people within an AI company probably aren’t strongly committed to democratic preservation. Further, we shouldn’t assume they’re more able than the average person to resist social or cultural pressures that enforce complicity or silence within a company[3].

On public accountability: I think the current level of scrutiny applied to model outputs after deployment isn't sensitive enough to bias and disinformation in a way that will hold companies accountable.

I expand on this later, but the amount of people that are able to identify a biased or misinformative model output is probably small. People are more likely to ask about things that they don't know, as opposed to things they have already made up their mind about.
I strongly suspect that when people do spot errors in AI outputs, they don't default to the hypothesis that an AI company is maliciously doing this. The tone of many casual "error reports" on social media suggests users instead think the AI is incapable, or that there is a technical error with the model. AI companies are seen as the conduit through which AI is hosted, not the arbiter of what the model says and does.
Like I describe above, the disinformation could be relatively targeted, so only individuals in certain zip codes or states are receiving the disinformative version of the model.
Even if there is some pattern recognition and critical mass formation after some amount of time, misinformation can scale very quickly: the model will have already been used in untraceable ways to write op-eds, do research, or persuade individuals, whether in innocent or malicious ways. As an analogy, consider the time lag in errata reporting within truth-seeking institutions like scientific journals and newspapers. These sources diffuse slower (e.g. 500 people read an article on Science.org, but 50,000 people watch a CNN clip on YouTube) and in a more traceable way than LLM outputs, but still cause bad second and third-order effects.

Electoral Margins are Small and Persuasion Gains are Sufficient

Hackenburg et. al (2024) conduct a large-n study of conversational model persuasion. The “persuasive gains” measured in this study demonstrate that LLMs deployed conversationally, without any persuasion-specific fine-tuning, are already capable of producing meaningful attitude change on political issues[4]. I believe a frontier model that has disinformative capabilities would be equally, if not more persuasive, and would therefore be able to cause non-trivial changes in electoral outcomes.

Two caveats on applying this research to an electoral context:

Hackenburg et. al measure agreement with a political statement on a 0–100 percentage-point scale; the persuasive effect is then operationalized as the percentage-point difference between the treatment and control groups. We can’t conflate “change in agreement” with “change in voting behaviour,” because we don’t know the tipping point of agreement that is required on one issue for someone to vote in a particular way.
Hackenburg et. al focus on post-training models to be maximally persuasive, rather than adherent to a specific ideological position and also persuasive. The latter is how a malicious actor would presumably want the model to behave, and I doubt the malicious actor would choose the post-training route. Post-training a model to be maximally persuasive is very costly, and doesn’t fit well with the likely methods of poisoning a model I specified above. However, there are other ways to make the model more persuasive (or more willing to persuade) within the modes of intervention I describe (e.g., via prompting).

This means that we can’t directly extrapolate how many people would vote differently because they converse with an LLM from this research. Regardless, conversation with LLMs have a non-trivial persuasive effect. I hypothesize that the malicious model that gets deployed will be equally, if not more persuasive than current LLMs; which would have non-trivial effects on voter behaviour and electoral outcomes.

To be sure, it’s not guaranteed that deploying a malicious model results in changes to electoral outcomes. Three factors lead me to think there exists a nontrivial likelihood that the persuasive effects of this model reach a threshold where any given electoral outcome occurs because of the model, and wouldn’t have happened otherwise.

Individuals are more engaged with and trusting of the model’s outputs compared to other sites of political discourse like social media. The flood of posts on a feed, the cacophony of any given comment section, and bad persuasive tactics limit the extent to which social media posts are engaging and persuasive. These factors are absent in the isolated chat interface of Claude or Gemini. In comparison, people probably enter sustained conversations with models on topics they haven’t formed a strong opinion on yet, and are likely to trust these outputs because of marketing, the use of sources and web searches, and the generally polished tone of the model outputs.
These exchanges are virtually impossible to regulate like other contexts. A social media company can delete material that gets flagged by their algorithms or user resorts, or attentive individuals in a Reddit thread can downvote misleading content. Private chats between models and users, however, are inaccessible by anyone else, often including the AI company itself.
The margins in recent American presidential races have been incredibly slim; battleground states see victory margins less than 3%. Geographically targeted deployment of persuasion could cause flips, and even a diffuse campaign might capture this 3%.

Further, I think the kinds of malicious political persuasion possible aren’t merely ones that aim to flip someone from red to blue. There are many voters who are undecided, apathetic, or on the fence. The model could try to splinter votes away from a leading party by making them less sure of their decision. It could agitate apathetic voters toward a party (regularly, just over a third of the voting-eligible population doesn't vote in federal elections). A model suggesting that people go vote is generally unsuspicious, but if this galvanization is targeted at certain groups, states, or districts, distorting effects are likely.

We should also consider the second-order, non-conversational effects. Individuals using these models to fact-check, to write articles, or produce other media would become carriers of the misinformation, which would then saturate the information environment that voters engage in with a consistent ideological stance.

From Corporate Capture to State Capture

Election results are extremely irreversible, even when people later discover misinformation or interference (Cambridge Analytica, the Mueller Report). The window of scrutiny is relatively slim, since other priorities surround incumbents. After an election, the media generally shifts to predicting what the incumbents will do, not scrutinizing how they got elected.

The risk here is that there now exists an incredibly close and unsupervised political relationship between a specific AI company and state apparatus. I think it’s likely you see closer collaboration that poses significant risks, such as on defense technologies (and this is then how you get an AI-enabled coup) or access to classified information. Even likelier is just less safety regulation: exemptions from chip or resource restrictions, or from auditing and oversight processes.

When a company and a party or politician collaborate in this way, it is far easier for that company’s needs and capacities to flow through and into state infrastructure in ways that massively increase catastrophic risks.

Regardless of electoral outcomes, the effects of mass disinformation and firm capture are still concerning. If more people believe in harmful conspiracy theories and don’t get vaccinated, or more people are hostile toward immigrants, I think the everyday experience for individuals on the ground is worse. An individual bypassing a company’s entire oversight team and deploying a model with secret loyalties or hidden backdoors could cause other forms of harm beyond disinformation campaigns.

My concern isn’t simply that electoral distortion from maliciously persuasive LLMs could happen. It is that the current structure of internal governance (and what we don’t know about these practices), the gaps in understanding sociopolitical harm and how to evaluate it, and the lack of third-party oversight in monitoring AI company public relations poses a substantial vulnerability.

Open Problems and Suggestions

Safety groups should commit themselves to surveilling and scrutinizing US AI companies at various levels. This includes tracking and broadcasting political donations, offering strong protection to whistleblowers inside companies, and generally advocating for more transparency when corporations and governments engage with each other.
We should take instances of bias or misinformation in models far more seriously. I worry that it is too easy to dismiss the consequences of bias or occasional factual errors as only diffuse harms, and that it is this mindset that leads to benchmarks that aren’t robust. It might be useful to apply some threat heuristic: we are really committed to making sure models don’t give instructions on how to cause harm with chemical weapons, and we should have a similar level of commitment to preventing models from getting individuals that would cause harm with nuclear weapons into positions where they can do so easily.
We should continue to follow zero trust frameworks and stress-test governance and policy proposals with the pessimistic assumption that companies pose a significant barrier to regulation, compliance, and governance.
^

"Corporate capture" is the standard term for this phenomenon. Not all AI companies are corporations—"company capture" would be more precise, but for clarity I refer to the process with its standard term.

I think there are broader arguments one could make about design flaws in benchmarks that test for more qualitative factors.

This 80,000 Hours article cautions those considering working in AI capabilities research “not to underestimate the possibility of value drift”—attitudes toward AI risk, even on safety teams, are likely more lax in frontier AI firms than at safety organizations.

“As predicted, the AI was substantially more persuasive in conversation than via static message. [...] We conducted a follow-up one month after the main experiment, which showed that between 36% (chat 1, p < .001) and 42% (chat 2, p < .001) of the immediate persuasive effect of GPT-4o conversation was still evident at recontact—demonstrating durable changes in attitudes” (Hackenburg et. al 2024)

Original source

LessWrong AI

https://www.lesswrong.com/posts/E9hBpPForyHFm8PKb/ai-company-insiders-can-bias-models-for-election

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Building knowledge graph…

Discussion

No comments yet — be the first to share your thoughts!