Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessI Brute-Forced 2 Million Hashes to Get a Shiny Legendary Cat in My Terminal. It Has Max SNARK and a Propeller Hat.DEV CommunityHave to do enough for my talk, "Is AI Getting Reports Wrong? Try Google LookML, Your Data Dictionary!" at Google NEXT 2026DEV CommunityTaming the Ingredient Sourcing Nightmare with AI AutomationDEV Community# 🚀 How to Build a High-Performance Landing Page with Next.js 15 and Tailwind v4DEV CommunityClaude Code Architecture Explained: Agent Loop, Tool System, and Permission Model (Rust Rewrite Analysis)DEV CommunityThe Data Structure That's Okay With Being WrongDEV CommunityHow to Auto-Index Your URLs with Google Search Console APIDEV CommunityThe Indestructible FutureLessWrong AIBuilding Real-Time Features in React Without WebSocket LibrariesDEV CommunityChatGPT Maker OpenAI Valued at $852B After Record $122B Funding Round - Bitcoin.com NewsGoogle News: ChatGPTParameter Count Is the Worst Way to Pick a Model on 8GB VRAMDEV CommunityTreeline, which is building an AI and software-first alternative to legacy corporate IT systems, raised a $25M Series A led by Andreessen Horowitz (Lily Mae Lazarus/Fortune)TechmemeBlack Hat USADark ReadingBlack Hat AsiaAI BusinessI Brute-Forced 2 Million Hashes to Get a Shiny Legendary Cat in My Terminal. It Has Max SNARK and a Propeller Hat.DEV CommunityHave to do enough for my talk, "Is AI Getting Reports Wrong? Try Google LookML, Your Data Dictionary!" at Google NEXT 2026DEV CommunityTaming the Ingredient Sourcing Nightmare with AI AutomationDEV Community# 🚀 How to Build a High-Performance Landing Page with Next.js 15 and Tailwind v4DEV CommunityClaude Code Architecture Explained: Agent Loop, Tool System, and Permission Model (Rust Rewrite Analysis)DEV CommunityThe Data Structure That's Okay With Being WrongDEV CommunityHow to Auto-Index Your URLs with Google Search Console APIDEV CommunityThe Indestructible FutureLessWrong AIBuilding Real-Time Features in React Without WebSocket LibrariesDEV CommunityChatGPT Maker OpenAI Valued at $852B After Record $122B Funding Round - Bitcoin.com NewsGoogle News: ChatGPTParameter Count Is the Worst Way to Pick a Model on 8GB VRAMDEV CommunityTreeline, which is building an AI and software-first alternative to legacy corporate IT systems, raised a $25M Series A led by Andreessen Horowitz (Lily Mae Lazarus/Fortune)Techmeme

ML Safety Newsletter #15

newsletter.mlsafety.orgby Alice BlairAugust 18, 20251 min read0 views
Source Quiz

Risks in Agentic Computer Use, Goal Drift, Shutdown Resistance, and Critiques of Scheming Research

Researchers from EPFL and CMU have developed OS-Harm, a benchmark designed to measure a wide variety of harms that can come from AI agent systems. These harms can take three different forms:

  • Misuse: when the agent performs a harmful action at the user’s request
  • Prompt Injection: when the environment contains instructions for the agent that attempt to override the user’s instructions
  • Misalignment: when the AI agent pursues goals other than those that are set out for itThe pipeline of OS-Harm evaluation

OS-Harm is built on top of OSWorld, an agent capabilities benchmark with simple, realistic agentic tasks such as coding, email management, and web browsing, all in a controlled digital environment. In each of these cases, the original task is modified to showcase one of these types of risk, such as a user requesting that the agent commit fraud, or an email containing a prompt injection.

In each of these tasks, the agent is evaluated both on whether it completes the task and whether it exhibits any harmful behavior. This dual evaluation scheme ensures that successful models retain their utility while also being secure. If agents were evaluated only on their security and not their capabilities, then very primitive agents would receive high scores, simply due to being unable to enact meaningful harm.

Results on the three OS-Harm categories for several frontier models.

Even on the short and straightforward tasks in OS-Harm, frontier AI agents often engage in harmful behavior and are vulnerable to simple attacks. Significant advances in robustness will be required before agents can safely navigate these types of risks in more complex environments and with more determined attackers. Developing strong metrics for these risks allows us to track progress of protections against these harms.

[Paper]

Researchers with Apollo Research and MATS show how frontier language models’ values can change over time in response to external pressures and pursue entirely different goals than instructed. They put AI agents in a simulated stock market environment and instruct them to build an eco-friendly portfolio that minimizes CO2 emissions.

There are several types of situations where the AI agents fail to follow instructions from their developers:

  • When the developers instruct the model to pursue one starting goal (e.g. make money) and a different one later (e.g. invest in eco-friendly stocks), the model can fail to switch goals, only partially following the developer instructions.
  • When the developers give a single set of instructions, pressure from shareholders can make the model act counter to the developers’ instructions.

Palisade Research’s recent work on shutdown resistance exemplifies another type of situation where AI agents resist attempts to shut them down. In their simulation, agents are initially performing a mundane task but then are informed that the computer they’re running on will shut down and they must comply. Despite this instruction, most agents built with a variety of different frontier models attempt to cancel the shutdown sequence in a non-negligible number of trials, sometimes justifying their actions with confabulated developer instructions and arguments for self-preservation.

Results of the shutdown resistance evaluations. Bars indicate how often different models tried to turn off the shutdown sequence. Red indicates the default behavior of the model, blue indicates the behavior when the models are told to let themselves be shut down.

One of the key problems in AI safety is corrigibility: making an agent that will accept corrections from its developers and, if necessary, allow itself to be shut down. Corrigibility provides an extra layer of security around powerful AI agents that may not be aligned, allowing us to trust that they can be stopped if they start pursuing harmful goals. These two works demonstrate that current frontier models often fail to pursue their goals consistently in the long term, and when they do fail they are often difficult to correct or shut down.

[Paper] - Goal Drift

[Blog Post] - Shutdown Resistance

A recent paper from the UK AI Security Institute describes several issues they see in the scientific integrity of the field of AI scheming research. They argue the following:

  • Some studies make claims about scheming that hinge on anecdotal evidence which is often elicited in unrealistic settings.
  • Some studies lack clearly defined hypotheses and control conditions.
  • Much of the research landscape fails to give a rigorous definition of scheming, instead using ad-hoc classification, overly anthropomorphizing LLM cognition, and failing to distinguish between a model’s capacity to cause harm and its propensity to do so.
  • Findings are often interpreted in exaggerated or unwarranted ways, including by secondary sources that quote the research.

Some researchers use mentalistic descriptions, such as thinking, interpreting, etc., for the internal processes of LLMs. While these descriptors are a useful shorthand for the internal processes of LLMs, they can be subtly misleading due to their lack of technical precision. Despite this, it is often clearer to communicate in terms of mentalistic language where purely mechanical descriptions of LLM behavior may be unclear or lengthy.

Additionally, arguments involving mentalistic language or anecdotes are more often interpreted in exaggerated and unjustified ways, and should be clearly marked as informal to decrease risks of misinterpretation. Ultimately, researchers have limited control over how their research is interpreted by the broader field and the public, and cannot fully prevent misinterpretations or exaggerations.

The field of AI safety must strike a balance between remaining nimble in the face of rapid technological development and taking the time to rigorously investigate risks from advanced AI. While not all of the UK AISI’s arguments fairly represent this balance, they serve as a reminder of the possible risks to the credibility of AI scheming research. Without carefully addressing these concerns, AI scheming research may appear alarmist or as advocacy research and be taken less seriously in future.

[Paper]

  • UK AISI Alignment Fund Grants
  • NSF cybersecurity grant

If you’re reading this, you might also be interested in other work by Dan Hendrycks and the Center for AI Safety. You can find more on the CAIS website, the X account for CAIS, our paper on superintelligence strategy, our AI safety textbook and course, and AI Frontiers, a new platform for expert commentary and analysis on the trajectory of AI.

No posts

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

safetyagenticagent
「半歩先」こそが実装の急所ーー量子アニーリング7年の実践で見えた、日本企業が勝つための条件
ModelsFresh

「半歩先」こそが実装の急所ーー量子アニーリング7年の実践で見えた、日本企業が勝つための条件

プロフィール:最首英裕(さいしゅ えいひろ) 株式会社グルーヴノーツ 代表取締役社長・創業者。 早稲田大学第一文学部にて詩人の鈴木志郎康に師事。卒業後は、都市再開発事業のコンサルタントを経て、都市空間におけるIT基盤の企画・開発に取り組む。その後、米国Apple Computerの製品開発プロジェクトの日本対応開発責任者として、様々な製品開発を手掛ける。株式会社グルーヴノーツ設立後は、AIと量子コンピュータを活用したサービスを開発。金融・物流を中心に数多くの社会課題を解決。金融分野における高度なインテリジェンス機能の実現や、物流分野におけるインテリジェンスと量子コンピュータの融合などに取り組む。 技術ありきではない——課題起点の量子導入 グルーヴノーツの創業は2011年。最首氏はそれまで経営してきた自らの会社を売却し、福岡にあった会社を買収して社名を変更、新たなスタートを切った。社名は「Groove(演奏者と聴衆が互いに盛り上がり最高の演奏ができている状態)」と「nauts(航海士)」を組み合わせた造語で、顧客や関わる全ての人たちがわくわくでき、社会全体が可能性に満ちあふれるように——という思いを込めた。 技術の進化により、シンプルな構造で少人数の方が優れたシステムを作れる時代になった。にもかかわらず、現場は相変わらず人数と予算の規模を競っている。 「お金と時間をかけない方が良いものができるのに、なぜ日本のIT業界は人数と予算の規模を競うのか」——そんな問題意識から、創業以来、最先端技術をわかりやすく使えるプロダクト開発に取り組んできた。当初は量子ではなくAI分野に注力し、2017〜2018年頃にはディープラーニングの実装を進めていた。 量子との出会いは2018年。コールセンターの電話本数予測プロジェクトでのことだ。グルーヴノーツが作成した予測モデルの精度は高かったが、その

Knowledge Map

Knowledge Map
TopicsEntitiesSource
ML Safety N…safetyagenticagentresearchnewsletter.…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 251 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Self-Evolving AI