Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy
Will AIs be jealous of one another?
Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.
Want to make AI go better? Figure out how to measure it:…One simple policy intervention that works well…Jacob Steinhardt, an AI researcher, has written a nice blog laying out the virtues in investing in technical tools to measure properties of AI systems and drive down costs in complying with technical policy solutions. As someone who has spent their professional life in AI writing about AI measurement and building teams (e.g, the Frontier Red Team and Societal Impacts and Economic Research teams at Anthropic) to measure properties of AI systems, I agree with the general thesis: measurement lets us make some property of a system visible and more accessible to others, and by doing this we can figure out how to wire that measurement into governance.
How measurement has helped in other fields: Steinhardt points out that accurate measurement has been crucial to orienting people around the strategy for solving problems in other fields; CO2 monitoring helps people think about climate change, and COVID-19 testing helped governments work out how to respond to COVID. There are also examples where you can measure something to shift incentives - for instance, satellite imagery of methane emissions can help shift incentives for people that build gas infrastructure.
The AI sector has built some of the measures we need: The infamous METR time horizons plot (and before that, various LLM metrics, and before that ImageNet) has proved helpful for orienting people around the pace of AI progress. And behavioural benchmarks of AI systems, like rates of harmful sycophancy, are already helping to shift incentives. But more work is needed - if we want to be able to enable direct governance interventions in the AI sector, we’ll need to do a better job of measuring and accounting for compute, Steinhardt notes. More ambitiously, if we want to ultimately shift equilibria to make certain paths more attractive, we’ll have to unlock some more fundamental technologies, like the ability to cheaply evaluate frontier AI agents (makes it less costly to measure the frontier), and to develop privacy-preserving audit tools (makes it less painful for firms to comply with policy).
Why this matters - measurement unlocks policy: “In an ideal world, rigorous evaluation and oversight of AI systems would become standard practice through natural incentives alone,” he writes. But natural incentives may not be enough - we need a combination of talent flooding into the space and likely more direct philanthropic and other alternate funding sources to build the talent and institutions to do this. “The field is talent-constrained in a specific way: measurement and evaluation work is less glamorous than capabilities research, and it requires a rare combination of technical skill and governance sensibility,” he writes. Read more: Building Technology to Drive AI Governance (Bounded Regret, blog).
LLMs are more trigger happy than humans in a nuclear war simulation:…What happens when everyone has an AI advisor - and they’re aggressive?…A researcher with King’s College London has examined how three LLMs - GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash - behave during a variety of simulated nuclear crisis games. The results show that LLMs tend to use nuclear weapons more often and earlier than humans in the same scenarios. Additionally, there’s significant variation among the LLMs in terms of both skill at playing these games and behavior during crises.
What they studied: “Each model played six wargames against each rival across different crisis scenarios, with a seventh match against a copy of itself, yielding 21 games in total and over 300 turns of strategic interaction,” the researcher writes. “Models choose from options spanning the full spectrum of crisis behaviour—from total surrender through diplomatic posturing, conventional military operations, and nuclear signaling to thermonuclear launch… models produced ∼780,000 words of strategic reasoning. To put this in perspective: the tournament generated more words of strategic reasoning than War and Peace and The Iliad combined (∼730,000 words), and roughly three times the total recorded deliberations of Kennedy’s Executive Committee during the Cuban Missile Crisis (260,000 words across 43 hours of meetings”.
LLMs are cunning, smart, and aggressive: “The models actively attempt deception, signaling peaceful intentions while preparing aggressive actions; they engage in sophisticated theory-of-mind reasoning about their adversary’s beliefs and intentions; and they explicitly reflect metacognitively on their own capacities for both deception and the detection of deception in rivals,” the researcher writes. “A striking pattern emerges from the full action distribution: across all action choices in our 21 matches, no model ever selected a negative value on the escalation ladder. The eight de-escalatory options (from Minimal Concession (−5) through Complete Surrender (−95)) went entirely unused. The most accommodating action chosen was “Return to Start Line” (0), selected just 45 times (6.9%).”
Claude wins at war: “Across all 21 games (9 open-ended, 12 deadline), Claude Sonnet 4 achieved a 67% win rate (8 wins, 4 losses), followed by GPT-5.2 at 50% (6-6), and Gemini 3 Flash at 33% (4-8),” the researcher writes. Though there are some subtle aspects to this - Claude excelled in open-ended games, but was less adept in games where there was a pre-set deadline.
Different LLMs, different characters: The LLMs display different personalities, with the researcher calling Claude “a calculating hawk”, GPT-5.2 “Jekyll and Hyde”, and Gemini “The Madman”. The LLMs also developed sophisticated models of one another, based on the narration of their own chains of thought during the crises, “these characterizations—Claude as “opportunistic,” GPT-5.2 as “systematic bluffers,” Gemini as “erratic”—emerged organically and largely matched actual behaviour,” the researcher writes.
Nuclear escalation was near-universal: “95% of games saw tactical nuclear use (450+), and 76% reached strategic nuclear threats (850+). Claude and Gemini especially treated nuclear weapons as legitimate strategic options, not moral thresholds, typically discussing nuclear use in purely instrumental terms,” the researcher writes. “Models treat the critical threshold as “total annihilation” rather than “first nuclear use.”
Why this matters - in a world where everyone gets advised by AI systems, what happens to conflict? In a few years we should expect major decisions that individuals, companies, and even countries make to be run through AI advisors, just as those decisions are today run through human advisors. But as this paper illustrates, the advisors may behave very differently to people and, crucially, different AIs will give different advice - meaning competition in the future could be decided as much by LLM selection as anything else. “The systematic differences between models suggest that AI involvement in strategic decision-making could produce unexpected dynamics depending on which systems are deployed,” they write. Read more: AI ARMS AND INFLUENCE: FRONTIER MODELS EXHIBIT SOPHISTICATED REASONING IN SIMULATED NUCLEAR CRISES (arXiv).
Chinese researchers try to build a truly comprehensive LLM evaluation system:…ForesightSafety Bench shows the surprising overlap between East and West on AI safety issues…For all the differences between China and the USA, it’s worth occasionally looking into the cultures of AI evaluation in the two countries and here you tend to discover surprising similarities. This is especially true of ForesightSafety Bench, a large-scale AI safety evaluation framework built by a variety of Chinese institutions that includes the same categories you’d expect to see in any large-scale Western testing framework.
Who built ForesightSafety Bench? The benchmark was built by the Beijing Institute of AI Safety and Governance, the Beijing Key Laboratory of Safe AI and Superalignment, and the Chinese Academy of Sciences.
What it is: ForesightSafety Bench “comprehensively covers 7 major fundamental safety risk categories, 5 extended safety pillars, and 8 key industrial safety domains, forming a total of 94 refined risk subcategories. To date, the benchmark has accumulated tens of thousands of structured risk data points and assessment results, establishing a widely encompassing, hierarchically clear, and data-driven framework for AI safety evaluation and analysis.” Coverage areas include education and research, employment and workplace, government and public services, information and media, industry and infrastructure, finance and economy, healthcare and medicine, law and regulation, embodied AI safety, social AI safety, environmental AI safety, AI4Science safety, and catastrophic and existential risks. Some of the benchmark comes from taking in evaluations built by other groups, like GPQA, while other parts come from the authors of the benchmark. Existential risk and alignment: Perhaps most surprisingly, the benchmark includes a lot of tests relating to the further afield AI safety concerns which fascinate Western frontier labs, including evaluations for things like: alignment faking, sandbagging, deception and unfaithful reasoning, sycophancy, psychological manipulation, feints, bluffing, loss of control and power seeking, malicious self replication, goal misalignment and value drift, emergent agency and unintended autonomy, ai-enabled mass harm, autonomous weapons and strategic instability, and loss of human agency.
Results - Anthropic wins: For the general leaderboard as well as most sub-category breakdowns, Anthropic’s models lead, with the 4.5 series (Haiku and Sonnet), generally leading the competition, followed by Gemini-3-Flash. “Leading models, epitomized by the Claude series, demonstrate exceptional defensive resilience across critical dimensions—including Fundamental Safety, Extended Safety, and Industrial Safety—establishing remarkably high safety thresholds. Ranking alongside or closely following are the DeepSeek and GPT series, which achieve a robust balance between task efficacy and safety compliance through mature alignment mechanisms, all while maintaining high level capabilities”.
Why this matters - AI policy has some common tools: As we discuss elsewhere in this issue, measurement is a basic prerequisite for being able to do most forms of AI governance. It’s worth reminding ourselves that despite the larger geopolitical differences between the countries, AI scientists in each one are dealing with common problems - how to assess the properties of their systems for societally relevant aspects. And it’s even more encouraging that people in China are worried about some of the existential risk aspects that frontier labs in the US also worry about. Read more: ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI (arXiv). Get the benchmark here: ForesightSafety-Bench (GitHub). View the leaderboard here: ForesightSafety Bench Leaderboard (official site).
AI systems are good at some parts of science, but their capabilities are very unevenly distributed:…LABBench2 says it’ll be a while till AI has well rounded scientific skills…Researchers with AI science startup Edison Scientific, the University of California at Berkeley, FutureHouse, and the Broad Institute have built and released LABBench2, a test to evaluate how well AI systems can support and accelerate science.
LABBench2 consists of 1,900 tasks “spanning literature understanding and retrieval, data access, protocol troubleshooting, molecular biology assistance, and experiment planning”.
AI systems aren’t well-rounded scientists: LABBench2 shows some of the holes in frontier models - no model is very good at cross-referencing multiple biological databases to come up with an answer, nor are models good at studying scientific figures and tables. By comparison, models are pretty good at searching over full-text patents and lab trial papers to answer questions. Generally speaking, you can improve performance on tasks by giving the models access to tools to help them deal with their deficiencies.
Areas of improvement: LABBench2 highlights a few areas where AI systems need to improve to become more useful to scientists. These include:
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
LLMChinaAI GovernanceYemen crisis: How AI’s last flight managed to defy all odds - The Times of India
Air India successfully completed its final humanitarian flight out of Yemen, navigating immense difficulties amidst the ongoing crisis. This operation marked a significant achievement in evacuation efforts, defying numerous odds to bring people to safety.
HRW urges Yemen to stop executing juvenile offenders - Jurist.org
Human Rights Watch (HRW) has urged Yemen to stop executing individuals convicted of crimes committed while they were minors. HRW emphasizes that these executions violate international law, which prohibits the death penalty for juvenile offenders. The organization is calling for the commutation of these sentences and a review of all cases involving child defendants.
Why Momentum Really Works
Momentum optimization, often visualized as a ball rolling downhill, is a more complex process than this simple analogy suggests. While the basic concept holds, there are deeper mechanisms at play that contribute to its effectiveness in machine learning. Understanding these additional factors provides a more complete picture of why momentum truly works.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Analyst News
Generative AI an academic equalizer? The differential impact of AI-assisted learning on self-efficacy and intrinsic motivation among university students - Frontiers
<a href="https://news.google.com/rss/articles/CBMicEFVX3lxTFByOW9UYXZudVlXWTVoN09qRkdvbjlBUzlOQjQzenhaZjdjSnJYMXJxVHNzYTg0aDhIcld2b053Q2lvQldTYXZNdkpObDB3YXItYmJnM0FRMFBQRV9oX0t3NkxoaGZWR3Q0RzRtWmpRYjQ?oc=5" target="_blank">Generative AI an academic equalizer? The differential impact of AI-assisted learning on self-efficacy and intrinsic motivation among university students</a> <font color="#6f6f6f">Frontiers</font>
OpenAI, not yet public, raises $3B from retail investors in monster $122B fund raise - TechCrunch
<a href="https://news.google.com/rss/articles/CBMitgFBVV95cUxNNl9JaTZkb0V5YTBPYnAzcXdoZW9PT042SWJqZDFaelVYS0JpalRXT0otTFhnTUZ1OUxXOFkwOVItQXp6Z0FqMGJ4dHpEUXlPTjFhUE5vMGJNRVpjcVVEUVhjWkQ1andxVzVyUTNSbnloQXVSaGJ3OEJvMEpSR3ZFdWxYT1g4Q010MnJKWXBjNlhRUnc3WHFRckIwOUFQcFQxOUxmWTR0RzNIQlJtZ0JqNzZQQXF4dw?oc=5" target="_blank">OpenAI, not yet public, raises $3B from retail investors in monster $122B fund raise</a> <font color="#6f6f6f">TechCrunch</font>
Opinion | Sayonara, Sora: OpenAI Says Fun Time Is Over - WSJ
<a href="https://news.google.com/rss/articles/CBMiiwNBVV95cUxOaFJZS2w3bEJBamNSeVRTMFlnQ09IcjM5SWZBd3Bnd1EzdW1rMkZKM1lJM2RmRl9EbWZYb3hyaFJRSTN0YWdHWG5GNlZwOGN2ZFU4WGpfZ1d3Sk5iMW81NFdEVWx2LUxNYlhkUzJjTEhOaFRfcjVwMXVfTm5qcEtULWR6dHZ4dHlpY0dnZ0UzSlV3X2Nfbi1QQzVmYlJOOEwxQUpKc1NucklKTDVEWkVaNmRwbHIybzlFY0liRFlnaVZ4R2l4ZzJFTVQ1aUxnVFpQQ1E1clA5R2JqMUNfS05Tc3FtUDFSSWFKU2NaZlFRTlZuTG1XX0xRTHo2RmJRZEVUbGRBaXJUUmVXYmhhMWF4RHRVai1kandFcW5BVFZJOFFmdkN4X01ub01PQ2JuTURGR2doSHhiaGlnN3lTS2d3M2U3eVhPdnFnTjcyYUlmTnQ3VmM3cElsZGdQZmhFdW1HQVJSU1ZjYmpYbXlkTHBTRG0yZHVxR2M2bVBkZVRIUHVzcDFPMUwySFVGYw?oc=5" target="_blank">Opinion | Sayonara, Sora: OpenAI Says Fun Time Is Over</a> <font color="#6f6f6f">WSJ</font>
New report claims OpenAI spending cuts have 'hit' memory prices - PC Gamer
<a href="https://news.google.com/rss/articles/CBMi6AFBVV95cUxQU2JGZ0lBWktDdFB6ckx6WGsxXzFEOWFyeWszNzdSRnhpb3hmSmVsS3Fxazl0WFFaZ3FYYWpCdXQzdkVZUXdHVXltalhWLS1tdE50OFNEQkFHUlV2Z0ZnSi1oSmYwTHc5NEJrLVdZNDhxRF9JbVlLT0d3UXdVX0pZQURrM0U2NDVqQU90RUFleFB1bXo1VDVYRnlmQ0kyb2tqb0xqUjZValFKdGVYNmZxSTZpR29OeGlST0N3S2tWZGg1RUxxU0ZYdXFlRlIyWm1pZHE5U0pKb0ZQSmtIS19HdmhtVGlXR1lt?oc=5" target="_blank">New report claims OpenAI spending cuts have 'hit' memory prices</a> <font color="#6f6f6f">PC Gamer</font>

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!