ML Safety Newsletter #4
Many New Interpretability Papers, Virtual Logit Matching, Rationalization Helps Robustness
Welcome to the 4th issue of the ML Safety Newsletter. In this edition, we cover:
- How “model-based optimization” environments can be used to research proxy gaming
- How models can express their uncertainty through natural language
- A new plug-and-play state-of-the-art OOD detection technique
- How “rationales” can improve robustness to adversarial attacks
- Announcing our subreddit with safety papers added nightly
- ... and much more.
Functions without a smoothness prior can result in solutions that maximize a proxy but not the true objective.
A portion of “Model-based optimization” (MBO) research provides a way to study simple cases of proxy gaming: with MBO environments, we can learn how to build better proxies that yield better solutions when optimized.
MBO aims to design objects with desired properties, that is to find a new input that maximizes an objective. The objective is typically assumed to be expensive to evaluate and a black box. Since the black box objective is expensive to query, researchers are tasked with creating a proxy that can be queried repeatedly. An optimizer then finds an input that maximizes the proxy. To design proxies that yield better solutions according to the ground truth black-box objective, this paper incorporates a smoothness prior. As there are many mathematical details, see the paper for a full description. In short, model-based optimization environments can be used to empirically study how to create better, less gameable proxies.
[Paper] [Video]
- [Link] Why we need biased AI -- How including cognitive and ethical machine biases can enhance AI systems: “a re-evaluation of the ethical significance of machine biases”
- [Link] Generating ethical analysis to moral quandaries
- [Link 1] [Link 2] Examples of inverse scaling or anticorrelated capabilities: perceptual similarity performance does not monotonically increase with classification accuracy
- [Link] “comprehensive comparison of these provably safe RL methods”
- [Link] Inverse Reinforcement Learning Tutorial
- [Link] Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions: “We do not find that explanations in our set-up improve human accuracy”
This work shows GPT-3 can express its uncertainty in natural language, without using model logits. Moreover, it is somewhat calibrated under various distribution shifts.
This is an early step toward making model uncertainty more interpretable and expressive. In the future, perhaps models could use natural language to express complicated beliefs such as “event A will occur with 60% probability assuming event B also occurs, and with 25% probability if event B does not.” In the long-term, uncertainty estimation will likely remain nontrivial, as it is not obvious how to make future models calibrated on inherently uncertain, chaotic, or computationally prohibitive questions that extend beyond existing human knowledge.
[Link]
An illustration of the Virtual Logit Matching pipeline.
Virtual logit matching is a new out-of-distribution technique that does not require hyperparameter tuning, does not require retraining models, and beats the maximum softmax baseline on most OOD detection tasks. The idea is to create a “virtual logit,” which is proportional to the magnitude of the projection of the input onto the space orthogonal to the principal embedding space. Then the OOD score is roughly equal to the virtual logit minus the maximum logit, intuitively the evidence that the input is unlike the training example embeddings minus the evidence that it is in-distribution.
[Link]
- [Link] “We train probes to investigate what concepts are encoded in game-playing agents like AlphaGo and how those concepts relate to natural language”
- [Link] By removing parts of an input image, one can analyze how much a model depends on a given input feature. However, removing parts of the input is often not completely sound, as removing parts confuses models. Fortunately with Vision Transformers, removing patches is a matter of simply dropping tokens, which is a more sound way to create counterfactual inputs.
- [Link] To more scalably characterize model components, this work “automatically labels neurons with open-ended, compositional, natural language descriptions”
- [Link] Transformer mechanisms that complete simple sequences are identified and shown to be emergent during training
- [Link] An interpretability benchmark: controllably generate trainable examples under arbitrary biases (shape, color, etc) → human subjects are asked to predict the systems' output relying on explanations
- [Link] A new library has implementations of over a dozen OOD detection techniques
- [Link] Trojan detection cat and mouse continues: a new attack “reduces the accuracy of a state-of-the-art defense mechanism from >96% to 0%”
- [Link] Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers: “we find that few-shot learning emerges only from applying the right architecture to the right data distribution; neither component is sufficient on its own”
- [Link] Research on understanding emergent functionality: “We observe empirically the presence of four learning phases: comprehension, grokking, memorization, and confusion”
- [Link] Displaying a model's true confidence can be suboptimal for helping people make better decisions
To improve robustness, this paper asks models to explain their predictions. These are called “rationales.” When models produce rationales before predicting, they are more robust to token-level and sentence-level adversarial attacks.
[Link]
- [Link] How well do adversarial attacks transfer? This paper provides a large-scale systematic empirical study in real-world environments
- [Link] Advancement in robustness with guarantees: “[we] provide better certificates in terms of certified accuracy, average certified radii and abstention rates as compared to concurrent approaches”
- [Link] A large-scale data collection effort to add three nines of reliability to an injury classification task
- [Link] “simple last layer retraining can match or outperform state-of-the-art approaches on spurious correlation benchmarks”
- [Link] What causes CLIP's perceived robustness? Mostly dataset diversity, suggesting semantic overlap with the test distribution
- [Link] Testing RL agent robustness to abrupt changes and sudden shocks to the environment
We now have a subreddit! The subreddit has a steady stream of safety-relevant papers, including safety papers not covered in this newsletter. Papers are added to the subreddit several times a week. The subreddit’s posts are available on twitter too.
A lecture series on social and ethical considerations of advanced AI: concrete suggestions for creating cooperative AI; discussion of the infeasibility and suboptimality of various deployment strategies; discussion of the merits of AI autonomy and reasonableness over rationality; and outlining how communities of agents could be robustly safe. (I recommend watching the final lecture, and if you’re interested consider watching the previous lectures.)
No posts
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
safetyinterpretabilitypaperAnthropic Dials Back AI Safety Commitments - WSJ
<a href="https://news.google.com/rss/articles/CBMiiwNBVV95cUxQSUFPWWl5SHlTVWF4a2Jfc1FxNWdSZkE4Y3pyQ0k0eWJSVzZUbi1KTDk3OW4taUNzejJiNUdrYWhqS3NINGUteG5LLWh4UlBkaGhocnJEdHZoUEpJX2JsZUlZdmI1V0RRcHd4amNYck1FRVgzR1doeHFLLVpIQkRjVktId29tY3kxYVVWZmR2Y1B0SlJwVmJHT2huQnp0alVITG5jT2FjQ2lfV1Vic1htd3ZvbUUzVkpZNGtON2NxZG15N29qSU1iQnhQWWREMHRiS0dLanQ3UHhhbnUySUZKXzQtbDFxRWpYbDhQUHhsZ0ZzVENlTy1YQkhWM2o2VmxYVExTdEYzVHlFT2NtM2JSNTZsd3NJX1psLXF5QWtNVTd1cXVYTzRMVEZyWkJCU28tSTloeUdtU1ZHTlJlUnNoYmFrQTZ3Y0gwQ3hkcW1RcHNwUTFFRE50cFpvUk84STR5Q19TWFFoVGJ3SXR2WFE4Y1d5Tk9RT3p4YWVYcG5RWTVwNFo5bHBPd3hwYw?oc=5" target="_blank">Anthropic Dials Back AI Safety Commitments</a> <font color="#6f6f6f">WSJ</font>
Anthropic to sign deal with Australia on AI safety and economic data tracking - The Star
<a href="https://news.google.com/rss/articles/CBMiygFBVV95cUxQVkE3bjU2NnN0al9FQlhqRUZmZFhlYzZHTWJCU3lYX3dPLVQ5d2VnQ01TWWtzRVNHcEdvNEtpY2toRW0zdnNuaFlKM0Vtc0ZpcFFnUkJ0eEpfTktlM2tkWFhCRXZTcUhmUGdMZ2lSMmM3U3FIdEk5Yl9SUDFWX2RaZ3JjYmQ5MFl1bmZScUczbXNzcW1vWFlXNmVoVDJIYS1jTmowSHZCc3Y3dkNXbDVGdDk0SlNRVjAxSUt0ZXFhN0czUzJWaWtxOS1B?oc=5" target="_blank">Anthropic to sign deal with Australia on AI safety and economic data tracking</a> <font color="#6f6f6f">The Star</font>
OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training
OptiMer enables flexible continual pre-training by decoupling data mixture ratio selection from training through post-hoc Bayesian optimization of distribution vectors extracted from individual dataset models. (1 upvotes on HuggingFace)
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Frontier Research
AI Is Hallucinating Entire Motorcycle Safety Courses Now, And It's Very Dumb - Yahoo Autos
<a href="https://news.google.com/rss/articles/CBMirAFBVV95cUxQUUpLQ3dJbV9HejZuOVdXR2lkaXVyaXF6YjVQbW1pWjlOeGkzM2M1ZXdITVAwNFh0MW9JNlAyY3lfdDEydHBMRFpwckRobW1ERlVLZUNvT1lCREJtX0FLdFhOQlFiT1NHdV8wM19Ga1B2ODBSYzZXcEJpX3d1bWdaM3dSbkJza25UTjcwOFI4QVp2Vmp6dlV4ODhZT0pYNEp4WS1jQWxwUmR3OFJo?oc=5" target="_blank">AI Is Hallucinating Entire Motorcycle Safety Courses Now, And It's Very Dumb</a> <font color="#6f6f6f">Yahoo Autos</font>
Agile Geoscience in Zambia: AI-Powered Mining Tech - Discovery Alert
<a href="https://news.google.com/rss/articles/CBMihgFBVV95cUxNZ0h0bk83ZUh1UzFJUXJqZWpLQ05NNWJNMGxiUVZDelBJTGpsQjE1T1Q5Qml4emVLOFJ3TUs0VVFCRVFNVk5BWjdtdlQwc05Md0lxZ2pyaWN0QWZ6eTF3ZTdMRHpTQmZLQzExVjZfSTBqUjh5anR6U0g1amVUTmE1OTUxZzhwQQ?oc=5" target="_blank">Agile Geoscience in Zambia: AI-Powered Mining Tech</a> <font color="#6f6f6f">Discovery Alert</font>
MMCTAgent: Enabling multimodal reasoning over large video and image collections - Microsoft
<a href="https://news.google.com/rss/articles/CBMiwwFBVV95cUxPNG9wa2d3ZHlkR05jTDZpNnFIc3pQTGVrRkRnTVExQkhySUxxdzFaSHVBekhFRnZHMV9Ed3RRamMyc05mQWNwVldteG1YX3ZteEw5OTNrTW1LUEdacTdjQWFTQ2hEaE1Fd0VfdHBBbGVtZndHMjE1bUoySnJIU2xIaDBEMFMteHo0a3BzSXlsODVua1NWNG5VZEtoREVCc283NkdTNVR1MnJDTUU4ck1wZWZWMWlzMW5NcHNZVy00S1IxQWc?oc=5" target="_blank">MMCTAgent: Enabling multimodal reasoning over large video and image collections</a> <font color="#6f6f6f">Microsoft</font>
Allosteric Bioscience, Inc. Targets Optimizing Aging and Longevity: Quantum Computing, Advanced AI, Genomics - Yahoo Finance
<a href="https://news.google.com/rss/articles/CBMilAFBVV95cUxQaUVHSEVBTHBHMFFVNnRLTURaWUV6NU1GOXAwTy1HNnZ1cDNsZnJrWVd6YlVXcnpIeXZtVlc3YVl6STdTbDhYQ01MLXk0RnhQZVdSQmZ2QUR3SXVGVnF3TERNR2paYThGbVRwR1paeHNnblFsUTVGMUp4TVpyS1A2bVczTDJ3VkhiR1FRYnZnUWFuODVQ?oc=5" target="_blank">Allosteric Bioscience, Inc. Targets Optimizing Aging and Longevity: Quantum Computing, Advanced AI, Genomics</a> <font color="#6f6f6f">Yahoo Finance</font>

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!