Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessThe way I see it — The development of autonomous vehicles is fraught with ethical concerns. And: The notion that the separatiDev.to AIThe Architect’s Reflection: The 5D MiddlewareMedium AII Am a Software Engineer Teaching Myself AI Engineering. Here Is Where I Am Starting.Medium AI20 Meta-Prompts That Boost AI Response Quality by 300%Dev.to AI5 Projects That Put a Fully Customizable AI Assistant on Your Wrist in Under $15Dev.to AIDIGIT Deal Roundup March 2026Digit.fyiWhy OpenAI’s TBPN Acquisition Is a Turning Point for Enterprise AIMedium AIFrom 1.5s to 250ms: How We 6x'd API Latency with Spring Boot OptimizationDev.to AIReinventing Brands in the Decentralised Era: Web3, Immersive Worlds, and User-Owned IdentityDev.to AI5 Key Insights That Cut My AI Wearable Development Time by 40%Dev.to AI5 AI Side Hustles That Generated $1,000 in 3 Months for a Beginner Like MeDev.to AIDesktop Canary v2.1.48-canary.25LobeChat ReleasesBlack Hat USADark ReadingBlack Hat AsiaAI BusinessThe way I see it — The development of autonomous vehicles is fraught with ethical concerns. And: The notion that the separatiDev.to AIThe Architect’s Reflection: The 5D MiddlewareMedium AII Am a Software Engineer Teaching Myself AI Engineering. Here Is Where I Am Starting.Medium AI20 Meta-Prompts That Boost AI Response Quality by 300%Dev.to AI5 Projects That Put a Fully Customizable AI Assistant on Your Wrist in Under $15Dev.to AIDIGIT Deal Roundup March 2026Digit.fyiWhy OpenAI’s TBPN Acquisition Is a Turning Point for Enterprise AIMedium AIFrom 1.5s to 250ms: How We 6x'd API Latency with Spring Boot OptimizationDev.to AIReinventing Brands in the Decentralised Era: Web3, Immersive Worlds, and User-Owned IdentityDev.to AI5 Key Insights That Cut My AI Wearable Development Time by 40%Dev.to AI5 AI Side Hustles That Generated $1,000 in 3 Months for a Beginner Like MeDev.to AIDesktop Canary v2.1.48-canary.25LobeChat Releases
AI NEWS HUBbyEIGENVECTOREigenvector

ML Safety Newsletter #1

newsletter.mlsafety.orgby Dan HendrycksOctober 18, 20211 min read0 views
Source Quiz

ICLR Safety Paper Roundup

Welcome to the 1st issue of the ML Safety Newsletter. In this edition, we cover:

  • various safety papers submitted to ICLR
  • results showing that discrete representations can improve robustness
  • a benchmark which shows larger models are more likely to repeat misinformation
  • a benchmark for detecting when models are gaming proxies
  • ... and much more.

Overview of the proposed Vision Transformer that uses discrete representations. The pixel embeddings (orange) are combined with discrete embedded tokens (pink) to create the input to the Vision Transformer.

There is much interest in the robustness of Vision Transformers, as they intrinsically scale better than ResNets in the face of unforeseen inputs and distribution shifts. This paper further enhances the robustness of Vision Transformers by augmenting the input with discrete tokens produced by a vector-quantized encoder. Why this works so well is unclear, but on datasets unlike the training distribution, their model achieves marked improvements. For example, when their model is trained on ImageNet and tested on ImageNet-Rendition (a dataset of cartoons, origami, paintings, toys, etc.), the model accuracy increases from 33.0% to 44.8%.

Paper

Improving test-time adaptation to distribution shift using data augmentation.

Certifying robustness to adversarial patches.

Augmenting data by mixing discrete cosine transform image encodings.

Teaching models to reject adversarial examples when they are unsure of the correct class.

Models trained to predict the next token are incentivized to repeat common misconceptions.

A new benchmark shows that GPT-3 imitates human misconceptions. In fact, larger models more frequently repeat misconceptions, so simply training more capable models may make the problem worse. For example, GPT-J with 6 billion parameters is 17% worse on this benchmark than a model with 0.125 billion parameters. This demonstrates that simple objectives can inadvertently incentivize models to be misaligned and repeat misinformation. To make models outputs truthful, we will need to find ways to counteract this new failure mode.

Paper

An expanded report towards building truthful and honest models.

Using an ensemble of one-class classifiers to create an out-of-distribution detector.

Provable performance guarantees for out-of-distribution detection.

Synthesizing outliers is becoming increasingly useful for detecting real anomalies.

As networks become larger, they can more aggressively optimize proxies and reduce performance of the true objective.

Real-world constraints often require implementing rough proxies instead of our true objectives. However, as models become more capable, they can exploit faults in the proxy and undermine performance, a failure mode called proxy gaming. This paper finds that proxy gaming occurs in multiple environments including a traffic control environment, COVID response simulator, Atari Riverraid, and a simulated controller for blood glucose levels. To mitigate proxy gaming, they use anomaly detection to detect models engaging in proxy gaming.

Paper

A paper studying how models may be incentivized to influence users.

Safe exploration in 3D environments.

A thorough analysis of security vulnerabilities generated by Github Copilot.

An ML system for improved decision making.

The NSF has a new call for proposals. Among other topics, they intend to fund Trustworthy AI (which overlaps with many ML Safety topics), AI for Decision Making, and Intelligent Agents for Next-Generation Cybersecurity (the latter two are relevant for External Safety).

No posts

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

safetypaper

Knowledge Map

Knowledge Map
TopicsEntitiesSource
ML Safety N…safetypapernewsletter.…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 234 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers