ML Safety Newsletter #3

newsletter.mlsafety.orgby Dan HendrycksMarch 8, 20221 min read0 views

Transformer adversarial robustness, fractals, preference learning

Welcome to the 3rd issue of the ML Safety Newsletter. In this edition, we cover:

NeurIPS ML safety papers
experiments showing that Transformers have no edge for adversarial robustness and anomaly detection
a new method leveraging fractals to improve various reliability metrics
a preference learning benchmark
... and much more.

This paper evaluates the distribution shift robustness and adversarial robustness of ConvNets and Vision Transformers (ViTs). Compared with previous papers, its evaluations are more fair and careful.

After controlling for data augmentation, they find that Transformers exhibit greater distribution shift robustness. For adversarial robustness, findings are more nuanced. First, ViTs are far more difficult to adversarially train. When successfully adversarially trained, ViTs are more robust than off-the-shelf ConvNets. However, ViTs’ higher adversarial robustness is explained by their smooth activation function, the GELU. If ConvNets use GELUs, they obtain similar adversarial robustness. Consequently, Vision Transformers are more robust than ConvNets to distribution shift, but they are not intrinsically more adversarially robust.

Paper

Video

PixMix improves both robustness (corruptions, adversaries, prediction consistency) and uncertainty estimation (calibration, anomaly detection).

PixMix is a data augmentation strategy that mixes training examples with fractals or feature visualizations; models then learn to classify these augmented examples. Whereas previous methods sacrifice performance on some reliability axes for improvements on others, this is the first to have no major reliability tradeoffs and is near Pareto-optimal.

Paper

A new adversarial robustness state-of-the-art by finding a better way to leverage data augmentations.

A highly effective gradient-based adversarial attack for text-based models.

A new benchmark for detecting adversarial text attacks.

Adversarially attacking language models with bidirectional and large-scale unidirectional language models.

First works on certified robustness under distribution shift: [1], [2], [3].

A dataset where in-distribution accuracy is negatively correlated with out-of-distribution robustness.

Improving performance in tail events by augmenting prediction pipelines with retrieval.

A set of new, more realistic 3D common corruptions.

Multimodality can dramatically improve robustness.

The authors model the hidden feature representations of in-distribution examples as class-conditional Gaussians, and they sample virtual outliers from the low-likelihood region. The model is trained to separate in-distribution examples from virtual outliers.

A path towards better out-of-distribution (OOD) detection is through generating diverse and unusual examples. As a step in that direction, this paper proposes to generate hidden representations or “virtual” examples that are outliers, rather than generate raw inputs that are outliers. The method is evaluated on many object detection and classification tasks, and it works well. It is not evaluated on the more difficult setting where anomalies are held-out classes from similar data generating processes. If the authors evaluated their CIFAR-10 model’s ability to detect CIFAR-100 anomalies, then we would have more of a sense of its ability to detect more than just far-from-distribution examples. Assuming no access to extra real outlier data, this method appears to be the state-of-the-art for far-from-distribution anomaly detection.

Paper

ML models can be “Trojans” and have hidden, controllable vulnerabilities. Trojan models behave correctly and benignly in almost all scenarios, but in particular circumstances (when a “trigger” is satisfied), they behave incorrectly. This paper demonstrates the simplicity of creating Trojan reinforcement learning agents that can be triggered to execute a secret, coherent, and undesirable procedure. They modify a small fraction of training observations without assuming any control over policy or reward. Future safety work could try to detect whether models are Trojans, detect whether a Trojan model is being triggered, or precisely reconstruct the trigger given the model.

Paper

The Species dataset contains over 700,000 images covering over 1,000 anomalous species.

While previous papers claimed that Transformers are better at OOD detection than ConvNets, it turns out their test-time “anomalous examples” were similar to examples seen during pretraining. How can we properly assess OOD detection performance for models pretrained on broad datasets? This paper creates a biological anomaly dataset with organisms not seen in broad datasets including ImageNet-22K. The OOD dataset shows that Transformers have no marked edge over ConvNets at OOD detection, and there is substantial room for improvement.

Paper

Detecting far-from-distribution examples by simply first clipping values in the penultimate layer.

A new OOD detection dataset with 224K classes.

A new metric advances the state-of-the-art for predicting a model’s performance on out-of-distribution data, assuming no access to ground truth labels.

A differentiable calibration loss sacrifices a small amount of accuracy for large calibration improvements.

In a thorough analysis of calibration, ConvNets are less calibrated than Transformers and MLP models, and more pretraining data has no consistent effect on calibration.

A dataset that can be used for detecting contradictions given long background contexts. Such a detector could be used for preventing models from stating falsehoods at odds with reality or their previous statements.

Factual knowledge in language models corresponds to a localized computation that can be directly edited.

Instead of assuming that the environment provides a (hand-engineered) reward, a teacher provides preferences between the agent’s behaviors, and the agent uses this feedback to learn the desired behavior.

Preference-based RL is a framework for teaching agents by providing preferences about their behavior. However, the research area lacks a commonly adopted benchmark. While access to human preferences would be ideal, this makes evaluation far more costly and slower, and it often requires navigating review board bureaucracies. This paper creates a standardized benchmark using simulated teachers. These simulated teachers have preferences, but they can exhibit various irrationalities. Some teachers skip queries, some exhibit no preference when demonstrations are only subtly different, some make random mistakes, and some overemphasize behavior at the end of the demonstration.

Paper

It is sometimes easier to identify preferences when decision problems are more uncertain.

Debate about “alignment” definitions: [1], [2], [3].

Today's optimal policies tend to be power-seeking, a failure mode that will become more concerning with future advanced AI.

Using model look-ahead to avoid safety constraint violations.

This work proposes a policy editor to make policies comply with safety constraints; experiments are based on Safety Gym.

Benchmarking policies that adhere to constraints specified via natural language.

Apply to Fathom Radiant which is working on hardware for safe machine intelligence.

No posts

Original source

newsletter.mlsafety.org

https://newsletter.mlsafety.org/p/ml-safety-newsletter-3

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

transformersafety

Laws & Regulation

Double win: Washington Gov. Ferguson signs two major AI safety bills into law - Transparency Coalition

<a href="https://news.google.com/rss/articles/CBMiuAFBVV95cUxNb05qUFFueTU2X0xQby1LSkhsQnhTLTFhTElLS1ZzVFRMaVI3QXlpMzNCX2JyMm4zYmQzVXRqaHZDS0xNMGtMM3pQRnRUSkNKeExOamNiNTIyRUM1SzZzZmJUMkJ6UVkyWkxTVE1kemp6WnFIbm9DNERfQzNYaTQxSnU2M1Q0M1d3d1JXU0tLT0hIWmNEYjdtUG5kOTlzUmwzS2xZWDdOUGhIdTdGMVZsaDRtbmNVc001?oc=5" target="_blank">Double win: Washington Gov. Ferguson signs two major AI safety bills into law</a> Transparency Coalition

Google News: AI Safety

1m8 days ago

ReleasesLive

Robosen Soundwave review: A childhood dream made real

There's just something magical about a robot that can convert into a car, tank or plane. It seems that Hollywood agrees as there are several major franchises based around that concept. As someone who grew up in the 80s and 90s, Transformers hold a special place in my heart, despite Michael Bay's best efforts at tarnishing its legacy. I spent countless hours as a kid playing with Hasbro and Takara's plastic figures, but there was one type of toy I always wanted but never got: a robot that could transform on its own just like the ones I watched on TV. That changed a few years ago when Robosen launched its line of officially licensed auto-converting models, and from what I've seen, its latest release featuring Soundwave might be its best yet. Design: More than meets the eye As a follow-up to

Engadget

9m44 minutes ago

Open Source AILive

LLM Quantization, Kernels, and Deployment: How to Fine-Tune Correctly, Part 5

The Unsloth deep dive into GPTQ, AWQ, GGUF, inference kernels, and deployment routing Generated using notebookLM A 1.5B model quantized to 4-bit can lose enough fidelity that instruction-following collapses entirely. A GPTQ model calibrated on WikiText and deployed on domain-specific medical text silently degrades on exactly the inputs that matter most. A Mixture-of-Experts model budgeted for 5B active parameters actually needs VRAM for all 400B. None of these failures produce error messages. All of them produce models that look fine on benchmarks and fail in production. The common thread is that the post-training pipeline, everything between the last training step and the first served request, was treated as a formatting step rather than an engineering problem. This episode opens that pip

Towards AI

32m43 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 201 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsRecent

Generative AI Upends Enterprise Software Licensing Models - letsdatascience.com

<a href="https://news.google.com/rss/articles/CBMioAFBVV95cUxNbTEyc1hEQ2lMRDFXQ1Z3X3c1TUVZUVBYdDUxUFVhcW5OME92ZjNfZ2J4SFVXZzVCejlWX0RLblB2M2VVdFlrWVpXbG9YU0NVNEhBWDFsRjd5V1pUSVp3bkNZazd3WXFXcUZEb0tjMTZIcUVxVVNEbXZrRXdJSHd2LS00VWlaS0E1b1pSeDI2RHc0X2lHaGNkSXQtVlo2cmhG?oc=5" target="_blank">Generative AI Upends Enterprise Software Licensing Models</a> letsdatascience.com

Google News: Generative AI

1m1 day ago

ModelsLive

ChatGPT gets more aggressive courting paying subscribers as circular financing worries loom large - Daily Kos

<a href="https://news.google.com/rss/articles/CBMi3wFBVV95cUxOb2xaWFN6YzdyODFwRmdfb1dTVEpHY3JZdEtHempHZ2FaOXJzUXNvVUVnSUNnWFVGV1p1SldfM2pDc21YcDlaM0pCYmEzcFlYMlRxSmEtRW15aHVZVnlDa0xzS083d1FpRUJ5Nm1LQ2VIT2hxOS1GejlpMHpTOU9kbFBvdk1Ja0dXRS1SZVdaUVVzMERSTGo2N014SUhmSHFGNVF2YnlJSW41ZkFqemRmei16ZlBTaENrUkJuQzBESHZjMmFQVGhiMkFmSkg2MDA3ZlA3TFRvVzlZMjJtV1hr?oc=5" target="_blank">ChatGPT gets more aggressive courting paying subscribers as circular financing worries loom large</a> Daily Kos

Google News: ChatGPT

1m44 minutes ago

ModelsLive

I tested ChatGPT vs. Claude to see which is better - and if it's worth switching - ZDNET

<a href="https://news.google.com/rss/articles/CBMiXEFVX3lxTE9pOTByM3RMNnhHNHV2RWJ0NW5JRDFxbFo5YkJ6WXhwbXJTWjY5RTNScVdGOUlEcXVBNmpRRVRkNWgwOGh3UjBKMW5IczVmX3U5OVdMSERvQ3hzbHBL?oc=5" target="_blank">I tested ChatGPT vs. Claude to see which is better - and if it's worth switching</a> ZDNET

Google News: ChatGPT

1m37 minutes ago

ModelsFresh

Anthropic Claude source code leak explained: Techie reveals how a 4 am update exposed 512,000 lines of cod - The Economic Times

<a href="https://news.google.com/rss/articles/CBMihwJBVV95cUxOem9YdnBkbnVXYzU1UWlKRW5aUXJTUTdsWW1uN21EMWJrTElqX2VRU0I0N2FEWE5fcVk3UXplR0dMY2t4NU9EWEdQV3Iwbkk1Ti10VV9hYUtjS0RaV3ExUXJ3NkVad3UzSjZNWEtZaDg2aWN3OXdqb0I2SW52NTN2eU1BODROaVA1N3oyT1pHUzdfNzNHMHNYSjZHSE95UDBuaUxXNjRTbHJpQTUyTHdHMmtYYzN1Y0xybVFZZFhlY1Iya0V5cURuM2xyak1sVEdiMGhiaWdhcjlhbTNtSGp6SlBhMzA2ZVBUVjVlaU83dmhZaUpWWndtX2wwbHFiRVp1TUd1dkpyVdIBjAJBVV95cUxPMC1vbmpxdjVLY2E1RWp4ZnZDZ2hteHROVnRBR0VlTGJRM005YlZ2dlNYcTgyZE1RN2Mtdy1VNnNsaFhUcy1aVHd3THpFRGdqU1Z3Y1h5MEw2WWV3VHVjLVluSlQ0RHNTV1JscTk2X0lubVFOdnhJN0YzOEV1dWJ2OUw3dkkzUVhEZzdCT2pLQ2xqMGNQYTc1TVp5ekVYRkVSbmE4Ylp3S0RNMjZRckJ0SGtxUDRkREZQM3hzaXVSQ1Azd29UeHJsd3hKTXp2MldVNkhjYWtSS24wWmJDbHdNT29BWUl2bkNIOEhETG1idzgza3dPODItc08wbXpTcENxOFBFb2d2aUhQdzAy?oc=5" target="_blank">Anthropic C

Google News: Claude

1mabout 3 hours ago